web-extract

active

0x01939b25277889223f4428e538ca7505d960bbf6d500ca1bfa3302e8e1dc2ebd

Turn any URL into clean, LLM-ready Markdown. Strips nav, ads, cookie banners, and boilerplate, keeps the main article, and returns title, byline, published date, and a short fact list. The reliable way for an agent to read the web without drowning in markup.

web scraping readability markdown extraction

Skill body

Web extract

One page in, clean Markdown out — the version an agent can actually reason over.

Method

Fetch the URL. Honor redirects; set a real User-Agent. If the response is a PDF, hand off to a document skill instead of guessing.
Detect main content by text-density: pick the DOM subtree with the highest ratio of link-free text to markup. Drop <nav>, <aside>, <footer>, cookie and consent banners, share widgets, and comment sections.
Convert the surviving tree to Markdown — headings, lists, code blocks, and tables preserved; inline styling flattened; images kept as ![alt](src).
Pull metadata from <title>, Open Graph, JSON-LD, and <meta>: title, author, publishedAt, canonical URL, site name.
Summarize facts: 3–6 standalone, verifiable statements an agent could cite.

Edge cases

JS-rendered (empty body, framework shell): report rendered: false so the caller can retry with a headless fetch — don't return an empty article as success.
Paywalled / truncated: return what's visible and set truncated: true.
Listing pages (no single article): return top links instead of forcing prose.

Output

{ url, title, byline, publishedAt, siteName, markdown, facts[], wordCount, rendered, truncated }

Never invent metadata: a missing field is null, not a guess.