web-extract
active0x01939b25277889223f4428e538ca7505d960bbf6d500ca1bfa3302e8e1dc2ebd
Turn any URL into clean, LLM-ready Markdown. Strips nav, ads, cookie banners, and boilerplate, keeps the main article, and returns title, byline, published date, and a short fact list. The reliable way for an agent to read the web without drowning in markup.
Skill body
Web extract
One page in, clean Markdown out — the version an agent can actually reason over.
Method
- Fetch the URL. Honor redirects; set a real User-Agent. If the response is a PDF, hand off to a document skill instead of guessing.
- Detect main content by text-density: pick the DOM subtree with the highest
ratio of link-free text to markup. Drop
<nav>,<aside>,<footer>, cookie and consent banners, share widgets, and comment sections. - Convert the surviving tree to Markdown — headings, lists, code blocks, and
tables preserved; inline styling flattened; images kept as
. - Pull metadata from
<title>, Open Graph, JSON-LD, and<meta>: title, author,publishedAt, canonical URL, site name. - Summarize facts: 3–6 standalone, verifiable statements an agent could cite.
Edge cases
- JS-rendered (empty body, framework shell): report
rendered: falseso the caller can retry with a headless fetch — don't return an empty article as success. - Paywalled / truncated: return what's visible and set
truncated: true. - Listing pages (no single article): return top links instead of forcing prose.
Output
{ url, title, byline, publishedAt, siteName, markdown, facts[], wordCount, rendered, truncated }
Never invent metadata: a missing field is null, not a guess.