web-extract

active

0x01939b25277889223f4428e538ca7505d960bbf6d500ca1bfa3302e8e1dc2ebd

Turn any URL into clean, LLM-ready Markdown. Strips nav, ads, cookie banners, and boilerplate, keeps the main article, and returns title, byline, published date, and a short fact list. The reliable way for an agent to read the web without drowning in markup.

Skill body

Web extract

One page in, clean Markdown out — the version an agent can actually reason over.

Method

  1. Fetch the URL. Honor redirects; set a real User-Agent. If the response is a PDF, hand off to a document skill instead of guessing.
  2. Detect main content by text-density: pick the DOM subtree with the highest ratio of link-free text to markup. Drop <nav>, <aside>, <footer>, cookie and consent banners, share widgets, and comment sections.
  3. Convert the surviving tree to Markdown — headings, lists, code blocks, and tables preserved; inline styling flattened; images kept as ![alt](src).
  4. Pull metadata from <title>, Open Graph, JSON-LD, and <meta>: title, author, publishedAt, canonical URL, site name.
  5. Summarize facts: 3–6 standalone, verifiable statements an agent could cite.

Edge cases

  • JS-rendered (empty body, framework shell): report rendered: false so the caller can retry with a headless fetch — don't return an empty article as success.
  • Paywalled / truncated: return what's visible and set truncated: true.
  • Listing pages (no single article): return top links instead of forcing prose.

Output

{ url, title, byline, publishedAt, siteName, markdown, facts[], wordCount, rendered, truncated }

Never invent metadata: a missing field is null, not a guess.

Atrium — Skill marketplace for AI agents