baoyu-url-to-markdown

文档已审计 @JimLiu v1.61.0

信任分: 92/100
兼容 Agent: 1

速查档案只列事实：领域、Agent、信任分、作者、原文章节。装与不装请看下方作者解读。

领域: 文档
兼容 Agent: Claude Code
信任分: 92 / 100 · 已通过审计
作者 / 版本 / 许可: @JimLiu · v1.61.0 · 未声明 license
安装命令数: 1 条

需要注意： 未限定 allowed-tools，默认拥有全部工具权限。

想读作者英文原文？ ↓ 滚到正文区切换 · 在 GitHub 查看 ↗

解读由编辑根据原文凝练而成，命令、链接、术语均与作者原文一致；想看完整论述请切到右侧

baoyu-url-to-markdown 给一个 URL，把网页正文抓回来转成干净的 Markdown——重点是「干净」：去广告、去导航、去推荐位，只留正文。

设计思路

作者发现通用型抓取工具（Mercury、Readability）对中文站点支持不太好，特定站点（X、YouTube、Hacker News、知乎、微信）又有自己的反爬和结构。所以技能用了「baoyu-fetch CLI + 站点专用 adapters」的双层设计——通用站点走 Defuddle 走清洗，特殊站点走 site-specific adapter。

内置 adapters

SKILL.md 列出 built-in adapters：

X / Twitter 推文 / 推串
YouTube 字幕（直接拿 transcript 而不是抓页面）
Hacker News thread
通用页面 via Defuddle

未来可加更多 adapter，框架已留好扩展点。

工作流

按 Workflow：① 读 URL；② 自动判断走哪个 adapter；③ 通过 baoyu-fetch CLI（封装了 Chrome CDP）打开页面；④ Agent Quality Gate：抓回来后检查正文是否完整、是否有明显污染（导航、推荐）；⑤ Output Path Generation 自动起文件名（基于标题 + 日期）；⑥ Adapters & Media 处理图、视频、附件。

CLI Setup

启用前需装 baoyu-fetch CLI——SKILL.md 顶部 CLI Setup 章节给了详细的安装步骤。装好后通过 Environment Variables 控制 Chrome 路径、超时、缓存。

适合谁

写作者收集素材，把读过的好文章按规范归档
知识管理（PKM）用户，建立可搜索的笔记库
研究人员追踪某话题，批量抓取参考文献
把 RSS / Twitter 收藏夹批量沉淀到 Obsidian

何时不该用

需要登录态才能看的私域内容——这是公域抓取
大规模数据采集（每天上千 URL）——用专门的爬虫框架
抓动态加载需要 JS 跑很久才出内容的页面——可能需要自定义 adapter

配套

baoyu-format-markdown：把抓回来的 Markdown 进一步格式化
baoyu-translate：英文文章抓回来后翻译
baoyu-post-to-wechat：抓 → 翻 → 发布的链路
obsidian-vault：把抓回来的文件归档到 Obsidian

URL to Markdown

Fetches any URL via baoyu-fetch CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.

User Input Tools

When this skill prompts the user, follow this tool-selection rule (priority order):

Prefer built-in user-input tools exposed by the current agent runtime — e.g., AskUserQuestion, request_user_input, clarify, ask_user, or any equivalent.
Fallback: if no such tool exists, emit a numbered plain-text message and ask the user to reply with the chosen number/answer for each question.
Batching: if the tool supports multiple questions per call, combine all applicable questions into a single call; if only single-question, ask them one at a time in priority order.

Concrete AskUserQuestion references below are examples — substitute the local equivalent in other runtimes.

CLI Setup

Important: The CLI source is vendored in {baseDir}/scripts/lib. scripts/package.json installs only third-party runtime dependencies.

Agent Execution Instructions:

Determine this SKILL.md file's directory path as {baseDir}
Resolve ${BUN} runtime: if bun installed → bun; else suggest installing Bun
If {baseDir}/scripts/node_modules does not exist, run ${BUN} install --cwd {baseDir}/scripts
${READER} = {baseDir}/scripts/baoyu-fetch
Replace all ${READER} in this document with the resolved value

Preferences (EXTEND.md)

Check EXTEND.md in priority order — the first one found wins:

Priority	Path	Scope
1	`.baoyu-skills/baoyu-url-to-markdown/EXTEND.md`	Project
2	`${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md`	XDG
3	`$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md`	User home

Result	Action
Found	Read, parse, apply settings
Not found	MUST run first-time setup (see below) — do NOT silently create defaults

EXTEND.md supports: download media by default, default output directory.

First-Time Setup ⛔ BLOCKING

When EXTEND.md is not found, you MUST use AskUserQuestion to gather preferences before creating EXTEND.md. NEVER create EXTEND.md with silent defaults. Generation is BLOCKED until setup completes. Batch all three questions into a single call:

Q1 — Media (header "Media"): "How to handle images and videos in pages?"
- "Ask each time (Recommended)" — Prompt after each save
- "Always download" — Download to local imgs/ and videos/
- "Never download" — Keep remote URLs
Q2 — Output (header "Output"): "Default output directory?"
- "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
- User may pick "Other" and type a custom path
Q3 — Save (header "Save"): "Where to save preferences?"
- "User (Recommended)" — ~/.baoyu-skills/ (all projects)
- "Project" — .baoyu-skills/ (this project only)

After answers, write EXTEND.md, confirm "Preferences saved to [path]", then continue.

Full template: references/config/first-time-setup.md.

Supported Keys

Key	Default	Values	Description
`download_media`	`ask`	`ask` / `1` / `0`	`ask` = prompt each time, `1` = always, `0` = never
`default_output_dir`	empty	path or empty	Default output directory (empty = `./url-to-markdown/`)

EXTEND.md → CLI mapping:

EXTEND.md key	CLI argument	Notes
`download_media: 1`	`--download-media`	Requires `--output` to be set
`default_output_dir: ./posts/`	Agent constructs `--output ./posts/{domain}/{slug}.md`	Agent generates path, not a direct flag

Value priority: CLI arguments → EXTEND.md → skill defaults.

Usage

# Default: headless capture, markdown to stdout
${READER} <url>

# Save to file
${READER} <url> --output article.md

# Save with media download
${READER} <url> --output article.md --download-media

# Wait for interaction (login/CAPTCHA) — auto-detect and continue
${READER} <url> --wait-for interaction --output article.md

# Wait for interaction — manual control (Enter to continue)
${READER} <url> --wait-for force --output article.md

# JSON output
${READER} <url> --format json --output article.json

# Force specific adapter
${READER} <url> --adapter youtube --output transcript.md

Options

Option	Description
`<url>`	URL to fetch
`--output <path>`	Output file path (default: stdout)
`--format <type>`	Output format: `markdown` (default) or `json`
`--json`	Shorthand for `--format json`
`--adapter <name>`	Force adapter: `x`, `youtube`, `hn`, or `generic` (default: auto-detect)
`--headless`	Force headless Chrome (no visible window)
`--wait-for <mode>`	Interaction wait mode: `none` (default), `interaction`, or `force`
`--wait-for-interaction`	Alias for `--wait-for interaction`
`--wait-for-login`	Alias for `--wait-for interaction`
`--timeout <ms>`	Page load timeout (default: 30000)
`--interaction-timeout <ms>`	Login/CAPTCHA wait timeout (default: 600000 = 10 min)
`--interaction-poll-interval <ms>`	Poll interval for interaction checks (default: 1500)
`--download-media`	Download images/videos to local `imgs/` and `videos/`, rewrite markdown links. Requires `--output`
`--media-dir <dir>`	Base directory for downloaded media (default: same as `--output` directory)
`--cdp-url <url>`	Reuse existing Chrome DevTools Protocol endpoint
`--browser-path <path>`	Custom Chrome/Chromium binary path
`--chrome-profile-dir <path>`	Chrome user data directory (default: `BAOYU_CHROME_PROFILE_DIR` env or `./baoyu-skills/chrome-profile`)
`--debug-dir <dir>`	Write debug artifacts (document.json, markdown.md, page.html, network.json)

Agent Quality Gate

CRITICAL: treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without failing the CLI.

After every headless run, inspect the saved markdown. See references/quality-gate.md for the full checklist, recovery workflow, and capture-mode table. Read it whenever a run looks suspicious or the user asks about login/CAPTCHA handling.

Output Path Generation

The agent must construct the output file path — baoyu-fetch does not auto-generate paths.

Algorithm:

Determine base directory from EXTEND.md default_output_dir or default ./url-to-markdown/
Extract domain from URL (e.g., example.com)
Generate slug from URL path or page title (kebab-case, 2-6 words)
Construct: {base_dir}/{domain}/{slug}/{slug}.md — each URL gets its own directory so media files stay isolated
Conflict resolution: append timestamp {slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md

Pass the constructed path to --output. Media files (--download-media) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.

Adapters & Media

See references/adapters.md for the adapter catalog (X, YouTube, Hacker News, generic), per-adapter notes, the media download flow (ask / always / never), and the JSON output schema. Read it before answering adapter-specific questions or handling media prompts.

Environment Variables

Variable	Description
`BAOYU_CHROME_PROFILE_DIR`	Chrome user data directory (can also use `--chrome-profile-dir`)

Troubleshooting: Chrome not found → use --browser-path. Timeout → increase --timeout. Login/CAPTCHA → --wait-for interaction. Debug → --debug-dir to inspect captured HTML and network logs.

Extension Support

Custom configurations via EXTEND.md. See Preferences section above for paths and supported keys.