baoyu-youtube-transcript

设计与多媒体已审计 @JimLiu v1.1.0

信任分: 92/100
兼容 Agent: 1

速查档案只列事实：领域、Agent、信任分、作者、原文章节。装与不装请看下方作者解读。

领域: 设计与多媒体
兼容 Agent: Claude Code
信任分: 92 / 100 · 已通过审计
作者 / 版本 / 许可: @JimLiu · v1.1.0 · 未声明 license
安装命令数: 1 条

需要注意： 未限定 allowed-tools，默认拥有全部工具权限。

想读作者英文原文？ ↓ 滚到正文区切换 · 在 GitHub 查看 ↗

解读由编辑根据原文凝练而成，命令、链接、术语均与作者原文一致；想看完整论述请切到右侧

baoyu-youtube-transcript 给 YouTube 视频链接，把字幕（subtitles / captions）下载下来——不需要 API key、不需要装浏览器。

设计思路

作者用了 YouTube 的 InnerTube API 直连——这是 YouTube 自己 App / 网页前端用的内部 API，速度比第三方接口稳。如果 YouTube 临时屏蔽了 InnerTube 路径（偶尔会发生），自动 fallback 到 yt-dlp（成熟的开源下载器）。两条路径都失败才报错。

支持的字幕类型

手工创建的字幕：作者上传的 SRT / VTT
自动生成的字幕：YouTube 语音识别的版本（CC）

两种都能拿。

输出格式

按 Output Formats 章节，输出可选：

纯文本（去时间码）
带时间码（SRT 风格）
按 chapters 分段
按 speakers 分段（多说话人视频）

可以直接拿去做笔记、做翻译、做摘要。

缓存机制

本地缓存——同一 URL 第二次抓直接走缓存，不再请求。这对调试 prompt 特别友好。

工作流

按 Workflow 章节：① 解析 URL（视频 / 频道 / 播放列表）；② 选输出格式；③ 走 InnerTube → fallback 到 yt-dlp；④ 后处理（chapters、speakers、清洗）；⑤ 写入 Output Directory。

Optional Environment Variables

列出可调环境变量：缓存路径、超时、yt-dlp 路径、代理设置。

适合谁

内容创作者把别人视频里的核心观点摘下来做笔记
学生 / 研究人员把课程 / 讲座转成可搜索文本
翻译者把外语视频字幕翻成中文（配 baoyu-translate）
自媒体二创：原视频字幕 → 重写文章 → 发公众号

何时不该用

完整下载视频本身——这只下字幕
没有任何字幕的视频（无 CC、无作者字幕）
受地区限制无法访问的视频（除非你配代理）

配套

baoyu-translate：翻译外语视频字幕
baoyu-format-markdown：把字幕整理成可读 Markdown
baoyu-post-to-wechat：发成公众号文章
obsidian-vault：归档到笔记库

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly and automatically falls back to yt-dlp when YouTube blocks the direct API path.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

Script	Purpose
`scripts/main.ts`	Transcript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

Options

Option	Description	Default
`<url-or-id>`	YouTube URL or video ID (multiple allowed)	Required
`--languages <codes>`	Language codes, comma-separated, in priority order	`en`
`--format <fmt>`	Output format: `text`, `srt`	`text`
`--translate <code>`	Translate to specified language code
`--list`	List available transcripts instead of fetching
`--timestamps`	Include `[HH:MM:SS → HH:MM:SS]` timestamps per paragraph	on
`--no-timestamps`	Disable timestamps
`--chapters`	Chapter segmentation from video description
`--speakers`	Raw transcript with metadata for speaker identification
`--exclude-generated`	Skip auto-generated transcripts
`--exclude-manually-created`	Skip manually created transcripts
`--refresh`	Force re-fetch, ignore cached data
`-o, --output <path>`	Save to specific file path	auto-generated
`--output-dir <dir>`	Base output directory	`youtube-transcript`

Optional Environment Variables

Variable	Description
`YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER`	Passed to `yt-dlp --cookies-from-browser` during fallback, e.g. `chrome`, `safari`, `firefox`, or `chrome:Profile 1`

Input Formats

Accepts any of these as video input:

Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Short URL: https://youtu.be/dQw4w9WgXcQ
Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
Video ID: dQw4w9WgXcQ

Output Formats

Format	Extension	Description
`text`	`.md`	Markdown with frontmatter (incl. `description`), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
`srt`	`.srt`	SubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)

{channel-slug}: Channel name in kebab-case
{title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

meta.json — video metadata, chapters, cover image path, language info
transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。？！ etc.), timestamps proportionally allocated by character length, CJK-aware text merging
imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

When YouTube returns anti-bot / blocked responses on the direct InnerTube path, the script retries with alternate client identities and then falls back to yt-dlp if available. If fallback is needed but yt-dlp is unavailable, the agent should decide how to make yt-dlp available and continue rather than pushing the installation decision to the user.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

Run with --list first if the user hasn't specified a language, to show available options
Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
The script auto-saves cached data + output file and prints the file path
For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (`--chapters`)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (`--speakers`)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

YAML frontmatter with video metadata (title, channel, date, cover, description, language)
Video description (for speaker name extraction)
Chapter list from description (if available)
Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

Read the saved .md file
Read the prompt template at {baseDir}/prompts/speaker-transcript.md
Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

Error	Meaning
Transcripts disabled	Video has no captions at all
No transcript found	Requested language not available
Video unavailable	Video deleted, private, or region-locked
IP blocked	Too many requests, try again later
Age restricted	Video requires login for age verification
bot detected	The script retries alternate clients and then `yt-dlp`; if fallback tooling is missing, the agent should resolve that itself, otherwise if it still fails try `YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER=safari` (or your browser)

设计思路

支持的字幕类型

输出格式

缓存机制

工作流

Optional Environment Variables

适合谁

何时不该用

配套

YouTube Transcript

Script Directory

Usage

Options

Optional Environment Variables

Input Formats

Output Formats

Output Directory

Caching

Workflow

Chapter & Speaker Workflow

Chapters (--chapters)

Speaker Identification (--speakers)

Error Cases

Chapters (`--chapters`)

Speaker Identification (`--speakers`)