HUANGCHIHHUNGLeo/claude-real-video: Let Claude (or any LLM) actually watch a video — scene-aware, deduplicated frames + transcript, from a URL or local file. Runs locally, MIT. · GitHub

Let Claude — or any LLM — actually watch a video.
Most AI tools don’t really see a video. Paste a YouTube link into ChatGPT and it
reads the transcript, not the picture. Claude won’t take a video file at all.
Even Gemini, which can read video natively, has to send it up to Google and
samples frames at a fixed interval (1 fps by default), so fast cuts slip past.
claude-real-video does it differently, and locally: point it at a URL or a
file, and it pulls the frames that actually matter (every scene change, not a
fixed quota), throws away the near-duplicates, transcribes the audio, and hands
you a clean folder any LLM can read — on your own machine, nothing uploaded.
crv “https://www.youtube.com/watch?v=…”
# → crv-out/frames/*.jpg + crv-out/transcript.txt + crv-out/MANIFEST.txt
Then drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.

Why not just sample frames?
Most “let an LLM watch a video” scripts (and Gemini’s own pipeline) grab frames
at a fixed interval — e.g. one per second. That over-samples a static
screencast and under-samples a fast-cut reel. claude-real-video is smarter:

fixed-interval sampling
claude-real-video

Frame selection
every N seconds
scene-change detection + density floor

Repeated shots (A-B-A cuts)
sent again every time
sliding-window dedup sends each shot once

Static slide (10 min)
~600 near-identical frames
collapses to 1 (dedup)

Fast-cut reel
misses frames between samples
catches each visual change

Audio
often ignored
Whisper transcript w/ language detect

Where the video goes
often uploaded to a cloud
stays on your machine

Input
usually local file only
URL (yt-dlp) or local file

You feed the model fewer, more meaningful frames — cheaper context, better
understanding.

pip install claude-real-video # core (frames + dedup)
pip install “claude-real-video(whisper)” # + audio transcription
System requirement: ffmpeg
ffmpeg / ffprobe are used for frame extraction and audio, and aren’t
pip-installable. Install them once:

OS
command

macOS
brew install ffmpeg

Linux
sudo apt install ffmpeg (or your distro’s package manager)

Windows
winget install Gyan.FFmpeg — or choco install ffmpeg — or download a build and add its bin\ folder to your PATH

Verify it’s on your PATH:

Transcription uses the whisper CLI (installed by the (whisper) extra, or
pip install openai-whisper). Whisper also relies on ffmpeg.
Works on macOS, Windows, and Linux — Python 3.10+.

# A YouTube / Instagram / TikTok / … link
crv “https://www.instagram.com/reel/XXXX/”

# A local file, English transcript, output to ./out
crv lecture.mp4 -o out –lang en

# Frames only, no transcription
crv clip.mp4 –no-transcribe

# A login-gated video (your own / authorised use): pass a Netscape cookie file
crv “https://…” –cookies cookies.txt
python -m claude_real_video … works as an alias for crv too.

flag
default
meaning

-o, –out
crv-out
output directory

–scene
0.30
scene-change sensitivity (lower = more frames)

–fps-floor
1.0
at least one frame every N seconds

–max-frames
150
hard cap on total frames

–lang
auto
Whisper language (en, zh, auto, …)

–dedup-threshold
8
% of pixels that must change for a frame to count as new; higher = fewer frames

–dedup-window
4
compare against the last N kept frames — a shot the model already saw doesn’t come back after a cutaway (1 = consecutive-only)

–report
off
keep dropped frames in ./dropped + write report.html visualising every keep/drop decision

–no-transcribe
off
skip audio

–keep-audio
off
also save the full soundtrack (audio.m4a) so audio models can hear it

–cookies
–
Netscape cookie file for login-gated sources

from claude_real_video import process

r = process(“https://youtu.be/…”, “out”, lang=”en”)
print(r.frame_count, r.transcript_path)

Fetch — yt-dlp for URLs (optional cookies), or copy a local file.
Extract — one chronological ffmpeg select pass grabs every scene change
plus a density floor (at least one frame every –fps-floor seconds), so
fast cuts and slow screencasts are both covered.
Dedup — real pixel difference (downscaled RGB, not a perceptual hash — hashes
go blind on flat colours and equal-luma hue changes) against a sliding window
of the last –dedup-window kept frames, so an A-B-A cutaway doesn’t re-send a
shot the model has already seen. –report writes report.html showing every
keep/drop decision with its diff %, for tuning.
Text — if the video already has subtitles (a sidecar .srt/.vtt next to a
local file, or an embedded subtitle track), those are used as the transcript —
faster and more accurate than re-transcribing. Only when there are no subtitles
does it fall back to Whisper on the audio (skipped cleanly if there’s no audio).
Audio (optional, –keep-audio) — save the full original soundtrack
(audio.m4a: music + speech + effects, copied losslessly when possible). The
transcript only has the words; the audio file lets a model that can listen
(Gemini, GPT-4o, …) actually hear the music and tone.
Manifest — MANIFEST.txt summarises everything for the model.

So the model can see (key frames), read (transcript) and — with –keep-audio —
hear (full soundtrack) the video. The transcript is plain text any model can read;
the tool doesn’t burn subtitles into the video — burning is a presentation choice,
not something needed to make a video AI-readable.

Only download content you have the right to. The –cookies option is for
your own, authorised access — don’t ship credentials in a repo.
Re-running overwrites the output directory.

MIT

Source link