{"id":6463,"date":"2026-07-03T02:25:46","date_gmt":"2026-07-02T19:25:46","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6463"},"modified":"2026-07-03T02:25:46","modified_gmt":"2026-07-02T19:25:46","slug":"huangchihhungleo-claude-real-video-let-claude-or-any-llm-actually-watch-a-video-scene-aware-deduplicated-frames-transcript-from-a-url-or-local-file-runs-locally-mit-%c2%b7-github","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6463","title":{"rendered":"HUANGCHIHHUNGLeo\/claude-real-video: Let Claude (or any LLM) actually watch a video \u2014 scene-aware, deduplicated frames + transcript, from a URL or local file. Runs locally, MIT. \u00b7 GitHub"},"content":{"rendered":"<p> <br \/>\n<br \/>\nLet Claude \u2014 or any LLM \u2014 actually watch a video.<br \/>\nMost AI tools don&#8217;t really see a video. Paste a YouTube link into ChatGPT and it<br \/>\nreads the transcript, not the picture. Claude won&#8217;t take a video file at all.<br \/>\nEven Gemini, which can read video natively, has to send it up to Google and<br \/>\nsamples frames at a fixed interval (1 fps by default), so fast cuts slip past.<br \/>\nclaude-real-video does it differently, and locally: point it at a URL or a<br \/>\nfile, and it pulls the frames that actually matter (every scene change, not a<br \/>\nfixed quota), throws away the near-duplicates, transcribes the audio, and hands<br \/>\nyou a clean folder any LLM can read \u2014 on your own machine, nothing uploaded.<br \/>\ncrv &#8220;https:\/\/www.youtube.com\/watch?v=&#8230;&#8221;<br \/>\n# \u2192 crv-out\/frames\/*.jpg  +  crv-out\/transcript.txt  +  crv-out\/MANIFEST.txt<br \/>\nThen drop the frames + MANIFEST.txt into Claude \/ ChatGPT \/ Gemini and ask away.<\/p>\n<p>Why not just sample frames?<br \/>\nMost &#8220;let an LLM watch a video&#8221; scripts (and Gemini&#8217;s own pipeline) grab frames<br \/>\nat a fixed interval \u2014 e.g. one per second. That over-samples a static<br \/>\nscreencast and under-samples a fast-cut reel. claude-real-video is smarter:<\/p>\n<p>fixed-interval sampling<br \/>\nclaude-real-video<\/p>\n<p>Frame selection<br \/>\nevery N seconds<br \/>\nscene-change detection + density floor<\/p>\n<p>Repeated shots (A-B-A cuts)<br \/>\nsent again every time<br \/>\nsliding-window dedup sends each shot once<\/p>\n<p>Static slide (10 min)<br \/>\n~600 near-identical frames<br \/>\ncollapses to 1 (dedup)<\/p>\n<p>Fast-cut reel<br \/>\nmisses frames between samples<br \/>\ncatches each visual change<\/p>\n<p>Audio<br \/>\noften ignored<br \/>\nWhisper transcript w\/ language detect<\/p>\n<p>Where the video goes<br \/>\noften uploaded to a cloud<br \/>\nstays on your machine<\/p>\n<p>Input<br \/>\nusually local file only<br \/>\nURL (yt-dlp) or local file<\/p>\n<p>You feed the model fewer, more meaningful frames \u2014 cheaper context, better<br \/>\nunderstanding.<\/p>\n<p>pip install claude-real-video              # core (frames + dedup)<br \/>\npip install &#8220;claude-real-video(whisper)&#8221;   # + audio transcription<br \/>\nSystem requirement: ffmpeg<br \/>\nffmpeg \/ ffprobe are used for frame extraction and audio, and aren&#8217;t<br \/>\npip-installable. Install them once:<\/p>\n<p>OS<br \/>\ncommand<\/p>\n<p>macOS<br \/>\nbrew install ffmpeg<\/p>\n<p>Linux<br \/>\nsudo apt install ffmpeg (or your distro&#8217;s package manager)<\/p>\n<p>Windows<br \/>\nwinget install Gyan.FFmpeg \u2014 or choco install ffmpeg \u2014 or download a build and add its bin\\ folder to your PATH<\/p>\n<p>Verify it&#8217;s on your PATH:<\/p>\n<p>Transcription uses the whisper CLI (installed by the (whisper) extra, or<br \/>\npip install openai-whisper). Whisper also relies on ffmpeg.<br \/>\nWorks on macOS, Windows, and Linux \u2014 Python 3.10+.<\/p>\n<p># A YouTube \/ Instagram \/ TikTok \/ &#8230; link<br \/>\ncrv &#8220;https:\/\/www.instagram.com\/reel\/XXXX\/&#8221;<\/p>\n<p># A local file, English transcript, output to .\/out<br \/>\ncrv lecture.mp4 -o out &#8211;lang en<\/p>\n<p># Frames only, no transcription<br \/>\ncrv clip.mp4 &#8211;no-transcribe<\/p>\n<p># A login-gated video (your own \/ authorised use): pass a Netscape cookie file<br \/>\ncrv &#8220;https:\/\/&#8230;&#8221; &#8211;cookies cookies.txt<br \/>\npython -m claude_real_video &#8230; works as an alias for crv too.<\/p>\n<p>flag<br \/>\ndefault<br \/>\nmeaning<\/p>\n<p>-o, &#8211;out<br \/>\ncrv-out<br \/>\noutput directory<\/p>\n<p>&#8211;scene<br \/>\n0.30<br \/>\nscene-change sensitivity (lower = more frames)<\/p>\n<p>&#8211;fps-floor<br \/>\n1.0<br \/>\nat least one frame every N seconds<\/p>\n<p>&#8211;max-frames<br \/>\n150<br \/>\nhard cap on total frames<\/p>\n<p>&#8211;lang<br \/>\nauto<br \/>\nWhisper language (en, zh, auto, &#8230;)<\/p>\n<p>&#8211;dedup-threshold<br \/>\n8<br \/>\n% of pixels that must change for a frame to count as new; higher = fewer frames<\/p>\n<p>&#8211;dedup-window<br \/>\n4<br \/>\ncompare against the last N kept frames \u2014 a shot the model already saw doesn&#8217;t come back after a cutaway (1 = consecutive-only)<\/p>\n<p>&#8211;report<br \/>\noff<br \/>\nkeep dropped frames in .\/dropped + write report.html visualising every keep\/drop decision<\/p>\n<p>&#8211;no-transcribe<br \/>\noff<br \/>\nskip audio<\/p>\n<p>&#8211;keep-audio<br \/>\noff<br \/>\nalso save the full soundtrack (audio.m4a) so audio models can hear it<\/p>\n<p>&#8211;cookies<br \/>\n\u2013<br \/>\nNetscape cookie file for login-gated sources<\/p>\n<p>from claude_real_video import process<\/p>\n<p>r = process(&#8220;https:\/\/youtu.be\/&#8230;&#8221;, &#8220;out&#8221;, lang=&#8221;en&#8221;)<br \/>\nprint(r.frame_count, r.transcript_path)<\/p>\n<p>Fetch \u2014 yt-dlp for URLs (optional cookies), or copy a local file.<br \/>\nExtract \u2014 one chronological ffmpeg select pass grabs every scene change<br \/>\nplus a density floor (at least one frame every &#8211;fps-floor seconds), so<br \/>\nfast cuts and slow screencasts are both covered.<br \/>\nDedup \u2014 real pixel difference (downscaled RGB, not a perceptual hash \u2014 hashes<br \/>\ngo blind on flat colours and equal-luma hue changes) against a sliding window<br \/>\nof the last &#8211;dedup-window kept frames, so an A-B-A cutaway doesn&#8217;t re-send a<br \/>\nshot the model has already seen. &#8211;report writes report.html showing every<br \/>\nkeep\/drop decision with its diff %, for tuning.<br \/>\nText \u2014 if the video already has subtitles (a sidecar .srt\/.vtt next to a<br \/>\nlocal file, or an embedded subtitle track), those are used as the transcript \u2014<br \/>\nfaster and more accurate than re-transcribing. Only when there are no subtitles<br \/>\ndoes it fall back to Whisper on the audio (skipped cleanly if there&#8217;s no audio).<br \/>\nAudio (optional, &#8211;keep-audio) \u2014 save the full original soundtrack<br \/>\n(audio.m4a: music + speech + effects, copied losslessly when possible). The<br \/>\ntranscript only has the words; the audio file lets a model that can listen<br \/>\n(Gemini, GPT-4o, \u2026) actually hear the music and tone.<br \/>\nManifest \u2014 MANIFEST.txt summarises everything for the model.<\/p>\n<p>So the model can see (key frames), read (transcript) and \u2014 with &#8211;keep-audio \u2014<br \/>\nhear (full soundtrack) the video. The transcript is plain text any model can read;<br \/>\nthe tool doesn&#8217;t burn subtitles into the video \u2014 burning is a presentation choice,<br \/>\nnot something needed to make a video AI-readable.<\/p>\n<p>Only download content you have the right to. The &#8211;cookies option is for<br \/>\nyour own, authorised access \u2014 don&#8217;t ship credentials in a repo.<br \/>\nRe-running overwrites the output directory.<\/p>\n<p>MIT<br \/>\n<script async src=\"\/\/www.instagram.com\/embed.js\"><\/script><br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/github.com\/HUANGCHIHHUNGLeo\/claude-real-video\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let Claude \u2014 or any LLM \u2014 actually watch a video. Most AI tools don&#8217;t really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won&#8217;t take a video file at all. Even Gemini, which can read video natively, has to send it up to Google [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6464,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-6463","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6463","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6463"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6463\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/6464"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6463"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6463"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6463"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}