Convert any video into a clean, searchable text transcript with timestamps. Free, browser-based, and works on MP4, MOV, WebM, MKV, MP3 and WAV. Exports to SRT, VTT, or plain text.
Video is great for watching, terrible for searching. A 40-minute lecture you recorded on your phone is effectively invisible to you a week later — you can't grep it, can't quote it, can't feed it to another tool. Converting video to text flips that: once there's a transcript, the content becomes a first-class searchable document. This converter takes the video you already have and gives you that document in minutes, with timestamps on every segment so you can still jump back to the video frame that matters.
Three stages. First, your browser pulls the audio track out of the video file using ffmpeg.wasm — this happens entirely on your machine, and the original video never leaves your device. Second, the compressed audio (16 kHz mono MP3, roughly 500 KB per minute) is uploaded over HTTPS to our Cloudflare Worker, which forwards it to OpenAI's Whisper API. Third, the returned segments are written to a database and streamed back to your browser in real time via Server-Sent Events so you watch the transcript appear as it's produced.
Anything ffmpeg.wasm can read, which is essentially everything — MP4, MOV, MKV, WebM, AVI, FLV for video; MP3, WAV, M4A, FLAC, OGG for audio. There's no format conversion step you need to do beforehand. If the file plays in your browser, it will almost certainly transcribe.
Every segment is delimited down to the millisecond in the underlying Whisper output. When you download SRT or VTT, those timestamps go with it, ready to drop into a YouTube upload or a Premiere Pro subtitle track. When you download plain TXT, segments are merged into natural paragraphs so it reads like a document — no timestamp clutter unless you want it.
Free tier: 10 minutes per file, 3 files per day. Pro tier: 4 hours per file, 200 files per day. File size caps are 200 MB free / 5 GB Pro.
Yes. Whisper outputs fully punctuated, case-correct text — you won't get a wall of lowercase.
The transcript is returned as one contiguous stream right now. Speaker labels are on the Pro roadmap (powered by pyannote-audio). Today you can still identify speakers manually during the edit step.
Whisper is surprisingly robust to background noise and light music. It struggles most with very overlapping speech or when the speaker is far from the mic. For those files you'll see lower confidence in the segmented output.
Yes. You own the transcript — we claim no rights to it. The audio file itself must of course respect the copyright of its original creator.