Transcription is the process of converting spoken audio or video into written text. It can be done manually, by a person who listens and types, or automatically, by speech recognition software that processes a recording in minutes. Most people now transcribe audio to text with automatic tools first and then fix the remaining mistakes by hand, because this is far faster and cheaper than typing everything from scratch.
This guide covers the whole topic in one place:
- What transcription means and the main types
- Manual vs automatic transcription, and when each makes sense
- How speech recognition actually works
- Step-by-step instructions to transcribe an audio file to text
- Free built-in dictation tools on Windows, Android, iPhone, and in Google Docs
- The difference between dictation and file transcription
- Practical tips for cleaner, more accurate transcripts
What transcription means
A transcript is a written record of speech. When you transcribe a recording, you produce text that says what the speakers said, in the same language they said it. That last part matters: transcription is not translation. Translation moves the words into another language; transcription keeps the language and changes only the format, from sound to text.
Transcription is also not the same thing as subtitles or captions. Captions are short text segments timed to a video. A transcript is a standalone text document. You can build captions from a transcript, but they serve different purposes: captions are read while watching, transcripts are read instead of listening.
Depending on what the transcript is for, it may include extras: timestamps that mark when each phrase was said, and speaker labels that mark who said it. Both are common in interviews, meetings, and research recordings.
The main types of transcription
Not every transcript needs to capture every sound. There are three common styles:
Verbatim transcription. Every word is kept, including filler words, false starts, repetitions, and sometimes non-verbal sounds like laughter. This style is used in legal work and in research where exactly how something was said carries meaning.
Clean verbatim, also called intelligent transcription. Fillers like “um” and “you know” are removed, stumbles are smoothed out, and light grammar fixes are applied, but the meaning and the speaker’s wording stay intact. This is the default for business meetings, interviews, and podcasts because it is much easier to read.
Edited or summarized transcription. The text is restructured and condensed. Strictly speaking this is closer to note-taking than transcription, but many tools offer it as an output option.
On top of the style, you can layer speaker identification (who is talking) and timestamps. Automatic tools handle these with varying success: speaker separation is one of the harder problems in speech recognition, especially when people interrupt each other.
Manual vs automatic transcription
Manual transcription means a human listens to the recording and types it out, replaying difficult parts. An experienced transcriber typically needs around four hours to transcribe one hour of clear audio, and noticeably more when the audio is noisy or several people talk at once. The upside is judgment: a human can untangle crosstalk, recognize sarcasm, handle thick accents, and figure out specialized jargon from context. The downside is time and cost.
Automatic transcription uses AI speech recognition. You upload a file, and software returns text in minutes for a small fraction of the price of human work. The quality on clean, single-speaker audio is good enough that you usually only need a light proofread. The quality on bad audio degrades, sometimes badly, and the software will confidently write the wrong word instead of marking it as unclear.
In practice, most workflows today are hybrid: let the machine produce the first draft, then have a person review it. You get most of the speed of automation with most of the reliability of human work.
How automatic speech recognition works
You do not need to understand the math to use transcription tools well, but a basic picture helps you predict when they will fail.
Speech recognition software takes the audio waveform, slices it into tiny fragments, and feeds those fragments into a neural network trained on enormous amounts of recorded speech. The network maps sound patterns to likely words, and a language model picks the word sequence that makes the most sense, which is how the system distinguishes “their” from “there”. Finally, punctuation and capitalization are restored, since nobody actually speaks commas out loud.
This explains the common failure points:
- Noise and distance. The network was mostly trained on reasonably clear speech. A phone lying on the far side of a meeting room produces audio it struggles with.
- Overlapping speakers. Two voices on top of each other blur the sound patterns.
- Rare words. Names, brands, and niche terminology are statistically unlikely, so the language model substitutes something more common.
- Accents and mixed languages. Recognition is best for speech similar to the training data, and weaker at the edges.
Knowing this, you can dramatically improve results just by improving the recording, which we cover in the tips section below.
How to transcribe an audio file to text
Here is the general workflow that works with any modern transcription service:
- Check your recording. Listen to the first minute. If you can barely make out the words, the software will not do better. Common formats like mp3, wav, m4a, and ogg work everywhere; video files like mp4 and mov are also accepted by most services, which extract the audio track automatically.
- Pick a tool and upload the file. For example, you can upload a file on the blablaType transcription page: it accepts mp3, wav, m4a, ogg, mp4, and mov, and the first few minutes are free, so you can test it on a real recording without entering a card. After the free minutes, you top up a balance and pay per minute of audio.
- Wait for processing. Automatic transcription of a typical recording takes minutes, not hours.
- Review the transcript against the audio. Pay special attention to names, numbers, dates, and any sentence that reads strangely. These are exactly the spots where recognition errors hide.
- Export or copy the text into wherever it needs to live: a document, a CRM, an email, show notes.
One more thing worth checking with any service is what happens to your data. With blablaType, the uploaded file and the resulting text are not stored longer than needed to process the audio and return your result.
Free built-in ways to turn speech into text
Every major operating system now ships a free dictation feature. Important caveat: these tools transcribe live speech from your microphone. They do not accept audio files. They are great for dictating notes, messages, and drafts, and not a substitute for file transcription.
Windows: voice typing with Win+H
- Click into any text field: a document, a chat, a browser form.
- Press the Windows key and H together.
- Start speaking. Your words appear in the field. You can turn on automatic punctuation in the voice typing settings (the gear icon on the small panel).
- Press Win+H again, or tap the microphone icon, to stop.
Voice typing is built into Windows 11 and recent versions of Windows 10 and needs an internet connection.
Google Docs: Voice typing
- Open a document in Google Docs in a supported desktop browser such as Chrome.
- Open the Tools menu and choose Voice typing.
- Click the microphone icon that appears and start speaking.
- Say punctuation out loud: “comma”, “period”, “question mark”, “new line”.
- Click the microphone again to stop.
This is one of the most popular free dictation setups because the text lands directly in a shareable document.
Android: the Gboard microphone
- Open any app where you can type and tap the text field so the Gboard keyboard appears.
- Tap the microphone icon on the keyboard.
- Speak. Recent versions insert punctuation automatically.
- Tap the microphone again to stop.
iPhone and iPad: built-in dictation
- Tap a text field to bring up the keyboard.
- Tap the microphone key on the keyboard.
- Speak. On recent iOS versions punctuation is added automatically, and you can also say it explicitly.
- Tap the microphone key again to finish.
If you do not see the microphone key, enable it in Settings under General, then Keyboard, then Enable Dictation.
Dictation vs file transcription: which one you need
The two workflows solve different problems, and picking the right one saves you a lot of friction.
You need file transcription when the speech already exists as a recording: an interview, a meeting recording, a lecture, a voice memo, a video. You upload the file and get text back.
You need dictation when you are creating new text and would rather speak than type. Most people speak several times faster than they type, so dictation pays off for emails, documents, messages, and notes. The built-in tools above are a free way to start. Dedicated apps go further: blablaType, for example, is a push-to-talk dictation app for Windows. You hold F9, speak, and the text is typed at your cursor in any program, whether that is Word, an email client, a messenger, or a browser. It recognizes 35 languages, can clean filler words out of your speech, and the 7-day trial does not ask for a card. If you are comparing options, see our overview of the best dictation software.
A useful rule of thumb: recording first and transcribing later adds a step. If the end goal is text and you are the one speaking, dictate directly and skip the audio file entirely.
How to get a more accurate transcript
Whatever tool you use, the recording quality decides most of the outcome. A few habits make a large difference:
- Get the microphone close to the speaker. A phone in the middle of the table beats a phone in someone’s pocket. A headset beats both.
- Cut background noise. Close the window, turn off music, move away from the espresso machine. Steady noise is bad; sudden noise over speech is worse.
- One voice at a time. In meetings and interviews, crosstalk is the single biggest source of garbage text. Brief the participants if you can.
- Say names and key terms clearly. If a rare name matters, spell it once on the recording; it gives you a reference point when editing.
- Proofread the risky spots. Numbers, dates, names, and negations (“can” vs “can’t”) deserve a check against the audio even when the rest looks clean.
- Keep the original audio until the transcript is finalized. You will want to re-listen to at least one disputed sentence.
What people use transcription for
Transcription stopped being a niche service when it became cheap. Typical uses today:
- Meetings and calls: a searchable record of decisions and action items.
- Interviews and research: qualitative researchers and journalists work from text, not audio.
- Lectures and study: students turn recordings into reviewable notes.
- Podcasts and video: transcripts feed show notes, subtitles, and search engines, which cannot listen to audio.
- Accessibility: text makes spoken content available to deaf and hard-of-hearing audiences.
- Personal voice memos: a thought captured on a walk becomes an editable paragraph.
Where to start
If you have a recording in hand, the fastest path is to upload it to a file transcription service, skim the result against the audio, and fix the handful of misheard words. You can try this on the blablaType transcription page with your first few minutes free. If your goal is to produce new text by voice, start with the free built-in dictation in your OS, and move to a dedicated dictation app once you feel the limits. Either way, the era of typing out recordings by hand is over for everyone who does not specifically need a certified human transcript.