Speech-to-Text (STT) | AI for Authors

Science fiction author Kevin J. Anderson has dictated over fifty novels. Not typed, not voice-to-text in a quiet office. Dictated into a handheld recorder while hiking through the Colorado mountains, then transcribed the audio into text he could edit. His reasoning was simple: the average person types around 40 words per minute. The average person speaks at 150. Why wrestle with a keyboard when your mouth is three to four times faster?

Anderson’s method used to require a human transcriptionist or expensive software that needed 45 minutes of voice training before it could understand you. Today, you can hand a free, open-source AI model an audio file in any of 98 languages and get back a clean transcript in minutes. The technology that makes this possible is called speech-to-text, and its journey from a single-digit vocabulary in 1952 to near-human accuracy in 2024 is one of the wildest rides in computing history.

What Speech-to-Text Actually Means

Speech-to-text is software that listens to spoken audio and converts it into written words. Audio goes in, text comes out. It’s the mirror image of text-to-speech, which does the reverse. You’ll also see it called “speech recognition” or “automatic speech recognition” (ASR), which is the term researchers prefer. They all describe the same core idea: a machine that can hear you and write down what you said.

The concept sounds deceptively simple, but human speech is one of the messiest signals in nature. We slur words together, drop syllables, trail off mid-sentence, and rely on context to sort out whether we said “their,” “there,” or “they’re.” Teaching a machine to handle all of that has taken seven decades of work, and the story of how it happened involves a six-foot-tall machine that could count to nine, a linguist-firing quote that became legend, and a transcription tool that only exists because OpenAI needed more training data for ChatGPT.

Audrey Could Count to Nine

The first machine that could recognize human speech was built at Bell Labs in 1952. Its creators named it Audrey, short for Automatic Digit Recognizer, which was honest advertising: Audrey could recognize the spoken digits zero through nine. That was it. Ten words. And it only worked reliably for a single speaker, its own inventor. The machine filled a six-foot-tall rack of electronics and was, by any commercial standard, completely useless.

But it proved the idea was possible. A machine could hear sound and map it to meaning.

IBM pushed further in 1962 with the Shoebox, a device that understood 16 English words (the ten digits plus commands like “plus,” “minus,” and “total”). Its party trick was taking spoken math problems and feeding them to a connected adding machine that would print the answer. Vocabulary of 16 words. Processing time: glacial. Significance: enormous.

Then the U.S. government got involved. In 1971, DARPA (the Defense Advanced Research Projects Agency) launched a five-year, $15 million program (roughly $100 million in today’s dollars) with a single goal: build a system that could recognize at least 1,000 words. Four institutions competed. The winner was a system called Harpy, built by a Carnegie Mellon graduate student named Bruce Lowerre. In 1976, Harpy successfully recognized 1,011 words, a vocabulary comparable to a three-year-old child’s. It ran on a computer with 4 megabytes of RAM.

At the same time, a statistician at IBM named Fred Jelinek was quietly revolutionizing the field by treating speech recognition as a math problem rather than a linguistics problem. His group applied a statistical technique called Hidden Markov Models to speech, and the approach worked so well that it dominated the field for the next 30 years. Jelinek is attributed a quote that became canonical lore in the research community: “Every time I fire a linguist, the performance of the speech recognizer goes up.” He denied ever saying it. Nobody believed him.

Dragon Learns to Listen

The most important name in consumer speech-to-text is Dragon, and its origin traces directly back to that DARPA competition. James and Janet Baker, researchers at Carnegie Mellon who had worked on one of the competing systems, founded Dragon Systems in 1982. Fifteen years later, in 1997, they released Dragon NaturallySpeaking 1.0, the first consumer software that let you talk at a natural pace and see your words appear on screen.

Before Dragon, consumer speech recognition required you to pause between every word. “Please.” Pause. “Write.” Pause. “This.” Pause. “Down.” Dragon NaturallySpeaking removed the pauses. You could speak at your normal conversational rate and the software would keep up. It supported 23,000 words at launch, cost $695, and processed speech at about 100 words per minute.

The catch: you had to spend 45 minutes reading passages aloud before the software could understand you. It was learning your voice, your accent, your particular way of shaping vowels. Initial accuracy hovered around 60-70%, climbing to 90% after that training period. Imperfect, but revolutionary. For the first time, an author could sit down and talk a chapter into existence.

The Baker family story has a painful coda. In 2000, they sold Dragon Systems to a Belgian corporation called Lernout & Hauspie, which was subsequently exposed as one of the largest financial frauds in European corporate history. The Bakers lost most of the proceeds. Dragon itself survived (it’s now owned by Microsoft), but the people who built it paid a steep price for their invention.

The Deep Learning Leap

By the early 2010s, speech recognition had plateaued. Hidden Markov Models had been refined for three decades and pushed accuracy to around 80-85% for clean audio, but the last mile proved stubborn. Then, in 2012, Geoffrey Hinton and his team at the University of Toronto applied deep neural networks to the problem, and the results were staggering. Word error rates dropped by 20-30% in a single year, a larger gain than the entire previous decade of incremental improvements combined. Within months, Microsoft, Google, and IBM had all switched their production speech systems to neural networks.

The field accelerated. Baidu’s Deep Speech project (2014-2015) showed that a single end-to-end neural network could replace the old multi-stage pipeline, working across both English and Mandarin. By 2019, Google’s systems reached “human parity” for clean English audio, meaning the AI made no more errors than a professional human transcriptionist.

Then came Whisper, and the origin story is almost too good. In 2021, OpenAI realized they were running out of high-quality text to train their large language models. Their solution: transcribe the internet’s vast library of YouTube videos and podcasts to create more text training data. They built Whisper as an internal tool to do that job. It worked so well that in September 2022, they released it to the public as open-source software. One of the most capable speech-to-text tools available today was, essentially, a byproduct of making GPT smarter.

Whisper was trained on 680,000 hours of audio across 98 languages, requires zero voice training, handles accents gracefully, and runs entirely offline on your own computer. No subscription. No data sent to anyone’s servers. It went from internal experiment to industry benchmark almost by accident.

How a Machine Turns Sound into Words

The traditional approach to speech recognition worked like an assembly line with specialized stations. First, the raw audio was converted into a compact mathematical representation (a process called feature extraction). Then an acoustic model analyzed those features and predicted which basic speech sounds, or phonemes, were being spoken. Finally, a language model took those phoneme predictions and figured out the most likely sequence of actual words, using its knowledge of English to decide whether you said “ice cream” or “I scream.”

This pipeline required enormous human labor. Linguists had to design phoneme sets. Phoneticians had to build pronunciation dictionaries by hand. Teams of annotators had to mark exactly when each word appeared in thousands of hours of audio.

Modern systems like Whisper skip all of that. A single neural network takes audio in one end and produces words out the other. No phonemes, no pronunciation dictionaries, no separately trained language model. The network learns all of those steps at once from massive amounts of data. It’s simpler to build, easier to train, and produces better results, which is roughly how every breakthrough in AI feels when you describe it after the fact.

What still makes the problem hard is everything that makes human speech messy. Background noise. Multiple people talking at once. Accents the model hasn’t heard enough of. Domain-specific vocabulary (your fantasy novel’s character names, for instance). And homophones, those eternal troublemakers, where the model has to figure out from context alone whether you meant “rain,” “reign,” or “rein.”

Why This Matters for Your Writing Life

Speech-to-text touches an author’s workflow in three practical ways, and each one is worth knowing about.

Dictation can transform your output. If you write 2,000 words a day by typing, that’s roughly 50 minutes of keyboard time. Speaking those same 2,000 words takes about 13 minutes. The math alone is compelling, but many authors report something more interesting: dictation changes the quality of their first drafts, not just the speed. Dialogue sounds more natural when you speak it aloud. Prose flows differently when it comes from your voice instead of your fingers. Dragon (now Dragon Professional Anywhere) remains the gold standard for dedicated dictation, with custom vocabulary support that lets you add character names and invented words. Google Docs offers free voice typing in any Chrome browser. Apple and Windows both have built-in dictation at the system level. For authors curious about trying it, the barrier to entry is zero.

Transcription tools make audio useful. If you record brainstorming sessions, conduct interviews for research, or host a podcast, speech-to-text turns those recordings into searchable, editable text. Otter.ai provides real-time transcription with speaker identification and AI summaries (free up to 300 minutes per month). Descript lets you edit audio by editing the transcript, deleting a word from the text removes it from the recording. And Whisper, through user-friendly wrappers like MacWhisper, gives you offline transcription with no recurring costs and no data leaving your machine.

Accessibility is the quiet superpower. For authors dealing with repetitive strain injury, carpal tunnel syndrome, or chronic pain in their hands and wrists, dictation isn’t a productivity hack. It’s what keeps them writing. Dragon NaturallySpeaking is specifically recommended by the Yale Center for Dyslexia & Creativity because it removes the bottleneck of translating thought into typed symbols, letting ideas flow without the friction of a keyboard. Author Joanna Penn has spoken publicly about adopting dictation after developing wrist and forearm pain, calling it a way to future-proof her writing career. Even if your hands feel fine today, knowing these tools exist means you have a backup plan that doesn’t require you to stop writing.

The machine that started in 1952 by recognizing nine digits now understands 98 languages, runs on a laptop, and costs nothing. The 45 minutes of voice training that Dragon required in 1997 has been replaced by zero setup. And the author who speaks at 150 words per minute no longer needs to slow down to 40 just because a keyboard was the only way to get words on a page.