Voice transcriptions and speech recognition

2019-01-07 — 2025-03-13

Suspiciously similar content

The converse to generating speech from text is generating text from speech. We might do this in real time, to control something or to subtitle, or in batch mode, to turn an audio recording into text. Or some hybrid of both, which ends up being what I typically want in practice when I am attempting to take dictation.

Figure 1: Using machines to wreck a nice beach is an older practice than I thought. Check out Volume 89 of Popular Science monthly for Lloyd Darling’s *The Marvelous Voice Typewriter* on the state-of-the-art dictation machine of 1916 (PDF version).

1 Dictation

Speaking as a real-time textual input method. This is a rapidly moving area.

macOS includes dictation.
So does Windows.
So does VS Code Speech, which is good because the macOS one clashes with its input method somehow.

See the following older roundups of dictation apps to start:

Zapier dictation roundup
The rather grimmer Linux-specific roundup.

Here are some options culled from those lists and elsewhere of vague relevance to me:

dictation.io provides a frontend to Google speech recognition.
A classic is Nuance Dragon dictate.

2 Coding by voice

I was researching this because I temporarily disabled a hand. For the moment, the easiest option was to use Serenade for Python programming, OS speech recognition for prose typing, and to leave my other activities aside for now. If my arms were to be disabled for a longer period, I would have probably accepted the learning curve of using Talon, which seems to solve more problems, at the cost of greater commitment.

These days I would probably switch to a hybrid code generation workflow where I generate code by plain language, which removes a lot of the specialist speech recognition problems.

See Speaking in code: how to program by voice—

Coding by voice command requires two kinds of software: a speech-recognition engine and a platform for voice coding. Dragon from Nuance, a speech-recognition software developer in Burlington, Massachusetts, is an advanced engine and is widely used for programming by voice, with Windows and Mac versions available. Windows also has its own built-in speech recognition system. On the platform side, VoiceCode by Ben Meyer and Talon by Ryan Hileman … are popular.

Two other platforms for voice programming are Caster and Aenea, the latter of which runs on Linux. Both are free and open source, and enable voice-programming functionality in Dragonfly, which is an open-source Python framework that links actions with voice commands detected by a speech-recognition engine.

One point of friction which I did not anticipate is that most of these tools will, for various reasons, do their best to switch off any music playing, any time I use them. For someone like me who can’t focus for three minutes straight without banging electro in the background this is not ideal. My current workaround is to play music on a different device so I can sneak beats past my unnecessarily diligent speech recognition tools trying to stifle background noise or whatever it is they are doing. This means that I am wearing two headsets, which looks funny, but to be honest it is not the worst fashion sacrifice I have been forced to make in the course of this particular injury.

Contrariwise, if I were to try to do the speech control stuff in an open plan office, coworking space or in the family living room, it would be excruciatingly irritating for anyone else who could hear me. My current workaround, when I am annoying some innocent bystander with my narration, is to accuse them of being ableist if they complain.

2.1 Doing without mouse

Try stylus or eye tracking systems, in addition to Talon, below.

2.2 Github copilot voice

Might be interesting.

GitHub Copilot Voice

2.3 Serenade

Serenade | Code with voice

Simple, low-lift intuitionistic voice recognition for coding. Includes deep integration for various languages and also various code editors including visual studio code and those JetBrains ones. Free. Simple to use.

Supported languages: Python, JavaScript, HTML, Java, C / C++, TypeScript, CSS, Markdown, Dart, Bash, Sass, C#, Go, Ruby, Rust.

The experience is good for plain code; I was using this when I had a broken arm for about 2 months. Editor integration is not awesome when using Jupyter, in line with the general rule that Jupyter makes everything more flaky and complicated.

2.4 Talon

tl;dr:

Powerful hands-free input

Voice Control — talk to your computer

Noise Control — click with a back-beat

Eye Tracking — mouse where you look

Python Scripts — customise everything

Full length:

🤳Talon aims to bring programming, realtime video gaming, command line, and full desktop computer proficiency to people who have limited or no use of their hands, and vastly improve productivity and wow-factor of anyone who can use a computer.

System requirements:

macOS High Sierra (10.13) or newer. Talon is a universal2 build with native Apple Silicon support.

Linux / X11 (Ubuntu 18.04+, and most modern distros), Wayland support is currently limited to XWayland

Windows 8 or newer

Powerful voice control - Talon comes with a free speech recognition engine, and it is also compatible with Dragon with no additional setup.

Multiple algorithms for eye tracking mouse control (depends on a single Tobii 4C, Tobii 5 or equivalent eye tracker)

Noise recognition system (pop and hiss). Many more noises coming soon.

Scriptable with Python 3 (via embedded CPython, no need to install or configure Python on your host system).

Talon is very modular and adaptable - you can use eye tracking without speech recognition, or vice versa.

Worked example: Coding with voice dictation using Talon Voice.

2.5 Cursorless

Advanced coding extension for VS Code.

Cursorless is a spoken language for structural code editing, enabling developers to code by voice at speeds not possible with a keyboard. Cursorless decorates every token on the screen and defines a spoken language for rapid, high-level semantic manipulation of structured text.

Seems to be based on Talon.

2.6 Dragonfly

dictation-toolbox/dragonfly:

Dragonfly is a speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software. It was written to make it very easy for Python macros, scripts, and applications to interface with speech recognition engines. Its design allows speech commands and grammar objects to be treated as first-class Python objects. Dragonfly can be used for general programming by voice. It is flexible enough to allow programming in any language, not just Python. It can also be used for speech-enabling applications, automating computer activities and dictating prose.

Dragonfly contains its own powerful framework for defining and executing actions. It includes actions for text input and key-stroke simulation. This framework is cross-platform, working on Windows, macOS and Linux (X11 only). See the actions sub-package documentation for more information, including code examples.

This project is a fork of the original t4ngo/dragonfly project.

Dragonfly currently supports the following speech recognition engines:

Dragon, a product of Nuance. All versions up to 15 (the latest) should be supported. Home, Professional Individual and previous similar editions of Dragon are supported. Other editions may work too

Windows Speech Recognition (WSR), included with Microsoft Windows Vista, Windows 7+, and freely available for Windows XP

Kaldi (under development)

CMU Pocket Sphinx (with caveats)

2.6.1 Mathematics

mrob95/mathfly-talon: Talon scripts for dictating mathematics into editors like LyX and Scientific Notebook 5.5.

2.7 VoiceCode

VoiceCode:

Your voice is the most efficient way to communicate. VoiceCode is a concise spoken language that controls your computer in real-time. When writing anything from emails to kernel code, to switching applications or navigating Photoshop — VoiceCode does the job faster and easier.

VoiceCode is different from other voice-command solutions in that commands can be chained and nested in any combination, allowing complex actions to be performed by a single spoken phrase.

By taking advantage of your brain’s natural aptitude for language you can control your computer more efficiently and naturally.

3 Transcribing recordings

Handy if you have a recording and you want to make it into a text thing offline.

3.1 Whisper

Whisper (Radford et al. 2022) is the recent speech transcription model casually released by OpenAI:

pip install -U openai-whisper
whisper audio.mp3 # transcribes
whisper audio.mp3 --language Japanese --task translate  #translates to english

Requires a GPU but otherwise free. Has now been integrated into lotsa things.

3.2 Descript

descript aims to integrate editing with transcription and in particular seems to allow editing audio via editing the transcription via voice fake technology.

3.3 Misc other

producthunt transcription options Weaponised social media deep fake here we come. USD 15/month for 10hr/month.
rev transcription is a human-powered service (USD1.25/minute)
Vatis tech is AI-backed? USD10/hr. Output to video subtitles and identifies different speakers.
Audioburst offers transcription as part of their podcast service. The price is a mystery.
The all-manual option: Type it yourself.
wreally transcribe has built their own in-browser speech recogniser as well as a manual transcription UI. More augmented-manual than automatic. $20/year.

4 Phonetic transcription

It has been a long time since I took Phil Rose’s extravagantly weird undergraduate phonetics class, and I have forgotten much. A cheating tool:

toPhonetics

I cannot easily see how to automate phonetic transcription, but surely that is around somewhere? Some voice transcription software may well use phonetics as an intermediate representation or even as the final output.

5 Incoming

New open source tools to unlock speech and audio data

6 References

Radford, Kim, Xu, et al. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.”