Voice transcriptions and speech recognition

January 8, 2019 — December 5, 2023

faster pussycat
machine learning
real time
signal processing
time series

The converse to generating speech from text is generating text from speech. We might do this in real time, to control something or to subtitle, or in batch mode, to turn an audio recording into text. Or some hybrid of both, which ends up being what I typically want in practice when I am attempting to take dictation.

Figure 1: Using machines to wreck a nice beach is an older practice than I thought. Check out Volume 89 of Popular Science monthly for Lloyd Darling’s The Marvelous Voice Typewriter on the state-of-the-art dictation machine of 1916 (PDF version).

1 Dictation

Speaking as a realtime textual input method. This is a rapidly moving area.

See following older roundups of dictation apps to start:

Here are some options culled from those lists and elsewhere of vague relevance to me:

2 Coding by voice

Figure 2

See Speaking in code: how to program by voice

Coding by voice command requires two kinds of software: a speech-recognition engine and a platform for voice coding. Dragon from Nuance, a speech-recognition software developer in Burlington, Massachusetts, is an advanced engine and is widely used for programming by voice, with Windows and Mac versions available. Windows also has its own built-in speech recognition system. On the platform side, VoiceCode by Ben Meyer and Talon by Ryan Hileman … are popular.

Two other platforms for voice programming are Caster and Aenea, the latter of which runs on Linux. Both are free and open source, and enable voice-programming functionality in Dragonfly, which is an open-source Python framework that links actions with voice commands detected by a speech-recognition engine.

See also: Programming by Voice May Be the Next Frontier in Software Development.

Full disclosure: I am researching this because I have temporarily disabled my hands. For the moment, for my purposes, the easiest option is to use Serenade for python programming, OS speech recognition for prose typing, and to leave my other activities aside for now. If my arms were to be disabled for a longer period of time I would probably accept the learning curve of using Talon, which seems to solve more problems, at the cost of greater commitment.

One point of friction which I did not anticipate is that most of these tools will, for various reasons, do their best to switch off any music playing, any time I use them. For someone like me who can’t focus for three minutes straight without banging electro in the background this is not ideal. My current workaround is to play music on a different device so I can sneak beats past my unnecessarily diligent speech recognition tools trying to stifle background noise or whatever it is they are doing. This means that I am wearing two headsets, which looks funny, but to be honest it is not the worst fashion sacrifice I have been forced to make in the course of this particular injury.

Contrariwise, if I were to try to do the speech control stuff in an open plan office, coworking space or in the family living room, it would be excruciatingly irritating for anyone else who could hear me. My current workaround, when I am annoying some innocent bystander with my narration, is to accuse them of being ableist if they complain.

2.1 Doing without mouse

Try stylus or eye tracking systems, in addition to Talon, below.

2.2 Github copilot voice

Might be interesting.

2.3 Serenade

Serenade | Code with voice

Simple, low-lift intuitionistic voice recognition for coding. Includes deep integration for various languages and also various code editors including visual studio code and those jetbrains ones. Free. Simple to use.

Supported languages: Python, JavaScript, HTML, Java, C / C++, TypeScript, CSS, Markdown, Dart, Bash, Sass, C#, Go, Ruby, Rust.

The experience is good for plain code. Editor integration is not awesome when using Jupyter, in line with the general rule that Jupyter makes everything more flaky and complicated.

2.4 talon


Powerful hands-free input

  • Voice Control — talk to your computer
  • Noise Control — click with a back-beat
  • Eye Tracking — mouse where you look
  • Python Scripts — customize everything

Full length:

🤳Talon aims to bring programming, realtime video gaming, command line, and full desktop computer proficiency to people who have limited or no use of their hands, and vastly improve productivity and wow-factor of anyone who can use a computer.

  • System requirements:

  • macOS High Sierra (10.13) or newer. Talon is a universal2 build with native Apple Silicon support.

  • Linux / X11 (Ubuntu 18.04+, and most modern distros), Wayland support is currently limited to XWayland

  • Windows 8 or newer

  • Powerful voice control - Talon comes with a free speech recognition engine, and it is also compatible with Dragon with no additional setup.

  • Multiple algorithms for eye tracking mouse control (depends on a single Tobii 4C, Tobii 5 or equivalent eye tracker)

  • Noise recognition system (pop and hiss). Many more noises coming soon.

  • Scriptable with Python 3 (via embedded CPython, no need to install or configure Python on your host system).

  • Talon is very modular and adaptable - you can use eye tracking without speech recognition, or vice versa.

Worked example: Coding with voice dictation using Talon Voice.

2.5 Cursorless

Advanced coding extension for VS Code.

2.6 Dragonfly


Dragonfly is a speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software. It was written to make it very easy for Python macros, scripts, and applications to interface with speech recognition engines. Its design allows speech commands and grammar objects to be treated as first-class Python objects. Dragonfly can be used for general programming by voice. It is flexible enough to allow programming in any language, not just Python. It can also be used for speech-enabling applications, automating computer activities and dictating prose.

Dragonfly contains its own powerful framework for defining and executing actions. It includes actions for text input and key-stroke simulation. This framework is cross-platform, working on Windows, macOS and Linux (X11 only). See the actions sub-package documentation for more information, including code examples.

This project is a fork of the original t4ngo/dragonfly project.

Dragonfly currently supports the following speech recognition engines:

  • Dragon, a product of Nuance. All versions up to 15 (the latest) should be supported. Home, Professional Individual and previous similar editions of Dragon are supported. Other editions may work too
  • Windows Speech Recognition (WSR), included with Microsoft Windows Vista, Windows 7+, and freely available for Windows XP
  • Kaldi (under development)
  • CMU Pocket Sphinx (with caveats)

2.6.1 mathematics

mrob95/mathfly-talon: Talon scripts for dictating mathematics into editors like LyX and Scientific Notebook 5.5.

2.7 VoiceCode


Your voice is the most efficient way to communicate. VoiceCode is a concise spoken language that controls your computer in real-time. When writing anything from emails to kernel code, to switching applications or navigating Photoshop – VoiceCode does the job faster and easier.

VoiceCode is different from other voice-command solutions in that commands can be chained and nested in any combination, allowing complex actions to be performed by a single spoken phrase.

By taking advantage of your brain’s natural aptitude for language you can control your computer more efficiently and naturally.

3 Transcribing recordings

Figure 3

Handy if you have a recording and you want to make it into a text thing offline.

3.1 Whisper

Whisper (Radford et al. 2022) is the recent speech transcription model casually released by OpenAI:

pip install -U openai-whisper
whisper audio.mp3 # transcribes
whisper audio.mp3 --language Japanese --task translate  #translates to english

Requires a GPU but otherwise free. Has now been integrated into lotsa things.

3.2 Descript

descript aims to integrate editing with transcription and in particular seems to allow editing audio via editing the transcription via voice fake technology.

3.3 Misc other

4 Phonetic transcription

It has been a long time since I took Phil Rose’s extravagantly weird undergraduate phonetics class, and I have forgotten much. A cheating tool:

I cannot easily see how to automate phenetic transcription, but surely that is around somewhere? Some voice transcription software may well use phonetics as an intermediate representation or even as the final output.

5 Incoming

6 References

Radford, Kim, Xu, et al. 2022. Robust Speech Recognition via Large-Scale Weak Supervision.”