Voice transcriptions and speech recognition



The converse to voice fakes: generating text from speech. a.k.a. speech-to-text. We might do this in real time, to control something, or “off-line”, to turn an audio recording into text. Or something in between.

Using machines to wreck a nice beach is an older practice than I thought. Check out Volume 89 of Popular Science monthly for Lloyd Darling’s The Marvelous Voice Typewriter on the state-of-the-art dictation machine of 1916 (PDF version).

Dictation

Speaking as a realtime textual input method. See following roundups of dictation apps to start:

Here are some options culled from those lists and elsewhere of vague relevance to me:

Automation and coding

See Speaking in code: how to program by voice

Coding by voice command requires two kinds of software: a speech-recognition engine and a platform for voice coding. Dragon from Nuance, a speech-recognition software developer in Burlington, Massachusetts, is an advanced engine and is widely used for programming by voice, with Windows and Mac versions available. Windows also has its own built-in speech recognition system. On the platform side, VoiceCode by Ben Meyer and Talon by Ryan Hileman … are popular.

Two other platforms for voice programming are Caster and Aenea, the latter of which runs on Linux. Both are free and open source, and enable voice-programming functionality in Dragonfly, which is an open-source Python framework that links actions with voice commands detected by a speech-recognition engine.

See also: Programming by Voice May Be the Next Frontier in Software Development.

Full disclosure: I am researching this because I have temporarily disabled my hands. For the moment, for my purposes, the easiest option is to use Serenade for python programming, OS speech recognition for prose typing, and to leave my other activities aside for now. If my arms were to be disabled for a longer period of time I would probably accept the learning curve of using Talon, which seems to solve more problems, at the cost of greater commitment.

One point of friction which I did not anticipate, is that most of these tools will, for various reasons, do their best to switch off any music playing any time you use them. For someone like me who can't focus for three minutes straight without banging electro in the background this is tricky. My current workaround is to play music on a different device so I can sneak beats past my unnecessarily diligent speech recognition tools trying to control background noise. This means that I am wearing two headsets, which looks funny, but to be honest it is not the worst fashion sacrifice I have been forced to make in the course of this particular injury.

Contrariwise, if I were to try to do the speech control stuff in an open plan office, coworking space or in the family living room, it would be excruciatingly irritating for anyone else who could hear me. My current workaround, when I am annoying some innocent bystander, is to accuse them of being ablist.

Serenade

Serenade | Code with voice

Simple, low-lift intuitionistic voice recognition for coding. Includes deep integration for various languages and also various code editors including visual studio code and those jetbrains ones. Free. Simple to use.

Supported languages:

  • Python
  • JavaScript
  • HTML
  • Java
  • C / C++
  • TypeScript
  • CSS
  • Markdown
  • Dart
  • Bash
  • Sass
  • C#
  • Go
  • Ruby
  • Rust

The experience is very good for plain code. Editor integration is not awesome when using Jupiter, in line with the general rule that Jupiter makes everything more flaky and complicated.

talon

tl;dr

Powerful hands-free input

  • Voice Control — talk to your computer
  • Noise Control — click with a back-beat
  • Eye Tracking — mouse where you look
  • Python Scripts — customize everything

Full length:

🤳Talon aims to bring programming, realtime video gaming, command line, and full desktop computer proficiency to people who have limited or no use of their hands, and vastly improve productivity and wow-factor of anyone who can use a computer.

  • System requirements:

  • macOS High Sierra (10.13) or newer. Talon is a universal2 build with native Apple Silicon support.

  • Linux / X11 (Ubuntu 18.04+, and most modern distros), Wayland support is currently limited to XWayland

  • Windows 8 or newer

  • Powerful voice control - Talon comes with a free speech recognition engine, and it is also compatible with Dragon with no additional setup.

  • Multiple algorithms for eye tracking mouse control (depends on a single Tobii 4C, Tobii 5 or equivalent eye tracker)

  • Noise recognition system (pop and hiss). Many more noises coming soon.

  • Scriptable with Python 3 (via embedded CPython, no need to install or configure Python on your host system).

  • Talon is very modular and adaptable - you can use eye tracking without speech recognition, or vice versa.

Worked example: Coding with voice dictation using Talon Voice.

Dragonfly

dictation-toolbox/dragonfly:

Dragonfly is a speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software. It was written to make it very easy for Python macros, scripts, and applications to interface with speech recognition engines. Its design allows speech commands and grammar objects to be treated as first-class Python objects. Dragonfly can be used for general programming by voice. It is flexible enough to allow programming in any language, not just Python. It can also be used for speech-enabling applications, automating computer activities and dictating prose.

Dragonfly contains its own powerful framework for defining and executing actions. It includes actions for text input and key-stroke simulation. This framework is cross-platform, working on Windows, macOS and Linux (X11 only). See the actions sub-package documentation for more information, including code examples.

This project is a fork of the original t4ngo/dragonfly project.

Dragonfly currently supports the following speech recognition engines:

  • Dragon, a product of Nuance. All versions up to 15 (the latest) should be supported. Home, Professional Individual and previous similar editions of Dragon are supported. Other editions may work too
  • Windows Speech Recognition (WSR), included with Microsoft Windows Vista, Windows 7+, and freely available for Windows XP
  • Kaldi (under development)
  • CMU Pocket Sphinx (with caveats)

VoiceCode

VoiceCode:

Your voice is the most efficient way to communicate. VoiceCode is a concise spoken language that controls your computer in real-time. When writing anything from emails to kernel code, to switching applications or navigating Photoshop – VoiceCode does the job faster and easier.

VoiceCode is different from other voice-command solutions in that commands can be chained and nested in any combination, allowing complex actions to be performed by a single spoken phrase.

By taking advantage of your brain’s natural aptitude for language you can control your computer more efficiently and naturally. It really feels like you’re in the future!

Transcribing recordings

Handy if you have a recording and you want to make it into a text thing offline.

Phonetic transcription

It has been a long time since I took Phil Rose’s extravagantly weird undergraduate phonetics class, and I have forgotten much. Here is a cheating tool:

I cannot easily see how to automate phenetic transcription, but surely that is around somewhere? Some voice transcription software may well use phonetics as an intermediate representation or even as the final output.

doing without mouse

try stylus or eye tracking systems.


No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.