Psychoacoustics
August 15, 2016 — January 28, 2020
1 Psychoacoustic units
A quick incomplete reference to pascals, Bels, erbs, Barks, sones, Hertz, semitones, Mels and whatever else I happen to need.
The actual auditory system is atrociously complex and I’m not going to compete against e.g. perceptual models here, even if I did know a stirrup from a hammer or a cochlea from a cauliflower ear. Measuring what we can perceive with our sensory apparatus involves a whole field of hacks to account for masking effects and variable resolution in time, space and frequency, not to mention variation between individuals.
Nonetheless, when studying audio there are some units which are more natural to human perception than the natural-to-a-physicist physical units such as Hz and Pascals. SI units are inconvenient when studying musical metrics or machine listening because they do not closely match human perceptual difference — 50 Hz is a significant difference at a base frequency of 100 Hz, but insignificant at 2000 Hz. But how big this difference is and what it means is rather a complex and contingent question.
Since my needs are machine listening features and thus computational speed and simplicity over perfection, I will wilfully and with malice ignore any fine distinctions I cannot be bothered with, regardless of how many articles have been published discussing said details. For example, I will not cover “salience”, “sonorousness” or cultural difference issues.
1.1 Start point: physical units
SPL, Hertz, pascals.
1.2 First elaboration: Logarithmic units
This innovation is nearly universal in music studies, because of its extreme simplicity. However, it’s constantly surprising to machine listening researchers who keep rediscovering it when they get frustrated with the FFT spectrogram. Bels/deciBels, semitones/octaves, dBA, dBV…
1.3 Next elaboration: “Cambridge” and “Munich” frequency units
Bark and ERB measures; these seem to be more common in the acoustics and psycho-acoustics community. An introduction to selected musically useful bits is given by Parncutt and Strasberger (Parncutt and Strasburger 1994).
According to Moore (2014) the key references for Barks is Zwicker’s “critical band” research (Zwicker 1961) extended by Brian Moore et al, e.g. in Moore and Glasberg (1983).
Traunmüller (1990) gives a simple rational formula to approximate the in-any-case-approximate lookup tables, as does (Moore and Glasberg 1983), and both relate these to Erbs.
1.3.1 Barks
Descriptions of Barks seem to start with the statement that above about 500 Hz this scale is near logarithmic in the frequency axis. Below 500 Hz the Bark scale approaches linearity. It is defined by an empirically derived table, but there are analytic approximations which seem just as good.
Traunmüller approximation for critical band rate in bark
\[ z(f) = \frac{26.81}{1+1960/f} - 0.53 \]
Lach Lau amends the formula:
\[ z'(f) = z(f) + \mathbb{I}\{z(f)>20.1\}(z(f)-20.1)* 0.22 \]
Hartmut Traunmüller’s online unit conversion page can convert these for you and Dik Hermes summarises some history of how we got this way.
1.3.2 Erbs
Newer, works better on lower frequencies. (but possibly not at high frequencies?) Seem to be popular for analysing psychoacoustic masking effects?
Erbs are given different formulae and capitalisation depending where you look. Here’s one from (Parncutt and Strasburger 1994) for the “ERB-rate”
\[ H_p(f) = H_1\ln\left(\frac{f+f_1}{f+f_2}\right)+H_0, \]
where
\[ H_1 &=11.17 \text{ erb}\\ H_0 &=43.0 \text{ erb}\\ f_1 &= 312 \text{ Hz}\\ f_2 &= 14675 \text{ Hz} \]
Erbs themselves (which is different at the erb-rate for a given frequency?)
\[ B_e = 6.23 \times 10^{-6} f^2 + 0.09339 f + 28.52. \]
1.4 Elaboration into space: Mel frequencies
Mels are credited by Traunmüller (1990) to Beranek (1949) and by Parncutt (2005) to Stevens and Volkmann (1940).
The mel scale is not used as a metric for computing pitch distance in the present model, because it applies only to pure tones, whereas most of the tone sensations evoked by complex sonorities are of the complex variety (virtual rather than spectral pitches).
Certainly some of the ERB experiment also used pure tones, but maybe… Ach, I don’t even care.
Mels are common in the machine listening community, mostly through the MFCC, the Mel-frequency Cepstral Transform, which is a metric that seems to be historically popular for measuring psychoacoustic similarity of sounds. (Davis and Mermelstein 1980; Mermelstein and Chen 1976)
Here’s one formula, the “HTK” formula.
\[ m(f) = 1127 \ln(1+f/700) \]
There are others, such as the “Slanek” formula which is much more complicated and piecewise defined. I can’t be bothered searching for details for now.
1.5 Perceptual Loudness
ISO 226:2003 Equal loudness contour image by Lindosland:
Sones (Stevens and Volkmann 1940) are a power-law-intensity scale. Phons, ibid, are a logarithmic intensity scale, something like the dB level of the signal filtered to match the human ear, which is close to… dBA? Something like that. But you can get more sophisticated. Keyword: Fletcher-Munson curves.
For this level of precision, the coupling of frequency and amplitude into perceptual “loudness” becomes important and they are no longer the same at different source sound frequencies. Instead they are related via equal-loudness contours, which you can get from an actively updated ISO standard at great expense, or try to reconstruct from journals. Suzuki et al. (2003) seems to be the accepted modern version, but their report only lists graphs and is missing values in the few equations. Table-based loudness contours are available under the MIT license from the Surrey git repo, under iso226.m. Closed-form approximations for an equal loudness contour at fixed SPL are given in Suzuki and Takeshima (2004) equation 6.
When the loudness of an \(f\)-Hz comparison tone is equal to the loudness of a reference tone at 1 kHz with a sound pressure of \(p_r\), then the sound pressure of \(p_f\) at the frequency of \(f\) Hz is given by the following function:
\[ p^2_f =\frac{1}{U^2(f)}\left[(p_r^{2\alpha(f)} - p_{rt}^{2\alpha(f)}) + (U(f)p_{ft})^{2\alpha(f)}\right]^{1/\alpha(f)} \]
AFAICT they don’t define \(p_{ft}\) or \(p_{rt}\) anywhere, and I don’t have enough free attention to find a simple expression for the frequency-dependent parameters, which I think are still spline-fit. (?)
There is an excellent explanation of the point of all this — with diagrams — by Joe Wolfe.
1.6 Onwards and upwards like a Shepard tone
At this point, where we are already combining frequency and loudness, things are getting weird; we are usually measuring people’s reported subjective loudness levels for various signals, some of which are unnatural signals (pure tones), and with real signals we rapidly start running into temporal masking effects and phasing and so on.
Thankfully, I am not in the business of exhaustive cochlear modelling, so I can all go home now. The unhealthily curious might read (Hartmann 1997; Moore 2007) and tell me the good bits, then move onto sensory neurology.
2 Psychoacoustic models in lossy audio compression
Pure link dump, sorry.