SoundStage! Xperience | SoundStageXperience.com - The Problem with Blind Testing

November 2017

As I write this, I’ve just returned from the Rocky Mountain Audio Fest, where I moderated a panel titled “Best Headphone Rigs vs. State-of-the-Art Audio Systems.” One comment, from PSB Speakers founder and chief engineer Paul Barton, especially stuck with me. As best I can recall, he said, “Once you go to blind testing, where the listeners can’t see the identity of the products, everything changes,” and he punctuated it with a wave of both arms.

Brent Butterworth

I doubt Barton’s comment resonated with most of the audience of audio enthusiasts. This was a high-end audio show, and the high-end audio industry -- and the press that covers the industry -- have largely rejected blind testing. But it stuck with me, because I’d just had a powerful experience that confirmed Barton’s statement.

That experience involved setting up, conducting, and participating in a blind test of ten portable DAC-headphone amps, including such models as the AudioQuest DragonFly Red and the Oppo Digital HA-2. This test required a great deal more work and expense than a typical equipment review, including having to rebuild one of my testing switchers, setting up four PCs with matching software and test files, then running each of the panelists -- all very experienced audio-equipment reviewers -- through at least nine test rounds, using HiFiMan HE1000 V2 and Sony MDR-7506 headphones and Shure SE535 earphones.

Countless audio writers and forum participants have opined about the sound of the DAC-amps we tested. Considering the large amount of sometimes conflicting opinion, and the fact that the output impedances of headphone amps and the input impedances of headphones vary in ways that can significantly influence the sound, I had no preconceptions about the results we’d get.

Blind testing DAC-amplifiers

While we did hear more differences among the DAC-amps through the Shure earphones (which, because they use balanced armatures, exhibit a huge impedance swing as the frequency rises) than with the HiFiMan headphones (which have near-zero impedance swing), and we did end up agreeing on a couple of marginal favorites, we were all surprised by how elusive and insignificant the differences were, despite reviews we’d read describing large and important differences. One panelist typified the difference as being “maybe 0.5% between the best and worst.” Another wondered aloud, “Why would anybody care about this?”

The difference between this test and most testing done for audio publications was that our test was blind, with documented, carefully measured and matched levels, and effectively identical testing conditions for each product. We knew which ten DAC-amps were being tested, but because in each test the listener used a handheld remote control to switch among three or four different DAC-amps, which were randomly grouped and identified only by number, the listener didn’t know what she or he was hearing. Even the test administrator had no way to tell which product any listener was hearing at any given moment. We really were, as so many audio reviewers insist they do, “trusting our ears” -- in this case, our ears could get no help from our eyes.

For almost all published audio reviews, comparisons are done under much more casual conditions, with the identities of the products known to the reviewer, and the testing conditions mostly undocumented, other than the now-obligatory list of fancy audio components the reviewer used in the course of the test. Of course, we weren’t the first to spot the differences between blind and sighted testing. Sean Olive, acoustic research fellow at Harman International, pointed out these differences eight years ago in a blog titled “The Dishonesty of Sighted Listening Tests,” in which he cited results of a 1994 test involving 40 listeners under carefully controlled conditions. According to Olive, “The psychological biases in the sighted tests were sufficiently strong that listeners were largely unresponsive to real changes in sound quality caused by acoustical interactions between the loudspeaker, its position in the room, and the program material.” Veteran audiophiles and reviewers may think they’re immune to such biases, but as Olive found, “the experienced listeners were no more or no less immune to the effects of visual biases than inexperienced listeners.”

More evidence of the problems of sighted, uncontrolled testing can be found on my own website, which features a “Bluetooth Blind Test” comparing uncompressed WAV files with files run through SBC (Bluetooth’s stock audio codec) and aptX (an optional Bluetooth codec marketed as an upgrade over SBC). The test was indeed blind, and the testing procedures are fully documented on that webpage. Almost everyone who’s taken the test has gotten about the same results -- they can usually tell the WAV file from the data-reduced SBC and aptX files, but the differences among the data-reduced files amount to “slightly different flavors of the same thing,” as one reader e-mailed me. (The test didn’t include the new aptX HD format. I hope to add that soon.)

Blind test chart Brent’s “Bluetooth Blind Test”

Yet for years, I’ve read reviews that proclaim aptX’s superiority, in definitive statements such as “an aptX-compressed Bluetooth stream sounds better than its SBC-compressed equivalent.” When you hear the test files on my site, then read those statements, it’s hard to avoid the conclusion that bias has influenced those reviewers’ work. Do they believe that aptX is better due to their listening experiences, or because of biases created by aptX’s marketing claims? We don’t know. And unless their tests were done blind, neither do the reviewers.

So what is wrong with blind testing?

On Facebook and other forums, Sean Olive often prods reviewers to elevate their testing methods to his standards. I’m glad he does this; someone’s got to counter the pseudo-scientific, nonsensical, self-serving reasons that audio reviewers tend to cite for not doing blind tests. But before we proclaim that all audio reviews should be done blind, let’s consider a couple of things I mentioned above.

The test Olive cites involved 40 people, all of whom were already on Harman International’s payroll, using Harman’s multimillion-dollar testing facilities. My blind test of ten DAC-headphone amps at RMAF took four days of my time, probably a week of the writer’s time, plus about six hours of testing for each panelist, and was funded by a media company with annual revenues of roughly $1.5 billion. My Bluetooth Blind Test wouldn’t have been possible if I didn’t happen to have most of the necessary test gear already on hand; even so, I had to invest $79 in an ART USB Phono Plus and about two days of my time, for no financial reward.

Audio websites make money by posting product reviews; generally, the more reviews they post, the more page views they attract and the more money they make. It’s simply not financially viable for most audio publications to do all their testing blind. While blind testing is by far the best way to get accurate results, it’s difficult, time-consuming, and requires considerable expertise and patience. For all of those reasons, blind testing is expensive.

This doesn’t mean we can’t aspire to reduce the level of bias in product testing. In the future, I’d like to see more product testing that gathers the opinions of multiple reviewers, and that also incorporates the opinions of enthusiasts, many of whom find problems that reviewers miss. Of course, more blind panel tests would be great, although they’re very expensive to do unless a publication has three or more reviewers who all live in the same town.

For now, measurements are probably the most practical way to reduce bias in product reviews, because they provide a “second opinion” that can’t be influenced by brand names and marketing hype. Measurements don’t necessarily tell you exactly how a product sounds, but in the case of speakers and, increasingly, headphones, they can give you a pretty good idea. And in the case of electronics, the product-to-product similarities found in most measurements of electronic components, especially when compared to measurements of speakers and headphones, serve as a bit of a check on reviewers who might grossly overstate the differences between similar amps, DACs, etc. Any website can start doing measurements for a negligible investment: Dayton Audio’s OmniMic V2 does great speaker measurements for just $299, and many useful measurements can be done with free software such as Room EQ Wizard and RightMark Audio Analyzer.

Dayton OmniMic V2

Sure, we’ve all heard that “some products that measure bad sound good, and some products that measure well sound bad” -- but have you ever heard it from someone who has actual experience at measuring products?

Right now, we’re in a situation where most audio writers are free to make ridiculous, unsubstantiated claims about, say, one banana plug “sounding better” than another, or the “noise floor” of a passive speaker being audibly reduced through careful choice of internal wiring. This may give the writer the neurochemical boost that comes with any insight, true or false, but it doesn’t serve the reader or the future of the audio industry. Maybe we’ll never have the resources to do the kind of substantial, unbiased blind testing Olive and his colleagues conduct, but the audio media should aspire to reduce bias rather than cultivate it.

. . . Brent Butterworth
brentb@soundstagenetwork.com

Pulse!

The Problem with Blind Testing