ABX Testing (part 1)
A sample article from the archives of the B.A.S. Speaker

From the December 1990 meeting, BASS Vol. 19 No. 3

Preliminary Discussion by E. Brad Meyer

Brad Meyer began with a transparency detailing the steps through which a piece of classical music reaches the ears of the audiophile. First of course is the composer. Next come the players (and their instruments) and the conductor, who realize the composer's ideas in acoustic form. The sound goes out through the air into a hall, and is changed by microphones into an electrical signal. These electrical signals go through a mixer, and thence into a master recorder. Then comes the editing, and perhaps processing, reverb, and remixing. The result is transferred to some consumer music carrier--LP once upon a time, today analog cassette or CD. Next comes the home player, followed by connecting wires, preamp and a power amp, speaker wires and loudspeakers, where the signal is transformed back into sound. Finally there is the listening room. Mark Fishman noted, to laughter, that Meyer had left out the influence of the power plant and the ac lines.

Meyer drew a box around the links in the chain over which the audiophile has control: the sequence from the music carrier through the listening room. Within this box lies the subject matter for consumer audio publications.

Tonight, however, Meyer was going to focus his attention on the very end of the chain: the listener. Many things affect how a listener perceives the sound. Among them are hearing limits, experience, fatigue, mood (one's own and that of others), and pharmacological substances both medical and recreational. Ambient lighting affects mood and is thus a factor, as Meyer discovered early on in his audio pursuits. Lighting just the speakers and leaving the rest of the room dark makes the sound more vivid and dramatic. Meyer suggested that people try this, and also darkening the whole room.

Listener hearing acuity of course is a major factor. Meyer mentioned that one now can get tested out to 20 kHz instead of just the standard 8 kHz [see the May 1991 meeting summary, in v19/1--PSH]. Meyer had had his ears' hearing thresholds (not the same thing as frequency response) measured and found that he has measurable loss in low-level detection at 12 kHz and a lot more above that. He found it a sobering experience, as have many of us.

Meyer pointed out that the ear's equal-loudness curves tend to bunch at the frequency extremes. This means that once the highest and lowest sounds are above the hearing threshold, a small change in level will sound louder than a similar change in the mid-band. This became painfully obvious during his high-frequency hearing tests. For example, at 18 kHz Meyer's threshold is 106 dB spl. At 104 dB he cannot hear it at all and yet at 106 dB; he yanked the phones from his head. Fishman quoted Bob Berkovitz as saying that if the sound is not audible it does not damage the ear even if its level is quite high.

Returning to the playback chain, Meyer went on to say that typically the audiophile can affect only a small part of it--the playback system and the listening room (both acoustically and how it may be made to influence mood). There is no control over the recording process, although Meyer suggested that those who have the opportunity to do live recording really should try it--it is dismaying how much influence microphone choice and placement have on the recorded sound.

Meyer speculated that so much attention has been paid by audiophiles to trivial aspects of the playback chain such as the cables and ac power because the advent of the CD has eliminated the audible distortion introduced in the process of getting the signals from the master tape to the playback preamp, hitherto an area ripe for great fussiness. Things are a lot less interesting now for those looking for controllable detail.

Mark Fishman brought up an interesting comment from J. Gordon Holt on memory: Holt now has better memory than hearing. His memory now hampers his enjoyment of many musical performances because he misses the sheen of the violin and the delicacy of the cymbals and triangle, which he remembers but no longer hears. The discrepancy bothers him. David Moran suggested that Holt might find it helpful to employ wider-dispersion tweeters and, theoretically, some judicious equalization, to get more audible treble into the reverberant field.

Alvin Foster reported a more cheerful result, saying that his own memory helps add sheen to the strings rather than detracting from his current listening enjoyment. Dan Banquer commented that, as a musician, he has always felt that nothing is like being in the middle of the music. No matter how many millions of dollars of equipment one has, it cannot recreate the experience of performing. Meyer added that he has a BSO violinist friend who complains that the BSO broadcasts do not have enough string sound. Meyer asked him how often he has listened to the BSO from out in the audience. [This again poses the question of what "viewpoint" the sound should be created for and/or played back from--PSH.]

The ABX Comparator

Historically, it was an interest in tracking down the source of perceived differences in the playback chain that led to the construction (by David Clark and associates) of ABX boxes like the one Meyer was to demonstrate at this meeting. During the next portion of the evening he introduced the ABX box and played with it a bit to show how it worked. The system assembled for the meeting comprised an Apt preamp, Audio Dynamics power amp (Japanese, class AB, bipolar), the Allison 205 3-piece satellite/woofer system (lightly equalized with a dbx 10/20 to boost the low bass and help ameliorate a presence wrinkle), and an AR turntable fitted with a JH Formula Four arm and Stanton cartridge.

The ABX comparator switches between two sources. The box has three buttons on the remote and three LEDs on the front panel, labeled A, B, and X (hence the product name). There is another pair of buttons, labeled Down and Up, which change the numeric display on the unit. When the box is powered on, it generates 100 random assignments of X to either A or B, one for each possible displayed number on a two-digit readout (00 to 99). A Reset button on the main control unit returns the sequence to test number 01. Pushing A connects source A to the output, and likewise for button B. Pushing X connects the box-selected source, which is either A or B. Neither the operator of the box nor the listeners have any notion of which source is X until the answers are read out at the end of the test. This kind of test is called double-blind, as neither the tester nor the tested knows the answers.

During the test the subjects (or the tester) switch among A, B, and X and then mark on an answer sheet whether X is A or B. The test is repeated for a series of separate trials. At the end of a series, pushing the Answer button reveals the identities of X for all trials. In the answer mode, X is on together with the selected source--if X were A for trial number 01, for example, the LEDs for X and A will both be lit.

The ABX box is designed to determine how reliably the listener can detect differences. Preconceptions affect perception and conclusions [in other words, not only is seeing believing, but believing is also seeing-Ed.], hence the need for single blindness. Double-blind testing is required because the tester almost invariably (and unpredictably) influences the test subject(s). One of many well-known examples occurred when a group of psychology students tested many subjects for IQ. The subjects were impartially tested for IQ beforehand, and then sorted into two groups with similar IQ ranges. The testers were told that group A was exceptionally intelligent while group B was not. For each group, the testers were to read the same script while administering the test. The result was that the group touted as smart to the test-givers scored statistically significantly better than the group labeled stupid. Somehow the testers conveyed their expectations about performance while reading the same instructions to the two groups, and the groups responded to the cues.

Listening

Demonstrations consisted of a range of comparative-listening tests to different musical sources, including PCM-F1 tapes and LPs, with two different devices inserted into the B path and compared with a straight-wire bypass in the A path. This kind of line-level comparison is easy to do well; at high, amp/speaker levels, there may be problems. Meyer also has a high-current relay box (an extra-cost option) for switching amplifiers or loudspeakers. The large relays in this box make a soft clunk that is different for the two sources and is audible in a quiet room; Meyer has identified X 10 out of 10 times without any signal! While the sound is quiet enough to be masked when any music is playing, testing hygiene dictates that the relay box be enclosed or otherwise muffled.

Meyer handed out a sheet photocopied from the ABX manual which showed typical level-matching required for reliable detection of differences between sources with 1/3 octave frequency-response aberrations. When the aberrations span a wider spectrum, level-matching becomes increasingly critical, dropping to less than 1/3 of a dB especially in the ear-sensitive 2-5kHz region. Acuity (ability to hear difference) also depends sometimes on how close to the threshold of hearing the level of the frequency is. At threshold, a small increase in level will make the sound audible and enable the listener reliably to distinguish A and B when different.

Steve Owades noted that the use of the ABX box does not reduce bias in results due to peer pressure when the box is used with more than one listener at a time. Visible or audible reactions from surrounding listeners may influence a subject's answer. Such bias makes the answers dependent--what one listener chooses is influenced by what his or her peers choose. This may invalidate the result for statistical analysis, which requires that the trials be independent.

The Tests

Meyer first demonstrated the operation of the ABX box by disconnecting the signal feed to the B inputs. This simpleminded procedure--comparing an audible signal with no signal--has proven helpful in clarifying how the box works for those who, for example, fail to pick up the point that the assignment of X remains constant for each trial. The 18 subjects present went through the exercise of writing their answers for X on the sheet. The result: 17 correct answers and one abstention, from someone who deemed the test too obvious to dignify with an answer.

Next Meyer inserted a Technics SH-9010 parametric equalizer in the B loop and set the 3 kHz slider for a 3 dB boost. The Q knob was set to 0.7 (the broadest setting, for a bandwidth of about two octaves). Playing pink noise through the system makes this alteration easy to hear, and the group got a score of 18/18 without difficulty. With choral music, whose broad frequency range makes it a good test for response aberrations, the score was 16/17.

The next test was much tougher: The 9010 was left in the circuit, but with all sliders set to their midpoints. Unlike some consumer equalizers, the semi-pro Technics has controls that really do what they say (boost, cut, or stay flat), and the response is quite flat in this condition except for a slight droop in the top octave. To make things more difficult, we heard only the choral music for this trial. The group got 7/17 correct.

The last two trials were bypass tests of the Sony PCM-F1 digital processor. The F1's video output was looped back to the input and the processor was set to a gain of 1.0 and connected to input B. The signal source was an LP made by Meyer and Peter Mitchell of organist James Johnson--the same production whose digital version has been excerpted on the first and second Stereophile test CDs. The LP was made from an analog master, so we really were comparing an analog source directly with an F1-digitized version. The results on the two trials were 9/15 and 7/15; the total was 16/30, 53% correct.

Analyzing Results

If listeners are really not able to detect any difference between A and B (whatever they believe) -or if they were to guess--the outcome will tend toward 50 percent correct (and 50 percent incorrect) answers as the sample size increases. When the listeners can tell the difference easily (as with the pink-noise test of the 3 kHz boost) the result will be all answers correct. When the difference is subtle and some can detect it reliably while others cannot, the number of correct answers should lie between half and all correct.

[Author's note: The number of correct answers can fall below half if the trials are not independent, i.e., if someone in the audience is influencing others. Meyer told a story of an AES workshop he and Mitchell gave when the box generated a run of successive trials in which X was B. Many people selected A on one difficult trial, apparently thinking that it was about time to get an A--which is, of course, a form of dependence, though dependence on previous trials and not on the other subjects in the room. In an AES preprint by my brother and me (presented to the BAS several years ago) we suggested a much more complicated distributional function to analyze the data which would help reduce the effects of dependent trials--PSH.]

Depending on the numbers of trials, there is a definite number of correct answers beyond which one can say that the probability of a listener's getting that number by chance is less than five percent. This is what is known as a 95% confidence level. Assuming independence, with six trials one has to get all six correct to satisfy this criterion. With 24 trials, 17 correct answers is the threshold. The percent of correct answers needed to qualify for `reliably hearing differences' decreases as the number of independent trials increases.

Stereophile carried out a double-blind test and then examined the results of only those subjects who got high scores. They concluded that this group had demonstrated the ability to hear differences. This, however, is statistically invalid: even for randomly generated answers, in a large group 1 out of 20 subjects would be expected to satisfy the 95% criterion by chance alone. (This group represents the 5% that you're 95% confident that a given subject doesn't fall into.) To ascertain whether there really is a golden-eared group, they should have selected the high scorers and used them for another series of trials.

The tests we took showed clear audibility to a confidence level well over 95% for the first three tests, and null results for the last three. The tests were conducted patiently and fairly, under generally good conditions; for example, there was a minimum of cross-comment.

Meyer noted that people typically get touchy, even grouchy, when two blind-compared pieces of equipment are very similar. It must be noted here that some high-end reviewers have said long-term listening to each piece of equipment produces more-reliable answers than short-period ABX switching. What they feel is that quick switching is less revealing than long-term listening to each piece of equipment--although there is good evidence that, to the contrary, quick comparison increases acuity. In any case, contrary to popular misconception, there is no law against leaving the ABX box in position A for a month, then switching to B the next month, and finally to X during a third month.

[Guest's addendum: Following my own experience, I tried to switch among A, B, and X at moments that would be the most revealing of differences. Still, these tests were necessarily conducted with fairly rapid switching. Needless, to say, the system and room were familiar to none of us. The conditions were obviously not the best, and finally, as always, a negative result does not conclusively prove the nonexistence of anything. The test can and should be made more sensitive when possible by using the subject's own system and room, and by repeating musical selections through both signal paths (a repeated-music test) rather than switching back and forth while a selection is playing (a running-music test).

The stress of the tests did indeed tell after a while. Even the temperate Poh Ser Hsu was heard to snap at someone two rows ahead of him to quit moving his head around! While this was a less than ideal test, then, I must point out that claims by writers like Robert Harley that blind tests necessarily generate such stresses are without foundation. EBM]

ABX Testing Part 2 is HERE

 

The Boston Audio Society
PO BOX 260211
Boston MA 02126

problems? email Barry: webmaster@bostonaudiosociety.org

updated 6/30/19