From
the December 1990 meeting, BASS Vol. 19 No. 3
WISHFUL
THINKING by Tom Nousaine (Illinois)
In the November 1990 Stereophile
editor John Atkinson and staffer Will Hammond provide a good
model for wishful-thinking analysis in discussing the results
of their CD-tweak listening tests ("As We See It: Music,
Fractals and Listening Tests"). In January of 1991 Martin
Colloms reprises the original wishful-thinking paradigm in discussing
his well-known 1986 amplifier comparisons ("As We See It:
Working the Front Line").
I call their analyses wishful
because they draw conclusions based on evidence that doesn't
support such findings.
In the CD-tweak test Atkinson
and Hammond conducted a 3222-trial single-blind listening experiment
to determine whether CD tweaks (green ink, Armor-All, expensive
transports) altered the sound of compact-disc playback. Subjects
overall were able to identify tweaked vs untweaked CDs only
48.3% of the time, and the proportion that scored highly (five,
six, or seven out of seven trials--Stereophile's definition
of a keen-eared listener) was well within the range to be expected
if subjects had been merely guessing.
Atkinson declared that there
were "some listeners who could and did hear a difference."
In response to several letters showing how the statistics didn't
support this conclusion, Hammond insisted that "...the
total of the tweaks used resulted in a sonic difference that
was detected correctly well beyond the probability of it being
a chance occurrence" (February 1991, p. 65).
Given the numbers published,
this conclusion is simply not supported. However, there were
analyses which seemed to support positive results. For example,
an analysis of one musical selection, through all listening
sessions, judged by males comparing different transports is
shown as being "significant: p.001," i.e., the probability
of these scores occurring from chance alone is less than 0.1%.
Further analysis shows 71%
(132/186) correct identifications when A and B were different
and only 32% correct (62/194) when they were the same. The first
proportion would be significant when compared with the 50% criterion,
which is a score that exceeds 50% by an amount that depends
on the size of the sample. The difference between 71% and 32%,
moreover, seems too great to be a chance happening.
So doesn't this support
their conclusions? Nope-they used the wrong criterion for comparison.
When the trials where B was different from A (A-B or B-A) are
combined with the trials where A and B were the same (A-A and
B-B), the combined score of 50.7% correct is not significantly
different from what one would expect by chance. The data do
suggest two important things, though. First, listeners are disposed
to report differences even when there are none. This group in
this example reported a difference 68% of the time when the
second presentation was the same as the first. Second, one should
have an equal number of same (A-A/B-B) and different (A-B/B-A)
trials when the 50% criterion is employed. Otherwise the criterion
score must be adjusted to account for response bias, the tendency
of subjects to report differences even when a component is compared
with itself. [There is an additional bias problem in the later
trials if the subjects know that the number of same and different
trials are equal. This is not a simple matter. Pub.]
This sort of response bias
was first seen in the blind amplifier tests staged by Quad in
1978. In those, the experimenters used the preference style
of test: subjects were asked, "Do You Prefer A, Prefer
B, or Have No Preference?" Subjects expressed a preference
for either A or B 35% of the time when the amplifier was being
compared with itself. They were, in other words, biased to prefer
A or B (i.e., to report a difference) even when A = B. The people
at Quad reported this bias correctly, concluding that based
on the numbers these subjects were unable to identify amplifiers
by sound alone.
Years later, in 1986, Martin
Colloms claimed to have proved that amplifiers sound different
with a 63% correct rate in a double-blind test report ("Amplifiers
Do Sound Different," Hi-Fi News and Record Review, May
1986). In this case Colloms made large analytical errors. He
ignored an unusually large part of his experiment (approximately
25% of the trials), a choice that may have introduced experimental
bias. Colloms based his analysis only on the trials where the
amplifiers were different, without compensating for the response
bias already discussed. Listeners scored 63.3% correct during
those trials where the amplifiers were different (95 of the
150 A-BB-A trials). However, subjects scored correctly only
65% of the time when the amplifiers were the same (26 of 40
A-A/B-B trials.) Another way of saying this is that subjects
reported a difference 35% of the time (14/40 trials) when there
could have been no difference.
There are two analytical
ways to compensate: 1) compare the correct rate of the sames
and the differents; 63.3% vs 65% is not a significant difference,
and 2) adjust the criterion score. Because of response bias,
we would expect a hypothetical 100-trial study in which differences
were inaudible and which had all different comparisons to produce
67.5 correct responses-35 correct responses because of bias
plus 50% of the remaining 65 trials by guessing. Thus a 63.3%
correct rate is below the 67.5% expected due to chance alone.
[It seems to me that Nousaine is trying to have it both ways
here: if a score in the neighborhood of 67% is to be expected
on the A-BB-A trials because of bias toward reporting a difference,
then 65% correct is all the more significant on the A-A/B-B
trials, where the subjects must overcome this bias in their
answers. If you combine the "same" and "different"
trials for the Colloms tests, as the author does for the CD-tweak
tests, the results do appear significant. See the note at the
end of the article. Pub.]
Note that the much attacked
ABX technique, where a forced choice is made, is free of this
problem. In an ABX test a criterion of 50% due to chance is
correct given a large enough sample size; however, most researchers
recommend a 75%-correct criterion to eliminate the possibility
that small bias errors will influence the results.
Returning to Colloms, what
if the 4.2% point differential (67%-63.3%) were significant?
That is, what if the 4.2% greater rate at which subjects scored
wrong was more than we could attribute to chance alone? The
most logical conclusion is that there was some sort of bias
or systematic error introduced into the study. In Colloms's
study this is likely. He arbitrarily excluded a large number
of trials because of poor test conditions. There may have been
bias built into the test procedure itself, or the exclusion
itself may have systematic.
And returning again to Stereophile
and the CD tweaks, notice the strength of the response bias
compared with earlier tests. In the Quad experiments, audio
professionals reported preferences about a third of the time
when amplifiers being compared were the same. Colloms's subjects
heard differences 35% of the time when amplifiers were compared
with themselves. However, the Stereophile report discloses that
their subjects answered "different" a whopping 58%
of the time in total when the presentations were identical (41.2%
correct in A-A and B-B comparisons.)
This is a phenomenon we
must be aware of in ourselves. People with an interest in sound
will tend to hear things simply from trying not to miss them.
These data show we are disposed to "hear" or "guess
about" nonexistent differences one-third to one-half of
the time even if the "coach" is blind. Imagine how
strong the tendency can be when the coach is a trusted friend,
reviewer, or salesperson who imposes no scientific controls
on himself. But can't we all just put aside our biases when
listening? Obviously, according to these data, not. [Not being
able to put aside one's biases perfectly and cleanly, simply
through effort, is another way of saying that no one is immune
to the placebo effect.--Ed.]
Sometimes research can seem
to grow in significance over time. For instance, in his February
1991 "As We See It," Martin Colloms recalls his 1986
experiment as being validated by a statistician who supported
his conclusion that amplifiers sound different, even after such
colleagues as Stanley Lipshitz pointed out weaknesses in the
analysis. Lipshitz himself told me that as far as he knew, Colloms
has never disclosed any additional analyses that supported his
conclusions, and I can't see any way to support them given the
extensive body of data supplied in the HFNRR report.
Based on their response
to previous controversies, I predict that within two years Stereophile
magazine will refer to their experiment as having proved that
CD tweaks were reliably identified under blind conditions, ignoring
the valid statistical objections already raised in its own pages.
It's sad. While I continue
to subscribe to the magazine for entertainment, I'm not so sure
I can accept evaluations of other people's products in a publication
that is not able to evaluate its own research rationally-especially
when many of its writers, and its editor, spend so much time
pointing fingers at those who over the years have added much
to our knowledge of audio.
2002 Update from Tom Nousaine:
There are two available
ABX-style comparison devices. QSC sells an ABX box and there
is a pc-based system (PCABX available free from www.pcabx.com)
from Arny Krueger, one of the original ABX Company guys. I have
four available; the two above, a one-off made for Bob Carver
and the original ABX box.
[Publisher's note: A discussion
with Nousaine revealed that he expects results from same/different
tests that are different from those from ABX tests. We did
not reach a consensus about how to interpret the former. He
seemed to interpret the 65% correct scores in Colloms's A-AB-B
amplifier tests as an illustration of a 35% bias toward hearing
differences here there are none, out of his expectation that
when the amplifiers are the same the subjects should say so
100% of the time. But 25/40 correct is significant to better
than 95% if you use the standard ABX criterion. Does this
mean that the ABX's 50% baseline doesn't apply to same/different
tests? Yes, Nousaine said. This answer surprised me; I always
thought that the data were analyzed the same way. Professor
Richard Greiner, to pick one example, conducted same/different
tests for his recent AES paper on the detection of polarity,
and carried out the statistical analysis in the familiar manner-with
a null result being 50% correct answers.
One thing we did agree
on was that switching during musical passages (what I call
a running-music test) was much more likely to generate false-difference
reports than hearing the same piece or short passage over
and over (a repeated-music test). Without knowing which method
was used, one can't precisely predict the tendency to report
nonexistent differences. I believe that repeated-music testing
should be used because it appears to be more sensitive, and
because it more closely mimics the listening habits of both
subjective reviewers and casual listeners. We should take
every opportunity to refine our tests for greater sensitivity
and make them duplicate actual listening conditions more closely.
Nousaine gave me a preliminary
report on an ongoing experiment designed to do these things.
Among the new protocols he uses is one designed to mimic what
happens when the non-blind listener chooses a certain passage
that best illustrates a particular sonic feature: After one
round of tests, a second round is conducted using only the
music on which correct answers have been given. The results
of this experiment, an amplifier comparison conducted on an
audiophile system by its owner, will be published in due course.--EBM)
Here is a reader's comment: |
|
Dear Sir,
I read the article "Wishful
Thinking" on your website with a lot of interest.
It was very enlightening on the misuse of statistics
and how misunderstood they are. Indeed, the articles
criticized in "Wishful thinking" manipulated
the numbers to give a positive conclusion. As the
author, Mr. Nousaine, pointed out, they only used
part of their data to support their claims. Mr.
Nousaine then started some complex reasoning to
correct this bias.
Unfortunately, the reasoning
used by Mr. Nousaine is also incorrect ...
And the note by the Publisher
at the bottom of the article also shows he does
not grasp the statistical theory to be used in this
case.
I will try to explain how this
data should be interpreted.
In the testings described, where
subject listen to pair of audio samples and must
tell if they are different or not, a major problem
occurs : we do not know what is the propensity of
the subjects to push that damn "I heard a difference"
button even when there is no difference.
The only way to interpret the
data is therefore to test if the "I heard a
difference" answers are randomly distributed.
This means to look if the frequency of answers "I
heard a difference" was higher when there was
actually a difference than when there was not.
In the CD-tweak test from Stereophile
for example, the subjects anwered 132 times "I
heard a difference" in the 186 trials where
there was a difference. That is 71% of the time.
But they also said "I heard a difference"
132 times in the 194 trials where there was no difference.
That is 68% of the time. It is not very difficult
to see that 68% is not statistically different than
71% (also see below)
However, in the amplifiers test,
the results are different : we have 95 answers "I
heard a difference" in the 150 trial where
there was actually a difference (63%), but the subjects
only answered "I heard a difference" 14
times out of the 40 trials were there was no difference
(35%).
Here, 63% looks different than
35%, but as the sample sizes are very different,
a little stricter mathematics have to be done. We
have to actually calculate if chance alone could
not give such a difference. If the 109 answers "I
heard a difference" are randomly distributed
among the 190 trials, what would be the probability
to get 14 or less such answers in a random subsample
of 40 trials by chance alone ? This is a classical
"marbles in a bag" problem that can be
calculated using the hypergeometric distribution.
Using the nice calculator at Stat Trek (http://www.stattrek.com/Tables/Hypergeometric.aspx
, just enter the parameters of the problem 190,
40, 109, 14) , we can calculate that the probability
of obtaining these numbers by chance alone is about
0.12%. This is well below any statistical criteria
for randomness, so we must conclude that, statistically,
the subjects could pinpoint a faint difference between
the amplifiers, even if they have a high propensity
to hear a difference when there is none
(Doing the same hypergeometric
calculation on the cd-tweak example, the probability
of obtaining those numbers by chance would be 30.5%,
which is much higher than the usual 5% threshold
uses in statistics. We must therefore conclude that
the answers are randomly distributed and that the
subjects press the "I heard a difference"
button 69% (264/380) of the times, whether there
actually exists a difference or not.)
I hope this clarifies the matter
a bit. Debunking false ideas is good, but only if
you use correct reasoning.
Best regards,
Olivier Van Cantfort
|
|
|