What’s
Behind the Numbers? Important Decisions in Judging Practical Significance Greg
Atkinson Sportscience
11, 12-15, 2007 (sportsci.org/2007/ga.htm)
|
As he
mentioned in his recent article (Hopkins, 2006), Will Hopkins' latest ideas about
sample size estimation have arisen from a long-standing interest in the confidence
interval approach to interpretation of study conclusions. Indeed, Will has
been instrumental over the last two decades in communicating the advantages
of such an approach amongst sport and exercise scientists. It is undeniable
that confidence intervals help researchers to appraise the
"real-world" relevancy of their study outcomes and that Will's
spreadsheets are useful tools to help researchers make such an appraisal. My
personal interest in Will's article centers on the underpinning philosophy of
the ideas rather than the mathematical accuracy of the spreadsheets derived
from the "statistical first principles" which Will adopts. I know
Will to be a highly competent mathematician who has a gift for communicating
complicated mathematical concepts in a "researcher-friendly" way,
especially through the use of his spreadsheets. I think
Will's claims that his new approach leads to sample sizes one third the size
of "traditional methods" need to be viewed from a philosophical
standpoint in order to unravel how this difference in numbers comes about.
Such claims are especially interesting given that there are surprisingly
tight relationships, both philosophically and mathematically, between some
interpretations of the confidence interval approach and the null hypothesis
testing process. For example, if the lower bound of a 95% confidence interval
is exactly zero, then the exact P-value for statistical significance of the
sample mean is 0.05 (5%). This makes sense, since both the lower bound of the
confidence interval and the P=0.05 in the null hypothesis testing process
basically suggest that it is unlikely that the true population effect size is
zero (or, put another way, that the observed effect size is unlikely to be
merely due to chance sampling error). I know that Will is not too comfortable
with this relationship between 95% confidence intervals and statistical
significance in the null hypothesis testing process and I believe this is one
reason why 90% confidence intervals are preferred by him and other
statisticians. I would
like to make some comments, which may be relevant, about the "null"
in the null hypothesis testing process. Firstly, the null value does not have
to be set at zero. The null assumption can also be that the effect size is
equal to the smallest worthwhile magnitude. "Null" in this sense
means "not important" and suggests that the null hypothesis testing
process is not completely disconnected from issues surrounding practical
significance. I think adoption of this philosophy in the past would have at
least reduced the instances of researchers automatically assuming that statistical
significance is synonymous with practical importance. It is also not very
well known that, as part of the philosophy of a one-tailed, directional
analysis, the null hypothesis should be stated that the observed effect is
zero or opposite in direction to that hypothesized by the researcher.
This is because both these scenarios should result in the same study
conclusion; the intervention should not be adopted. Given
Will's claims, it may surprise some readers when I say that there are some published
interpretations of confidence intervals (e.g., Guyatt et al., 1995) which lead to estimations of larger (not smaller) sample sizes than
for the null hypothesis testing procedure (when zero is the chosen null
value). This is because the lower bound of a confidence interval might be
larger than zero (hence the sample mean is statistically significant) but
might not be larger than the smallest worthwhile effect. Some statisticians
interpret this situation as the sample size not being large enough to be
reasonably certain that the true population effect is larger than the
smallest worthwhile effect, i.e. more subjects are needed to narrow the confidence
interval and therefore arrive at a more precise conclusion. One can tell from
the work Will has done on boundaries of benefit/harm that he is one of the
statisticians that does not agree with this rather conservative pass-fail
approach to confidence interval interpretation. Still, it serves to
illustrate that the interpretation of confidence intervals is itself under
debate, even without bringing in the Bayesians! So, in
view of the drastic reduction in estimated sample size, what exactly is Will doing
differently in terms of the philosophy of applying probabilistic statements
to study conclusions? If multiple assumptions have been made, how have these
been rationalized? The answer to this latter question is especially important
given the oft-cited criticism that the popular P<0.05 (5%) cut-off value
for statistical significance in the null hypothesis testing process is quite
arbitrary, although to be wrong about a claim of significance, given the observed
data, only one time out of 20 seems a decent delimitation of "reasonably
certain" to me. Will
believes that the use of the P<0.05 cut-off value is not only arbitrary
but it leads to decisions that are too conservative. Is Will fighting a
generalization with another (or several other) generalization(s) in this
respect? Who or what is P<0.05 too conservative for? Doesn't such a view
actually detract from what is really important - that the level of alpha (or
indeed any delimitation about probability coverage or levels in data
analysis) is a situation-specific delimitation? The P<0.05 cut-off could
be viewed as too liberal in some circumstances,
e.g. the use of an antiviral drug to combat HIV infection when that drug
might have serious side effects. Will's solution to this problem seems to
involve the introduction of two new types of decision error with delimited acceptable
cut-off values of 0.5% and 25% (to be fair, Will cites these as examples).
What is the exact rationale for these values? Following these delimitations,
then the acceptable cut-offs for qualitative conclusions of "beneficial",
"trivial", etc, are introduced. What should these probabilistic
values be and what philosophical basis drives them? If Will's new methods are
adopted, then all these situation-specific delimitations should come to the
forefront of the researchers mind. Do we need discussion-based position
statements to be formulated for all these delimitations which affect the
study conclusion process? Inherent
in the confidence interval approach to interpreting study conclusions is the
most important delimitation a researcher needs to make; the selection of the
smallest outcome magnitude that is clinically or practically important. Will
maintains that any researcher who cannot arrive at such a value should
"quit the field"! I can see his point in terms of the number of
researchers who seem unable to even discuss the practical importance of their
findings and agree that this inability is a terrible side effect of
over-reliance on the null hypothesis testing approach. Nevertheless, I am not
so sure that sport and exercise scientists have such an easy job in arriving
at this smallest worthwhile effect. Will
maintains that a change of approximately 0.5 of the within-subject variability
in performance between competitions is probably worthwhile for sports performance
contexts (Hopkins et al., 1999). This cut-off value was arrived
at following a study (the first of its kind) on the within- and
between-athlete variability of real track-and-field performances at the elite
level. Using these data, Will was able to estimate how much the
within-athlete performance needs to change in order for it to make a
difference in terms of winning places. But how does such a cut-off value
relate to other scenarios, especially when such values have been calculated
with all the variability associated with real-world situations? I am not
challenging the delimitation here but wonder if we need to formalize the
process of arriving at these decisions? Also, can such cut-off values derived
from the real world be applied to the more tightly controlled environment of
a laboratory experiment? For example, I have found recently that
within-player variation (CV) of real soccer motion analyses can be as high as
100%. This variability is not surprising given the myriad of tactical and
behavior variations between soccer matches. I don't think this magnitude of
variability will be present if one researches an externally-valid component
of soccer performance in the controlled environment of the laboratory. Will's
value for a meaningful effect size of 0.5 x within-subject variability is at
least better, in terms of underlying rationale, than Cohen's 0.2 of a
between-subjects SD. How has this latter cut-off value been rationalized in
terms of sports performance, physiology of exercise or indeed any outcome
relevant to exercise science? Cohen was not a sport and exercise scientist,
so he wasn't even in the field for him to be able to quit it! Of
course, the size of worthwhile effect should be an informed decision based on
knowledge about what really makes a difference. But how easy is such a
decision, especially when the study outcome variable is part of an overall
concept? For example, what is the smallest difference in bowling speed that
makes a difference to overall cricket performance of the team? This question
was exactly the one Will needed to answer when he co-authored a recent paper (Petersen et al., 2004). In response to a training
intervention, the smallest worthwhile change in bowling speed was stated by
Peterson et al. to be 5 km/h as "the smallest that a top batsman would
notice". Nevertheless, a smallest worthwhile effect size of 2.5 km/h was
also stated as being "beneficial to a world-class bowler". As an
illustration of how vital these decisions about smallest worthwhile effect
are, and how clearly rationalized they should be, it was interesting that
Peterson et al. found that the 90% confidence interval for the change in
bowling speed was 1.2 to 4.2 km/h. This confidence interval tells us that a
zero (null) change in true bowling speed is very unlikely (since the lower
limit of the interval is 1.2). Nevertheless, the true change in bowling speed
could be beneficial according to one delimited worthwhile effect (2.5 km/h)
but not another (5 km/h), since the upper limit was higher than the former
but lower than the latter delimited cut-off. Therefore, whilst Peterson et
al. were pretty sure that the intervention induced an improvement in bowling
speed, their study conclusion was less certain, according to their delimited
worthwhile effect sizes. My question is to what extent should this ambiguity
in the magnitude of the smallest worthwhile effect be built into Will's
probabilities of "very likely beneficial", "trivial",
etc? If the anchor between the delimited smallest worthwhile effect size and
real world relevancy is pretty loose, is it actually worth being so precise
with all the probabilities associated with the observed effect? In
summary, I believe that the most important issues in Will's article are not
sample size calculations, but the new philosophy underpinning his new approaches
to arriving at study conclusions using confidence intervals. There are new
delimited conclusion error types and new boundaries of overlap between
confidence interval and smallest worthwhile effect. Will has set a very important
ball rolling but its path needs to be clearly steered and agreed on in my
opinion. Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S (1995).
Interpreting study results: confidence intervals. Canadian Medical Association
Journal 152, 169-173 Hopkins
WG, Hawley JA, Burke LM (1999). Design and analysis of research on sport performance
enhancement. Medicine and Science in Sports and Exercise 31, 472-485 Hopkins
WG (2006). Estimating sample size for magnitude-based inferences.
Sportscience 10, 63-67 Petersen
CJ, Wilson BD, Hopkins WG (2004). Effects of modified-implement training on
fast bowling in cricket. Journal of Sports Sciences 22, 1035-1039 Published Dec 2007 |