imported>Geysirhead: removed Category:Statistical hypothesis testing using HotCat redudant

2025-06-07T18:14:26Z

removed Category:Statistical hypothesis testing using HotCat redudant

New page

{{Short description|Statistical interpretation with many tests}}
[[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.6|An example of coincidence produced by [[data dredging]] (uncorrected multiple comparisons) showing a correlation between the number of letters in a spelling bee's winning word and the number of people in the United States killed by venomous spiders. Given a large enough pool of variables for the same time period, it is possible to find a pair of graphs that show a [[spurious correlation]].]]
'''Multiple comparisons''', '''multiplicity''' or '''multiple testing problem''' occurs in [[statistics]] when one considers a set of [[statistical inference]]s simultaneously<ref>{{cite book | last=Miller | first=R.G. | year=1981 | title=Simultaneous Statistical Inference 2nd Ed | publisher=Springer Verlag New York | isbn=978-0-387-90548-8}}</ref> or [[Estimation theory|estimates]] a subset of parameters selected based on the observed values.<ref>{{cite journal | journal=Biometrical Journal | title=Simultaneous and selective inference: Current successes and future challenges | year=2010 | volume=52 | last=Benjamini | first=Y. | pages=708–721 | doi=10.1002/bimj.200900299 | issue=6 | pmid=21154895| s2cid=8806192 }}</ref>

The larger the number of inferences made, the more likely erroneous inferences become. Several statistical techniques have been developed to address this problem, for example, by requiring a [[Bonferroni correction|stricter significance threshold]] for individual comparisons, so as to compensate for the number of inferences being made. Methods for [[family-wise error rate]] give the probability of false positives resulting from the multiple comparisons problem.

==History==
The problem of multiple comparisons received increased attention in the 1950s with the work of statisticians such as [[Tukey]] and [[Scheffé]]. Over the ensuing decades, many procedures were developed to address the problem. In 1996, the first international conference on multiple comparison procedures took place in [[Tel Aviv]].<ref>{{cite web |url=http://www.mcp-conference.org/ |title=Home |website=mcp-conference.org}}</ref> This is an active research area with work being done by, for example [[Emmanuel Candès]] and [[Vladimir Vovk]].

==Definition==
[[File:Multiple binomial testing.svg|thumb|upright=1.6|Production of a small p-value by multiple testing. <br>30 samples of 10 dots of random color (blue or red) are observed. On each sample, a two-tailed [[binomial test]] of the null hypothesis that blue and red are equally probable is performed. The first row shows the possible p-values as a function of the number of blue and red dots in the sample. <br>Although the 30 samples were all simulated under the null, one of the resulting p-values is small enough to produce a false rejection at the typical level 0.05 in the absence of correction.]]
Multiple comparisons arise when a statistical analysis involves multiple simultaneous statistical tests, each of which has a potential to produce a "discovery". A stated confidence level generally applies only to each test considered individually, but often it is desirable to have a confidence level for the whole family of simultaneous tests.<ref>{{cite book |last1=Kutner |first1=Michael |last2=Nachtsheim |first2=Christopher |last3=Neter |first3=John |author-link3=John Neter |last4=Li |first4=William |date=2005 |title=Applied Linear Statistical Models |url=https://archive.org/details/appliedlinearsta00kutn_164 |url-access=limited |pages=[https://archive.org/details/appliedlinearsta00kutn_164/page/n782 744]–745|publisher=McGraw-Hill Irwin |isbn=9780072386882 }}</ref> Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples:

* Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random [[sampling error]] alone.
* Suppose we consider the efficacy of a [[Pharmacology|drug]] in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes increasingly likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.

In both examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison.

For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a 5% risk of incorrectly rejecting the null hypothesis. However, if 100 tests are each conducted at the 5% level and all corresponding null hypotheses are true, the [[expected number]] of incorrect rejections (also known as [[false positive]]s or [[Type I error]]s) is 5. If the tests are statistically independent from each other (i.e. are performed on independent samples), the probability of at least one incorrect rejection is approximately 99.4%.

The multiple comparisons problem also applies to [[confidence intervals]]. A single confidence interval with a 95% [[coverage probability]] level will contain the true value of the parameter in 95% of samples. However, if one considers 100 confidence intervals simultaneously, each with 95% coverage probability, the expected number of non-covering intervals is 5. If the intervals are statistically independent from each other, the probability that at least one interval does not contain the population parameter is 99.4%.

Techniques have been developed to prevent the inflation of false positive rates and non-coverage rates that occur with multiple statistical tests.

===Classification of multiple hypothesis tests{{anchor|Classification of ''m'' hypothesis tests}}===

{{Classification of multiple hypothesis tests}}

==Controlling procedures==
{{further|Family-wise error rate#Controlling procedures}}
{{see also|False coverage rate#Controlling procedures|False discovery rate#Controlling procedures}}

{{Image frame
|content ={{Graph:Chart|width=300|height=100|type=line|x=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49|y=0.050000000000000044, 0.09750000000000003, 0.1426250000000001, 0.18549375000000012, 0.22621906250000023, 0.2649081093750002, 0.3016627039062503, 0.33657956871093775, 0.3697505902753909, 0.4012630607616213, 0.43119990772354033, 0.45963991233736334, 0.4866579167204952, 0.5123250208844705, 0.536708769840247, 0.5598733313482347, 0.5818796647808229, 0.6027856815417818, 0.6226463974646927, 0.6415140775914581, 0.6594383737118852, 0.676466455026291, 0.6926431322749764, 0.7080109756612276, 0.7226104268781662, 0.7364799055342579, 0.7496559102575451, 0.7621731147446679, 0.7740644590074345, 0.7853612360570628, 0.7960931742542097, 0.8062885155414992, 0.8159740897644242, 0.8251753852762029, 0.8339166160123929, 0.8422207852117732, 0.8501097459511846, 0.8576042586536253, 0.8647240457209441, 0.8714878434348969, 0.877913451263152, 0.8840177786999944, 0.8898168897649947, 0.895326045276745, 0.9005597430129078, 0.9055317558622624, 0.9102551680691493, 0.9147424096656918, 0.9190052891824072|yAxisMin=0
xAxisTitle="k"|yAxisTitle=P(at least 1 H_0 is wrongly rejected)}}
|caption = Probability that at least one null hypothesis is wrongly rejected, for <math>\alpha_\text{per comparison}=0.05</math>, as a function of the number of independent tests <math>m</math>.
|width=300
}}

===Multiple testing correction===
{{anchor|Correction}}
{{cleanup merge|21=section|Multiple testing correction|date=April 2016}}
'''Multiple testing correction''' refers to making statistical tests more stringent in order to counteract the problem of multiple testing. The best known such adjustment is the [[Bonferroni correction]], but other methods have been developed. Such methods are typically designed to control the [[family-wise error rate]] or the [[false discovery rate]].

If ''m'' independent comparisons are performed, the ''[[family-wise error rate]]'' (FWER), is given by

:<math> \bar{\alpha} = 1-\left( 1-\alpha_{\{\text{per comparison}\}} \right)^m.</math>

Hence, unless the tests are perfectly positively dependent (i.e., identical), <math>\bar{\alpha}</math> increases as the number of comparisons increases.
If we do not assume that the comparisons are independent, then we can still say:

:<math> \bar{\alpha} \le m \cdot \alpha_{\{\text{per comparison}\}},</math>

which follows from [[Boole's inequality]]. Example: <math> 0.2649=1-(1-.05)^6 \le .05 \times 6 = 0.3</math>

There are different ways to assure that the family-wise error rate is at most <math>\alpha</math>. The most conservative method, which is free of dependence and distributional assumptions, is the [[Bonferroni correction]] <math> \alpha_\mathrm{\{per\ comparison\}}={\alpha}/m</math>. A marginally less conservative correction can be obtained by solving the equation for the family-wise error rate of <math>m</math> independent comparisons for <math>\alpha_\mathrm{\{per\ comparison\}}</math>. This yields <math>\alpha_{\{\text{per comparison}\}} = 1-{(1-{\alpha})}^{1/m}</math>, which is known as the [[Šidák correction]]. Another procedure is the [[Holm–Bonferroni method]], which uniformly delivers more power than the simple Bonferroni correction, by testing only the lowest p-value (<math>i=1</math>) against the strictest criterion, and the higher p-values (<math>i>1</math>) against progressively less strict criteria.<ref>{{cite journal | last1 = Aickin | first1 = M | last2 = Gensler | first2 = H | title = Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods | journal = Am J Public Health | volume = 86| pages = 726–728 | doi=10.2105/ajph.86.5.726 | pmid=8629727 | date=May 1996 | pmc=1380484 | issue=5}}</ref>
<math> \alpha_\mathrm{\{per\ comparison\}}={\alpha}/(m-i+1)</math>.

For continuous problems, one can employ [[Bayesian statistics|Bayesian]] logic to compute <math>m</math> from the prior-to-posterior volume ratio. Continuous generalizations of the [[Bonferroni correction|Bonferroni]] and [[Šidák correction]] are presented in.<ref name="Bayer2020">{{cite journal |first1=Adrian E. |last1=Bayer | first2=Uroš| last2=Seljak | title=The look-elsewhere effect from a unified Bayesian and frequentist perspective |journal=[[Journal of Cosmology and Astroparticle Physics]] |volume=2020 |issue=10 |pages=009|year=2020 |arxiv = 2007.13821 | url=https://doi.org/10.1088%2F1475-7516%2F2020%2F10%2F009 |doi=10.1088/1475-7516/2020/10/009 |bibcode=2020JCAP...10..009B |s2cid=220830693 }}</ref>

==Large-scale multiple testing==
Traditional methods for multiple comparisons adjustments focus on correcting for modest numbers of comparisons, often in an [[analysis of variance]]. A different set of techniques have been developed for "large-scale multiple testing", in which thousands or even greater numbers of tests are performed. For example, in [[genomics]], when using technologies such as [[DNA microarray|microarray]]s, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genetic markers can be measured. Particularly in the field of [[genetic association]] studies, there has been a serious problem with non-replication — a result being strongly statistically significant in one study but failing to be replicated in a follow-up study. Such non-replication can have many causes, but it is widely considered that failure to fully account for the consequences of making multiple comparisons is one of the causes.<ref>{{Cite journal|last1=Qu|first1=Hui-Qi|last2=Tien|first2=Matthew|last3=Polychronakos|first3=Constantin|date=2010-10-01|title=Statistical significance in genetic association studies|journal=Clinical and Investigative Medicine|volume=33|issue=5|pages=E266–E270|issn=0147-958X|pmc=3270946|pmid=20926032}}</ref> It has been argued that advances in [[measurement]] and [[information technology]] have made it far easier to generate large datasets for [[exploratory data analysis|exploratory analysis]], often leading to the testing of large numbers of hypotheses with no prior basis for expecting many of the hypotheses to be true. In this situation, very high [[false positive rate]]s are expected unless multiple comparisons adjustments are made.

For large-scale testing problems where the goal is to provide definitive results, the [[family-wise error rate]] remains the most accepted parameter for ascribing significance levels to statistical tests. Alternatively, if a study is viewed as exploratory, or if significant results can be easily re-tested in an independent study, control of the [[false discovery rate]] (FDR)<ref>{{cite journal | last=Benjamini | first=Yoav |author2=Hochberg, Yosef | year=1995 | title=Controlling the false discovery rate: a practical and powerful approach to multiple testing | journal=[[Journal of the Royal Statistical Society, Series B]] | volume=57 | pages=125–133 | issue=1 | jstor=2346101}}</ref><ref>{{cite journal | last=Storey | first=JD |author2=Tibshirani, Robert | year=2003 | title=Statistical significance for genome-wide studies | journal=PNAS | volume=100 | pages=9440–9445 | doi=10.1073/pnas.1530509100 | pmid=12883005 | issue=16 | pmc=170937 | jstor=3144228| bibcode=2003PNAS..100.9440S | doi-access=free }}</ref><ref>{{cite journal | last=Efron | first=Bradley |author2=Tibshirani, Robert |author3=Storey, John D. |author4= Tusher, Virginia | journal=[[Journal of the American Statistical Association]] | volume=96 | issue=456 | year=2001 | pages=1151–1160 | title=Empirical Bayes analysis of a microarray experiment | doi=10.1198/016214501753382129 | jstor=3085878| s2cid=9076863 }}</ref> is often preferred. The FDR, loosely defined as the expected proportion of false positives among all significant tests, allows researchers to identify a set of "candidate positives" that can be more rigorously evaluated in a follow-up study.<ref>{{Cite journal|last=Noble|first=William S.|date=2009-12-01|title=How does multiple testing correction work?|journal=Nature Biotechnology|language=en|volume=27|issue=12|pages=1135–1137|doi=10.1038/nbt1209-1135|issn=1087-0156|pmc=2907892|pmid=20010596}}</ref>

The practice of trying many unadjusted comparisons in the hope of finding a significant one is a known problem, whether applied unintentionally or deliberately, is sometimes called "[[p-hacking]]".<ref name="Deming">{{Cite journal
|author = Young, S. S., Karr, A.
|title = Deming, data and observational studies
|journal = Significance
|volume = 8
|issue = 3
|pages = 116–120
|year = 2011
|url = http://www.niss.org/sites/default/files/Young%20Karr%20Obs%20Study%20Problem.pdf|doi = 10.1111/j.1740-9713.2011.00506.x
|doi-access = free
}}
</ref><ref name="bmj02">
{{Cite journal
|author = Smith, G. D., Shah, E.
|title = Data dredging, bias, or confounding
|journal = BMJ
|volume = 325
|year = 2002
|pmc = 1124898
|doi = 10.1136/bmj.325.7378.1437
|pmid=12493654
|issue=7378
|pages=1437–1438}}
</ref>

===Assessing whether any alternative hypotheses are true===
[[Image:quantile meta test.svg|thumb|325px|A [[Q–Q plot|normal quantile plot]] for a simulated set of test statistics that have been standardized to be [[standard score|Z-scores]] under the null hypothesis. The departure of the upper tail of the distribution from the expected trend along the diagonal is due to the presence of substantially more large test statistic values than would be expected if all null hypotheses were true. The red point corresponds to the fourth largest observed test statistic, which is 3.13, versus an expected value of 2.06. The blue point corresponds to the fifth smallest test statistic, which is -1.75, versus an expected value of -1.96. The graph suggests that it is unlikely that all the null hypotheses are true, and that most or all instances of a true alternative hypothesis result from deviations in the positive direction.]]

A basic question faced at the outset of analyzing a large set of testing results is whether there is evidence that any of the alternative hypotheses are true. One simple meta-test that can be applied when it is assumed that the tests are independent of each other is to use the [[Poisson distribution]] as a model for the number of significant results at a given level α that would be found when all null hypotheses are true.{{citation needed|date=June 2016}} If the observed number of positives is substantially greater than what should be expected, this suggests that there are likely to be some true positives among the significant results.

For example, if 1000 independent tests are performed, each at level α = 0.05, we expect 0.05 × 1000 = 50 significant tests to occur when all null hypotheses are true. Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if more than 61 significant results are observed, it is very likely that some of them correspond to situations where the alternative hypothesis holds. A drawback of this approach is that it overstates the evidence that some of the alternative hypotheses are true when the [[test statistic]]s are positively correlated, which commonly occurs in practice. {{citation needed|date=August 2012}}. On the other hand, the approach remains valid even in the presence of correlation among the test statistics, as long as the Poisson distribution can be shown to provide a good approximation for the number of significant results. This scenario arises, for instance, when mining significant frequent itemsets from transactional datasets. Furthermore, a careful two stage analysis can bound the FDR at a pre-specified level.<ref>{{cite journal | last1 = Kirsch | first1 = A | last2 = Mitzenmacher | first2 = M | author2-link = Michael Mitzenmacher | last3 = Pietracaprina | first3 = A | last4 = Pucci | first4 = G | last5 = Upfal | first5 = E | author5-link = Eli Upfal | last6 = Vandin | first6 = F | title = An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets | journal = Journal of the ACM | volume = 59 | issue = 3 | pages = 12:1–12:22 | doi=10.1145/2220357.2220359 | date=June 2012| arxiv = 1002.1104 }}</ref>

Another common approach that can be used in situations where the [[test statistic]]s can be standardized to [[standard score|Z-scores]] is to make a [[Q–Q plot|normal quantile plot]] of the test statistics. If the observed quantiles are markedly more [[statistical dispersion|dispersed]] than the normal quantiles, this suggests that some of the significant results may be true positives.{{citation needed|date=January 2012}}

==See also==
*[[Q-value (statistics)|''q''-value]]

;Key concepts
*[[Family-wise error rate]]
*[[False positive rate]]
*[[False discovery rate]] (FDR)
*[[False coverage rate]] (FCR)
*[[Interval estimation]]
*[[Post-hoc analysis]]
*[[Experimentwise error rate]]
*[[Statistical hypothesis testing]]

;General methods of alpha adjustment for multiple comparisons
*[[Closed testing procedure]]
*[[Bonferroni correction]]
*Boole–[[Bonferroni bound]]
*[[Duncan's new multiple range test]]
*[[Holm–Bonferroni method]]
*[[Harmonic mean p-value]] procedure
*[[Benjamini–Hochberg procedure ]]
*[[E-values]]

;Related concepts
*[[Testing hypotheses suggested by the data]]
*[[Texas sharpshooter fallacy]]
*[[Model selection]]
*[[Look-elsewhere effect]]
*[[Data dredging]]

==References==
{{Reflist|30em}}

==Further reading==
* F. Bretz, T. Hothorn, P. Westfall (2010), ''Multiple Comparisons Using R'', CRC Press
* [[Sandrine Dudoit|S. Dudoit]] and M. J. van der Laan (2008), ''Multiple Testing Procedures with Application to Genomics'', Springer
* {{cite journal | last1 = Farcomeni | first1 = A. | year = 2008 | title = A Review of Modern Multiple Hypothesis Testing, with particular attention to the false discovery proportion | journal = Statistical Methods in Medical Research | volume = 17 | issue = 4 | pages = 347–388 | doi = 10.1177/0962280206079046 | pmid = 17698936 | hdl = 11573/142139 | s2cid = 12777404 }}
* {{cite journal | last1 = Phipson | first1 = B. | last2 = Smyth | first2 = G. K. | year = 2010 | title = Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn | journal = Statistical Applications in Genetics and Molecular Biology | volume = 9 | pages = Article39 | doi = 10.2202/1544-6115.1585 | pmid = 21044043 | arxiv = 1603.05766 | s2cid = 10735784 }}
* P. H. Westfall and S. S. Young (1993), ''Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment'', Wiley
* P. Westfall, R. Tobias, R. Wolfinger (2011) ''Multiple comparisons and multiple testing using SAS'', 2nd edn, SAS Institute
* [http://www.tylervigen.com/spurious-correlations A gallery of examples of implausible correlations sourced by data dredging]
* [https://xkcd.com/882/] An ''[[xkcd]]'' comic about the multiple comparisons problem, using jelly beans and acne as an example
{{Experimental design}}
{{Statistics}}

[[Category:Multiple comparisons| ]]

Multiple comparisons problem - Revision history

imported>Geysirhead: removed Category:Statistical hypothesis testing using HotCat redudant