<% pagename="atatistics" %>

Post Hoc Tests in ANOVA

Shuqin Guo: In one of my analysis using SPSS ANOVA,  I found that the overall F test was significant, but non of the pairwise comparisons of the groups was significant. So I am asking whether other people had such situation using other statistical packages and what caused the phenomenon. Jiali: It is quite common to have a significant overall F-value but without any significant post hoc pairwise comparisons. Two major factors may cause this to happen. First, the variance between groups is not really big enough to make a difference. With unequal sample size for each group, the small difference there between groups may be cancelled out in the post hoc pairwise comparisons. Second, the total number of groups in your analysis also has an impact on the analysis. The more groups you have with large variation in sample size, the harder it is for you to detect significant differences between groups, especially with rigorous and robust post hoc tests such as Scheffe and Bonferroni. If you think that there is really some difference between A and B, for instance, based on your theoretical perspective and the group mean, you may wish to conduct a separate ANOVA analysis to check your assumption just using these two groups. Shuqin Guo: Thank you very much, Jiali, for your message. Your comments confirmed my initial doubts that the unequal sample size for the groups might cause the phenomenon. However, I don't have any citation about it. Do you know any? I have a question about the separate pairwise comparison tests? Doesn't it increase type I error if we conduct such test for each pair? Jiali:In response to your email, I did come back home and checked one reference book I have: Design and Analysis: A Researcher's Handbook (Third Edition) by Geoffrey Keppel (1991). Info provided in several chapters of the book is helpful in solving the problem. These chapters are Chapter 3 (Variance Estimates and the Evaluation of the F Ratio), Chapter 4 (The Sensitivity of an Experiment: Effect Size and Power), Chapter 6 (Analytical Comparisons Among Treatment Means), and Chapter 8 (Correction for Cumulative Type 1 Error).

You are right in that separate pairwise comparisons are likely to generate type I errors. However, we can choose a lower alpha level or reduce the number of groups to make the analysis possible. Again, depending on your theoretical perspectives for the study, you may wish to use one of the groups as the control, and then compare the combination of other groups with the control group. Meihua Zhai: I checked some other sources and came up with some different "suspicions". The two books I checked are: Keppel, Geoffrey (1982) Design Y Analysis A Researcher's Handbook (2nd Ed) and Kirk, Roger, E. (1982) Experimental Design: Procedures for the Behavioral Sciences (2nd ed.) Based on what I read, I think your problem might be due to the stringent nature of the scheffe test since this test is considered robust against different sample size.

(After some further study, ShuQin found that the phenomenon that she encountered is called “Dissonosance”.  Further citations on Post Hoc Tests summarized by Shuqin Guo)

The following is reference from Roger E. Kirk's (1982) Experimental Design (pp115-125).

Fisher's LSD (least Significant Difference)

When subsequent tests are performed, the conceptual unit for the error rate is the individual comparison, which means it doesn't control the error rate at a for the collection of tests. The use of LSD test can lead to an anomalous situation in which the overall F statistic is significant, but none of the pairwise comparison is significant. This situation can arise because the overall F test is equivalent to a simultaneous test of the hypothesis that all possible contrasts among means are equal to zero. The contrast that is significant, however, may involve some linear combination of means such as m1 - (m2 + m3)/2 rather than m1 - m2.

Tukey's HSD Test (Honestly Significant Difference)

Tukey's test requires that the n's in each treatment level must be equal. The critical difference, y-hat(HSD), that a pairwise comparison must exceed to be declared significant is, according to Tukey's procedure,

y-hat(HSD) = qa;p,v srt(MSerror /n)

A test of the overall null hypothesis that m1 =m…= mp is provided by a comparison of the largest pairwise difference between means with the critical difference y-hat(HSD), which can be obtained from a table. This test procedure, which utilizes a range statistic, is an alternative to the overall F test. For most sets of data, the range and F tests lead to the same decision concerning the overall null hypothesis. However, the F test is generally more powerful.

Scheffe's Test

The Scheffe's S procedure (1953) is one of the most flexible, conservative, and robust data snooping procedures available. If the overall F statistics is significant, Scheffe's procedure can be used to evaluate all a posteriori contrast among means, not just the pairwise comparisons.  In addition, it can be used with unequal n's. The error rate experiment wise is equal to a for the infinite number of possible contrast among p>=3 means. Since an experimenter always evaluate a subset of the possible contrasts, Scheffe's procedure tends to be conservative.  It is much less powerful than Tukey's HSD procedure for evaluating pairwise comparisons, for example, and consequently, is recommended only when complex contrasts are of interest.  Scheffe's procedure uses the F sampling distribution and, like ANOVA, is robust with respect to nonnormality and heterogeneity of variance.

Newman-Keuls Test

A different approach to evaluating a posteriori pairwise comparisons stems from the work of Student (1927), Newman (1939), and Keuls (1952). The Newman-Keuls procedure is based on a stepwise or layer approach to significance testing. Sample means are ordered from the smallest to the largest. The largest difference, which involves means that are r = p steps apart, is tested first at a level of significance; if significant, means that are r = p - 1 steps apart are tested at a level of significant and so on. The Newman-Keuls procedure provides an r-mean significance level equal to a for each group of r ordered means; that is, the probability of falsely rejecting the hypothesis that all means in an ordered group are equal to a. It follows that the concept of error rate applies neither on an experimentwise nor on a per comparison basis--the actual error rate falls somewhere between the two. The Newman-Keuls procedure, like Tukey's procedure, requires equal sample n's.

The critical difference y-hat(Wr), that two means separated by r steps must exceed to be declared significant is, according to the Newman-Keuls procedure,

y-hat(Wr) = qa;p,v srt(MSerror /n)

It should be noted that the Newman-Keuls and Tukey procedures require the same critical difference for the frost comparison that is tested. The Tukey procedure uses this critical difference for all of the remaining tests while the Newman-Keuls procedure reduces the size of the critical difference, depending on the number of steps separating the ordered means. As a result, Newman-Keuls test is more powerful than Tukey's test. Remember, however, that Newman-Keuls procedure does not control the experimentwise error rate at a.

Frequently a test of the overall null hypothesis m1 =m…= mp  is performed with an F statistic in ANOVA rather than with a range statistic. If the F statistic is significant, Shaffer (1979) recommends using the critical difference y-hat(Wr -1) instead of y-hat(Wr) to evaluate the largest pairwise comparison at the first step of the testing procedure. The testing procedure for all subsequent steps is unchanged. She has shown that the modified procedure leads to greater power at the first step without affecting control of the type I error rate. This makes dissonances, in which the overall null hypothesis is rejected by an F test without rejecting any one of the proper subsets of comparison, less likely.