OCAIR

Overseas Chinese Association for Institutional Research
An AIR Affiliate That Supports IR Professionals Since 1996

What to do when the dataset is skewed?

Nov. 2003

Meihua Zhai: Dear Statistical Gurus,
I have a simple question for you: If someone wants to do factorial analysis on the relationship between student background variables and one other course variable on their mid/term grades, we know that for course grades, usually the data is negatively skewed (thanks to grade inflation). As a researcher, what should the person do or pay attention to since most of our statistical methods are based on normal distribution. Any insight will welcome and I will do the discussion summary.

Thank you in advance.


Vincent Tong: Off the top of my head, I have two responses. First, grade inflation does not necessarily mean that the grade distribution will not be based on normal distribution. Second, nonparametric statistics can be used when the distribution is not normal (e.g. pass/fail option). Hope this helps.


Meihua: I don't quite understand the grade inflation does not mean that grade distribution will not be normally distributed part. Due to grade inflation, 75% in the class will get A, for example, and the data distribution will be negatively skewed. Am I right?


Vincent Tong: Your scenario is certainly true that this is not a normal distribution. However, you may have a situation like this: The average of the class is C and the distribution is normal. For unknown reasons, a professor raises the average to B and the distribution remains the same. Is this grade inflation? Yes, it is. Is it normally distributed? Yes, it is.


Jing Su: I am not a Statistical Guru, but I want to involve the discussion as a "new comer". I will try some simple transformations of course grades, like log(Y), 1/Y, etc. If the residuals plots and normal probability plots do not support one of those transformations. I will try the Box-Cox procedure to automatically identify a better transformation from the family of power transformations on this variable.


Shuqin Guo: my suggestion is to transform the data before you do the analysis. Normally speaking, data transformation can take care of the non-normalized data. STATA can tell you which is the best method to do the transformation if you run a check on normality of your data. Please refer to the link for various methods for data transformation. http://www.pfc.forestry.ca/profiles/wulder/mvstats/transform_e.html