What to do when the dataset is skewed?
Nov. 2003
Meihua Zhai: Dear Statistical Gurus,
I have a simple question for you: If someone wants to
do factorial analysis on the relationship between
student background variables and one other course
variable on their mid/term grades, we know that for
course grades, usually the data is negatively skewed
(thanks to grade inflation). As a researcher, what
should the person do or pay attention to since most of
our statistical methods are based on normal
distribution. Any insight will welcome and I will do
the discussion summary.
Thank you in advance.
Vincent Tong: Off the top of my head, I have two responses.
First, grade inflation does not necessarily mean that the grade distribution will not be based on
normal distribution. Second, nonparametric statistics can be used when the distribution is not normal
(e.g. pass/fail option). Hope this helps.
Meihua: I don't quite understand the grade inflation does not
mean that grade distribution will not be normally distributed part. Due to grade inflation, 75% in the class will get A,
for example, and the data distribution will be negatively skewed. Am I right?
Vincent Tong: Your scenario is certainly true that this is not a normal distribution.
However, you may have a situation like this: The average of the class is C and the distribution is normal.
For unknown reasons, a professor
raises the average to B and the distribution remains the same. Is this
grade inflation? Yes, it is. Is it normally distributed? Yes, it is.
Jing Su: I am not a Statistical Guru, but I want to involve the discussion as a "new comer".
I will try some simple transformations of course grades, like log(Y), 1/Y, etc. If the residuals plots and normal probability plots
do not support one of those transformations. I will try the Box-Cox procedure to automatically identify a better transformation
from the family of power transformations on this variable.
Shuqin Guo: my suggestion is to transform the data before you do the analysis. Normally speaking, data transformation can take care
of the non-normalized data. STATA can tell you which is the best method
to do the transformation if you run a check on normality of your data. Please refer to the link for various methods for data transformation.
http://www.pfc.forestry.ca/profiles/wulder/mvstats/transform_e.html