Apples and Oranges: Comparing Reading Scores Across Texts

  • 03 April, 2014

I get this kind of question frequently from teachers who work with struggling readers, so I decided to respond publicly. What I say about these two tests would be true of others as well.

  I am a middle school reading teacher and have an issue that I'm hoping you could help me solve. My students' placements are increasingly bound to their standardized test results. I administer two types of standardized tests to assess the different areas of student reading ability. I use the Woodcock Reading Mastery Tests and the Terra Nova Test of Reading Comprehension. Often, my students' WRMT subtest scores are within the average range, while their Terra Nova results fall at the the lower end of the average range or below. How can I clearly explain these discrepant results to my administrators? When they see average scores on one test they believe these students are no longer candidates for remedial reading services.

  Teachers are often puzzled by these kinds of testing discrepancies, but they can happen for a lot of reasons.

  Reading tests tend to be correlated with each other, but this kind of general performance agreement between two measures doesn’t mean that they would categorize student performance identically. Performing at the 35%ile might give you a below average designation with one test, but an average one with the other. Probably better to stay away from those designations and use NCE scores or something else that is comparable across the tests.

  An important issue in test comparison is the norming samples that they use. And, that is certainly the case with these two tests. Terra Nova has a very large and diverse nationally representative norming sample (about 200,000 kids) and the GMRT is based on a much smaller group that may be skewed a bit towards struggling students (only 2600 kids). When you say that someone is average or below average, you are comparing their performance with those of the norming group. Because of their extensiveness, I would trust the Terra Nova norms more than the WMRT ones; Terra Nova would likely give me a more accurate picture of where my students are compared to the national population. The GMRT is useful because it provides greater information about how well the kids are doing in particular skill areas, and it would help me to track growth in these skills.

  Another thing to think about is reliability. Find out the standard error of the tests that you are giving and calculate 95% confidence intervals for the scores. Scores should be stated in terms of the range of performance that the score represents. Lots of times you will find that the confidence intervals of the two tests are so wide that they overlap. This would mean that though the score differences look big, they are not really different. Let’s say that the standard error of one of the tests is 5 points (you need to look up the actual standard error in manual), and that your student received a standard score of 100 on the test. That would mean that the 95% confidence interval for this score would be: 90-110 (in other words, I’m sure that if the student took this test over and over 95% of his scores would fall between those scores). Now say that the standard score for the other test was 8 and that the student’s score on that test was 120. That looks pretty discrepant, but the confidence interval for that one is 104-136. Because 90-110 (the confidence interval for the first test) overlaps with 104-136 (the confidence interval of the second test), these scores look very different and yet they are actually the same.

  You mention the big differences in the tasks included in the two tests. These can definitely make a difference in performance. Since WMRT is given so often to lower performing students, that test wouldn’t require especially demanding tasks to spread out performance, while the Terra Nova, given to a broader audience, would need a mix of easier and harder tasks (such as longer and more complex reading passages) to spread out student performance. These harder tasks push your kids lower in the group and may be so hard that it would be difficult to see short-term gains or improvements with such an test. WMRT is often used to monitor gains, so it tends to be more sensitive to growth.

  You didn’t mention which edition of the tests you were administering. But these tests are revised from time to time and the revisions matter. GMRT has a 2012 edition, but studies of previous versions of the tests reveal big differences in performance from one edition to the other (despite the fact that the same test items were being used). The different versions of the tests changed their norming samples and that altered the tests performances quite a bit (5-9 points). I think you would find the Terra Nova to have more stable scores, and yet, comparing them across editions might reveal similar score inflation.

  My advice is that when you want to show where students stand in the overall norm group, only use the Terra Nova data. Then use the GMRT to show where the students’ relative strengths and weaknesses are and to monitor growth in these skills. That means your message might be something like: “Tommy continues to perform at or near the 15% percentile when he is compared with his age mates across the country. Nevertheless, he has improved during the past three months in vocabulary and comprehension, though not enough to improve his overall position in the distribution.“ In other words, his reading is improving and yet he remains behind 85% of his peers in these skills.


See what others have to say about this topic.

Anonymous Jun 15, 2017 12:12 PM


I am not sure if I should be happy or upset that I have dealt with similar circumstances when discussing placements for students. I have been in this situation numerous times. Our students are progress monitored each week for fluency and comprehension. I think this is a great way of recording progress and a great tool to use when determining groups. However, the passages change every week. So one week, the student may read one hundred words per minute and the next week only read seventy words per minute. The complexity of the passages are not always the same. So therefore, I don’t believe this alone should be a deciding factor in placement. Three times a year, our students take a test called Measurement of Academic Progress, or MAP. This test determines the students’ ability for numerous reading skills. It grades the students as they take the test and they are given a numeric score at the end. There is also a cumulative numeric score for norms and percentiles. Many times, the students’ progress monitoring scores are compared with their MAP scores. This is another example of comparing apples and oranges. The two assessments are nowhere near the same and therefore not accurately comparable.

Timothy Shanahan Jun 15, 2017 12:12 PM


Weekly monitoring using the types of tests that you mention cannot be worthwhile. It is simply too much testing without any possibility that you can use the information. Students do not learn fast enough in those skills to change weekly in any way that the tests can accurately measure. The ratio of average developmental growth to standard error of measurement is too small--meaning that you will get meaningless numbers from that exercise that cannot be evaluated. I don't know the grade levels (you can do this more often early on), but I wouldn't retest fluency more than every couple of months.

JoAnn H. Thomas Jun 15, 2017 12:13 PM


I am total agreement with the anonymus blogger, because I am a parent who have a lot of concerns about the socio-economic advantaged child. If a child comes from a very high prfile family and another child from a low income family. The high profile child will have an advantage over the low income child and that bothers me. It is like saying soda or pop, it all depends on where you came from on your comprehension and reading level.

JoAnn H. Thomas Jun 15, 2017 12:13 PM


I am in total agreement with the statement "Reading tests tend to be correlated with each other", but this kind of general performance agreement between two measures doesn’t mean that they would categorize student performance identically.
So my question is your low performing students, will always be at the bottom, how will we ever get them to the middle or even on top. IKf we are still basing their performance off of reading scores. There has to be a better way.

Timothy Shanahan Jun 15, 2017 12:14 PM


Jo Ann--
The problem isn't the testing, it is the instruction and academic support. It would be problematic if we had reading tests that led to actual performance differences, but that isn't the case. That there is a group of students who consistently reads poorly is a serious problem and one that can only be solved through more teaching and better teaching.

What Are your thoughts?

Leave me a comment and I would like to have a discussion with you!

Comment *

Apples and Oranges: Comparing Reading Scores Across Texts


One of the world’s premier literacy educators.

He studies reading and writing across all ages and abilities. Feel free to contact him.