Showing posts with label Testing. Show all posts
Showing posts with label Testing. Show all posts

Thursday, April 3, 2014

Apples and Oranges: Comparing Reading Scores across Tests

I get this kind of question frequently from teachers who work with struggling readers, so I decided to respond publicly. What I say about these two tests would be true of others as well.

I am a middle school reading teacher and have an issue that I'm hoping you could help me solve. My students' placements are increasingly bound to their standardized test results. I administer two types of standardized tests to assess the different areas of student reading ability. I use the Woodcock Reading Mastery Tests and the Terra Nova Test of Reading Comprehension. Often, my students' WRMT subtest scores are within the average range, while their Terra Nova results fall at the the lower end of the average range or below. How can I clearly explain these discrepant results to my administrators? When they see average scores on one test they believe these students are no longer candidates for remedial reading services.

Teachers are often puzzled by these kinds of testing discrepancies, but they can happen for a lot of reasons.

Reading tests tend to be correlated with each other, but this kind of general performance agreement between two measures doesn’t mean that they would categorize student performance identically. Performing at the 35%ile might give you a below average designation with one test, but an average one with the other. Probably better to stay away from those designations and use NCE scores or something else that is comparable across the tests.

An important issue in test comparison is the norming samples that they use. And, that is certainly the case with these two tests. Terra Nova has a very large and diverse nationally representative norming sample (about 200,000 kids) and the GMRT is based on a much smaller group that may be skewed a bit towards struggling students (only 2600 kids). When you say that someone is average or below average, you are comparing their performance with those of the norming group. Because of their extensiveness, I would trust the Terra Nova norms more than the WMRT ones; Terra Nova would likely give me a more accurate picture of where my students are compared to the national population. The GMRT is useful because it provides greater information about how well the kids are doing in particular skill areas, and it would help me to track growth in these skills.

Another thing to think about is reliability. Find out the standard error of the tests that you are giving and calculate 95% confidence intervals for the scores. Scores should be stated in terms of the range of performance that the score represents. Lots of times you will find that the confidence intervals of the two tests are so wide that they overlap. This would mean that though the score differences look big, they are not really different. Let’s say that the standard error of one of the tests is 5 points (you need to look up the actual standard error in manual), and that your student received a standard score of 100 on the test. That would mean that the 95% confidence interval for this score would be: 90-110 (in other words, I’m sure that if the student took this test over and over 95% of his scores would fall between those scores). Now say that the standard score for the other test was 8 and that the student’s score on that test was 120. That looks pretty discrepant, but the confidence interval for that one is 104-136. Because 90-110 (the confidence interval for the first test) overlaps with 104-136 (the confidence interval of the second test), these scores look very different and yet they are actually the same.

You mention the big differences in the tasks included in the two tests. These can definitely make a difference in performance. Since WMRT is given so often to lower performing students, that test wouldn’t require especially demanding tasks to spread out performance, while the Terra Nova, given to a broader audience, would need a mix of easier and harder tasks (such as longer and more complex reading passages) to spread out student performance. These harder tasks push your kids lower in the group and may be so hard that it would be difficult to see short-term gains or improvements with such an test. WMRT is often used to monitor gains, so it tends to be more sensitive to growth.

You didn’t mention which edition of the tests you were administering. But these tests are revised from time to time and the revisions matter. GMRT has a 2012 edition, but studies of previous versions of the tests reveal big differences in performance from one edition to the other (despite the fact that the same test items were being used). The different versions of the tests changed their norming samples and that altered the tests performances quite a bit (5-9 points). I think you would find the Terra Nova to have more stable scores, and yet, comparing them across editions might reveal similar score inflation.

My advice is that when you want to show where students stand in the overall norm group, only use the Terra Nova data. Then use the GMRT to show where the students’ relative strengths and weaknesses are and to monitor growth in these skills. That means your message might be something like: “Tommy continues to perform at or near the 15% percentile when he is compared with his age mates across the country. Nevertheless, he has improved during the past three months in vocabulary and comprehension, though not enough to improve his overall position in the distribution.“ In other words, his reading is improving and yet he remains behind 85% of his peers in these skills.


Thursday, May 3, 2012

Here We Go Again


For years, I’ve told audiences that one of my biggest fantasies (not involving Heidi Klum) was that we would have a different kind of testing and accountability system. In my make-believe world, teachers and principals would never get to see the tests – under penalty of death.

They wouldn’t be allowed within miles of a school on testing days, and they would only be given general information about the results (e.g., “your class was in the bottom quintile of fourth grades in reading”). Telling a teacher the kinds of test questions or about the formatting would be punished severely, too.

In that fantasy, teachers would be expected to try to improve student reading scores by… well, by teaching kids to read without regard to how it might be measured later. I have even mused that it would be neat if the test format changed annually to even discourage teachers from thinking about teaching to a test format.

In some ways, because of common core, my fantasy is coming true (maybe Heidi K. isn’t far behind?).

Principals and teachers aren’t sure what these tests look like right now. The whole system has been reset, and the only sensible solution is… teaching.

And, yet, I am seeing states that are holding back on rolling out the common core until they can see the test formats.

Last week, Cyndie (my wife – yes, she knows all about Heidi and me – surprisingly, she doesn’t seem nervous about it) was contacted by a state department of education trying to see if she had any inside dope on the PARCC test.

This is crazy. We finally have a chance to raise achievement and these test-chasing bozos are working hard to put us back in the ditch. There is no reason to believe that you will make appreciable or reliable gains teaching kids to reply to certain kinds of test questions or to particular test formats (you can look it up). The people who push such plans know very little about education (can they show you the studies of their “successful” test-teaching approaches?). I am very pleased with the unsettled situation in which teachers and principals don’t know how the children’s reading is going to be evaluated; it is a great opportunity for teachers and kids to show what they can really do.

Saturday, August 9, 2008

Rubber Rulers and State Accountability Testing in Illinois

Much has been made in recent years of the political class’s embrace of the idea of test-based accountability for the schools. Such schemes are enshrined in state laws and NCLB. On the plus side, such efforts have helped move educators to focus on outcomes more than we traditionally have. No small change, this. Historically, when a student failed to learn it was treated as a personal problem—something beyond the responsibility of teachers or schools. That was fine, I guess, when “Our Miss Brooks” was in the classroom and teachers were paid a pittance. Not much public treasure was at risk, and frankly low achievement wasn’t a real threat to kids’ futures (with so many reasonably-well-paying jobs available at all skills levels). As the importance and value of doing well has changed, so have the demands for accountability.

Sadly, politicos have been badly misled on the accuracy of tests, and technically achievement testing has just gotten really complicated—well beyond the scope of what most legislative education aides can handle.
And so, here in Illinois we have a new test scandal brewing (requiring the rescoring of about 1 million). http://www.suntimes.com/news/education/1099086,CST-NWS-tests09.article

Two years ago Illinois adopted a new state test. This test would be more colorful and attractive and would have some formatting features that would make it more appealing to the kids who had to take it. What about the connection of the new test with the test it was to replace? Not to worry, the state board of education and Pearson publishing’s testing service were on the game: they were going to equate the new test with the old statistically so the line of growth or decline would be unbroken, and the public would know if schools were improving, languishing, or slipping down.

A funny thing happened, however: test scores jumped immediately. Kids in Illinois all of a sudden were doing better than ever before. Was it the new tests? I publicly opined that it likely was; large drops or gains in achievement scores are unlikely, especially without any big changes in public policy or practice. The state board of education, the testing companies, and even the local districts chimed in saying how “unfair” it was that anyone would disparage the success of our school kids. They claimed there was no reason to attribute the scores sudden trending up to the coincidental change in tests, and frankly they were not happy about kill-joys like me who would dare question their new success (it was often pointed out that teachers were working very hard—the Bobby Bonds’ defense: I couldn’t have done anything wrong since I was working hard).

Now after two years of that kind of thing, Illinois started using a new form of this test. The new form was statistically equated with the old form, so it could not possibly have any different results. Except that it did. Apparently, the scores came back this summer, much lower than they had been during the past two years. So much lower, in fact, that the educators recognized that it could not possibly be due to a real failure of the schools, but it must be a testing problem. Magically, the new equating was found to be screwed up (a wrong formula apparently). Except, Illinois officials have not yet released any details about how the equating was being done. Equating can get messed up by computing the stats incorrectly, but they also can be influenced by how, when, and from whom these data are collected.

It’s interesting that when scores rise the educational community is adamant that it must be due to their successes, but when they fall—as they apparently did this year in Illinois, it must be a testing problem.
Illinois erred in a number of ways, but so have many states in this regard.

The use of a single form of a single measure administered to large numbers of children in order to make important public policy decisions is foolish. It turns out there are many forms of the test Illinois is using. It is foolish that they didn’t use multiple forms simultaneously (like they would have if it had been a research study), as this can help to do away with their “rubber ruler” problem. Sadly, conflicting purposes for testing programs have us locked into a situation where we’re more likely to make mistakes than to get it right.

I’m a fan of testing (yes, I’ve worked on NAEP, ACT, and a number of commercial tests), and am a strong proponent of educational accountability. It makes no sense, however, to try to do this kind of thing with single tests. It isn’t even wise to test every child. Public accountability efforts need to focus their attention on taking a solid overall look at performance on multiple measures without trying to get too detailed about the information on individual kids. Illinois got tripped up when they changed from testing schools to testing kids (teachers didn’t think kids would try hard enough if they weren’t at risk themselves, so our legislator went from sampling the state to testing every kid—of course, if you want individually comparable data it only makes sense to test kids on the same measure).

Barack Obama has called for a new federal accountability plan that will make testing worthwhile to teachers by providing individual diagnostic information. That kind of plan sounds good, but ultimately it will require a lot more individual testing, with single measures (as opposed to multiple alternative measures). Instead of getting a clearer or more efficient picture for accountability purposes—and one less likely to be flawed by the rubber ruler problem, it can’t help but being muddled as in Illinois. This positive-sounding effort will be more expensive and will result in a less picture in the long run.

Accountability testing aimed at determining how well public institutions are performing would be better constructed along the lines of the National Assessment (which uses several forms of a test simultaneously with samples of students representing the states and the nation. NAEP has to do some fancy statistical equating, too, but this is more likely to be correct when a several overlapping forms of the test are used each year. By not trying to be all things to all people, they manage to do a good job of letting the public and policymakers know how are kids are performing.