Sunday, July 21, 2013
Sunday, April 7, 2013
Reading comprehension tests are not goals… they are samples of behaviors that represent the goals… and they are useful right up until teachers can’t distinguish them from the goals themselves.
Saturday, October 31, 2009
Teachers there told me they were testing some kids weekly or biweekly. That is too much. How do I know it is too much?
The answer depends on two variables: the standard error of measurement of the test you are using and the student growth rate. The more certain of the scores (the lower the SEM that is), the more often you could profitably test... And the faster that students improve on the measure relative to the SEM of the test, the more often you can test.
On something like DIBELS, the standard errors are reasonably large compared to the actual student growth rate--thus, on that kind of measure it doesn't make sense to measure growth more than 2-3 times per year. Any more than that, and you won't find out anything about the child (just the test).
The example in my powerpoint below is based on the DIBELS oral reading fluency measure. For that test, kids read a couple of brief passages (1 minute each) and a score is obtained in words correct per minute. Kids in first and second grade make about 1 word improvement per week on that kind of measure.
However, studies reveal this test has a standard error of measurement of 4 to 18 words... that means, under the best circumstances, say the student scores 50 wcpm on the test, then we can be 68 percent certain that this score is someplace between 46 and 54 (under the best conditions when there is a small SEM). That means that it will be at least 4 weeks before we would be able to know whether the child was actually improving. Sooner than that, and any gains or losses that we see will likely be due to the standard error (the normal bouncing around of test scores).
And that is the best of circumstances. As kids grow older, their growth rates decline. Older kids usually improve 1 word every three or four weeks weeks on DIBELS. In those cases, you would not be able to discern anything new for several months. But remember, I'm giving this only 68 percent confidence, not 95 percent, and I am assuming that DIBELS has the smallest SEMs possible (not likely under normal school conditions). Two or three testings per year is all that will be useful under most circumstances.
More frequent testing might seem rigorous, but it is time wasting, misinformative, and simply cannot provide any useful information for monitoring kids learning. Let's not just look highly committed and ethical by testing frequently; let's be highly committed and ethical and avoid unnecessary and potentially damaging testing.
Here is the whole presentation.
Monday, September 28, 2009
This weekend, there was a flurry of discussion on the National Reading Conference listserv about how to place students in books for reading instruction. This idea goes back to Emmet Betts in 1946. Despite a long history, there hasn’t been a great deal of research into the issue, so there are lots of opinions and insights. I tend to lurk on these listservs rather than participating, but this one really intrigued me as it explored a lot of important ideas. Here are a few.
Which ways of indicating book difficulty work best?
This question came up because the inquirer wondered if it mattered whether she used Lexiles, Reading Recovery, or Fountas and Pinnell levels. The various responses suggested a whiff of bias against Lexiles (or, actually, against traditional measures of readability including Lexiles).
So are all the measures of book difficulty the same? Well, they are and they’re not. It is certainly true that historically most measures of readability (including Lexiles) come down to two measurements: word difficulty measure and sentence difficulty. These factors are weighted and combined to predict some criterion. Although Lexiles include the same components as traditional readability formulas, they predict different criteria. Lexiles are lined up with an extensive database of test performance, while most previous formulas predict the levels of subjectively sequenced passages. Also, Lexiles have been more recently normed. One person pointed out that Lexiles and other traditional measures of readability tend to come out the same (correlations of .77), which I think is correct, but because of the use of recent student reading as the criterion, I usually go with the Lexiles if there is much difference in an estimate.
Over the years, researchers have challenged readability because it is such a gross index of difficulty (obviously there is more to difficulty than sentences and words), but theoretically sound descriptions of text difficulty (such as those of Walter Kintsch and Arthur Graesser) haven’t led to appreciably better text difficulty estimates. Readability usually explains about 50% of the variation in text difficulty, and these more thorough and cumbersome measures don’t do much better.
One does see a lot of Fountas and Pinnell and Reading Recovery levels these days. Readability estimates are usually only accurate within about a year, and that is not precise enough to help a first-grade teacher to match her kids with books. So these schemes claim to make finer distinctions in text difficulty early on, but these levels of accuracy are open to question (I only know of one study of this and it was moderately positive), and there is no evidence that using such fine levels of distinction actually matter in student learning (there is some evidence of this with more traditional measures of readability).
If anything, I think these new schemes tend to put kids into too many levels and more than necessary. They probably correlate reasonably well with readability estimates, and their finer-grained results probably are useful for early first grade, but I’d hard pressed to say they are better than Lexiles or other readability formulas even at these levels (and they probably lead to over grouping).
Why does readability work so poorly for this?
I’m not sure that it really does work poorly despite the bias evident in the discussion. If you buy the notion that reading comprehension is a product of the interaction between the reader and the text (as most reading scholars do), why would you expect text measures to measure much more than half the variance in comprehension? In the early days of readability formula design, lots of text measures were used, but those fell away as it became apparent that they were redundant and 2-3 measures would be sufficient. The rest of the variation is variation in children’s interests and knowledge of topics and the like (and in our ability to measure student reading levels).
Is the right level the one that students will comprehend best at?
One of the listserv participants wrote that the only point to all of this leveling was to get students into texts that they could understand. I think that is a mistake. Often that may be the reason for using readability, but that isn’t what teachers need to do necessarily. What a teacher wants to know is “at what level will a child make optimum learning gains in my class?” If the child will learn better from something hard to comprehend, then, of course, we’d rather have them in that book.
The studies on this are interesting in that they suggest that sometimes you want students practicing with challenging text that may seem too hard (like during oral reading fluency practice) and other times you want them practicing with materials that are somewhat easier (like when you are teaching reading comprehension). That means we don’t necessarily want kids only reading books at one level: we should do something very different with a guided reading group that will discuss a story, and a paired reading activity in which kids are doing repeated reading, and an independent reading recommendation for what a child might enjoy reading at home.
But isn’t this just a waste of time if it is this complicated?
I don’t think it is a waste of time. The research certainly supports the idea that students do better with some adjustment and book matching than they do when they work whole class on the same level with everybody else.
However, the limitations in testing kids and testing texts should give one pause. It is important to see such data as a starting point only. By all means, test kids and use measures like Lexiles to make the best matches that you can. But don’t end up with too many groups (meaning that some kids will intentionally be placed in harder or easier materials than you might prefer), move kids if a placement turns out to be easier or harder on a daily basis than the data predicted, and find ways to give kids experiences with varied levels of texts (from easy to challenging). Even when a student is well placed, there will still be selections that turn out to be too hard or too easy, and adjusting the amount of scaffolding and support needed is necessary. That means that teachers need to pay attention to how kids are doing, and responding to these needs to make sure the student makes progress (i.e., improves in what we are trying to teach).
If you want to know more about this kind of thing, I have added a book to my recommended list (at the right here). It is a book by Heidi Mesmer on how to match texts with kids. Good luck.
Thursday, August 20, 2009
I visited schools yesterday that used to DIBEL. You know what I mean, the teachers used to give kids the DIBELS assessments to determine how they were doing in fluency, decoding, and phonemic awareness. DIBELS has been controversial among some reading experts, but I’ve always been supportive of such measures (including PALS, TPRI, Ames-web, etc.). I like that they can be given quickly to provide a snapshot of where kids are.
I was disappointed that they dropped the tests and asked why. “Too much time,” they told me, and when I heard their story I could see why. This was a district that like the idea of such testing, but their consultants had pressured them into repeating it every week for at risk kids. I guess the consultants were trying to be rigorous, but eventually the schools gave up on it altogether.
The problem isn’t the test, but the silly testing policies. Too many schools are doing weekly or biweekly testing and it just doesn’t make any sense. It’s as foolish as checking your stock portfolio everyday or climbing on the scale daily during a diet. Experts in those fields understand that too much assessment can do harm, so they advise against it.
Frequent testing is misleading and it leads to bad decisions. Investment gurus, for example, suggest that you look at your portfolio only every few months. Too many investors look at a day’s stock losses and sell in a panic, because they don’t understand that such losses happen often—and that long term such losses mean nothing. The same kind of thing happens with dieting. You weigh yourself and see that you’re down 2 pounds, so what the heck, you can afford to eat that slice of chocolate cake. But your weight varies through the day as you work through the nutrition cycle (you don’t weigh 130, but someplace between 127 and 133). So, when your weight jumps from 130 to 128, you think “bring on the desert” when you real weight hsn't actually changed since yesterday.
The same kind of thing happens with DIBELS. Researchers investigated the standard error of measurement (SEM) of tests like DIBELS (Poncy, Skinner, & Axtell, 2005 in the Journal of Psychoeducational Measurement) and found standard errors of 4 to 18 points with oral reading fluency. That’s the amount that the test scores jump around. They found that you could reduce the standard error by testing with multiple passages (something that DIBELS recommends, but most schools ignore). But, testing with multiple passages only got the SEM down to 4 to 12 points.
What does that mean? Well, for example, second graders improve in words correct per minute (WCPM) in oral reading about 1 word per week. That means it would take 4 to 12 weeks of average growth for the youngster to improve more than a standard error of measurement.
If you test Bobby at the beginning of second grade and he gets a 65 wcpm in oral reading, then you test him a week later and he has a 70, has his score improved? That looks like a lot of growth, but it is within a standard error so it is probably just test noise. If you test him again in week 3, he might get a 68, and week 4 he could reach 70 again, and so on. Has his reading improved, declined, or stagnated? Frankly, you can’t tell in this time frame because on average a second grader will improve about 3 words in that time, but the test doesn’t have the precision to identify reliably a 3-point gain. The scores could be changing because of Bobby’s learning, or because of the imprecision of the measurement. You simply can't tell.
Stop the madness. Let’s wait 3 or 4 months, still a little quick, perhaps, but since we use multiple passages to estimate reading levels ,it is probably is okay. In that time frame, Bobby should gain about 12-16 words correct per minute if everything is on track. If the new testing reveals gains that are much lower than that, then we can be sure there is a problem, and we can make some adjustment to instruction. Testing more often can’t help, but it might hurt!
Monday, August 10, 2009
Last week, the First District Court of Appeal in San Francisco upheld the right of California to administer achievement tests and high school exit exams in English to all students, no matter what their language background. Various education groups had challenged the practice of using English-only testing since federal law requires that second-language students “be assessed in a valid and reliable manner.”
As reported in the San Francisco Chronicle, Marc Coleman, a lawyer for the plaintiffs complained that, “The court dodges the essential issue in the lawsuit, which is: What is the testing supposed to measure?”
Mr. Coleman gets an A from me for that question, but I wonder how many of the groups that he represents have a good answer to it?
The reason I’m curious is that I’ve received so many queries over the years about the practice of testing second-language students in English. The question often includes some kind of characterization of the practice as mean, stupid, or racist, so it is apparent that many professionals feel strongly about the impropriety of testing children in English.
No matter how angry the query, my response is always the same as Mr. Coleman’s: “What is the testing supposed to measure?” It obviously doesn’t satisfy the questioners, but whether they embrace such practices or loathe them, the appropriateness of English testing turns on the purpose of the testing.
If you are trying to find out how well your students do in reading English, I would not hesitate to test them in English.
“But,” I hear the critics asking,” won’t that make the test unreliable?”
“No”, I answer. “Reliability has to do with stability of measurement. If a student does poorly on an English test because he or she doesn’t know English, that low performance will likely be very stable.”
“But won’t an English test be invalid?”
“Validity has to do with what the test purports to measure: if I’m trying to find out how well a student can read English, then this kind of test, all things being equal, would likely be a pretty good measure of that.”
“Yes, but won’t that kind of test lead to underestimates of how well that student is really doing, since he/she might be reading better in his or her home language?”
And that switcheroo is the key to this… because the questioner has now changed the purpose of the measure from finding out how well the student can read English to finding out how well he or she can read in any language. If the exit test is supposed to show that the student is academically-skilled in English, then an English test is sound and appropriate. If the purpose of the exit measure is to reveal whether or not students are skilled in any language, then the English test alone would obviously be insufficient.
Certainly, I can tick off reasons why, diagnostically, a school might want to test a reading student in both English and the home language, at least in those cases in which the students are receiving some instruction in their home language (or have received such instruction in the past). And, I can think of all kinds of reasons why a school or state might test students in their home language, even on an exit exam: “We recognize that Diego struggles with English, but we want to know how well he does math or what he knows about science information.” In such cases, testing in English might lower performance below the level that Diego could demonstrate if language difference wasn’t an issue.
But if you want to know how well a student can read English, by all means give an English reading test no matter what the students’ home language backgrounds or educational histories. That would be the only valid way to find out the answer to the question. The court got this one right.
Monday, August 3, 2009
Rose Birkhead made a comment on my last post. She liked the idea of the states aiming at the same standards and giving the same reading tests, but she was concerned about testing kids on grade level. Say, I'm a teacher, and one of my students reads horribly, not even close to fourth-grade level. Rose's concern is that having that child take the fourth-grade test will tell me nothing of value about his reading (I won't get any useful insights about how to help him), and what it tells him he probably doesn't want to hear since he already knows he is struggling.
Last week, I was meeting with a group of school administrators with great expertise in the needs of English Language Learners. They thought testing and the use of data to be very important, but they were chagrined about state laws that required them to test kids in English. Their reasoning was pretty much the same as Rose's: since these kids don't know English yet, and we have been teaching them to read in Spanish, an English test won't tell us anything that will help us to deliver better reading programs, and such testing is only going to make these kids feel bad.
Rose and my ELL colleagues are not crazy, but I hope they don't get what they want. The problem is that accountability tests have very different purposes than other kinds of educational measures. Teachers and adminstrators want insights about how to teach particular kids more effectively, but accountability tests can't help there. Such tests will tell you who can't read this well, but not why.
Tests that will help diagnose kids' instructional needs aren't very good at doing the accountability work. My colleagues are right: there are times that I'd rather concede that my students won't yet do well on an English test, and give them the Spanish test that will help me to figure out what to do next. Rose is right that it can be useful to test kids out of level in some instances, but accountability wouldn't be the purpose:. Jimmy reads very well for a second grader, unfortunately he is in ninth. Obviously that kind of information will mislead everyone as to how well a school is doing.
During the presidential campaign, then-candidate Obama, kept saying he was going to replace accountability tests with diagnostic tests, and I kept thinking, not a chance. Diagnostic testing takes awhile; I hope we can keep accountability brief, and then do whatever is necessary to figure out the teaching needs of those kids who are lagging behind. I hope that we get back to the idea that, for accountability purposes, we don't really need to test everyone. If I want to know how well kids in Montana read, or how well immigrant kids are doing in school, I can test random samples of such kids and get the information that I need. Which means less testing time and less time away from teaching.
But what about how testing someone on a test that is clearly too hard or inappropriate? Won't that make them feel terrible? Motivation does matter, and frankly there is no reason to test a child in a case like that--as long as the teacher of the school will concede that the child does not meet the standards. Unfortunately, schools that want to opt some kids out of accountability, don't want them to count at all (a process we have seen in the National Assessment). That won't work (I remember a year when they allowed that in Detroit and something like 92% of kids met standards).
No matter how concerned for the kids a teacher or principal might be, they have had a tendency to either keep low kids out of the testing pool altogether, so that the community cannot know the real percentage of low readers in a school or district. Or, if the data are going to count whether the child is tested or not, these same caring individuals tend to administer the test (hoping those kids do better than expected). We need to make it easier for a school to say, "Henry is a poor reader. There is no reason to test him on the fifth-grade test to prove that since we know that he cannot read that well. Our school district concedes that Henry is a poor reader, we'll take the adjustment to our scores, and Henry will not be subjected to this."
Ain't gong to happen? Maybe not, but if last fall you had told me that 46 states would sign on to joint standards, I would have said you were crazy.
Monday, April 13, 2009
I received this letter recently and below is my response. I bet this goes on in lots of schools (unfortunately).
Dear Dr. Shanahan,
What do you believe to be best practice in assessing a student's reading comprehension? As elementary schools turn to the Professional Learning Community framework, teachers are expected to devise tests within their grade level teams to test for reading skills like inferring, author's purpose, cause & effect...etc...
In your comprehesion blog, however, you stated that it was difficult to assess these skills separately since reading is an integrated process. That makes alot of sense.
Are these Professional Learning Communities misdirected in creating these tests for specific skills? From how students perform on these tests from week to week, our intervention groups are then decided. For example, if students perform low on the cause & effect questions, then they will be retaught this skill during intervention.
I question whether this is best practice, and if we are oversimplifying other skills that go into comprehension.
As a result, my big questions are:
What is the best way for a teacher to assess reading comprehension? (Other then student conferences and observations) Should intervention groups then focus more on reading components such as decoding, fluency, comprehension and vocabulary? Is this practice more effective than breaking down the skills of reading for focused reteaching?
Dear Literacy Coach:
Thanks for the question. Yes, it is nearly impossible to come up with a comprehension test that can diagnose specific skills performance. The major testing companies have spent loads of money and time on that problem with a plethora of fine psychometricians and scores of skilled test writers dedicated to the problem, and they have never managed to do it. Of course, it's pretty unlikely that an individual teacher with all that he or she has to do would manage to come up with a test that would reveal how kids do with such skills. I would recommend saving your time on that one.
The first rule of assessment is never test unless you plan to use the information. If you are going to provide extra help for kids who are struggling with reading comprehension, by all means have the kids doing retellings (orally or in writing), or ask them a bunch of different questions about a text and see how they do with the answers. If they aren't comprehending a large proportion of what they are reading, then by all means give them more time reading and thinking about the ideas in text through discussion and writing and other activities. Keep your focus less on the question types than on the texts themselves... instead of trying to pile up inferential questions alone or main idea questions alone, focus on asking questions that will help the kids think deeply about the ideas in the texts (you'll end up with a pretty good mix if you do that). And, by all means, continue to pay attention to the students fluency, vocabulary knowledge, decoding skills, and writing ability.
Saturday, January 24, 2009
Here’s a big idea that can save your school district a lot of money and teachers and kids a lot of time: reading comprehension tests cannot be used to diagnose reading problems.
This isn’t a traditional educator complaint about reading tests; I’m pro reading test. The typical reading comprehension test (e.g, Gates-MacGinitie, Stanford, Metropolitan, Iowa, state accountability assessment) is valid, reliable, reasonably respectful of students from varied cultures… and, yet, those tests cannot be used diagnostically.
The problem isn’t with the tests, it’s just a fact based on the nature of reading ability. Reading is complicated. It involves a bunch of skills that need to be used either simultaneously or in amazingly rapid sequence. Reading comprehension tests do a great job of identifying who has trouble with reading, but they can’t sort out why a student struggles. Is it a comprehension problem or did the student fail to decode? Maybe the youngster decoded the words, just fine, but didn’t know the word meanings. Or could she read the text fluently, with the pauses in the right places within sentences? Of course, none of those might be problems: maybe the student really had trouble thinking about the ideas.
Because reading is a hierarchy of skills that must be used simultaneously, failures with low-level skills necessarily undermine higher level ones (like interpreting ideas in text). Because every comprehension question has to be answered on the basis of decoding, interpretations of word meanings, use of prior knowledge, analysis of sentence syntax, etc., it is impossible to find patterns of student performance on a typical reading comprehension test that can tell you anything. That is also a reason why items are so highly intercorrelated in reading comprehension tests. The companies that offer to analyze kids test results to provide you with an instructional map of their comprehension needs are offering something of no value. If a main idea question is hard, all your kids will need help with main ideas. If several inferential questions are bunched at the end of the test and some of your kids don’t finish all the items, you’ll find out that most of your kids need help with inferencing. Not one scheme for analyzing item responses on comprehension tests is reliable and none has been validated empirically. Those schemes simply don’t work, except to separate schools from their money.
Sunday, August 31, 2008
Reading First is the federal education program that encourages teachers to follow the research on how best to teach reading. The effort requires that teachers teach phonemic awareness (grades K-1), phonics (grades K-2), oral reading fluency (grades 1-3), vocabulary (grades K-3), and reading comprehension strategies (grades K-3). Reading First emphasizes such teaching because so many studies have shown that the teaching of each of these particular things improves reading achievement.
Reading First also requires that kids get 90-minutes of uninterrupted reading instruction each day, because research overwhelmingly shows that the amount of teaching provided makes a big difference in kids’ learning.
It requires that kids who are struggling be given extra help in reading through various of interventions. Again, an idea supported by lots of research. Early interventions get a big thumbs up from the research studies.
It requires that teachers and principals receive lots of professional development in reading, the idea being that if they know how to teach reading effectively, higher reading achievement will result. The research clearly supports this idea, too.
It requires that kids be tested frequently using monitoring tests to identify which kids need extra help and to do this early, before they have a chance to fall far behind. Sounds pretty sensible to me, but where’s the research?
Truth be told, there is a very small amount of research on the learning benefits of “curriculum-based measurement” and “work sampling, but beyond these meager—somewhat off-point—demonstrations, there is little empirical evidence supporting such big expenditures of time and effort.
This isn’t another rant against DIBELS (the tests that have been used most frequently for this kind of monitoring). Replace DIBELS with any monitoring battery you prefer (e.g., PALS, Ames-Webb, ISEL, TPRI) and you have the same problem. What do research studies reveal about the use of these tests to improve achievement? Darned little!
There is research showing that these tests are valid and reliable, that is they tend to measure what they claim to measure and they do this in a stable manner. In other words, the quality of these tests in terms of measurement properties isn’t the problem.
The real issue is how would you use these tests appropriately to help improve kids’ performance? For instance, do we really need to test everyone or are there kids who so clearly are succeeding or failing that we would be better off saving the testing time and simply stipulating that they will or will not get extra help?
Or, are the cut scores really right for these tests? I know when I reviewed DIBELS for Buros I found that the cut scores (the scores used to identify who is at risk) hadn’t been validated satisfactorily. Since then my experiences in Chicago suggest to me that the scores aren’t sufficiently rigorous; that means many kids who need help don’t get it because the tests fail to identify them as being in need.
Perhaps, the monitoring test schemes (and the tests themselves) are adequate, but in practice you can’t make it work. I have personally seen teachers subverting these plans by doing things like having kids memorizing nonsense words, or having kids read as fast as possible (rather than reading for meaning). Test designers can’t be held accountable for such misuse of their tests, but such aberrations cannot be ignored in determining the ultimate value of these testing plans.
There are few aspects of Reading First that make more sense than checking up on the students’ reading progress, and providing extra help to those who are not learning… unfortunately, we don’t have much evidence showing that such schemes—as actually carried out in classrooms—work the way logic says they should. I think it is worth continuing to try to make such approaches pay off for kids, but given the lack of research support, I think real prudence is needed here:
1. Administer these tests EXACTLY in the way the manuals describe.
2. Limit the amount of testing to what is really needed to make a decision (if a teacher is observing everyday and believes that a child is struggling with some aspect of reading, chances are pretty good that extra help is needed).
3. Examine the results of your testing over time. Perhaps if you systematically adjust the cut scores, you can improve student learning. It is usually best to err on the side of giving kids more help than they might need.
4. Don’t neglect aspects of reading instruction that can’t be measured as easily (such as vocabulary or reading comprehension). Monitoring tests do a reasonably good job of helping teachers to sort out performance of “simple skills.” They do not, nor do they purport to, assess higher level processes; these still need to be taught and taught thoroughly and well, however. Special effort may be needed to ensure that these are adequately addressed given the lack of direct testing information.
Saturday, August 9, 2008
Much has been made in recent years of the political class’s embrace of the idea of test-based accountability for the schools. Such schemes are enshrined in state laws and NCLB. On the plus side, such efforts have helped move educators to focus on outcomes more than we traditionally have. No small change, this. Historically, when a student failed to learn it was treated as a personal problem—something beyond the responsibility of teachers or schools. That was fine, I guess, when “Our Miss Brooks” was in the classroom and teachers were paid a pittance. Not much public treasure was at risk, and frankly low achievement wasn’t a real threat to kids’ futures (with so many reasonably-well-paying jobs available at all skills levels). As the importance and value of doing well has changed, so have the demands for accountability.
Sadly, politicos have been badly misled on the accuracy of tests, and technically achievement testing has just gotten really complicated—well beyond the scope of what most legislative education aides can handle.
And so, here in Illinois we have a new test scandal brewing (requiring the rescoring of about 1 million). http://www.suntimes.com/news/education/1099086,CST-NWS-tests09.article
Two years ago Illinois adopted a new state test. This test would be more colorful and attractive and would have some formatting features that would make it more appealing to the kids who had to take it. What about the connection of the new test with the test it was to replace? Not to worry, the state board of education and Pearson publishing’s testing service were on the game: they were going to equate the new test with the old statistically so the line of growth or decline would be unbroken, and the public would know if schools were improving, languishing, or slipping down.
A funny thing happened, however: test scores jumped immediately. Kids in Illinois all of a sudden were doing better than ever before. Was it the new tests? I publicly opined that it likely was; large drops or gains in achievement scores are unlikely, especially without any big changes in public policy or practice. The state board of education, the testing companies, and even the local districts chimed in saying how “unfair” it was that anyone would disparage the success of our school kids. They claimed there was no reason to attribute the scores sudden trending up to the coincidental change in tests, and frankly they were not happy about kill-joys like me who would dare question their new success (it was often pointed out that teachers were working very hard—the Bobby Bonds’ defense: I couldn’t have done anything wrong since I was working hard).
Now after two years of that kind of thing, Illinois started using a new form of this test. The new form was statistically equated with the old form, so it could not possibly have any different results. Except that it did. Apparently, the scores came back this summer, much lower than they had been during the past two years. So much lower, in fact, that the educators recognized that it could not possibly be due to a real failure of the schools, but it must be a testing problem. Magically, the new equating was found to be screwed up (a wrong formula apparently). Except, Illinois officials have not yet released any details about how the equating was being done. Equating can get messed up by computing the stats incorrectly, but they also can be influenced by how, when, and from whom these data are collected.
It’s interesting that when scores rise the educational community is adamant that it must be due to their successes, but when they fall—as they apparently did this year in Illinois, it must be a testing problem.
Illinois erred in a number of ways, but so have many states in this regard.
The use of a single form of a single measure administered to large numbers of children in order to make important public policy decisions is foolish. It turns out there are many forms of the test Illinois is using. It is foolish that they didn’t use multiple forms simultaneously (like they would have if it had been a research study), as this can help to do away with their “rubber ruler” problem. Sadly, conflicting purposes for testing programs have us locked into a situation where we’re more likely to make mistakes than to get it right.
I’m a fan of testing (yes, I’ve worked on NAEP, ACT, and a number of commercial tests), and am a strong proponent of educational accountability. It makes no sense, however, to try to do this kind of thing with single tests. It isn’t even wise to test every child. Public accountability efforts need to focus their attention on taking a solid overall look at performance on multiple measures without trying to get too detailed about the information on individual kids. Illinois got tripped up when they changed from testing schools to testing kids (teachers didn’t think kids would try hard enough if they weren’t at risk themselves, so our legislator went from sampling the state to testing every kid—of course, if you want individually comparable data it only makes sense to test kids on the same measure).
Barack Obama has called for a new federal accountability plan that will make testing worthwhile to teachers by providing individual diagnostic information. That kind of plan sounds good, but ultimately it will require a lot more individual testing, with single measures (as opposed to multiple alternative measures). Instead of getting a clearer or more efficient picture for accountability purposes—and one less likely to be flawed by the rubber ruler problem, it can’t help but being muddled as in Illinois. This positive-sounding effort will be more expensive and will result in a less picture in the long run.
Accountability testing aimed at determining how well public institutions are performing would be better constructed along the lines of the National Assessment (which uses several forms of a test simultaneously with samples of students representing the states and the nation. NAEP has to do some fancy statistical equating, too, but this is more likely to be correct when a several overlapping forms of the test are used each year. By not trying to be all things to all people, they manage to do a good job of letting the public and policymakers know how are kids are performing.
Thursday, June 12, 2008
I'm asked frequently by schools to come and help with reading comprehension, but as with the Reading First folks, I often sense that there are many things that these folks don't know enough about in order to make real progress in improving reading comprehension. However, I sometimes think they mess up for the opposite reasons that the Reading First people do.
Many teachers, especially in the upper grades, think that they only need to teach comprehension and everything will be fine--neglecting the need for instruction in decoding, fluency, and vocabulary. I think in Reading First teachers too often lost sight of two key points: (1) that reading comprehension can and should be taught explicitly and (2) that the enabling skills (decoding, fluency, vocabulary) need to be taught in ways that aim them at reading comprehension. Decoding needs to be taught, but lessons in decoding should always end with kids reading new text with their new skills. Vocabulary needs to be taught, kids need to read text that uses that vocabulary, and they need to think about what the word meanings have to do with interpreting the text. Fluency instruction should be supported by having students answering questions or reacting to the meaning after each rereading (it is more than a race to read fast).
If you go to this link you can download my new powerpoint, 10 things every teacher should know about reading comprehension.
Tuesday, March 18, 2008
It’s not uncommon for educators to oppose high-stakes testing. Teachers and principals have personal reasons to be against such approaches: high stakes tests are more likely to be used to pressure them than on the kids who they serve. University-based scholars also tend to be against testing, but that isn’t surprising as most university professors are politically liberal and most education accountability plans emanate from conservative governments. While professors may have a knee-jerk reaction to high-stakes tests, this in no way disparages the high quality scholarly analyses of such tests, such as the one that carried out for the National Academy of Sciences: http://www.nap.edu/openbook.php?isbn=0309062802).
Lots of folks figure I must be for high-stakes testing: I was appointed to the Board of Advisors of the National Institute for Literacy by President George W. Bush, I served on the National Reading Panel, and I am very interested in seeing reading scores improve. That isn’t the case, however. I’m against high-stakes testing for a simple reason: for the most part it hasn’t worked. Various analyses show that such tests narrow curriculum, encourage students to leave school early, reduce amount of instructional time, and have not led to improved reading achievement.
Politicians and tax payers believe that schools are not doing a good job. I agree with them as our kids aren’t learning enough. The politicos are certain that if teachers and principals would try harder, then things would change. And that’s where I disagree. For the most part, teachers and principals are trying very hard. The reason testing hasn’t motivated higher achievement is because higher achievement is not an issue of motivation. Too few teachers know how to teach reading effectively. In many cases, quality instructional materials are not available. Schools are often disorganized and fail to provide teachers with sufficient support.
Motivating folks to try harder when an outcome depends mainly on determination is a good idea. Pressuring them to do better when they don’t know how or lack the necessary tools is a losing strategy. I would gladly see states and the feds move their accountability dollars into professional development, after-school programs, truancy prevention, and other things that can work. High stakes testing is a motivation strategy. So are teacher incentive plans. Neither is likely to improve students’ reading, however, until teachers have sufficient knowledge and support so they can expend the right extra effort. That means that someday I may support test-based incentive pay or high-stakes testing, as my opposition isn’t political, it is about effectiveness. Our literacy needs are real: we can’t afford to continue to waste hundreds of millions of dollars, and millions of children’s education hours on a losing strategy.