Showing posts with label Assessment. Show all posts
Showing posts with label Assessment. Show all posts

Sunday, January 25, 2015

Concerns about Accountability Testing

Why don’t you write more about the new tests?

I haven’t written much about PARCC or SBAC—or the other new tests that other states are taking on—in part because they are not out yet. There are some published prototypes, and I was one of several people asked to examine the work product of these consortia. Nevertheless, the information available is very limited, and I fear that almost anything I may write could be misleading (the prototypes are not necessarily what the final product will turn out to be).

However, let me also say that, unlike many who strive for school literacy reform and who support higher educational standards, I’m not all that enthused about the new assessments. 

Let me explain why.

1. I don't think the big investment in testing is justified. 

I’m a big supporter of teaching phonics and phonological awareness because research shows that to be an effective way to raise beginning reading achievement. I have no commercial or philosophical commitment to such teaching, but trust the research. There is also strong research on the teaching of vocabulary, comprehension, and fluency, and expanding the amount of teaching is a powerful idea, too.

I would gladly support high-stakes assessment if it had a similarly strong record of stimulating learning, but that isn't the case.

Test-centered reform is expensive, and it has not been proven to be effective. The best studies of it that I know reveal either extremely slight benefits, or somewhat larger losses (on balance, it is—at best—a draw). Having test-based accountability, does not lead to better reading achievement.

(I recognize that states like Florida have raised achievement and they had high-stakes testing. The testing may have been part of what made such reforms work, but you can't tell if the benefits weren't really due to the other changes (e.g., professional development, curriculum, instructional materials, amount of instruction) that were made simultaneously.)

2. I doubt that new test formats—no matter how expensive—will change teaching for the good.

In the early 1990s, P. David Pearson, Sheila Valencia, Robert Reeve, Karen Wixson, Charles Peters, and I were involved in helping Michigan and Illinois develop innovative tests; tests that included entire texts and with multipe-response question formats that did away with the one-correct answer notion. The idea was that if we had tests that looked more like “good instruction,” then teachers who tried to imitate the tests would do a better job. Neither Illinois nor Michigan saw learning gains as a result of these brilliant ideas.

That makes me skeptical about both PARCC and SBAC. Yes, they will ask some different types of questions, but that doesn’t mean the teaching that results will improve learning. I doubt that it will.

I might be more excited if I didn’t expect companies and school districts to copy the formats, but miss the ideas. Instead of teaching kids to think deeply and to reason better, I think they’ll just put a lot of time into two-part answers and clicking. 

3. Longer tests are not really a good idea.

We should be trying to maximize teaching and minimize testing (minimize, not do away with). We need to know how states, school districts, and schools are doing. But this can be figured out with much less testing. We could easily estimate performance on the basis of samples of students—rather than entire student bodies—and we don’t need annual tests; with samples of reliable sizes, the results just don’t change that frequently.

Similarly, no matter how cool a test format may seem, it is probably not worth the extra time needed to administer. I suspect the results of these tests will correlate highly with the tests that they replace. If that's the case, will you really get any more information from these tests? And, if not, then why not use these testing days to teach kids instead? Anyone interested in closing poverty gaps, or international achievement gaps, is simply going to have to bite the bullet: more teaching, not more testing, is the key to catching up.

4. The new reading tests will not provide evidence on skills ignored in the past.

The new standards emphasize some aspects of reading neglected in the past. However, these new tests are not likely to provide any information about these skills. Reading tests don't work that way (math tests do, to some extent). We should be able to estimate the Lexile levels that kids are attaining, but we won’t be able to tell if they can reason better or are more critical thinkers (they may be, but these tests won’t reveal that).

Reading comprehension tests—such as those used by all 50 states for accountability purposes—can tell us how well kids can comprehend. They cannot tell which skills the students have (or even if reading comprehension actually depends on such a collection of discrete skills). Such tests, if designed properly, should provide clues about the level of language difficulty that students can negotiate successfully, but beyond that we shouldn’t expect any new info from the items.

On the other hand, we should expect some new information. The new tests are likely to have different cut scores or criteria of success. That means these tests will probably report much lower scores than in the past. Given the large percentage of boys and girls who “meet or exceed” current standards, graduate from high school, and enter college, but who lack basic skills in reading, writing, and/or mathematics, it would only be appropriate that their scores be lower in the future.


However, I predict that when those low-test scores arrive, there will be a public outcry that some politicians will blame on the new standards. Instead of recognizing that the new tests are finally offering honest info about how their kids are doing, they’ll believe that the low scores are the result of the poor standards and there'll be a strong negative reaction. Instead of militating for better schools, the public will be stimulated to support lower standards.

The new tests will only help if we treat them differently than the old tests. I hope that happens, but I'm skeptical.

Sunday, May 18, 2014

IRA 2014 Presentations

I made four presentations at the meetings of the International Reading Association in New Orleans this year. One of these was the annual research review address in which I explained the serious problems inherent in the "instructional level" in reading and in associated approaches like "guided reading" which have certainly outlived their usefulness.

IRA Talks 2014



Thursday, April 3, 2014

Apples and Oranges: Comparing Reading Scores across Tests

I get this kind of question frequently from teachers who work with struggling readers, so I decided to respond publicly. What I say about these two tests would be true of others as well.

I am a middle school reading teacher and have an issue that I'm hoping you could help me solve. My students' placements are increasingly bound to their standardized test results. I administer two types of standardized tests to assess the different areas of student reading ability. I use the Woodcock Reading Mastery Tests and the Terra Nova Test of Reading Comprehension. Often, my students' WRMT subtest scores are within the average range, while their Terra Nova results fall at the the lower end of the average range or below. How can I clearly explain these discrepant results to my administrators? When they see average scores on one test they believe these students are no longer candidates for remedial reading services.

Teachers are often puzzled by these kinds of testing discrepancies, but they can happen for a lot of reasons.

Reading tests tend to be correlated with each other, but this kind of general performance agreement between two measures doesn’t mean that they would categorize student performance identically. Performing at the 35%ile might give you a below average designation with one test, but an average one with the other. Probably better to stay away from those designations and use NCE scores or something else that is comparable across the tests.

An important issue in test comparison is the norming samples that they use. And, that is certainly the case with these two tests. Terra Nova has a very large and diverse nationally representative norming sample (about 200,000 kids) and the GMRT is based on a much smaller group that may be skewed a bit towards struggling students (only 2600 kids). When you say that someone is average or below average, you are comparing their performance with those of the norming group. Because of their extensiveness, I would trust the Terra Nova norms more than the WMRT ones; Terra Nova would likely give me a more accurate picture of where my students are compared to the national population. The GMRT is useful because it provides greater information about how well the kids are doing in particular skill areas, and it would help me to track growth in these skills.

Another thing to think about is reliability. Find out the standard error of the tests that you are giving and calculate 95% confidence intervals for the scores. Scores should be stated in terms of the range of performance that the score represents. Lots of times you will find that the confidence intervals of the two tests are so wide that they overlap. This would mean that though the score differences look big, they are not really different. Let’s say that the standard error of one of the tests is 5 points (you need to look up the actual standard error in manual), and that your student received a standard score of 100 on the test. That would mean that the 95% confidence interval for this score would be: 90-110 (in other words, I’m sure that if the student took this test over and over 95% of his scores would fall between those scores). Now say that the standard score for the other test was 8 and that the student’s score on that test was 120. That looks pretty discrepant, but the confidence interval for that one is 104-136. Because 90-110 (the confidence interval for the first test) overlaps with 104-136 (the confidence interval of the second test), these scores look very different and yet they are actually the same.

You mention the big differences in the tasks included in the two tests. These can definitely make a difference in performance. Since WMRT is given so often to lower performing students, that test wouldn’t require especially demanding tasks to spread out performance, while the Terra Nova, given to a broader audience, would need a mix of easier and harder tasks (such as longer and more complex reading passages) to spread out student performance. These harder tasks push your kids lower in the group and may be so hard that it would be difficult to see short-term gains or improvements with such an test. WMRT is often used to monitor gains, so it tends to be more sensitive to growth.

You didn’t mention which edition of the tests you were administering. But these tests are revised from time to time and the revisions matter. GMRT has a 2012 edition, but studies of previous versions of the tests reveal big differences in performance from one edition to the other (despite the fact that the same test items were being used). The different versions of the tests changed their norming samples and that altered the tests performances quite a bit (5-9 points). I think you would find the Terra Nova to have more stable scores, and yet, comparing them across editions might reveal similar score inflation.

My advice is that when you want to show where students stand in the overall norm group, only use the Terra Nova data. Then use the GMRT to show where the students’ relative strengths and weaknesses are and to monitor growth in these skills. That means your message might be something like: “Tommy continues to perform at or near the 15% percentile when he is compared with his age mates across the country. Nevertheless, he has improved during the past three months in vocabulary and comprehension, though not enough to improve his overall position in the distribution.“ In other words, his reading is improving and yet he remains behind 85% of his peers in these skills.


Thursday, February 6, 2014

To Special Ed or not to Special Ed: RtI and the Early Identification of Reading Disabilities

My question centers on identifying students for special education. Research says identify students early, avoid the IQ-discrepancy model formula for identification, and use an RTI framework for identification and intervention. 

That said, I have noticed that as a result of high stakes accountability linked to teacher evaluations there seems to be a bit of a shuffle around identifying students for special education. While we are encourages to "identify early", the Woodcock Johnson rarely finds deficits that warrant special education identification.  Given current research  on constrained skills theory ( Scott  Paris)  and late emerging reading difficulties (Rollanda O’Connor), how do we make sure we are indeed identifying students early? 

If a student has been with me for two years (Grades 1 and 2) and the instructional trajectory shows minimal progress on meeting benchmarks, (despite quality research-based literacy instruction), but a special education evaluation using the Woodcock Johnson shows skills that fall within norms, how do we service these children? Title I is considered a regular education literacy program. Special Education seems to be pushing back on servicing these students, saying they need to "stay in Title I."  Or worse, it is suggested that these students be picked up in SPED for phonics instruction, and continue to be serviced in Title I for comprehension. 

I am wondering what your thoughts are on this. The "duplication of services" issue of being service by both programs aside, how does a school system justify such curriculum fragmentation for its most needy students? Could you suggest some professional reading or research that could help me make the case for both early identification of students at risk for late emerging reading difficulties, and the issue of duplication of services when both Title I and SPED service a student?

This is a great question, but one that I didn’t feel I could answer. As I’ve done in the past with such questions: I sent it along to someone in the field better able to respond. In this case, I contacted Richard Allington, past president of the International Reading Association, and a professor at the University of Tennessee. This question is right in his wheelhouse, and here is his answer:

I know of no one who advocates early identification of kids as pupils with disabilities (PWDs). At this point in time we have at least 5 times as many kids identified as PWDs [as is merited]. The goal of RTI, as written in the background paper that produced the legislation, is a 70-80% decrease in the numbers of kids labeled as PWDs. The basic goal of RTI is to encourage schools to provide kids with more expert and intensive reading instruction. As several studies have demonstrated, we can reduce the proportion of kids reading below grade to 5% or so by the end of 1st grade. Once on level by the end of 1st about 85% of kids remain on grade level at least through 4th grade with no additional intervention. Or as two other studies show, we could provide 60 hours of targeted professional development to every K-2 teacher to develop their expertise sufficiently to accomplish this. In the studies that have done this fewer kids were reading below grade level than when the daily 1-1 tutoring was provided in K and 1st. Basically, what the research indicates is that LD and dyslexics and ADHD kids are largely identified by inexpert teachers who don't know what to do. If Pianta and colleagues are right, only 1 of 5 primary teachers currently has both the expertise and the personal sense of responsibility for teaching struggling readers. (It doesn't help that far too many states have allowed teachers to avoid responsibility for the reading development of PWDs by removing PWDs from value-added computations of teacher effectiveness).

I'll turn to senior NICHD scholars who noted that, "Finally, there is now considerable evidence, from recent intervention studies, that reading difficulties in most beginning readers may not be directly caused by biologically based cognitive deficits intrinsic to the child, but may in fact be related to the opportunities provided for children learning to read." (p. 378)

In other words, most kids that fail to learn to read are victims of inexpert or nonexistent teaching. Or, they are teacher disabled not learning disabled. Only when American schools systems and American primary grade teachers realize that they are the source of the reading problems that some kids experience will those kids ever be likely to be provided the instruction they need by their classroom teachers.

As far as "duplication of services" this topic has always bothered me because if a child is eligible for Title i services I believe that child should be getting those services. As far as fragmentation of instruction this does not occur when school districts have a coherent systemwide curriculum plan that serves all children. But most school districts have no such plan and so rather than getting more expert and more intensive reading lessons based on the curriculum framework that should be in place, struggling readers get a patchwork of commercial programs that result in the fragmentation. Again, that is not the kids as the problem but the school system as the problem. Same is true when struggling readers are being "taught" by paraprofessionals. That is a school system problem not a kids problem. In the end all of these school system failures lead to kids who never becomes readers.

Good answer, Dick. Thanks. Basically, the purpose of these efforts shouldn’t be to identify kids who will qualify for special education, but to address the needs of all children from the beginning. Once children are showing that they are not responding adequately to high quality and appropriate instruction, then the intensification of instruction—whether through special education or Title I or improvements to regular classroom teaching should be provided. Quality and intensity are what need to change; not placements. Early literacy is an amalgam of foundational skills that allow one to decode from print to language and language skills that allow one to interpret such language. If students are reaching average levels of performance on foundational skills, it is evident that they are attaining skills levels sufficient to allow most students to progress satisfactorily. If they are not progressing, then you need to look at the wider range of skills needed to read with comprehension. The focus of the instruction, the intensity of the instruction, and the quality of the instruction should be altered when students are struggling; the program placement or labels, not so much.

Sunday, July 21, 2013

The Lindsay Lohan Award for Poor Judgment or Dopey Doings in the Annals of Testing


Lindsay Lohan is a model of bad choices and poor judgments. Her crazy decisions have undermined her talent, wealth, and most important relationships. She is the epitome of bad decision making (type “ridiculous behavior” or “dopey decisions” into Google and see how fast her name comes up). Given that, it is fitting to name an award for bad judgment after her.

Who is the recipient of the Lindsay? I think the most obvious choice would be PARCC, one of the multi-state consortium test developers. According to Education Week, PARCC will allow its reading test to be read to struggling readers. I assume if students suffer from dyscalculia they’ll be able to bring a friend to handle the multiplication for them, too.

Because some students suffer from disabilities it is important to provide them with tests that are accessible. No one in their right mind would want blind students tested with traditional print; Braille text is both necessary and appropriate. Similarly, students with severe reading disabilities might be able to perform well on a math test, but only if someone read the directions to them. In other cases, magnification or extended testing times might be needed.

However, there is a long line of research and theory demonstrating important differences in reading and listening. Most studies have found that for children, reading skills are rarely as well developed as listening skills. By eighth grade, the reading skills of proficient readers can usually match their listening skills. However, half the kids who take PARCC won’t have reached eighth grade, and not everyone who is tested will be proficient at reading. Being able to decode and comprehend at the same time is a big issue in reading development. 

I have no problem with PARCC transforming their accountability measures into a diagnostic battery—including reading comprehension tests, along with measures of decoding and oral language. But if the point is to find out how well students read, then you have to have them read. If for some reason they will not be able to read, then you don’t test them on that skill and you admit that you couldn’t test them. But to test listening instead of reading with the idea that they are the same thing for school age children flies in the face of logic and a long history of research findings. (Their approach does give me an idea: I've always wanted to be elected to the Baseball Hall of Fame, despite not having a career in baseball. Maybe I can get PARCC to come up with an accommodation that will allow me to overcome that minor impediment.)  


The whole point of the CCSS standards was to make sure that students would be able to read, write, and do math well enough to be college- and career-ready. Now PARCC has decided reading isn’t really a college- or career-ready skill. No reason to get a low reading score, just because you can't read. I think you will agree with me that PARRC is a very deserving recipient of the Lindsay Lohan Award for Poor Judgment; now pass that bottle to me, I've got to drive home soon.

Sunday, April 7, 2013

Backwards Design and Reading Comprehension


Many schools are into what they call “backward design.” This means they start with learning goals, create/adopt assessments, and then make lessons aimed at preparing kids for those assessments.

That sounds good—if you don’t understand assessment. In some fields an assessment might be a direct measure of the goal. If you want to save $1,000,000 for retirement, look at your bank account every six months and you can estimate of how close you are to your goal. How do you get closer to your goal? Add money to your accounts… work harder, save more, spend less.

Other fields? Doctors assess patient’s temperatures. If a temperature is 102 degrees, the doctor will be concerned, but he won’t assume a temperature problem. He'll guess the temperature means some kind of infection. He’ll want to figure out what kind and treat it (the treatment itself may be an assessment—this could be strep, so I’ll prescribe an antibiotic and if it works, it was strep, and if it doesn’t, then we’ll seek another solution; all very House).

And in education? Our assessments are only samples of what we want students to be able to do. Let’s say we want to teach single digit addition. There are 100 single addition problems: 0+0=0; 0+1=1; 1+0=1; 1+1=2… 9+9=18. Of course, those 100 problems could be laid out vertically or horizontally, so that means 200 choices. There is the story problem version too, so we could have another 100 of those. That’s 300 items, which would be a small universe in reading comprehension.  

A test-maker would sample from those 300 problems. No one wants a 300-item test; too expensive and unnecessary. A random sample of 30 items would represent the whole set pretty well. If Johnny gets 30 right on such a test, we could assume that he would get all or most of the 300 right.

But what if the teachers knew what the sample was going to look like? What if only five of the items were story problems? Maybe she wouldn’t teach story problems; it wouldn’t be worth it. Kids could get a good score without those. She may notice that eight items focused on the addition of 5; she’d spend more time on the 5s than the other numbers. Her kids might do well on the task, but they wouldn’t necessarily be so good at addition. They’d only be good at this test of addition, which is not the same thing as reaching the addition goal.

When a reading comprehension test asks main idea questions like, “What would be a good title for this story?”, teachers will focused their main idea instruction on titles as a statements of main idea. Not on thesis statements. Not on descriptive statements. If the test is multiple-choice, then teachers would emphasize recognition over construction. If a test only asks about stories, then to hell with paragraphs. 

Conceptions like main idea, theme, comparison, inference, conclusion, and so on, can be asked so many different ways. And there are so many texts that we could ask about. Anyone who aims instruction at a test, thinking that is the same as the goal, may get higher scores. But the cost of such a senseless focus is the students’ futures, because their skills won’t have the complexity, the depth, or the flexibility to allow them to meet the actual goal--the one that envisioned them reading many kinds of texts and being able to determine key ideas no matter how they were assessed.

Reading comprehension tests are not goals… they are samples of behaviors that represent the goals… and they are useful right up until teachers can’t distinguish them from the goals themselves.    

Saturday, October 31, 2009

Response to Instruction and Too Much Testing

This week I was honored to speak at the Center for Teaching and Learning at the University of Oregon. I was asked to talk about evaluation, which is a big issue in the great northwest because of RtI (response to instruction). They are testing the heck out of kids towards ensuring that no one falls behind. It's a grand sentiment, but a poor practice.

Teachers there told me they were testing some kids weekly or biweekly. That is too much. How do I know it is too much?

The answer depends on two variables: the standard error of measurement of the test you are using and the student growth rate. The more certain of the scores (the lower the SEM that is), the more often you could profitably test... And the faster that students improve on the measure relative to the SEM of the test, the more often you can test.

On something like DIBELS, the standard errors are reasonably large compared to the actual student growth rate--thus, on that kind of measure it doesn't make sense to measure growth more than 2-3 times per year. Any more than that, and you won't find out anything about the child (just the test).

The example in my powerpoint below is based on the DIBELS oral reading fluency measure. For that test, kids read a couple of brief passages (1 minute each) and a score is obtained in words correct per minute. Kids in first and second grade make about 1 word improvement per week on that kind of measure.

However, studies reveal this test has a standard error of measurement of 4 to 18 words... that means, under the best circumstances, say the student scores 50 wcpm on the test, then we can be 68 percent certain that this score is someplace between 46 and 54 (under the best conditions when there is a small SEM). That means that it will be at least 4 weeks before we would be able to know whether the child was actually improving. Sooner than that, and any gains or losses that we see will likely be due to the standard error (the normal bouncing around of test scores).

And that is the best of circumstances. As kids grow older, their growth rates decline. Older kids usually improve 1 word every three or four weeks weeks on DIBELS. In those cases, you would not be able to discern anything new for several months. But remember, I'm giving this only 68 percent confidence, not 95 percent, and I am assuming that DIBELS has the smallest SEMs possible (not likely under normal school conditions). Two or three testings per year is all that will be useful under most circumstances.

More frequent testing might seem rigorous, but it is time wasting, misinformative, and simply cannot provide any useful information for monitoring kids learning. Let's not just look highly committed and ethical by testing frequently; let's be highly committed and ethical and avoid unnecessary and potentially damaging testing.

Here is the whole presentation.

http://sites.google.com/site/shanahanstuff/home/evaluation

Monday, September 28, 2009

Putting Students into Books for Instruction

This weekend, there was a flurry of discussion on the National Reading Conference listserv about how to place students in books for reading instruction. This idea goes back to Emmet Betts in 1946. Despite a long history, there hasn’t been a great deal of research into the issue, so there are lots of opinions and insights. I tend to lurk on these listservs rather than participating, but this one really intrigued me as it explored a lot of important ideas. Here are a few.

Which ways of indicating book difficulty work best?
This question came up because the inquirer wondered if it mattered whether she used Lexiles, Reading Recovery, or Fountas and Pinnell levels. The various responses suggested a whiff of bias against Lexiles (or, actually, against traditional measures of readability including Lexiles).

So are all the measures of book difficulty the same? Well, they are and they’re not. It is certainly true that historically most measures of readability (including Lexiles) come down to two measurements: word difficulty measure and sentence difficulty. These factors are weighted and combined to predict some criterion. Although Lexiles include the same components as traditional readability formulas, they predict different criteria. Lexiles are lined up with an extensive database of test performance, while most previous formulas predict the levels of subjectively sequenced passages. Also, Lexiles have been more recently normed. One person pointed out that Lexiles and other traditional measures of readability tend to come out the same (correlations of .77), which I think is correct, but because of the use of recent student reading as the criterion, I usually go with the Lexiles if there is much difference in an estimate.

Over the years, researchers have challenged readability because it is such a gross index of difficulty (obviously there is more to difficulty than sentences and words), but theoretically sound descriptions of text difficulty (such as those of Walter Kintsch and Arthur Graesser) haven’t led to appreciably better text difficulty estimates. Readability usually explains about 50% of the variation in text difficulty, and these more thorough and cumbersome measures don’t do much better.

One does see a lot of Fountas and Pinnell and Reading Recovery levels these days. Readability estimates are usually only accurate within about a year, and that is not precise enough to help a first-grade teacher to match her kids with books. So these schemes claim to make finer distinctions in text difficulty early on, but these levels of accuracy are open to question (I only know of one study of this and it was moderately positive), and there is no evidence that using such fine levels of distinction actually matter in student learning (there is some evidence of this with more traditional measures of readability).

If anything, I think these new schemes tend to put kids into too many levels and more than necessary. They probably correlate reasonably well with readability estimates, and their finer-grained results probably are useful for early first grade, but I’d hard pressed to say they are better than Lexiles or other readability formulas even at these levels (and they probably lead to over grouping).

Why does readability work so poorly for this?
I’m not sure that it really does work poorly despite the bias evident in the discussion. If you buy the notion that reading comprehension is a product of the interaction between the reader and the text (as most reading scholars do), why would you expect text measures to measure much more than half the variance in comprehension? In the early days of readability formula design, lots of text measures were used, but those fell away as it became apparent that they were redundant and 2-3 measures would be sufficient. The rest of the variation is variation in children’s interests and knowledge of topics and the like (and in our ability to measure student reading levels).

Is the right level the one that students will comprehend best at?
One of the listserv participants wrote that the only point to all of this leveling was to get students into texts that they could understand. I think that is a mistake. Often that may be the reason for using readability, but that isn’t what teachers need to do necessarily. What a teacher wants to know is “at what level will a child make optimum learning gains in my class?” If the child will learn better from something hard to comprehend, then, of course, we’d rather have them in that book.

The studies on this are interesting in that they suggest that sometimes you want students practicing with challenging text that may seem too hard (like during oral reading fluency practice) and other times you want them practicing with materials that are somewhat easier (like when you are teaching reading comprehension). That means we don’t necessarily want kids only reading books at one level: we should do something very different with a guided reading group that will discuss a story, and a paired reading activity in which kids are doing repeated reading, and an independent reading recommendation for what a child might enjoy reading at home.

But isn’t this just a waste of time if it is this complicated?
I don’t think it is a waste of time. The research certainly supports the idea that students do better with some adjustment and book matching than they do when they work whole class on the same level with everybody else.

However, the limitations in testing kids and testing texts should give one pause. It is important to see such data as a starting point only. By all means, test kids and use measures like Lexiles to make the best matches that you can. But don’t end up with too many groups (meaning that some kids will intentionally be placed in harder or easier materials than you might prefer), move kids if a placement turns out to be easier or harder on a daily basis than the data predicted, and find ways to give kids experiences with varied levels of texts (from easy to challenging). Even when a student is well placed, there will still be selections that turn out to be too hard or too easy, and adjusting the amount of scaffolding and support needed is necessary. That means that teachers need to pay attention to how kids are doing, and responding to these needs to make sure the student makes progress (i.e., improves in what we are trying to teach).

If you want to know more about this kind of thing, I have added a book to my recommended list (at the right here). It is a book by Heidi Mesmer on how to match texts with kids. Good luck.

Thursday, August 20, 2009

Yes, Virginia, You Can DIBEL Too Much!

I visited schools yesterday that used to DIBEL. You know what I mean, the teachers used to give kids the DIBELS assessments to determine how they were doing in fluency, decoding, and phonemic awareness. DIBELS has been controversial among some reading experts, but I’ve always been supportive of such measures (including PALS, TPRI, Ames-web, etc.). I like that they can be given quickly to provide a snapshot of where kids are.

I was disappointed that they dropped the tests and asked why. “Too much time,” they told me, and when I heard their story I could see why. This was a district that like the idea of such testing, but their consultants had pressured them into repeating it every week for at risk kids. I guess the consultants were trying to be rigorous, but eventually the schools gave up on it altogether.

The problem isn’t the test, but the silly testing policies. Too many schools are doing weekly or biweekly testing and it just doesn’t make any sense. It’s as foolish as checking your stock portfolio everyday or climbing on the scale daily during a diet. Experts in those fields understand that too much assessment can do harm, so they advise against it.

Frequent testing is misleading and it leads to bad decisions. Investment gurus, for example, suggest that you look at your portfolio only every few months. Too many investors look at a day’s stock losses and sell in a panic, because they don’t understand that such losses happen often—and that long term such losses mean nothing. The same kind of thing happens with dieting. You weigh yourself and see that you’re down 2 pounds, so what the heck, you can afford to eat that slice of chocolate cake. But your weight varies through the day as you work through the nutrition cycle (you don’t weigh 130, but someplace between 127 and 133). So, when your weight jumps from 130 to 128, you think “bring on the desert” when you real weight hsn't actually changed since yesterday.

The same kind of thing happens with DIBELS. Researchers investigated the standard error of measurement (SEM) of tests like DIBELS (Poncy, Skinner, & Axtell, 2005 in the Journal of Psychoeducational Measurement) and found standard errors of 4 to 18 points with oral reading fluency. That’s the amount that the test scores jump around. They found that you could reduce the standard error by testing with multiple passages (something that DIBELS recommends, but most schools ignore). But, testing with multiple passages only got the SEM down to 4 to 12 points.

What does that mean? Well, for example, second graders improve in words correct per minute (WCPM) in oral reading about 1 word per week. That means it would take 4 to 12 weeks of average growth for the youngster to improve more than a standard error of measurement.

If you test Bobby at the beginning of second grade and he gets a 65 wcpm in oral reading, then you test him a week later and he has a 70, has his score improved? That looks like a lot of growth, but it is within a standard error so it is probably just test noise. If you test him again in week 3, he might get a 68, and week 4 he could reach 70 again, and so on. Has his reading improved, declined, or stagnated? Frankly, you can’t tell in this time frame because on average a second grader will improve about 3 words in that time, but the test doesn’t have the precision to identify reliably a 3-point gain. The scores could be changing because of Bobby’s learning, or because of the imprecision of the measurement. You simply can't tell.

Stop the madness. Let’s wait 3 or 4 months, still a little quick, perhaps, but since we use multiple passages to estimate reading levels ,it is probably is okay. In that time frame, Bobby should gain about 12-16 words correct per minute if everything is on track. If the new testing reveals gains that are much lower than that, then we can be sure there is a problem, and we can make some adjustment to instruction. Testing more often can’t help, but it might hurt!