Showing posts with label Assessment. Show all posts
Showing posts with label Assessment. Show all posts

Sunday, November 29, 2015

On Progress Monitoring, Maze Tests, and Reading Comprehension Assessment

Teacher question:
I am looking for some insight on the use of mazes to progress monitor reading comprehension.  I teach in a middle school (6-8) and am struggling with using this to measure reading comprehension with fluent readers. So much of their reading comprehension in class is measured by determining main idea, recalling basic facts, inferencing, and analyzing the use of literary elements. It seems that when the maze is used to monitor reading comprehension, it doesn’t offer much information about the reader. Often students rush through it and circle words just to complete it in the time allotted and score exactly the same as students who are reading and choosing the correct word, but do not complete it in the allotted time. It seems like student motivation is a critical component of the accuracy of these scores.

Is the maze an effective way to measure passage comprehension, or is it simply a way to measure sentence comprehension? Do you have any suggestions on what else could be used? I appreciate your help with this and look forward to your response.

Shanahan responds:
            John Guthrie developed maze in the 1970s to determine how well students could read particular texts. Let’s say you have a 7th grade science book and want to know who in your class is likely to struggle with that book. 

            To figure this out you'd test students on several passages from that science book. According to Guthrie, students who score 50% or higher on maze should be able to handle this book. 

            The benefit of maze is that it is easy to construct, administer, and score and maze results are reasonably accurate and reliable. (To design a maze test, you select a passage of 150-200 words in length, delete a word from the second sentence, and every 5th or 7th word after that. Provide the students with three word choices in random order: the correct word, a word that is the same part of speech but incorrect, and a word that is the wrong part of speech.)

            As you point out, maze tells you nothing about what comprehension skills students have or how well they can answer certain kinds of questions. However, question-and-answer comprehension questions can’t tell you that either, so switching tests won't solve that problem for you.

            I was at the University of Delaware during the 1970s where John Guthrie was working at the time. He'd told the late Aileen Tobin, my office mate, a funny thing about maze. He told her that they had tried it out with individual sentences and with passages (as described above) and it didn’t make any difference. Even when sentences were presented randomly students seemed to perform equally well.

            We laughed a lot about that. It just didn't make sense to us. We wondered if that was also true of other popular measures such as cloze tests. (Cloze is similar to maze, but harder to administer because instead of multiple-choice it requires students to fill in the blanks.)

            Our banter over this issue ended up in a series research studies that I carried out. We found just what you surmised. Students performed as well on sequential order passages and on passages that we had scrambled the orders of the sentences. Imagine reading Moby Dick, starting with sentence 16, then 5, then 32, then 1, etc. (Randomizing sentence order doesn't hurt maze or cloze performance, but it wreaks havoc on summary writing.)
           
            I also found that cloze correlated best with multiple-choice reading comprehension tests that asked questions based on information from single sentences. Correlations were lower if students had to synthesize information across the passages.

            Cloze and maze tests provide reasonable predictions of reading comprehension, but they do this based on how well students interpret single sentences. For most readers, the prediction works because it is unusual that someone develops the ability to read sentences without developing the ability to read texts.  

            If you want to know who is going to struggle with your literature anthology, maze can be a tool that will help you to accomplish that. If you want to identify specific reading comprehension skills so you can provide appropriate practice, maze won’t help, but neither will the testing alternatives that you could consider.

            You say you want to monitor your students’ reading comprehension. I suspect that means you need a way of determining at various points during the year whether your students are reading better. For this, I would suggest that you use a collection of graded passages (using Lexiles or some other text evaluation method to put these on a difficulty continuum). Identify the levels of difficulty your students can handle successfully (this could be done with maze tests of those passages), and then later in the year, check to see if the students can now handle passages that are even harder. 

          Monitoring comprehension means not tabulating specific skills that have been accomplished, but what complexity of text language students can negotiate. Perhaps early in the year, your students will be able to score 50% or higher with texts written at 800 Lexiles. By mid-year you'd want them to score like that with harder passages (e.g., 900L-950L). That kind of a testing regimen would allow you to identify who is improving and who is not.

NOTE: HERE ARE MY POWERPOINTS FROM MY PRESENTATIONS IN INDIANA THIS WEEK.



Tuesday, September 22, 2015

Does Formative Assessment Improve Reading Achievement?

                        Today I was talking to a group of educators from several states. The focus was on adolescent literacy. We were discussing the fact that various programs, initiatives, and documents—all supposedly research-based efforts—were promoting the idea that teachers should collect formative assessment data.

            I pointed out that there wasn’t any evidence that it actually works at improving reading achievement with older students.

            I see the benefit of such assessment or “pretesting” when dealing with the learning of a particular topic or curriculum content. Testing kids about what they know about a topic, may allow a teacher to skip some topics or to identify topics that may require more extensive classroom coverage than originally assumed.

            It even seems to make sense with certain beginning reading skills (e.g., letters names, phonological awareness, decoding, oral reading fluency). Various tests of these skills can help teachers to target instruction so no one slips by without mastering these essential skills. I can’t find any research studies showing that this actually works, but I myself have seen the success of such practices in many schools. (Sad to say, I’ve also seen teachers reduce the amount of teaching they provide in skills that aren’t so easily tested—like comprehension and writing—in lieu of these more easily assessed topics.)

            However, “reading” and “writing” are more than those specific skills—especially as students advance up the grades. Reading Next (2004), for example, encourages the idea of formative assessment with adolescents to promote higher literacy. I can’t find any studies that support (or refute) the idea of using formative assessment to advance literacy learning at these levels, and unlike with the specific skills, I’m skeptical about this recommendation.

            I’m not arguing against teachers paying attention… “I’m teaching a lesson and I notice that my many of my students are struggling to make sense of the Chemistry book, so I change my up my upcoming lessons, providing a greater amount of scaffolding to ensure that they are successful.” Or, even more likely… I’m delivering a lesson and can see that the kids aren’t getting it, so tomorrow we revisit the lesson.

            Those kinds of observations and on-the-fly adjustments may be what all that is implied by the idea of “formative assessment.” If so, it is obviously sensible, and it isn’t likely to garner much research evidence.

            However, I suspect the idea is meant to be more sophisticated and elaborate than that. If so, I wouldn’t encourage it. It is hard for me to imagine what kinds of assessment data would be collected about reading in these upper grades, and how content teachers would ever use that information productively in a 42-minute period with a daily case load of 150 students.

            A lot of what seems to be promoted these days as formative assessment is getting a snapshot or level of a school’s reading performance, so that teachers and principals can see how much gain the students make in the course of the school year (in fact, I heard several of these examples today). That isn’t really formative assessment by any definition that I’m aware of. That is just a kind of benchmarking to keep the teachers focused. Nothing wrong with that… but you certainly don’t need to test 800 kids to get such a number (a randomized sample would provide the same information a lot more efficiently).

            Of course, many of the computer instruction programs provide a formative assessment placement test that supposedly identifies the skills that students lack so they can be guided through the program lessons. Thus, a test might have students engaged in a timed task of filling out a cloze passage. Then the instruction has kids practicing this kind of task. Makes sense to align the assessment and the instruction, right? But cloze has a rather shaky relationship with general reading comprehension, so improving student performance on that kind of task doesn’t necessarily mean that these students are becoming more college and career ready. Few secondary teachers and principals are savvy about the nature of reading instruction, so they get mesmerized by the fact that “formative assessment”—a key feature of quality reading instruction—is being provided, and the “gains” that they may see are encouraging. That these gains may reflect nothing that matters would likely never occur to them; it looks like reading instruction, it must be reading instruction.

            One could determine the value of such lessons by using other outcome measures that are more in line with the kinds of literacy one sees in college, as well as in civic, familial, and economic lives of adults. And, one could determine the value of the formative assessments included in such programs if one were to have groups use the program, following the diagnostic guidance based on the testing, and having other groups just use the program by following a set grade level sequence of practice. I haven’t been able to find any such studies on reading so we have to take the value of this pretesting on the basis of faith I guess.

            Testing less—even for formative purposes—and teaching more seems to me to be the best way forward in most situations. 

Thursday, July 2, 2015

Why Research-Based Reading Programs Alone Are Not Enough

Tim,

Every teacher has experienced this. While the majority of the class is thriving with your carefully planned, research supported instructional methods, there is often one kid that is significantly less successful. We work with them individually in class, help them after school, sometimes change things up to see what will work, bring them to the attention of the RtI team that is also using the research supported instructional methods. But what if the methods research support for the majority of kids don't work for this kid?

Several months ago I read an article in Discover magazine called "Singled Out" by Maggie Koerth-Baker. Regarding medicine rather than education, the article is about using N of 1 experiments to find out whether an individual patient reacts well to a particular research backed treatment. http://discovermagazine.com/2014/nov/17-singled-out

"But even the gold standard isn't perfect. The controlled clinical trial is really about averages, and averages don't necessarily tell you what will happen to an individual."

Ever since I read the article, I've been wondering what an N of 1 experiment would look like in the classroom. This would be much easier to implement in the controlled numbers of a special education classroom, but we do so much differentiation in the regular classroom now, I'd like to find a way to objectively tell if what we do for individuals is effective in the short term, rather than waiting for the high stakes testing that the whole class takes. Formative assessment is helpful, but I suspect we need something more finely tuned to tease out what made the difference. We gather tons of data to report at RtI meetings, but at least at my school, it's things like sight word percentages, reading levels, fluency samples, not clear indicators of say, whether a child As a researcher, how would you set up an N of 1 experiment in an elementary classroom?

My response:
This letter points out an important fact about experimental research and its offshoots (e.g., quasi-experiments, regression discontinuity designs): when we say a treatment was effective that doesn’t mean everyone who got the special whiz-bang teaching approach did better than everyone who didn’t. It just means one group, on average, did better than the other group, on average.

For example, Reading First was federal program that invested heavily in trying to use research-based approaches to improve beginning reading achievement in Title I schools. At the end of the study, the RF schools weren't doing much better than the control schools overall. But that doesn't mean there weren’t individual schools that used the extra funding well to improve their students’ achievement, just that there weren’t enough of those schools to make a group difference.

The same happens when we test the effectiveness of phonics instruction or comprehension strategies. A study may find that the average score for the treatment group was significantly higher than that obtained by the control group, but there would be kids in the control group who would outperform those who got the treatment, and students in the successful treatment who weren’t themselves so successful.

That means that even if you were to implement a particular procedure perfectly and with all of the intensity of the original effort (which is rarely the case), you'd still have students who were not very successful with the research-based training.

Awhile back, Atul Gawande, wrote in the New Yorker about the varied results obtained in medicine with research-based practices (“The Bell-Curve”). Dr. Gawande noted that particular hospitals, although they followed the same research-based protocols, were so scrupulous and vigorous in their application of those methods that they obtained better results.

For example, in the treatment of cystic fibrosis, it's a problem when a patient’s breathing capacity falls below level. If the lung capacity reaches that benchmark, standard practice would be to hospitalize the patient to try to regain breathing capacity. However, in the particularly effective hospitals, doctors didn’t wait for the problem to become manifest. As soon as things started going wrong for a patient, breathing capacity started to decline, they intervened.

It is less about formal testing (since our measures usually lack the reliability of those used in medicine) or about studies with Ns of 1, than about thorough and intensive implementation of research-based practices and careful and ongoing monitoring of student performance within instruction.

Many educators and policymakers seem to think that once research-based programs are selected, then we no longer need to worry about learning. That neglects the fact that our studies tell us less about what works, than they do about what may work under some conditions. Our studies tell us about practices that have been used successfully, but people are so complex that you can’t guarantee such programs will always work that way. It is a good idea to use practices that have been successful--for someone--in the past, but such practices do not have automatically positive outcomes. In the original studies, teachers would have worked hard to try to implement successfully; later, teachers may be misled into thinking that if they just take kids through the program the same levels of success will automatically be obtained. 

Similarly, in our efforts to make sure that we don't lose some kids, we may impose testing regimes aimed at monitoring success, such as DIBELing kids several times a year… but such instruments are inadequate for such intensive monitoring and can end up being misleading.

I’d suggest, instead, that teachers use those formal monitors less frequently—a couple or three times a year, but to observe the success of their daily lessons more carefully. For example, a teacher is having students practice hearing differences in the endings of words. Many students are able to implement the skill successfully by the end of the lesson, but some are not. If that’s the case, supplement that lesson with more practice rather than just going onto the next prescribed lesson (or do this simultaneous to the continued progress through the program). If the lesson was supposed to make it possible for kids to hear particular sounds, then do whatever you can to enable them to hear those sounds.

To monitor ongoing success this carefully, the teacher does have to plan lessons that allow students many opportunities to demonstrate whether or not they could implement the skill. The teacher also has to have a sense of what success may look like (e.g., the students don’t know these 6 words well enough if they can’t name them in 10 seconds or less; the students can’t spell these particular sounds well enough if they can’t get 8 out of 10 correct; the student isn’t blending well enough if they… etc.).


If a program of instruction can be successful, and you make sure that students do well with the program—actually learning what is being presented by the program—then you should have fewer kids failing to progress.

Sunday, January 25, 2015

Concerns about Accountability Testing

Why don’t you write more about the new tests?

I haven’t written much about PARCC or SBAC—or the other new tests that other states are taking on—in part because they are not out yet. There are some published prototypes, and I was one of several people asked to examine the work product of these consortia. Nevertheless, the information available is very limited, and I fear that almost anything I may write could be misleading (the prototypes are not necessarily what the final product will turn out to be).

However, let me also say that, unlike many who strive for school literacy reform and who support higher educational standards, I’m not all that enthused about the new assessments. 

Let me explain why.

1. I don't think the big investment in testing is justified. 

I’m a big supporter of teaching phonics and phonological awareness because research shows that to be an effective way to raise beginning reading achievement. I have no commercial or philosophical commitment to such teaching, but trust the research. There is also strong research on the teaching of vocabulary, comprehension, and fluency, and expanding the amount of teaching is a powerful idea, too.

I would gladly support high-stakes assessment if it had a similarly strong record of stimulating learning, but that isn't the case.

Test-centered reform is expensive, and it has not been proven to be effective. The best studies of it that I know reveal either extremely slight benefits, or somewhat larger losses (on balance, it is—at best—a draw). Having test-based accountability, does not lead to better reading achievement.

(I recognize that states like Florida have raised achievement and they had high-stakes testing. The testing may have been part of what made such reforms work, but you can't tell if the benefits weren't really due to the other changes (e.g., professional development, curriculum, instructional materials, amount of instruction) that were made simultaneously.)

2. I doubt that new test formats—no matter how expensive—will change teaching for the good.

In the early 1990s, P. David Pearson, Sheila Valencia, Robert Reeve, Karen Wixson, Charles Peters, and I were involved in helping Michigan and Illinois develop innovative tests; tests that included entire texts and with multipe-response question formats that did away with the one-correct answer notion. The idea was that if we had tests that looked more like “good instruction,” then teachers who tried to imitate the tests would do a better job. Neither Illinois nor Michigan saw learning gains as a result of these brilliant ideas.

That makes me skeptical about both PARCC and SBAC. Yes, they will ask some different types of questions, but that doesn’t mean the teaching that results will improve learning. I doubt that it will.

I might be more excited if I didn’t expect companies and school districts to copy the formats, but miss the ideas. Instead of teaching kids to think deeply and to reason better, I think they’ll just put a lot of time into two-part answers and clicking. 

3. Longer tests are not really a good idea.

We should be trying to maximize teaching and minimize testing (minimize, not do away with). We need to know how states, school districts, and schools are doing. But this can be figured out with much less testing. We could easily estimate performance on the basis of samples of students—rather than entire student bodies—and we don’t need annual tests; with samples of reliable sizes, the results just don’t change that frequently.

Similarly, no matter how cool a test format may seem, it is probably not worth the extra time needed to administer. I suspect the results of these tests will correlate highly with the tests that they replace. If that's the case, will you really get any more information from these tests? And, if not, then why not use these testing days to teach kids instead? Anyone interested in closing poverty gaps, or international achievement gaps, is simply going to have to bite the bullet: more teaching, not more testing, is the key to catching up.

4. The new reading tests will not provide evidence on skills ignored in the past.

The new standards emphasize some aspects of reading neglected in the past. However, these new tests are not likely to provide any information about these skills. Reading tests don't work that way (math tests do, to some extent). We should be able to estimate the Lexile levels that kids are attaining, but we won’t be able to tell if they can reason better or are more critical thinkers (they may be, but these tests won’t reveal that).

Reading comprehension tests—such as those used by all 50 states for accountability purposes—can tell us how well kids can comprehend. They cannot tell which skills the students have (or even if reading comprehension actually depends on such a collection of discrete skills). Such tests, if designed properly, should provide clues about the level of language difficulty that students can negotiate successfully, but beyond that we shouldn’t expect any new info from the items.

On the other hand, we should expect some new information. The new tests are likely to have different cut scores or criteria of success. That means these tests will probably report much lower scores than in the past. Given the large percentage of boys and girls who “meet or exceed” current standards, graduate from high school, and enter college, but who lack basic skills in reading, writing, and/or mathematics, it would only be appropriate that their scores be lower in the future.


However, I predict that when those low-test scores arrive, there will be a public outcry that some politicians will blame on the new standards. Instead of recognizing that the new tests are finally offering honest info about how their kids are doing, they’ll believe that the low scores are the result of the poor standards and there'll be a strong negative reaction. Instead of militating for better schools, the public will be stimulated to support lower standards.

The new tests will only help if we treat them differently than the old tests. I hope that happens, but I'm skeptical.