Showing posts with label Testing. Show all posts
Showing posts with label Testing. Show all posts

Tuesday, September 22, 2015

Does Formative Assessment Improve Reading Achievement?

                        Today I was talking to a group of educators from several states. The focus was on adolescent literacy. We were discussing the fact that various programs, initiatives, and documents—all supposedly research-based efforts—were promoting the idea that teachers should collect formative assessment data.

            I pointed out that there wasn’t any evidence that it actually works at improving reading achievement with older students.

            I see the benefit of such assessment or “pretesting” when dealing with the learning of a particular topic or curriculum content. Testing kids about what they know about a topic, may allow a teacher to skip some topics or to identify topics that may require more extensive classroom coverage than originally assumed.

            It even seems to make sense with certain beginning reading skills (e.g., letters names, phonological awareness, decoding, oral reading fluency). Various tests of these skills can help teachers to target instruction so no one slips by without mastering these essential skills. I can’t find any research studies showing that this actually works, but I myself have seen the success of such practices in many schools. (Sad to say, I’ve also seen teachers reduce the amount of teaching they provide in skills that aren’t so easily tested—like comprehension and writing—in lieu of these more easily assessed topics.)

            However, “reading” and “writing” are more than those specific skills—especially as students advance up the grades. Reading Next (2004), for example, encourages the idea of formative assessment with adolescents to promote higher literacy. I can’t find any studies that support (or refute) the idea of using formative assessment to advance literacy learning at these levels, and unlike with the specific skills, I’m skeptical about this recommendation.

            I’m not arguing against teachers paying attention… “I’m teaching a lesson and I notice that my many of my students are struggling to make sense of the Chemistry book, so I change my up my upcoming lessons, providing a greater amount of scaffolding to ensure that they are successful.” Or, even more likely… I’m delivering a lesson and can see that the kids aren’t getting it, so tomorrow we revisit the lesson.

            Those kinds of observations and on-the-fly adjustments may be what all that is implied by the idea of “formative assessment.” If so, it is obviously sensible, and it isn’t likely to garner much research evidence.

            However, I suspect the idea is meant to be more sophisticated and elaborate than that. If so, I wouldn’t encourage it. It is hard for me to imagine what kinds of assessment data would be collected about reading in these upper grades, and how content teachers would ever use that information productively in a 42-minute period with a daily case load of 150 students.

            A lot of what seems to be promoted these days as formative assessment is getting a snapshot or level of a school’s reading performance, so that teachers and principals can see how much gain the students make in the course of the school year (in fact, I heard several of these examples today). That isn’t really formative assessment by any definition that I’m aware of. That is just a kind of benchmarking to keep the teachers focused. Nothing wrong with that… but you certainly don’t need to test 800 kids to get such a number (a randomized sample would provide the same information a lot more efficiently).

            Of course, many of the computer instruction programs provide a formative assessment placement test that supposedly identifies the skills that students lack so they can be guided through the program lessons. Thus, a test might have students engaged in a timed task of filling out a cloze passage. Then the instruction has kids practicing this kind of task. Makes sense to align the assessment and the instruction, right? But cloze has a rather shaky relationship with general reading comprehension, so improving student performance on that kind of task doesn’t necessarily mean that these students are becoming more college and career ready. Few secondary teachers and principals are savvy about the nature of reading instruction, so they get mesmerized by the fact that “formative assessment”—a key feature of quality reading instruction—is being provided, and the “gains” that they may see are encouraging. That these gains may reflect nothing that matters would likely never occur to them; it looks like reading instruction, it must be reading instruction.

            One could determine the value of such lessons by using other outcome measures that are more in line with the kinds of literacy one sees in college, as well as in civic, familial, and economic lives of adults. And, one could determine the value of the formative assessments included in such programs if one were to have groups use the program, following the diagnostic guidance based on the testing, and having other groups just use the program by following a set grade level sequence of practice. I haven’t been able to find any such studies on reading so we have to take the value of this pretesting on the basis of faith I guess.

            Testing less—even for formative purposes—and teaching more seems to me to be the best way forward in most situations. 

Saturday, June 20, 2015

Making Whole Class Work More Effective

          Recently, I wrote about the quandary of grouping. Small group instruction supports greater student engagement, higher amounts of interaction, greater opportunity for teacher observation, and more student learning. However, the benefits of small group are balanced by the relative ineffectiveness of most seatwork activities. Subtracting the downside of working on one's own away from the teacher from the clear benefits of small group teaching, one ends up with little advantage to all of the effort of orchestrating the small-group oriented classroom.
          Despite this, the benefits of small group teaching is so obvious, it is not uncommon for coaches and supervisors to promote a lot of small group work in spite of its ultimate lack of benefit.

          While arguing to keep the small group-teaching arrow in my quiver, I suggested that one of the best things we could do as teachers was to work on our large-group teaching skills. The focus of this has to be, not on organizing our classes in particular ways, but in ensuring that all of our students learn as much as possible. 

So what kinds of things can one do to make large group or whole class teaching more effective? In other words, how can you maintain the efficiency of whole-class teaching, while grabbing the same benefits one gets from small-group work?

1.     Get close to the kids
            In small-group work, teachers command greater attention and involvement partly by being so close. Small groups are often arrayed around the teacher or pulled together at a single table. But with whole-class work, the teacher may as well be on the Moon. Perching yourself at the desk or whiteboard puts you in a different orbit than the kids. No eye contact with the individual students, or no chance that you’ll reach out and touch them; no wonder we lose attention. Set up your classroom so that you can move easily among the students and can reach them without a lot of rigmarole. Place students where you want them to be to support high attention (no Billy cannot sit where he wants).

2.     Ask questions first and assign them to students later
          One way of maximizing attention is to ask your questions first, and then call on the student who is to answer. Even put a bit of pause in between the question and the assignment. The point of the question is rarely to get one student thinking, but to get the whole class to reflect on the problem. When a teacher says, “Johnny, why was Baby Bear so upset with Goldilocks?,” Johnny will think about it, but most of the other kids will take a pass. When she says, “Why was Baby Bear so upset with Goldilocks?.... Johnny?” everybody has to think about it because they can’t be sure who'll get called. 

3.     Focus on teaching, not putting on a show
          Many of us grew up watching Phil Donahue and Oprah. We know how to run a Q&A discussion with a studio audience because we have seen it so often. The tempo moves along, there aren’t long pauses or digressions, and at the end the pertinent info has been covered. But what’s good TV would be lousy teaching. The idea that you’re the emcee presenting information—even with some audience participation, is the wrong mindset. You may be teaching a group of 30 students in a whole class setting, but you have to think of them as 30 individuals, not one group. Your job is to maximize participation for the students while increasing your opportunity to monitor individual progress.

4    Maximize student response.
                 Too often in whole-class work the teacher asks a question, then calls on a child to answer. There are many better schemes for this that allow more student thinking and response, such as “think-pair-share.” Here the teacher asks a question, but has the kids talking it over with each other before answering (the smallest configuration for this can be pairs, but the pairs can then talk to other pairs, and other schemes make sense as well). This increases the degree to which everyone thinks about the question and tries to figure out an answer.

               Another popular approach is the multiple-response card. With simple yes-no tasks, thumbs up-thumbs down may be sufficient. Thus, if the teacher is doing a phonological awareness activity, she may have the students respond with thumbs up if a pair of words start or end with the same sound, and a thumbs down otherwise. For more complex responses, cards may be better. For example, the students might have a card for each character in a story, and the teacher can then ask questions like, Who packed the picnic basket? Who was supposed to take the basket to grandmother? Who was lurking in the woods? And, all the students then hold up the cards that reveal the answer.

                A third way, not used enough in my opinion, is the written answer. Teachers can ask any kind of question, and have everyone write an answer to the question. The oral responses that follow tend to be longer and more involved than what kids come up with orally. The written record is useful here because it allows teachers to check to see who answered the question well, the quality of the reasoning, and can take them back into the text to figure out the discrepancies.

5    Teach groups in whole class—teaching in a fishbowl
            Sometimes you can increase the involvement of particular students even though you are working in whole class. Let’s say everyone has been asked to read Chapter 6 of the social studies book, and now the class is going to discuss. The teacher might select 5-8 students who she wants to be the primary discussants this time. These students may sit in a circle in the middle of the classroom and everyone else will be arrayed around them. The teacher leads the discussion with her questions and challenges, and the students in the inner circle answer and talk about the ideas. The students on the outside observe, participate in the discussion if the inner group is stuck, and perhaps write answers to the same questions. Through careful selection, the teacher is able to maximize the amount of participation of quiet students or those who usually get shut out of the discussions by being too slow.

6    Be strategic in calling on students 
          It can be difficult to manage the calling on students. Certain students always seem to have an answer, and are quick to respond. This shuts out others who need to explore their thinking and who would benefit from teacher follow up. Teachers can do what football coaches do, which is plan their plays ahead of time, changing up the routine only if the situation changes. Thus, a teacher might, during planning, decide not just what to ask, but who she wants to hear from. That means if certain students are struggling to give longer answers or sufficient explanation, the teacher can be ready to initiate and guide them through some scaffolded work within the context of the whole class lesson. In other cases, more randomized calling (in which everyone has an equal chance) might make sense; this is easily accomplished with the tongue-depressor routine, in which all the student names are on tongue depressors and the teacher just pulls sticks out of the can as she needs a response or explanation.

7    Whole class can be more than lecture or Q&A
          Instead of using worksheets as “shut up sheets” (thanks, Vicki Gibson), use these tasks to engage everyone within the class in an interactive activity. For example, let’s say the task is finding text evidence. The worksheet includes assertions based on the text, and the students have to locate information from the text that supports the assertion. Kids could go off and do that on their own or they could do it in separate small group activities with teacher scaffolding, but that kind of task could be done most efficiently with teacher participation in the whole class. The teacher needs to observe how the students go about the task—maybe even taking notes on who just started reading and who went to particular parts of the text, who's copying, who's paraphrasing, and so on. At any point, the teacher might stop the class and ask about the strategies being used and might provide some guidance for proceeding more effectively.

            Remember, even in whole class teaching, you want students to pay attention; you want to get as many students to respond and participate as possible (without losing everyone else’s attention); you want maximum possibility of identifying when problems and misunderstandings occur so that you can scaffold, explain, and guide students to solve the problem. Structure whole group lessons in those ways, and then follow up in smaller groups (and even individually) to ensure success with what is being taught.

My recent presentation on improving test performance:

My recent presentation on teaching with challenging text:

Thursday, April 3, 2014

Apples and Oranges: Comparing Reading Scores across Tests

I get this kind of question frequently from teachers who work with struggling readers, so I decided to respond publicly. What I say about these two tests would be true of others as well.

I am a middle school reading teacher and have an issue that I'm hoping you could help me solve. My students' placements are increasingly bound to their standardized test results. I administer two types of standardized tests to assess the different areas of student reading ability. I use the Woodcock Reading Mastery Tests and the Terra Nova Test of Reading Comprehension. Often, my students' WRMT subtest scores are within the average range, while their Terra Nova results fall at the the lower end of the average range or below. How can I clearly explain these discrepant results to my administrators? When they see average scores on one test they believe these students are no longer candidates for remedial reading services.

Teachers are often puzzled by these kinds of testing discrepancies, but they can happen for a lot of reasons.

Reading tests tend to be correlated with each other, but this kind of general performance agreement between two measures doesn’t mean that they would categorize student performance identically. Performing at the 35%ile might give you a below average designation with one test, but an average one with the other. Probably better to stay away from those designations and use NCE scores or something else that is comparable across the tests.

An important issue in test comparison is the norming samples that they use. And, that is certainly the case with these two tests. Terra Nova has a very large and diverse nationally representative norming sample (about 200,000 kids) and the GMRT is based on a much smaller group that may be skewed a bit towards struggling students (only 2600 kids). When you say that someone is average or below average, you are comparing their performance with those of the norming group. Because of their extensiveness, I would trust the Terra Nova norms more than the WMRT ones; Terra Nova would likely give me a more accurate picture of where my students are compared to the national population. The GMRT is useful because it provides greater information about how well the kids are doing in particular skill areas, and it would help me to track growth in these skills.

Another thing to think about is reliability. Find out the standard error of the tests that you are giving and calculate 95% confidence intervals for the scores. Scores should be stated in terms of the range of performance that the score represents. Lots of times you will find that the confidence intervals of the two tests are so wide that they overlap. This would mean that though the score differences look big, they are not really different. Let’s say that the standard error of one of the tests is 5 points (you need to look up the actual standard error in manual), and that your student received a standard score of 100 on the test. That would mean that the 95% confidence interval for this score would be: 90-110 (in other words, I’m sure that if the student took this test over and over 95% of his scores would fall between those scores). Now say that the standard score for the other test was 8 and that the student’s score on that test was 120. That looks pretty discrepant, but the confidence interval for that one is 104-136. Because 90-110 (the confidence interval for the first test) overlaps with 104-136 (the confidence interval of the second test), these scores look very different and yet they are actually the same.

You mention the big differences in the tasks included in the two tests. These can definitely make a difference in performance. Since WMRT is given so often to lower performing students, that test wouldn’t require especially demanding tasks to spread out performance, while the Terra Nova, given to a broader audience, would need a mix of easier and harder tasks (such as longer and more complex reading passages) to spread out student performance. These harder tasks push your kids lower in the group and may be so hard that it would be difficult to see short-term gains or improvements with such an test. WMRT is often used to monitor gains, so it tends to be more sensitive to growth.

You didn’t mention which edition of the tests you were administering. But these tests are revised from time to time and the revisions matter. GMRT has a 2012 edition, but studies of previous versions of the tests reveal big differences in performance from one edition to the other (despite the fact that the same test items were being used). The different versions of the tests changed their norming samples and that altered the tests performances quite a bit (5-9 points). I think you would find the Terra Nova to have more stable scores, and yet, comparing them across editions might reveal similar score inflation.

My advice is that when you want to show where students stand in the overall norm group, only use the Terra Nova data. Then use the GMRT to show where the students’ relative strengths and weaknesses are and to monitor growth in these skills. That means your message might be something like: “Tommy continues to perform at or near the 15% percentile when he is compared with his age mates across the country. Nevertheless, he has improved during the past three months in vocabulary and comprehension, though not enough to improve his overall position in the distribution.“ In other words, his reading is improving and yet he remains behind 85% of his peers in these skills.

Thursday, May 3, 2012

Here We Go Again

For years, I’ve told audiences that one of my biggest fantasies (not involving Heidi Klum) was that we would have a different kind of testing and accountability system. In my make-believe world, teachers and principals would never get to see the tests – under penalty of death.

They wouldn’t be allowed within miles of a school on testing days, and they would only be given general information about the results (e.g., “your class was in the bottom quintile of fourth grades in reading”). Telling a teacher the kinds of test questions or about the formatting would be punished severely, too.

In that fantasy, teachers would be expected to try to improve student reading scores by… well, by teaching kids to read without regard to how it might be measured later. I have even mused that it would be neat if the test format changed annually to even discourage teachers from thinking about teaching to a test format.

In some ways, because of common core, my fantasy is coming true (maybe Heidi K. isn’t far behind?).

Principals and teachers aren’t sure what these tests look like right now. The whole system has been reset, and the only sensible solution is… teaching.

And, yet, I am seeing states that are holding back on rolling out the common core until they can see the test formats.

Last week, Cyndie (my wife – yes, she knows all about Heidi and me – surprisingly, she doesn’t seem nervous about it) was contacted by a state department of education trying to see if she had any inside dope on the PARCC test.

This is crazy. We finally have a chance to raise achievement and these test-chasing bozos are working hard to put us back in the ditch. There is no reason to believe that you will make appreciable or reliable gains teaching kids to reply to certain kinds of test questions or to particular test formats (you can look it up). The people who push such plans know very little about education (can they show you the studies of their “successful” test-teaching approaches?). I am very pleased with the unsettled situation in which teachers and principals don’t know how the children’s reading is going to be evaluated; it is a great opportunity for teachers and kids to show what they can really do.

Saturday, August 9, 2008

Rubber Rulers and State Accountability Testing in Illinois

Much has been made in recent years of the political class’s embrace of the idea of test-based accountability for the schools. Such schemes are enshrined in state laws and NCLB. On the plus side, such efforts have helped move educators to focus on outcomes more than we traditionally have. No small change, this. Historically, when a student failed to learn it was treated as a personal problem—something beyond the responsibility of teachers or schools. That was fine, I guess, when “Our Miss Brooks” was in the classroom and teachers were paid a pittance. Not much public treasure was at risk, and frankly low achievement wasn’t a real threat to kids’ futures (with so many reasonably-well-paying jobs available at all skills levels). As the importance and value of doing well has changed, so have the demands for accountability.

Sadly, politicos have been badly misled on the accuracy of tests, and technically achievement testing has just gotten really complicated—well beyond the scope of what most legislative education aides can handle.
And so, here in Illinois we have a new test scandal brewing (requiring the rescoring of about 1 million).,CST-NWS-tests09.article

Two years ago Illinois adopted a new state test. This test would be more colorful and attractive and would have some formatting features that would make it more appealing to the kids who had to take it. What about the connection of the new test with the test it was to replace? Not to worry, the state board of education and Pearson publishing’s testing service were on the game: they were going to equate the new test with the old statistically so the line of growth or decline would be unbroken, and the public would know if schools were improving, languishing, or slipping down.

A funny thing happened, however: test scores jumped immediately. Kids in Illinois all of a sudden were doing better than ever before. Was it the new tests? I publicly opined that it likely was; large drops or gains in achievement scores are unlikely, especially without any big changes in public policy or practice. The state board of education, the testing companies, and even the local districts chimed in saying how “unfair” it was that anyone would disparage the success of our school kids. They claimed there was no reason to attribute the scores sudden trending up to the coincidental change in tests, and frankly they were not happy about kill-joys like me who would dare question their new success (it was often pointed out that teachers were working very hard—the Bobby Bonds’ defense: I couldn’t have done anything wrong since I was working hard).

Now after two years of that kind of thing, Illinois started using a new form of this test. The new form was statistically equated with the old form, so it could not possibly have any different results. Except that it did. Apparently, the scores came back this summer, much lower than they had been during the past two years. So much lower, in fact, that the educators recognized that it could not possibly be due to a real failure of the schools, but it must be a testing problem. Magically, the new equating was found to be screwed up (a wrong formula apparently). Except, Illinois officials have not yet released any details about how the equating was being done. Equating can get messed up by computing the stats incorrectly, but they also can be influenced by how, when, and from whom these data are collected.

It’s interesting that when scores rise the educational community is adamant that it must be due to their successes, but when they fall—as they apparently did this year in Illinois, it must be a testing problem.
Illinois erred in a number of ways, but so have many states in this regard.

The use of a single form of a single measure administered to large numbers of children in order to make important public policy decisions is foolish. It turns out there are many forms of the test Illinois is using. It is foolish that they didn’t use multiple forms simultaneously (like they would have if it had been a research study), as this can help to do away with their “rubber ruler” problem. Sadly, conflicting purposes for testing programs have us locked into a situation where we’re more likely to make mistakes than to get it right.

I’m a fan of testing (yes, I’ve worked on NAEP, ACT, and a number of commercial tests), and am a strong proponent of educational accountability. It makes no sense, however, to try to do this kind of thing with single tests. It isn’t even wise to test every child. Public accountability efforts need to focus their attention on taking a solid overall look at performance on multiple measures without trying to get too detailed about the information on individual kids. Illinois got tripped up when they changed from testing schools to testing kids (teachers didn’t think kids would try hard enough if they weren’t at risk themselves, so our legislator went from sampling the state to testing every kid—of course, if you want individually comparable data it only makes sense to test kids on the same measure).

Barack Obama has called for a new federal accountability plan that will make testing worthwhile to teachers by providing individual diagnostic information. That kind of plan sounds good, but ultimately it will require a lot more individual testing, with single measures (as opposed to multiple alternative measures). Instead of getting a clearer or more efficient picture for accountability purposes—and one less likely to be flawed by the rubber ruler problem, it can’t help but being muddled as in Illinois. This positive-sounding effort will be more expensive and will result in a less picture in the long run.

Accountability testing aimed at determining how well public institutions are performing would be better constructed along the lines of the National Assessment (which uses several forms of a test simultaneously with samples of students representing the states and the nation. NAEP has to do some fancy statistical equating, too, but this is more likely to be correct when a several overlapping forms of the test are used each year. By not trying to be all things to all people, they manage to do a good job of letting the public and policymakers know how are kids are performing.