How to do the Best Benchmark Testing

Teacher’s Question:

Dear Dr. Shanahan,

I have been searching for research around student testing. Could you point me in the direction of relevant research:

Does testing time of the day make a difference?
Does it matter if the testing proctor is that student's teacher for that subject?
What effect does testing over several days in short bursts have?

With the state test, all grades are tested on the same day at the same time. But our benchmark tests are different. We give those three times a year (a test with four passages and 35 questions, taking about two hours to complete).

Students are tested within their ELA or Math class for several days over a week for the benchmark. My students get into the testing groove, and then they must stop. They have about 30 minutes of testing time each day. Every day, it is harder for them to get into the groove. They stop reading and start just clicking away. Our morning classes do better than those tested after lunch.

We lose 13-15 days each year of instruction within ELA for the district benchmarks. I think the tests should be completed like the state tests. My colleagues argue that 1) teachers would feel best administering tests to their students, 2) the teachers who do not teach ELA or Math would lose instructional time, and 3) non-testing teachers don't want to be responsible for administering the benchmark.

Currently only 30% of our students meet state standards on the state test. The benchmark is higher than that.

Shanahan’s response

As my three-year-old granddaughter, Cassidy, likes to tell me, “Grandpa, you’re old!”

I’m so old that I can remember a time in schools that students took few standardized tests. Teachers might improvise an arithmetic test or perhaps use the exercises in a review unit for that purpose. But reading tests were unusual in most regular classrooms, and instruction outside the classroom wasn’t common either – this was before Title I and IDEA funding.

Teachers determined kids’ reading ability mainly by listening to them read during lessons and their judgments were subjective. That’s how they determined reading group assignments and what to tell parents about little Johnny or Janie’s reading. Kids might not even know they were being evaluated.

That may sound idyllic to some – even to me at times – but that system allowed lots of kids to fall through the cracks. If Mrs. Smith didn’t know what to do with one of her struggling charges, there was a good chance she could just ignore the problem with no one the wiser. Likewise, the incompetent reading teacher could soldier on without any need for improvement, as long as her classroom was orderly that is.

These days, everyone – school administration, the state, newspapers – all seem to be peeking over a teacher’s shoulder. Unfortunate test scores may end a superintendent’s term. Perhaps, there’ll be a new curriculum, or a new regime of professional development. Kids get pulled out or pushed back in based on those scores.

The idea of benchmark testing is to identify problems before the official evaluation (the state tests) reveals that publicly. It gives districts, schools, teachers, and students a chance to address a looming problem before it becomes a public one.

That, however, only works if the benchmark tests do a good job of predicting. The tests must be reliable (meaning student performance is consistent), valid (meaning the tests manage to identify the kids – and only those kids – who will eventually fall short on the state test), and efficient (they shouldn’t be costly financially or in terms of instructional time).

Given the length of the tests that you describe, my guess is that they’re reliable. Tests that long usually are.

Given that the benchmark test is somewhat harder than the state test, I presume adequate validity. The benchmark should be harder since the cost of failing to address the needs of a struggler is higher than that of giving extra tuition to a youngster who would succeed anyway. A few more kids may get more reading help than is necessary, but this isn’t a real loss since that additional teaching still may boost these kids’ performances.

To be certain of the validity, you would need to compare the list of kids who received extra support because of the benchmark test with the students who fall short on the state test. Certainly, all or most of the kids who fall below standard on the end of year test should be flagged by the benchmark test. How many were missed? The larger the number, the less valid your test.

I’d also be curious about how many kids benchmarked low and state-tested high. Such a difference may suggest that both the predictor and the remedial teaching are sound, since the regime would appear to be protecting those kids from failure. However, given such a high failure rate, my guess is that the two numbers don’t differ that much.

Efficient? Here, I have serious doubts. This testing protocol is depriving kids of lots of instruction (13-15 days, whew!).

That’s more problematic in a school with a high percentage of failing kids. If 70% of the kids are not accomplishing the standard, then it may be meaningless to try to shift resources to them – it would probably be better to just focus on improving the overall reading instruction. Targeting the laggards makes greater sense, when there are fewer laggards!

That’s also problematic because kids in middle school typically don’t change that much in reading over a three-month period. It appears that you’ve adopted a practice that makes lots of sense with foundational skills in the primary grades and are applying it to seventh- grade reading comprehension – two very different animals. In the primary grades, kids’ decoding and fluency skills change frequently. Close monitoring can keep kids from slipping through the cracks.

But, when you’re 12, reading comprehension doesn’t change that much over short periods of time. I suspect that most of those kids repeatedly ace the benchmarks or fail them. I bet the numbers of kids who swing between categories are small and that most of that shifting is due mainly to the unreliability in the tests rather than to changes in what kids know.

This practice – monitoring kids’ reading status – makes so much sense that people aren’t thinking about it clearly. Your district’s approach is akin to the “bubble kids strategy.”

The state determines your success by the numbers who do better than a a certain score. We could try to improve reading instruction for everybody, or we could focus our resources mainly on the kids on the bubble – that is, the ones who come close to the magic score.

In some schools, principals aim to make their school look better by dragging those kids across the line.

Let’s say, 325 is the target score. Those principals will devote all remedial resources to kids with scores in the range of 310-324. They are neither trying to teach everyone nor are they trying to teach anyone very much. They just want a few more kids across that line.

Thank goodness, your district’s approach doesn’t sound like it poses the same ethical problems as the bubble kid strategy, since it should be ensuring that even those far below the target score are going to get help. Nevertheless, it seems likely that your successes – kids predicted to fail who eventually succeed – are surely bubble kids.

When states choose such a single narrow assessment goal, school districts will respond. If a state wants more kids across that line, then resources swing towards kids near the line.

What if a state emphasized reducing the numbers of kids in the bottom 30%ile instead? I’m not touting that as a superior goal but am only pointing out that schools like yours would likely shift their targeting scheme towards addressing the needs of a different group of strugglers.

What if a state emphasized average reading scores? I’m not touting that as a superior goal either. That, too, would encourage a different pedagogical response. Perhaps there would be a greater emphasis on the quality and quantity of classroom instruction instead of targeted remedial assistance. It would be the rare school willing to sacrifice 13-15 days of valuable instruction for testing that mainly provides information they already know. In a school like yours, with so many kids behind, I think trying to raise the average would be a really good idea. It doesn’t sound like that is a goal. It should be.

Towards that end, it would be smart to reduce the amount of unnecessary testing. I feel sorry for the kids who never pass the state test, but who must engage in 13-15 days of testing each year to qualify for the extra help. At some point you’d think adults would show some compassion for them. Why not concede the point, skip the testing, and help them anyway. Kids who test far above the cutpoint may not be as frustrated but there is no benefit in this for them either.

Yep, I’m for something really crazy: more teaching, less testing!

As to your specific questions, from the research, I don’t think those variations in testing practices matter that much.

Time of Day. Does time of day matter in kids’ test performance? At NWEA – the MAP testing people – data shows that kids do best early in the day on their reading and math tests, and they fade as the day goes on (Wise, 2023). Their solution? To get the best scores, test in the morning and not after lunch. This goes along with an earlier study, this one from Denmark, that reported lower test scores as the day goes on. 8:00AM scores were better than 9:00AM, 9 better than 10, and so on throughout the day. Though they also found that providing break time prior to testing minimized this fall off (Sievertsen, Gino, Piovesan, 2016). Students showed greater cognitive persistence early in the day and work breaks allowed them to recharge or regain their freshness.

That sounds reasonable – until you read some other studies. For instance, Cusick and company (2018) found that time of day made no difference in test performance. According to that study, test time had no impact on how well kids did on the tests. Still another investigation contradicted this finding, reporting that the best performances were accomplished… wait for it… after lunch (Gaggero & Tommasi, 2021).

You heard it. Kids either do better in the morning or the afternoon, except when it doesn’t matter. That sounds like a noise factor to me.

Your observation that morning kids outperform the afternoon crew is easy enough to check with your own data. First, make sure that those groups are equivalent. Compare the class averages using their previous year’s state scores. If they’re very close, then compare the morning and afternoon benchmarks. Is one set markedly higher than the other? Such a comparison should reveal whether this is a problem or not.

While you’re at it, I’d get the district’s testing people to correlate separately these groups benchmark scores with their end of year state test scores. Benchmarking is used to predict success on the state test. If the correlations differ much, then you’re likely doing a better predicting job at some time of the day. Of course, if the correlations aren’t that different, then this set of research studies is probably correct – time of day is not meaningfully or consistently impacting your kids’ performance.

Proctoring. Research shows that students perform better when their classroom teacher is the test proctor (DeRosa & Patalano, 1991). That makes sense but why the score boost from a familiar teacher? These researchers assumed familiarity reduced anxiety, the students could relax and do their best in their customary situation. That may be, but an interesting twist is introduced when you listen to teachers’ explanations of how they handle test administration (Childs & Umezawa, 2009). In this study, many third-grade teachers admitted they would not be scrupulous in following test administration procedures because they wanted to support “students to do their best work,” to provide “a positive testing experience for the students,” and to maintain their usual “pedagogical routines and their relationships with the students.”

These benchmark tests are not high stakes exams. Their only purpose is to identify who should get additional help. There would be no logical reason for teachers to artificially heighten these scores. Nevertheless, experience tells me that many teachers do exactly that anyway. Instead of trying to make sure that everyone who needs it gets the help they need, these teachers are trying to make their scores look better. (I don’t like putting the teacher’s interests into conflict with the students’ interests).

Some things teachers may do – like offering encouragement during testing – have not been found to be particularly effective (Barona & Pfeiffer, 1992; Johnson & Hummel, 1971). However, that doesn’t mean they can’t goose the scores. For instance, one study reported that extending testing times a bit could make scores look better, even for high achievers (Baldwin, et al., 1991). That would be an example of an inappropriate way that familiar proctoring may help kids to do their “best work.”

I doubt that proctoring differences matter much in terms of overall scores, but I’d remind teachers that if they manage to boost kids’ scores inappropriately, it will deprive them of needed help. Teachers care about their students – so make sure they really know what is best for the kids!

Short Burst Testing. I know of no research on this practice but four days of testing three times a year seems like overkill to me. As noted earlier, I’d strive for a program that did less testing and more teaching.

To sum up, in terms of these concerns, if you want your test to be highly predictive of the state test results, then you would want the prediction power of your benchmark testing to be high. One way to increase those correlations is to make the predictor tests as much like the criterion as possible. That means making the tests highly similar in their design and in how they’re administered. It’s possible that realigning the benchmarking to be more like the end-of year tests could improve your predictions. By how much is the question? I think you’d do better concentrating your deep thinking on figuring out how to extensively reduce the sheer amount of unnecessary and uninformative testing that this regime now requires.

References

Cusick, C, N., Isaacson, P. A., Langberg, J. M., & Becker, S. P. (2018). Last night’s sleep in relation to academic achievement and neurocognitive testing performance in adolescents with and without ADHD. Sleep Medicine, 52, 75-79.

Gaggero, A., & Denni Tommasi, D. (2023). Time of day and high-stake cognitive assessments. The Economic Journal, 133(652), 1407–1429, https://doi.org/10.1093/ej/ueac090

Sievertsen, H. H., Gino, F., & Piovesan, M. (2016). Cognitive fatigue influences students' performance on standardized tests. Proceedings of the National Academy of Sciences of the United States of America, 113(10), 2621–2624. https://doi.org/10.1073/pnas.1516947113

Wise, S. (2023, May 20). Don’t test after lunch: Time of day affects test-taking engagement. NWEA. https://www.nwea.org/blog/2023/dont-test-after-lunch-time-of-day-affects-test-taking-engagement/

Comments

See what others have to say about this topic.

What Are your thoughts?

Leave me a comment and I would like to have a discussion with you!

Comments

How to do the Best Benchmark Testing

0 comments

How to do the Best Benchmark Testing

Should Reading Be Taught Whole Class or Small Group?

Encouraging Summer Reading

Autism and Reading Part 2: Lessons to be Learned from Special Kids

Comments

What Are your thoughts?

How to do the Best Benchmark Testing

Subscribe to newsletter

Latest posts

Should Reading Be Taught Whole Class or Small Group?

Encouraging Summer Reading

Autism and Reading Part 2: Lessons to be Learned from Special Kids

Comments

What Are your thoughts?

One of the world’s premier literacy educators.