My friend Greg Toppo and the rest of the education reporting team at USA Today have produced a must-read series on standardized testing irregularities around the country. Culling student achievement data from 24,000 public schools in six states and Washington, D.C., the team identified 1,610 instances in which test score gains from year to year exceeded three standard deviations–a jump greater than that of 99.7 percent of all test-takers annually in any given state, the threshold at which statisticians agree that test results may be suspect.
In some cases, these jumps were due to outright cheating, such as a teacher using an eraser to correct students' answers on a multiple choice exam. In others, the causes remain mysterious because school districts–especially those under pressure to meet the requirements of No Child Left Behind–are reluctant to launch investigations into "good news" test scores:
That was the case at Charles Duval Elementary School in Gainesville, Fla., where math scores in the fifth grade rose sharply year after year. In 2005, the school's fourth-graders finished near the bottom of the state, in the 5th percentile. The next year, 2006, as fifth-graders, they scored in the 79th percentile. Duval fifth-grade classes repeated the feat the next two years, and, by 2008, they were testing in the 91st percentile.
Then, in 2009, Duval's test scores took a nose dive. Math scores as well as reading scores crashed, and the fifth-graders finished in the 1st percentile, at the very bottom, on both tests. The school overall dropped in state rankings from an "A" school — among the best in the state — to an "F" school. Only then did the state step in, but not to investigate how the high scores were achieved. Instead, the state sent education specialists to help the school get back on track.
Linda Perlstein, author of the wonderful book Tested, points out that aggressive teaching to the test could account for these fantastic score gains, and that teaching to the test–even in a highly scripted, curriculum-narrowing way–is not, alas, against the rules. In fact, it's often encouraged by adminstrators, as it was in this case:
The third-grade teacher I followed for my book Tested had a good sense of what was going to be on the Maryland School Assessment. The exam, and the benchmark tests designed in its image, didn’t change a whole lot from year to year—there were certain constructs that showed up again and again, and certain questions too. One question she’d come to expect was, “How do you know such-and-such is a poem?” The standard tested was identifying the elements of a poem. We all know that the best way to ingrain an enduring understanding of poetry is to have students not just read poems but to engage with them—especially, to write them. These kids didn’t do that. More than 30 times the teacher had the kids copy some form of this paragraph from the overhead projector: I know this is a poem because it has rhyme, stanzas and rhythm. It has rhyme because sea and free rhyme. It has stanzas because the paragraphs don’t indent. It has rhythm because…
These are the unintended consquences of well-intentioned standards-and-accountability education reforms. If you're looking for a clear-headed, easy-to-understand discussion of why high-stakes tests almost inevitably lead to narrowed curricula and widespread score inflation, pick up Harvard psychometrician Daniel Koretz's Measuring Up: What Educational Testing Really Tells Us. Here are some choice nuggets from chapter 10, "Inflated Test Scores:"
People will argue, "There is nothing wrong with the items on the test, so what is wrong with focusing our teaching on them?" … The problem is not bad material on the test; it is that the material on the test is only a small sample of what matters. … The acid test is whether the gains in scores produced by test preparation truly represent meaningful gains in student achievement. We should not care very much about a score on a particular test…What we should be concerned about is the proficiency, the knowledge and skills, that the test score is intended to represent. Gains that are specific to a particular test and that do not generalize to other measures and to performance in the real world are worthless.