What’s Wrong with Pay for Performance?
The quality of teaching has a big impact on student performance. But a lot of other factors are also important. Over the course of a semester student test scores in a good teacher’s class might go down or remain unchanged because of all those other factors. The scores in a bad teacher’s class might go up for the same reasons.
The quality of teaching has a big impact on student performance. But a lot of other factors are also important. Over the course of a semester student test scores in a good teacher’s class might go down or remain unchanged because of all those other factors. The scores in a bad teacher’s class might go up for the same reasons. If we reward and punish teachers based on test scores, therefore, much of the time we will be doing the wrong thing. That is, the reward system will often reward bad teachers and punish good ones.
This same principle also applies to the practice of medicine. Paying doctors or hospitals based on outcomes would be fine as long as the outcomes can be reliably measured and we know how much each entity contributes. Until that is possible, we run the risk that we will inadvertently punish the good practitioners and reward the bad ones. (Pay for performance wouldn’t be a problem if we actually knew how to measure outcomes and each person’s contribution to it.)
Understanding the problems with pay-for-performance is important because Medicare will begin adjusting payments to physicians based on “the value of the care provided” in 2015. Education is full of examples in which lousy, inaccurate measures have unintended consequences. Unfortunately, lousy health care measures are pretty much all we have, or are likely to have, by 2015.
Before going further, let’s make a distinction between inputs and outputs. Inputs are often easier to measure, and many pay-for-performance schemes are actually paying for inputs. Yet it is the outputs that we really care about.
In education, inputs are things like the time teachers spend in the classroom, how many minutes are devoted to math, how many minutes are devoted to vocabulary, and how much a school district spends on books. These inputs may or may not be related to how much children learn. In health care, inputs are things like whether a medical history was taken, whether the results of an examination are recorded electronically, and the number of nurses per patient. As in education, these inputs may or may not be related to whether patients actually get well.
What happens when we try to pay based on outputs?
In a study for Mathematica Policy Research, Greg Peterson and Eric Schone suggest that the value-added models developed to determine teacher pay might also prove useful in health care. They provide a useful, non-technical rundown of the problems that the Centers for Medicare & Medicaid Services (CMS) will face once it moves beyond measuring inputs and begins searching for actual performance measures.
One problem is figuring out how to apportion measured improvement among the many physicians that may see a patient during an episode of care. Another is deciding how to apportion credit over time. A patient with a condition that is difficult to diagnose may see several specialists over several years. Once he is diagnosed, he may improve after surgery, other treatments, and continuing medications. How, exactly, is credit for his improvement to be apportioned?
There are also significant data problems. Measurements to describe many outcomes are simply not available, and if they are, they may not be comparable from patient to patient. While one patient may describe a cut as a five on a 1 to 10 pain scale, another patient may describe the same cut as a two because he has a higher pain tolerance or more experience with pain. Paying a physician more because the second patient reports less pain is paying for differences in patient perceptions, not physician skill.
Even seemingly objective measures, such as rating physicians’ ability to treat diabetes using their patients’ HbA1c levels, have problems. One study using identical twins concluded that 62 percent of HbA1c variability is genetic. The variation introduced by factors that are beyond a physician’s control generates noisy data, making it difficult to separate a physician’s influence from that of genetics, environment, patient willingness to comply with medical recommendations, and the capital and staff a physician has to work with.
As is well known, value-added models in education have similar problems. Educational researchers have produced a substantial literature that is based on enormous, highly detailed datasets for teachers, schools and student achievement. (Not exactly. Their claim is that the educational value added models will be useful in health care. The hive is already at work undercutting Rothstein’s findings. As far as I can tell, though, it still requires that patients or students be randomly distributed—not exactly a useful assumption for looking at reality.) They have shown that value-added models in education have problems so severe that Jesse Rothstein concluded that policies based on them will “reward or punish teachers who do not deserve it and fail to reward or punish teachers who do.” For three common value-added specifications, “accountability policies that rely on measures of short-term value added would do an extremely poor job of rewarding the teachers who are best for students’ longer-run outcomes.”
Although everyone agrees that good teachers can make a big difference, existing estimates suggest that 80 percent or more of student achievement is explained by something other than existing measures of teaching quality.
Eric Hanushek and Steven Rivkin conclude that representative estimates of teacher value-added range from 0.1 to 0.2 student achievement standard deviations. This implies that moving a student from a teacher in the 25th percentile to the 75th percentile of measured effectiveness would only move the student from the 50th to the 58th percentile in the achievement distribution.
Furthermore, the measured performance of a particular teacher does not appear to be especially persistent. In another study, Daniel McCaffrey and his colleaguesestimate that 30 to 60 percent of the variation in measured teacher effects is due to transitory noise and that less than half of a measured effect persists. Goldhaber (gated, with abstract) points out that recent evidence suggests that teacher value-added also depends upon peer effectiveness, the quality of the match between teachers and schools, changes in school demographics, experience, and absences of both teachers and their peers. He also notes that “incorporating too much prior information [into value added-models] increases the risk of bias from performance that does not persist over time.”
Finally, academics have been unable to show that many of the observable measures thought to be significant contributors to teacher value-added have much effect on student achievement. In their summary of the relationship between the observable characteristics of teachers and student performance, Douglas Staigner and Jonah Rockoff conclude that although teachers do improve after several years of experience, there is little reason to believe that teacher academic background does much to affect student performance. Teach for America, a highly selective program that draws applicants from top universities, fields teachers whose students score slightly better in math but no better in reading.
Rivkin, Hanushek and Kain find that while achievement gains are systematically related to observable teacher and school characteristics, they are small. There is no evidence that master’s degrees improve teacher skills and there is little evidence the teacher skills improve after the first three years of experience. Class size has modest effects on mathematics and reading growth but it is limited to the younger grades and the effect is so small that the benefits from increasing class size are likely to be outweighed by its costs. There is no evidence that more restrictive certification standards or teacher education requirements will raise the quality of instruction.
The good news is that work from the 1970s suggests that principals’ subjective ratings do a fairly good job of identifying good teachers. That may explain why the private and charter schools in which principals have the power to hire and fire are more likely to improve achievement by disadvantaged students than their relatively powerless public counterparts.
The superiority of subjective measures may also explain why private medicine, where peers, patients and professional associations subjectively evaluate a physician’s value-added does a better job of providing quality care than the quality measures adopted in national systems run by governments.
Judging from the progress on value-added models in education, CMS might do more good by freeing doctors and patients to reach their own conclusions and by redirecting its resources toward reducing the national debt.