If your evaluation system has one number, your team will optimize for one number. That's not a moral failing — that's how every measurable system works. The question is whether you've designed the measurable thing well enough to be the goal.
Most companies haven't.
Here's what we mean: when "lines of code shipped per week" is the metric, you get more lines of code. When "tickets closed" is the metric, you get tickets that get closed faster — by being smaller, by being closed prematurely, by being created mainly so they can be closed. None of these correlate with what you actually wanted, which was useful work.
So we did something different. RUQA doesn't have one score. It has five.
T1: Self-report. What the engineer says they did. (Most gameable. We collect it anyway because the delta between self-report and reality is itself a signal.)
T2: AI baseline. Given the actual output (the PR diff, the spec, the design file), what would Claude estimate this took? This is a model-based prior, not a ground truth.
T3: Mechanical signals. Git commit timestamps. AI session durations. File save times. None of these prove how long something took, but together they bound it.
T4: Volume regression. Across the whole team's history, what does volume X (LOC, words, components) typically take? An outcome 3× faster than the regression predicts is a flag.
T5: Peer median. Recently shipped similar work by teammates — what was the median time? Not "compare to the best person." Compare to the typical case.
Now the trick. When all five agree, confidence is HIGH and the work passes through. When they disagree by less than 15%, it's noise. When they disagree by 15-30%, it's a SOFT flag — manager sees it, conversation follows. When they disagree by more than 30%, it's a HARD flag and goes to council review.
The point isn't catching liars. It's that every signal is independently gameable, but gaming all five at once is harder than doing the actual work. And critically: the engineer can see all five at the same time you can. There's no hidden score.
Performance systems with one number reward people who learn the number. Performance systems with five independent signals reward people who do the work. Those are different optimization targets.