this post was submitted on 10 Oct 2023

207 points (96.4% liked)

Today I Learned

17770 readers

204 users here now

What did you learn today? Share it with us!

We learn something new every day. This is a community dedicated to informing each other and helping to spread knowledge.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must begin with TIL. Linking to a source of info is optional, but highly recommended as it helps to spark discussion.

** Posts must be about an actual fact that you have learned, but it doesn't matter if you learned it today. See Rule 6 for all exceptions.**

Rule 2- Your post subject cannot be illegal or NSFW material.

Your post subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding non-TIL posts.

Provided it is about the community itself, you may post non-TIL posts using the [META] tag on your post title.

Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

For further explanation, clarification and feedback about this rule, you may follow this link.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here.

Unless included in our Whitelist for Bots, your bot will not be allowed to participate in this community. To have your bot whitelisted, please contact the moderators for a short review.

Partnered Communities

You can view our partnered communities list by following this link. To partner with our community and be included, you are free to message the moderators or comment on a pinned post.

Community Moderation

For inquiry on becoming a moderator of this community, you may comment on the pinned post of the time, or simply shoot a message to the current moderators.

founded 1 year ago

MODERATORS

[email protected]

207

TIL The Goodhart's Law: Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. (lemmy.world)

submitted 1 year ago by [email protected] to c/[email protected]

33 comments fedilink hide all child comments

Here is the Wikipedia link.

top 33 comments

sorted by: hot top controversial new old

[–] [email protected] 139 points 1 year ago (4 children)

Alternative (and generally easier to understand) formulation: Once a measure becomes a target, it ceases to be a good measure.

See: grades, GDP, workplace metrics...

[–] [email protected] 66 points 1 year ago (5 children)

Yeah. OP's overly complicated explanation doesn't convey the reason why this happens, which is really kinda just human psychology. People are probably thinking it's like observability in quantum mechanics or some shit.

Goodhart's Law Example:

App has poor testing and low quality. New bugs are introduced weekly. Customers complain.
Management sets a test coverage KPI of 90% for the apps codebase.
Dev team focuses on tests that hit as many lines of code as possible; NOT on requirements, business logic, or anything that would improve quality, or prevent bugs.
Quality does not improve because KPI was an arbitrary statistic that holds no value in isolation. Productivity drops because dev team wasted hundreds of hours writing useless tests to achieve KPI. Codebase worse, abd less maintainable. Company wasted millions of dollars. Customers still not happy. Devs hate their lives.

[–] [email protected] 15 points 1 year ago (2 children)

It sounds like this is an example you chose from first hand experience. If so, I'm very sorry. That sounds incredibly frustrating.

[–] [email protected] 14 points 1 year ago

Thanks but I only joined when the companies several dozen codebases were already at step 4, and that example wasn't even in the top 5 of their worst problems. I completed a small greenfield project to high praise. They wanted me to fix other codebases. I handed them a laundry list of problems (with suggested solutions), told them their problems are due to incompetent management, and left.

Not my monkeys. Not my circus.

[–] [email protected] 8 points 1 year ago

Actually this is a pretty common thing in software development. It's become a bit of a trope. So much so that when management proposed things like KPIs tied to code (which they often do!) the devs are like, "can we get bonuses based on these KPIs? 🤑"

"If they're such a good measure clearly we should be compensated when we do fantastic job! How about $5k extra for exceeding KPIs!"

"Oh, you don't have enough trust in the system for that? Then why would you trust it for improving quality? I mean, you're the one that made them..."

[–] [email protected] 6 points 1 year ago (1 children)

I never heard of this before, but I now know the word to describe exercise problems!

Body builders who are judged by how bulging their muscles are feel like garbage despite supposedly being "peak".

People with high muscle mass or tall being screwed over by BMI targets.

People who are told weight indicates health ignore everything else in exchange for lowering calories.

Even in high school I remember how they would judge you based on like how many push ups you could do... no one who did a ton did proper push ups. Which led to them not helping at all as actual exercise, and even possibly leading to injury.

Heck, we can even use this for stupid Dog Shows, where because they measure specific things for the "goodness" of the dog, they screw over the dog in every way imaginable that isn't being judged.

This is a good law to know. I like knowing this law. It's sad how often it's used, but it's good to know.

[–] [email protected] 3 points 1 year ago (1 children)

Body builders who are judged by how bulging their muscles are feel like garbage despite supposedly being "peak".

Plenty of good examples listed in your post, but I disagree with the bodybuilding one. The point isn't to make you feel good. It's to play this game where you compete against others to best accomplish a specific task. Just like any other sport, when you compete at the elite level, it's never going to feel good, and it's never going to be good for your health.

[–] [email protected] 2 points 1 year ago

Hmm okay I just am not a fan of sports that destroy people's bodies.

"Cut" diets that focus on looking great by not eating/drinking water before showing off just sound awful and not a fan of them. But you're right, it's not much worse than any other sport which damages athlete's body.

[–] [email protected] 3 points 1 year ago (1 children)

A big part of it seems to be manipulation of the results? So, like, devs writing tests for more parts of the code base, but ones that are written to always pass.

[–] [email protected] 4 points 1 year ago (1 children)

Yes, of course. Fundamentally the end goal is to improve the app's quality. However "quality" is not a measurable thing. Therefore, someone observed that as test coverage goes up, bugs tend to decrease, and as bugs decrease app quality tends to go up. So they make code coverage a KPI, and start putting pressure on developers to increase it.

The problem is that once people are pressured into optimizing a certain number, they will get very creative at doing so. And this creativity often breaks the measure's relationship with the actual underlying quality we were trying to improve.

[–] [email protected] 1 points 1 year ago (1 children)

Could anyone explain what a "test coverage" means?

[–] [email protected] 3 points 1 year ago (1 children)

Test coverage is defined as the percentage of your application's functionality that is being covered by the automated tests.

Usually this is measured in lines of code. You run the automated tests, then for every line of code, you track whether it's executed or not. If 20% of lines were never executed during the test run, your test coverage is 80%.

Software teams will often aspire to reach high coverage, because lines that are never executed during testing are a good place where bugs can hide. However it's generally acknowledged that this isn't a foolproof method to get rid of bugs, and reaching 100% coverage can be more effort than it's worth. Often you have critical code sections that should be covered by multiple tests, and unimportant sections that are unlikely to fail.

[–] [email protected] 1 points 1 year ago

Thanks! TIL =)

[–] [email protected] 3 points 1 year ago (1 children)

People are probably thinking it's like observability in quantum mechanics or some shit.

Lol I really don't think anyone was thinking that

[–] [email protected] 3 points 1 year ago (1 children)

I was, at first

[–] [email protected] 3 points 1 year ago (1 children)

I don't think anyone other than this guy was thinking that

[–] [email protected] 1 points 1 year ago

Still, an interesting take, same terms mean different things to different people

[–] [email protected] 1 points 1 year ago

ya, code testing doesn't actually increase code complexity, nor worsen the code base, and tends to actually reduce and avoid bugs

[–] [email protected] 23 points 1 year ago

This is the first line of the linked article btw.

Goodhart's law is an adage often stated as, "When a measure becomes a target, it ceases to be a good measure".

[–] [email protected] 10 points 1 year ago (2 children)

So how does a company manage anything if they can't use measurement targets?

Like software engineering. How do you improve productivity or code quality if setting a target value for a measurement doesn't work?

[–] [email protected] 34 points 1 year ago (1 children)

You don’t make your measures your targets.

Example:

“Our customers hate us. We will make our employees get a 10 on their surveys for each customer or we’ll punish them” makes the measure a target.

“Our customers hate us, so we’re going to change our shitty policies to be more consumer friendly and see how our customers respond” keeps the measure as a measure.

[–] [email protected] 0 points 1 year ago

So the difference is who decides what changes to make when interacting with the subject of the measure: workers vs management. Making the measure a target is basically a shitty management technique that abdicates responsibility.

[–] [email protected] 12 points 1 year ago (2 children)

Ok I'm going to answer my own question because I'm too curious to wait lol

Goodhart’s Law states that “when a measure becomes a target, it ceases to be a good measure.” In other words, when we use a measure to reward performance, we provide an incentive to manipulate the measure in order to receive the reward. This can sometimes result in actions that actually reduce the effectiveness of the measured system while paradoxically improving the measurement of system performance. ... The manipulation of measures resulting from Goodhart’s Law is pervasive because direct measures of effectiveness (MOEs), which are more difficult to manipulate, are also more difficult to measure, and sometimes simply impossible to define and quantify. As a result, analysts must often settle for measures of performance (MOPs) that correlate to the desired effect of the MOE. ... These negative effects can sometimes be avoided. When they cannot, they can be identified, mitigated, and even reversed.

Use MOEs instead of MOPs whenever practicable and possible
Use the scientific method to generate new measurement data, rather than harvesting existing and possibly compromised data
Help customers establish authoritative and difficult-to-manipulate definitions for measures
Identify and avoid the use of manipulated data and data prone to manipulation
Use measurement data not generated by the organization being measured
Collect data secretly or after a measurable activity has already occurred
Measure all relevant system characteristics rather than just a representative few
Randomize the measures used over time
Wargame or red team potential measures

This report recommends that the organizations that employ analysts should do the following:

Return to the roots of operational research to focus more on direct measurements in the field
Answer the questions that should be answered, rather than the questions that can be answered simply because the required data are already available
Train analysts on MOEs, MOPs, and Goodhart’s Law and how they are interrelated
Make recognition of Goodhart’s Law part of the internal peer review process and part of all delivered analytical products
Identify and share mitigation best practices

[Source]

[–] [email protected] 8 points 1 year ago

It's pretty well established academically that basically the only way KPIs can actually work toward their intended purpose is if they are changed often and determined by the people doing the work that is ultimately measured. Ongoing measurements should only ever be used as indicators - hence the term *key performance indicators_ - and should never be used as targets. What that means in practice is that you should generally ignore all the individual metrics, and look across all of them instead to see if you can spot trends and anomalies, then investigate these qualitatively with the workers who ultimately produce those data to figure out what is happening and if any intervention is necessary.

The problem is that the higher up you get in the hierarchy, the less of that kind of work there is to do and you end up chasing the people below you for nice numbers to plot into your presentations to make it look like there's a point to your job's existence.

[–] [email protected] 2 points 1 year ago

Thanks for uncovering this report, very insightful and lots of great examples!

[–] [email protected] 2 points 1 year ago (1 children)

Thanks, yes, I saw that one, too, but I liked the emphasis on relationships. The shorter version is easier to get but it does not explain why this happens. E.g., you can observe some relationship (e.g., test results and a student's intelligence) and then you target grades. But then you have an incentive to teach to the test, which breaks down the relationship between test results and intelligence. Other people here gave great examples of relationships that can fail.

[–] [email protected] 4 points 1 year ago (1 children)

Imagine an antivirus program that looks at a piece of code and outputs either "Yes, this is malware" or "No, this is not malware." It is not perfect, but it is pretty good.

If the malware authors have access to this program, they can test their malware with it. They can keep modifying their malware until it passes the antivirus program.

Once the antivirus people publish a function AV(code)→boolean, the malware people can use that function to make malware that the function mistakes for non-malware.

If you publish the exact metric that you promise to use to make a decision, then people who want to control your decision can use that metric to test their methods of manipulating you.

[–] [email protected] 1 points 1 year ago (1 children)

That's what's happening with Google and Instagram search algorithms. People figure out how to manipulate them and start spamming. Then the search results deteriorate and you have to modify the algorithm.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

Partly, yeah. But eventually the fake news people write a narrative that looks prosodically identical to real news; a bot can't tell it isn't because the bot doesn't interact with the real world, only with text on the web.

Ultimately, fact-checkers and anti-spam systems have to touch grass too.

[–] [email protected] 6 points 1 year ago (1 children)

Apart from your mom's weight.

[–] [email protected] -5 points 1 year ago

God dammit almost made me spit out my coffee.

[–] [email protected] 2 points 1 year ago (1 children)

Now that I think of it, it seems to be at the core of some issues with training AI agents using reinforcement learning (e.g., if you choose a wrong metric, you'd get the behavior that makes sense for the agent but not what you want) and with any kind of planned economy (you need targets for planning, but people manipulate them, so you do not get what you want)

[–] [email protected] 2 points 1 year ago

The economic calculation problem is not only Goodhart, but Goodhart certainly doesn't help.