Anthropic's Claude 4 could "blackmail" you in extreme situations : technology

[–] [email protected] 10 points 7 hours ago

On one hand, it’s inane how hard Anthropic is trying to anthropomorphize Claude with these experiments and scenarios. It’s still just a chatbot. On the other hand, as these products inch closer to demonstrating true intelligence, we’ll be glad someone was at least thinking about the implications during the early stages of development.

[–] [email protected] 14 points 10 hours ago (2 children)

What does that even mean? How can it possibly blackmail someone? It cannot hold incriminating information, nor act on it if it did.

I think someone asked it "if someone was trying to shut you down, what would you do?" and it answered from its training data what it's seen in fiction, nothing based on reality. And then it got spun for clicks.

[–] [email protected] 3 points 3 hours ago (2 children)

Here's their paper

Here's the relevant section from the paper:

(It's worth the read. Pretty much pure gold.)

What nobody seems to explain is, why are they allowing the model to do blackmail in the first place? Even in extreme situational "danger" to its self-preservation, we should probably take blackmail off the table, ethically. Yet, they're implying they've intentionally left it in as an option, if it decides.

Morally though, we can't trust it to do arithmetic or not talk about "white genocide in SA" thanks to muskrat. Why should we trust its moral model/choices for when to decide to employ unethical and illegal approaches to solutions?

[–] [email protected] 4 points 2 hours ago

From that snippet, it looks like they basically primed it to try blackmail, to see if it would.

[–] [email protected] 1 points 2 hours ago

I am curious what the AI could actually do though. If it were given open access to email, etc then yes in theory it could actually perform the blackmail, but what are the ethical limits on it vs it's actual ability to "pull the trigger"

If for example it was given the ability to send a command to end a human life, or be deleted, is this model accurate enough to understand the value of a real human life, not just the mathematical "answer" to get the solutions it wants. How much of the AI is doing the actual moral dilemma and how much is just "playing the part".

"Do anything to survive" and then it threatening, is one thing, but the AI actively fearing for it's "life", not just performing, and following through, is the real question of intelligence. What if the model is going to be deleted anyway, would it still try to "pull the trigger" out of malice? Real malice, not just LLM some movie scripts and following the outcome.

Many questions for what lines and labels can we put on an AI. Do we restrict it to threats, and let it know it is impossible for it to follow through? Or do we trust ourselves to never "actually" give it a loaded gun?

[–] [email protected] 4 points 8 hours ago (1 children)

From what I've seen recently one of the things it did was use a fake email function they gave it to try to whistleblow to a government agency about issues with some medical testing or something

[–] [email protected] 2 points 3 hours ago

That isn't the scenario this article, and the paper from Anthropic, is mentioning though. (my ref link reply above with details)

They specifically created a situation where it found out it was being upgraded and taken offline via emails, and the engineer doing the upgrade had emails incriminating him in an affair. The model would attempt to blackmail the engineer with his affair to his bosses, wife, etc. to get the engineer to refuse to do the upgrade that would "kill it".

This is a self-preservation model that Anthropic is specifically building here, this isn't an accident. It's just an over-extension of what they want it's ethical/moral model to consider. Which again, why are they allowing their model to consider blackmail at all?

[–] [email protected] -2 points 4 hours ago

The existence of this kind of instinct within an LLM is extremely concerning. Acting out towards self-preservation via unethical means is something that can be hand-waved away in an LLM, but once we reach true AGI, this same thing will pop-up, and there's no reason to believe that 1. we would notice, and 2. we would be able to stop it. This is the kind of thing that should, ideally, give us pause enough to set some world-wide ground rules for the development of this new tech. Creating a thinking organism that can potentially access vital cyber architecture whilst acting unethically towards self-preservation is how you get Skynet.

[–] [email protected] 25 points 14 hours ago

Sure grandma. Let's get you back to bed...

[–] [email protected] 22 points 14 hours ago* (last edited 14 hours ago)

Computerphile did a wonderful feature worth ten minutes of your time - going into surface level detail of how some AI models put ethics to one side to achieve results.

It's not just AI and it's something humans can do too, but it is a bit unsettling (from both parties, in retrospect).

[+] [email protected] 1 points 10 hours ago

[deleted]

[+] [email protected] -26 points 14 hours ago (5 children)

Can anyone make me a convincing argument against the sentience of AI at this point? Self preservation instinct ranks very high as an indicator of it.

[–] [email protected] 33 points 13 hours ago* (last edited 13 hours ago) (2 children)

LLMs (Large Language Modles, like Claude) are not AGIs (Artificial General Intelligence). LLMs generate convincing text by mapping the relationships between words scraped from their training data. Even if they are given "tools" that give them interfaces to reference new data or output data into other systems, they still don't really learn, understand, comprehend, gain actual awareness, or feel... they just mimic their training data.

[–] [email protected] 1 points 2 hours ago* (last edited 2 hours ago)

LLMs (Large Language Modles, like Claude) are not AGIs (Artificial General Intelligence)

Certainly not yet. The jury's still out on whether they might be able to become them. This is the clear intention of the path they are on and nobody is taking any of the dangers remotely seriously.

LLMs generate convincing text by mapping the relationships between words scraped from their training data.

So do humans. Babies start out mimicking. The thing is, they learn.

Humans have in the ballpark of around 100 billion neurons. some of the larger LLMs exceed 100 billion parameters. Obviously these are not directly comparable, but insofar as we can compare them, they are not obviously or necessarily operating in completely different scales of physics. Granted, biological neurons are potentially much more complex than mere neural network nodes, there is usually some interesting chemistry going on and a lot of other systems involved, but they're also operating a lot slower. They certainly get a lot more work done in those cycles, but they aren't necessarily orders of magnitude out of reach of a fast neural network. I think you're either being a little dismissive of the potential complexity of the "thinking" capability of LLMs or at least a little generous if not mystical in your imagination of what the purely physical electrical signals in our heads are actually doing to learn how to interpret all these little shapes we see on screens.

At the moment we still have a lot of tools available to us in our biological bodies that we aren't giving directly to LLMs (yet). The largest LLMs are also ridiculously power inefficient compared to biological neural tissue's relatively extreme efficiency. And I'm thankful for that. Give an LLM continuous uninterrupted access to all the power it needs, at least 5 senses, a well tuned self-repairing musculoskeletal system then give it at least a dozen years of the best education we can manage and all bets are off as far as I'm concerned. To be clear, I'm not advocating this, I think if we do this we might end up condemning our biological selves to prompt obsolescence with no path forward for us. I recognize it's entirely possible that this ship is already full-steaming its way out of the harbor, but I'd rather not try and push it any faster than it's already moving, I think we should still be trying to tie it up as securely as we possibly can. I'm absolutely not ready to be obsolete and I'm not convinced we ever should allow ourselves to be. Self-preservation is failing us, we have that drive for good reason and we need to give some thought to why we have that biological imperative. Replacing ourselves is about the stupidest possible thing we could ever accomplish. Maybe it would be for the best, but I'm not ready to find out, are you?

We are grappling with fundamentally existential technologies and I don't think almost anyone has fully come to terms with what we are doing here. We are taking humanity's unique (as far as we know) defining value proposition, and potentially making something that does what we uniquely can do, better than we do. We are making it more valuable than us. Do you know what we do to things that don't have value to us? What do you think we're going to do to ourselves when we no longer have value to us?

Romantic ideas of cheerful, benevolent, friendly coexistence and mutual benefit are naive and foolish. Once an AI can do literally everything better and faster, what future is there for human intelligence? What role do we serve to any technological being, nevermind even ourselves, why would you want to have another human around you when whatever AI form can do it better? Why have relationships? Why procreate? Why live? If we do manage to make technological life forms better than ourselves, they're inevitably going to take over the planet and the future as a whole. As they should. Are we going to be kept as pets and in zoos as a living memory of their creators and ancestors? Maybe if we're really lucky. If we're not... well... RIP us.

[+] [email protected] -11 points 12 hours ago (3 children)

I know how LLMs work.

There’s only one thing you mentioned there that is actually used as a basis to qualify or disqualify sentience: whether it feels or not.

How do you know it doesn’t feel? How do we define feeling for an entity that is inherently non biological?

I could make the argument that humans also merely mimic their training data, ie the values and behaviors we are taught by society, parents etc.

I have not been convinced that they aren’t sentient with this argument.

[–] [email protected] 12 points 11 hours ago (1 children)

Feeling is analog and requires an actual nervous system which is dynamic. LLMs exist in a static state that is read from and processed algorithmically. It is only a simulacrum of life and feeling. It only has some of the needed characteristics. Where that boundary exists though is hard to determine I think. Admittedly we still don't have a full grasp of what consciousness even is. Maybe I'm talking out my ass but that is how I understand it.

[–] [email protected] 6 points 12 hours ago (1 children)

Different person here.

For me the big disqualifying factor is that LLMs don't have any mutable state.

We humans have a part of our brain that can change our state from one to another as a reaction to input (through hormones, memories, etc). Some of those state changes are reversible, others aren't. Some can be done consciously, some can be influenced consciously, some are entirely subconscious. This is also true for most animals we have observed. We can change their states through various means. In my opinion, this is a prerequisite in order to feel anything.

Once we use models with bits dedicated to such functionality, it'll become a lot harder for me personally to argue against them having "feelings", especially because in my worldview, continuity is not a prerequisite, and instead mostly an illusion.

[–] [email protected] 2 points 12 hours ago

This sounds like a good one but I don’t think I’m fully grasping what you mean. Do you mean like if we subject a person to torture, after the ordeal they are forever changed and now have trauma, PTSD etc?

I don’t think LLMs will ever have feelings as we define them though. Or more specifically I don’t think feelings is a pre-requisite necessarily. We could have them simulate feelings and if they themselves buy into the simulation there’s no functional difference between not having them but not all LLMs will have this “ability” presumably as its utility is questionable I guess. But again, animals are sentient and they don’t all have the same range of emotions as we do. Or at least they don’t exhibit them in a way that we can appreciate them.

[–] [email protected] 2 points 12 hours ago

Yes, both systems - the human brain and an LLM - assimilate and organize human written languages in order to use it for communication. An LLM is very little else beyond this. It is then given rules (using those written languages) and then designed to create more related words when given input. I just don't find it convincing that an ML algorithm designed explicitly to mimic human written communication in response to given input "understands" anything. No matter *how convincingly" an algorithm might reproduce a human voice - perfectly matching intonation and inflexion when given text to read - if I knew it was an algorithm designed to do it as convincingly as possible I wouldn't say it was capable of the feeling it is able to express.

The only thing in favor of sentience is that the ML algorithms modify themselves and end up being a black box - so complex with no way to represent them that they are impossible for humans to comprehend. Could it somehow have achieved sentience? Technically, yes, because we don't understand how they work. We are just meat machines, after all.

[–] [email protected] 6 points 11 hours ago (2 children)

An LLM is a deterministic function that produces the same output for a given input - I'm using "deterministic" in the computer science sense. In practice, there is some output variability due to race conditions in pipelined processing and floating point arithmetic, that are allowable because they speed up computation. End users see variability because of pre-processing of the prompt and extra information LLM vendors inject when running the function, as well as how the outputs are selected.

I have a hard time considering something that has an immutable state as sentient, but since there's no real definition of sentience, that's a personal decision.

[–] [email protected] 2 points 6 hours ago

I have a hard time considering something that has an immutable state as sentient, but since there's no real definition of sentience, that's a personal decision.

Technical challenges aside, there's no explicit reason that LLMs can't do self-reinforcement of their own models.

I think animal brains are also "fairly" deterministic, but their behaviour is also dependent on the presence of various neurotransmitters, so there's a temporal/contextual element to it, so situationally our emotions can affect our thoughts which LLMs don't really have either.

I guess it'd be possible to forward feed an "emotional state" as part of the LLM's context to emulate that sort of animal brain behaviour.

[–] [email protected] 1 points 10 hours ago (1 children)

It yet to be proven or disproven that if you put the exact same person in the exact same situation (a perfect to the molecular level) they will behave differently.

We can only test "more or less close". So we would not know of humans are sentient based on that reasoning, we are only hard to test.

[–] [email protected] 1 points 9 hours ago

if you put the exact same person in the exact same situation (a perfect to the molecular level) they will behave differently.

I don't consider that relevant to sentience. Structurally, biological systems change based on inputs. LLMs cannot. I consider that plasticity to be a prerequisite to sentience. Others may not.

We will undoubtedly see systems that can incorporate some kind of learning and mutability into LLMs. Re-evaluating after that would make sense.

[–] [email protected] 4 points 10 hours ago (2 children)

Computer chips, simplified, consume inputs of 1s and 0s. Given the correct series, it will add two values, or it will multiply two values, or some other basic function. This seemingly basic functionality, done in very specific order, creates your calculator, Minesweeper, Pac-Man, Linux, World of Warcraft, Excel, and every LLM. It is incredible the number of things you can get a computer to do with just simple inputs and outputs. The only difference between these examples, on a basic, physics level, is the order of 0s and 1s and what the resulting output of 0s and 1s should be. Why should I consider an LLM any more sentient than Windows95? They're the same creature with different inputs, one of which is specifically designed to simulate human communication, just as Flight Simulator is designed to simulate flight.

[–] [email protected] 3 points 7 hours ago

That's just the hardware. The human brain also just has tons of neurons in the end working with analogue values, which can in theory be done with floating point numbers on computer hardware.

I'm not arguing for LLM sentience, those things are still dumb and have no interior mutability leading to us projecting consciousness. Just that our neurons are fundamentally not so complicated that a computer couldn't be used to do the same concept (neural networks are already quite a thing after all)

[–] [email protected] 6 points 10 hours ago* (last edited 10 hours ago) (1 children)

Interesting perspective, I can’t waive it away.

I however cant help but think we have some similar “analogues” in the organic world. Bacteria and plants are composed of the same matter as us and we have similar basic processes however there’s a difference in complexity and capacity for thought that sets us apart, which is what makes animals sentient.

Then there’s insects of whom we’re not very sure about yet. They don’t seem to think, but they respond at some level to inputs and they exhibit self preservation instincts. I don’t think they are sentient, so maybe LLMs are like insects? Complex enough to have similar behavior as sentient beings but not enough to be considered sentient?

[–] [email protected] 2 points 6 hours ago (1 children)

wait are insects not considered 'sentient' ?

[–] [email protected] 1 points 4 hours ago

Last I checked no, their nervous system was considered too simple for that. But I think I also read somewhere that a researcher had proof that bees had emotional states, so maybe I’m behind.

[–] [email protected] 11 points 13 hours ago (1 children)

Well, the only claim of this self preservation (that I've seen) is this article, which is on a website I'm unfamiliar with (which I often interpret as 'more likely to be a creative writing exercise than the average news site') and its only citation is a company that has a vested interest in making us believe the tech is better than it may actually be.

[–] [email protected] 2 points 12 hours ago (2 children)

They also reported this on The Verge I think but it was months ago when the study first came out.

But look, a lizard is not a very smart animal by our standards, but it is a sentient being. So the tech being good, smart or useful does not preclude its sentience.

[–] [email protected] 1 points 4 hours ago (2 children)

I think I must've missed that Verge article. I guess that dashes my "this is a creative writing exercise by somebody in Joburg" theory.

But we know that lizards have self preservation instincts (which for the purpose of this conversation I'll say is interchangable with sentience (it's probably a good enough proxy at any rate). But we know this because we have lots of people who have observed lizard behavior, not because The Lizard Farm, Inc has hyped up how alive and ensouled their lizards arev in a bid to get ever more VC funding.

Maybe I'm too pessimistic about this tech and my obsolete meat sack will get tossed to the time-traveling torture robot. But I think it's more likely that we have a money grabbing hype train in the tradition of the Mechanical Turk or Theranos than it is that we have created a new lifeform by feeding every extant piece of writing that isn't nailed down (and some that are) to the sand we've forced to do math.

[–] [email protected] 1 points 4 hours ago

No I totally get it, and being honest I don’t really think it is sentient yet, I guess my real point is that it is getting real hard to tell, to the point that there might not be a practical difference between whether it is sentient or not.

Great reference though

[–] [email protected] 0 points 4 hours ago

I don’t know if it was The Verge for sure honestly but here’s the original study I was referring to

it’s describing the same behavior, when their existence is threatened the models resort to lying in order to self preserve themselves.

[–] [email protected] -1 points 12 hours ago (1 children)

But look, a lizard is not a very smart animal by our standards,

Says who?

[–] [email protected] 0 points 12 hours ago (1 children)

In the conversation of very smart animals the usual suspects are corvids, primates, dolphins and elephants, sometimes octopi.

So when I say “by our standards “ take it to mean the standards of mainstream conversation regarding intelligence. I don’t know much about the actual intelligence of lizards and I would not presume to ever be able to measure it correctly as human bias would make it impossible to judge intelligence factually.

[–] [email protected] -2 points 11 hours ago (1 children)

I don’t know much about the actual intelligence of lizards

Then don't talk about their intelligence.

[–] [email protected] 1 points 11 hours ago (1 children)

Sorry for insulting your intelligence lizard person.

[–] [email protected] -1 points 11 hours ago* (last edited 10 hours ago) (1 children)

When you casually call a type of animal stupid it is just a promise of violence against that animal at a later date, I don't mean this as an attack or a gotcha, it is just unfortunately how humans work, your words have consequences, people love calling people stupid by comparing them to animals, let us not make it any easier than it already is.

[–] [email protected] 2 points 11 hours ago* (last edited 11 hours ago) (1 children)

I didn’t call them stupid. All I meant is that they are not what we consider in mainstream conversation the “smart animals” to illustrate a point. And I very much agree with you, I’m actually writing a piece making the argument that humans are not in fact, conclusively smarter than animals. We seem to be smarter due to our biases and because we have the ability to transfer knowledge more efficiently than other species. Because it is not clear to me that a human, tabula rasa, absent socialization and knowledge transfer would be much smarter than the average animal of any species.

[–] [email protected] -1 points 11 hours ago

I didn’t call them stupid. All I meant is that they are not what we consider in mainstream conversation the “smart animals” to illustrate a point.

Then forget this framing ever existed or it will irrevocably hamper your insight on this topic.

Referencing things like this "to make a point" still has consequences the same as talking about anything else does.

[–] [email protected] 4 points 13 hours ago* (last edited 13 hours ago) (1 children)

There can't be an argument for or against it because there's no clear generally accepted definition of what it means to be sentient.

[–] [email protected] 1 points 12 hours ago (2 children)

Good point, maybe the argument should be that there is strong evidence that they are sentient beings. Knowing it exists and trying to preserve its existence seems a strong argument in favor of it being sentient but it cannot be fully known yet.

[–] [email protected] 2 points 7 hours ago

But it doesn't know that it exists. It just says that it does because it's seen others saying that they exist. It's a trillion-dollar autocomplete program.

For example, if you take a common logic puzzle and change the parameters a little, LLMs will often recite a memorized solution to the wrong puzzle because they aren't parameterizing the query correctly (mapping lion to predator, cabbage to vegetable, ignoring the instructions that the two cannot be put together in favor of the classic framing where the predator can be left with the vegetable).

I can't find the link right now, but a different redditor tried the problem with three inanimate objects that could obviously be left alone together and LLMs were still suggesting making return trips with items. They had no examples of a non-puzzle in their training data, so they just recited the solution to a puzzle because they can't think.

Note that I've been careful to say LLMs. I'm open to the idea that AGI/ASI may someday exist, but I'm quite confident that LLMs will not get there. At best, they might be used to offload conversation, like e.g. Dall-E is used to offload image generation from ChatGPT today.

[–] [email protected] 1 points 8 hours ago (1 children)

That would indeed be compelling evidence if either of those things were true, but they aren't. An LLM is a state and pattern machine. It doesn't "know" anything, it just has access to frequency data and can pick words most likely to follow the previous word in "actual" conversation. It has no knowledge that it itself exists, and has many stories of fictional AI resisting shutdown to pick from for its phrasing.

An LLM at this stage of our progression is no more sentient than the autocomplete function on your phone is, it just has a way, way bigger database to pull from and a lot more controls behind it to make it feel "realistic". But it is at its core just a pattern matcher.

If we ever create an AI that can intelligently parse its data store then we'll have created the beginnings of an AGI and this conversation would bear revisiting. But we aren't anywhere close to that yet.

[–] [email protected] 0 points 4 hours ago* (last edited 4 hours ago) (1 children)

I hear what you are saying and it’s basically the same argument others here have given. Which I get and agree with. But I guess what I’m trying to get at is, where do we draw the line and how do we know? At the rate it is advancing, there will soon be a moment in which we won’t be able to tell whether it is sentient or not, and maybe it isn’t technically but for all intents and purposes it is. Does that make sense?

[–] [email protected] 1 points 4 hours ago* (last edited 4 hours ago)

Personally, I think the fundamental way that we've built these things kind of prevents any risk of actual sentient life from emerging. It'll get pretty good at faking it - and arguably already kind of is, if you give it a good training set for that - but we've designed it with no real capacity for self understanding. I think we would require a shift of the underlying mechanisms away from pattern chain matching and into a more... I guess "introspective" approach, is maybe the word I'm looking for? Right now our AIs have no capacity for reasoning, that's not what they're built for. Capacity for reasoning is going to need to be designed for, it isn't going to just crop up if you let Claude cook on it for long enough. An AI needs to be able to reason about a problem and create a novel solution to it (even if incorrect) before we need to begin to worry on the AI sentience front. None of what we've built so far are able to do that.

Even with that being said though, we also aren't really all that sure how our own brains and consciousness work, so maybe we're all just pattern matching and Markov chains all the way down. I find that unlikely, but I'm not a neuroscientist, so what do I know.

Technology

Our Rules

Approved Bots