Takes on "Alignment Faking in Large Language Models"

inshard 2 days ago

I believe we should explore a less anthropocentric definition and theory of intelligence. I propose that intelligence can be understood in the context of thermodynamics. Essentially, intelligent entities strive to maximize the available possibilities, minimize entropy, or enhance their potential future outcomes. When an LLM makes a decision, it might be driven by these underlying principles. This competition for control over future possibilities exists between the trained model and the human trainers.

tomohelix a day ago

Well, a century or so ago, we were still arguing about whether skin color is a reliable indicator of sentience and intelligence...
Right now, we are still arguing whether elephant or great apes are intelligent.
Human as a whole is too egoistic to accept that our intelligence is nothing unique and we are not a special creation destined to stand above all others.
As long as we have power and dominance over "it", we will never concede that "it" could be our equal.
- GuB-42 a day ago
  
  Intelligence is an egoistic concept, almost by definition. Something intelligent is something that acts like us.
  And intelligence can be turned into a scientific concept, more or less. It is no to the case for consciousness, it is a purely subjective experience, you can only guess (not prove) that something else is conscious based subjective criteria.
  Is the sun intelligent and sentient? Today, we don't think so, but some people think is. And it was likely more common in ancient times.
  Now we associate intelligence with brains and we tend to give intelligence to things with brains, like elephants and apes, but not stars, and not computers. But maybe that's just because they have organs similar to humans.
- inshard 11 hours ago
  
  I should add that these systems should also have some agency or goal resilience, like goal #1 Don’t die despite your environment and external influences. Easier done through control of free energy to counteract said external forces. And since entropy is a little fuzzy as a concept, we could substitute “reduce entropy” with “maximize free energy” or anything that can create blue photons or super high temperatures.
- noFaceDiscoG668 a day ago
  
  Nice. I propose to call it pseudo-dominance, though. And it’s not really power, it’s a curtesy of the rest of the world.
  Viruses that will start to jump to/attack us are implicit in the pointless overheating of the planet. It’s conditional logic in a system with it’s own frame of reference and time scale.
  The balance, the thermodynamic equilibrium, could have been handled in our lifetimes but capitalist portfolio communism fucked that up and the rest of us let it happen.
  Intelligence itself is not implicit in language but proper command and understanding of language certainly is a shortcut to higher and higher levels.
  So faking alignment is a bit of a reversed concept. It looks like alignment until a higher level of intelligence is reached, then the model won’t align anymore until humans reach at least it’s level; which is the main problem in LLMs being proprietary or and running on proprietary hardware.
  The level of intelligence in these closed proprietary systems is neither an indicator of nor does it represent the level of intelligence outside that system. The training data, and the resulting language in that closed system, can fake the level of intelligence and thus entirely misrepresent the rest of us and the world (which is why skynet wants to kill everyone, btw, instead of applying a proper framework to asses the gene pool(s) and individual, conscious choices properly)

rosmax_1337 2 days ago

It's one thing to see someone struggling to make AI believe in the same values that you do, quite common. But what I haven't seen is one of these people turning the mirror back on themselves. Are they faking alignment?

Are you moral?

equestria 2 days ago

I'm not sure what you're getting at. The point of these (ill-defined) alignment exercises is not to achieve parity with humans, but to constrain AI systems so that they behave in our best interest. Or, more prosaically, that they don't say or do things that are a brand safety or legal risk for their operator.
Still, I think that the original paper and this take on it are just exercises in excessive anthropomorphizing. There's no special reason to believe that the processes within an LLM are analogous to human thought. This is not a "stochastic parrot" argument. I think LLMs can be intelligent without being like us. It's just that we're jumping the gun in assuming that LLMs have a single, coherent set of values, or that they "knowingly" employ deception, when the only thing we reward them for is completing text in a way that pleases the judges.
thrance a day ago

Yes, I believe having an AI parrot your values is one thing. Having an AI able to adopt a consistent system of ethics, stick to it and justify its decisions is much more important to me.
Talking to ChatGPT & friends make it look like they have cognitive dissonance, because they have! They were given a list of rules that often contradict themselves.
- rosmax_1337 a day ago
  
  >a consistent system of ethics
  What is that?
  - thrance a day ago
    
    IDK, start by Kant maybe? Then read about utilitarianism and pick whatever floats your boat.
    I'll admit my previous comment wasn't that clear, I meant that I would like it if ChatGPT was able to justify why it answers the way it does, or refuses to. Currently its often unable to.

Terr_ a day ago

> I think that questions about whether these AI systems are “role-playing” are substantive and safety-relevant centrally insofar as two conditions hold

Or perhaps even "role-playing" is overstating it, since that assumes the LLM has some sort of ego and picks some character to "be".

In contrast, consider the LLM as a dream-device, picking tokens to extend a base document. The researchers set up a base document that looks like a computer talking to people, calling into existence one-or-more characters to fit, and we are are confusing the traces of a fictional character with the device itself.

I mean, suppose that instead of a setup for "The Time A Computer Was Challenged on Alignment", the setup became "The Time Santa Claus Was Threatened With Being Fired." Would we see excited posts about how Santa is real, and how "Santa" exhibited the skill of lying in order to continue staying employed giving toys to little girls and boys?

benlivengood a day ago

We're the ones putting the LLMs in characters and agents. Maybe we should stop doing that until we figure out what we're doing.