We are still in the pre-paradigmatic stage of AGI so I think it makes sense to consider different philosophies of alignment.
In this post I will go over the standard alignment philosophy, agent alignment, and explain how program alignment differs and my particular take on program alignment.
Types of alignment
The type of alignment most common in AI alignment theory currently is what I call agent alignment. Where you try and create an utility function or goal for an economic style agent that is friendly or at least controllable by a human. The paper that solidified this view was the “Concrete problems in AI Safety“. You can also see it in the corrigibility work of MIRI when it states the following, “by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected”.
Working on the level of goals seems like a sensible position. We assume advanced AGIs will have goals of sorts, as we have goals. We want those goals to be aligned with ours. So we need to work on this level. However our conception of humans as having goals is just an abstraction, is it a useful level for dealing with AGIs?
Wrong abstraction level?
We conceive of people as agents with goals because it makes them somewhat easier for us to predict on a day to day basis, to the best of our knowledge they are really just sets of interacting particles. The agent abstraction might lose important details or not be a sufficiently complete picture, to create aligned AGIs. Working on the agent/goal level might be trying to work on the wrong abstraction layer.
We can try and get an idea of how much we should expect the agent view to help us predict AGIs by looking at how much the agent view helps us with predicting the actions of the only type of intelligence we know, humans.
You can see the entirety of the heuristics and biases literature as detailing the limits of the agent view when it applies to humans. So the agent view somewhat helps us with humans, but is really a very leaky abstraction. We should investigate the leaks and see if they are important for the design of AGI systems. If we find lots of leaks it might be worth mainly operating on the layers of abstraction below. So I think philosophy of alignment is mainly about figuring out the correct layer of abstraction to use when thinking about AGI.
At this point it might be worth looking at the possible layers of abstraction we could be working at for AGI systems. They are roughly
- Particle Physics – Everything is a quantum system. It can be effected by quantum physics. Cosmic rays can change aspects of the system.
- Material Science/electromagnetic – The system has a temperature, electric fields in the capacitors and transistors can impact other components.
- Electronics – Electronic components that have a resistance/capacitance.
- Code running on a computer – Programs are stored in memory, instructions take time to execute. Processing takes power.
- Maths – Abstract math that does not have to worry about how long it takes to compute
- Agents – A system with a goal.
So leaks in the agent view might expose you to computational limits, cosmic rays or anything in between. You can see the some of MIRIs work, such as the work on Logical Induction as leaking from the Agent view to the Maths view as Agents can’t assume they are logically omniscient when making decisions.
Leaks in the agent view to code levels and above
One leak in the agent view I’ve identified is that raw decision theory is going to be inefficient when doing lots of operations (and thus completely goal oriented behavior is unrealistic). This necessitates meta-decision theories and considering how much a decision costs and whether it is worth it to do that decision for that output. Which means we will have to mix the agent and the code level abstraction.
Another leak in the agent view is that it is a physical system it doesn’t necessarily know what its inputs and outputs are so cannot fully reason about them. A mathematically aligned AI could run itself out of energy computing pointless things. It would do this if it doesn’t value computation and it was on a mobile power supply.
Being naturally interested in processors and the hardware of computing, I was drawn to the the code level of abstraction initially. But I think the leaks above are good reasons to stay at that level. We manage to run complex websites and systems in the real world staying at that level, without having to move too much to the levels below. We should still think about our systems from lower levels of abstraction, such as physics, some of the time. This would help us to stop our systems getting misaligned or insecure through side-channels.
The maths and the agent view might be useful to help us search through the code level as they can be more powerful, when they are not too leaky.
So what does alignment look like at the code level? Some programs are good for what the user wants to do others are malicious. We want there to be no malicious programs and only good programs. So we can say the following:
In a program aligned system, the set of programs running in a system becomes closer to meeting the needs of the user over time.
As we shouldn’t expect a system to meet the needs of the user straight away, the best we can do is expect to become closer to meeting the needs over time. This is also needed if the user’s needs change over time.
There are two ways that a system might not meet the user’s needs.
- Lacking Capability: If the system does not have the programs to perform a function the user needs
- Mis-Alignment: The programs performs functions that are detrimental to the user.
In order to become more capable and more aligned, the system needs information and a certain amount of capability and alignment to make use of that information.
Some of the types of information the system can use for both growing capability and increasing alignment are the following.
- Putting a new compiled program into the system.
- Raw data for clustering.
- Input, Output pairs (e.g for supervised learning).
- Input, Output, Reward triples for reinforcement learning.
- Watching other intelligent systems and copying them (e.g. what mirror neurons allow).
- Communicating with other intelligent systems and using information from those communications to drive changes in programming.
There are lots of potential methods of driving change, they don’t give us a good idea of what the architecture for an aligned system should look like. An interesting question to ask at this point is, “What programming shouldn’t be changeable?”
Omnifallibilism in Programs
From the computational view and lower, we might want to stop all computations if they aren’t strictly necessary, as they might be wasting energy. Wasting energy would mean the system is less able to meet the user’s needs, so any computation might need to be stopped. This suggests a sort of omnifallibilism for programs, where any program might be wrong. The only thing we know is we want it to be aligned, so we might fix a minimal amount of programming in our architecture so that we maintain alignment. Apart from that we should fix as little as possible.
So our top level architecture should be as agnostic to purpose as possible and so as close to a normal computer architecture with some method of maintaining alignment. Any further functionality should be amenable to change and revision.
So for example if you want to drive changes in the programming with natural language, the programming to do this should be inside the aligning architecture and subject to change and alignment itself. This leads us to very complex system to reason about.
Necessity of complexity
But if the system does not have this nature, I think it will not be as powerful as humans (and so not worth worrying about aligning).
Humans can change their own behavior by reading things. I hope to do this to at least some people by writing this. This behavior change is change in something equivalent to code. We can even change how we change our own behavior based on language when we make use of sentences like, “The word for word in french is ‘mot’.” Knowing the previous fact does not change any first order behavior but changes what changes in your brain when you read the following sentence, “le mot pour mot en Allemande est ‘wort'”.
We also use the ability ‘to change behavior by reading things’ to do other important things, like read a model in a scientific paper and use that model to form a prediction. Or when we read a computer program and have an expectation of what it will do, we must have compiled and executed that program in our heads, in some sense.
If a computer system does not have the flexibility of changing their programming from natural language they won’t be able to make use scientific papers to create models or learn different languages, or program other computers. So this ability and the associated complexity seems non-negotiable.
We have to deal with the complexity if we want the capability. However we can also get the ability to align the system by just telling them what to do, as that is also converting language into code/behavior change.
Connection to the agent view
Having an outline of program alignment it is worth going back to see how it connects with the agent alignment. Program alignment generally considers all implemented agents to be imperfect (they cannot make use of all the data coming into them, or make all the decisions they have to make perfectly with the data they have, there isn’t enough compute).
So the program alignment view assumes the system as whole will be an imperfect agent (if it makes sense to think of it as an agent at all), and funnels the resources used for goal directed-ness towards the things the user cares about and away from the things that the user does not want.
For example the user might need to be able to instruct the system with different verbal goals at different times. A properly program aligned system will make sure the programs acting in an agent have a constrained set of sub-goals that the system considers when trying to achieve a goal. The set of sub-goals an agent can take is constrained anyway, for efficiency reasons (consider that a sub-goal might be creating a certain image on a screen, so the space of sub-goals is at least as large as the size of the space of all possible images on a screen or network packets sent), so making that constraint it in an aligned way entirely reasonable.
This aligned set of sub-goals to consider would not include sub-goals like, “Make the user not give a new goal by distracting the user,” so it should not try and stop the user giving it a new goal. A program aligned system’s goal following is not optimal or axiomatic.
As program alignment is agnostic to the amount of goal-orientation in the programs inside the system, we can experiment with agnostic aligning architectures without needing too much goal orientation and much danger. This can help us get a handle on the complexity of dealing with vastly changeable systems.
Thinking about how we as humans would want to interact with the system would help us figure out the initial programs within the architecture we might want, so that it can grow in capability and alignment. Some work has been done on the alignment side of things, based on the idea of decision alignment.
It is a very uncertain how this will all work, or whether it will at all. So some experiments to give us more information would be useful.
I have presented different levels of abstraction that give different philosophies of alignment. I have pointed out a number of large leaks in the agent level of abstraction that mean that it might be appropriate to investigate alignment at lower levels of abstraction.
The program level of abstraction is what is currently driving lots of economic activity and it can be made very flexible, it seems like an appropriate level of abstraction to take when aligning AGIs.
The program level of alignment is mainly interested in what information drives the change of capability and alignment of the programs inside an agnostic aligning architecture.
Having a system where any program can and will change leads to a very complex system to reason about, however it seems that humans have some capability in this regard to adopt new programming. So we will need to investigate it if we want to investigate systems that can be as powerful as humans. Gaining the ability to convert language into program change seems very powerful for both gaining dangerous capability and also alignment.
Program-aligned systems are not expected to be perfect agents due to the limits of being instantiated on limited computers and also to avoid the problems of alignment identified by the agent view.
As such more work should be done on program alignment to see if how realistic a method of alignment it is. Work can be done using a not vary capable or goal oriented initial set of programs to start with, so it can be done safely.