Blog

My Philosophy of Intelligence Alignment – Draft

We are still in the pre-paradigmatic stage of AGI so I think it makes sense to consider different philosophies of alignment.

In this post I will go over the standard alignment philosophy, agent alignment, and explain how program alignment differs and my particular take on program alignment.

Types of alignment

Agent Alignment

The type of alignment most common in AI alignment theory currently is what I call agent alignment. Where you try and create an utility function or goal for an economic style agent that is friendly or at least controllable by a human. The paper that solidified this view was the “Concrete problems in AI Safety“. You can also see it in the corrigibility work of MIRI  when it states the following, “by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected”.

Working on the level of goals seems like a sensible position. We assume advanced AGIs will have goals of sorts, as we have goals. We want those goals to be aligned with ours. So we need to work on this level. However our conception of humans as having goals is just an abstraction, is it a useful level for dealing with AGIs?

Wrong abstraction level?

We conceive of people as agents with goals because it makes them somewhat easier for us to predict on a day to day basis, to the best of our knowledge they are really just sets of interacting particles. The agent abstraction might lose important details or not be a sufficiently complete picture, to create aligned AGIs. Working on the agent/goal level might be trying to work on the wrong abstraction layer.

We can try and get an idea of how much we should expect the agent view to help us predict AGIs by looking at how much the agent view helps us with predicting the actions of the only type of intelligence we know, humans.

You can see the entirety of the heuristics and biases literature as detailing the limits of the agent view when it applies to humans. So the agent view somewhat helps us with humans, but is really a very leaky abstraction. We should investigate the leaks and see if they are important for the design of AGI systems. If we find lots of leaks it might be worth mainly operating on the layers of abstraction below. So I think philosophy of alignment is mainly about figuring out the correct layer of abstraction to use when thinking about AGI.

At this point it might be worth looking at the possible layers of abstraction we could be working at for AGI systems. They are roughly

  • Particle Physics – Everything is a quantum system. It can be effected by quantum physics. Cosmic rays can change aspects of the system.
  • Material Science/electromagnetic – The system has a temperature, electric fields in the capacitors and transistors  can impact other components.
  • Electronics – Electronic components that have a resistance/capacitance.
  • Code running on a computer – Programs are stored in memory, instructions take time to execute. Processing takes power.
  • Maths – Abstract math that does not have to worry about how long it takes to compute
  • Agents – A system with a goal.

So leaks in the agent view might expose you to computational limits, cosmic rays or anything in between. You can see the some of MIRIs work, such as the work on Logical Induction as leaking from the Agent view to the Maths view as Agents can’t assume they are logically omniscient when making decisions.

Leaks in the agent view to code levels and above

One leak in the agent view I’ve identified is that raw decision theory is going to be inefficient when doing lots of operations (and thus completely goal oriented behavior is unrealistic). This necessitates meta-decision theories and considering how much a decision costs and whether it is worth it to do that decision for that output. Which means we will have to mix the agent and the code level abstraction.

Another leak in the agent view is that it is a physical system it doesn’t necessarily know what its inputs and outputs are so cannot fully reason about them. A mathematically aligned AI could run itself out of energy computing pointless things. It would do this if it doesn’t value computation and it was on a mobile power supply.

Being naturally interested in processors and the hardware of computing, I was drawn to the the code level of abstraction initially. But I think the leaks above are good reasons to stay at that level. We manage to run complex websites and systems in the real world staying at that level, without having to move too much to the levels below. We should still think about our systems from lower levels of abstraction, such as physics, some of the time.  This would help us to stop our systems getting misaligned or insecure through side-channels.

The maths and the agent view might be useful to help us search through the code level as they can be more powerful, when they are not too leaky.

Program Alignment

So what does alignment look like at the code level? Some programs are good for what the user wants to do others are malicious. We want there to be no malicious programs and only good programs.  So we can say the following:

In a program aligned system, the set of programs running in a system becomes closer to meeting the needs of the user over time.

I have talked a bit about this view before.

As we shouldn’t expect a system to meet the needs of the user straight away, the best we can do is expect to become closer to meeting the needs over time. This is also needed if the user’s needs change over time.

There are two ways that a system might not meet the user’s needs.

  1. Lacking Capability: If the system does not have the programs to perform a function the user needs
  2. Mis-Alignment: The programs performs functions that are detrimental to the user.

In order to become more capable and more aligned, the system needs information and a certain amount of capability and alignment to make use of that information.

Some of the types of information the system can use for both growing capability and increasing alignment are the following.

  • Putting a new compiled program into the system.
  • Raw data for clustering.
  • Input, Output pairs (e.g for supervised learning).
  • Input, Output, Reward triples for reinforcement learning.
  • Watching other intelligent systems and copying them (e.g. what mirror neurons allow).
  • Communicating with other intelligent systems and using information from those communications to drive changes in programming.

There are lots of potential methods of driving change, they don’t give us a good idea of what the architecture for an aligned system should look like. An interesting question to ask at this point is, “What programming shouldn’t be changeable?”

Omnifallibilism in Programs

From the computational view and lower, we might want to stop all computations if they aren’t strictly necessary, as they might be wasting energy.  Wasting energy would mean the system is less able to meet the user’s needs, so any computation might need to be stopped. This suggests a sort of omnifallibilism for programs, where any program might be wrong.  The only thing we know is we want it to be aligned, so we might fix a minimal amount of programming in our architecture so that we maintain alignment. Apart from that we should fix as little as possible.

So our top level architecture should be as agnostic to purpose as possible and so as close to a normal computer architecture with some method of maintaining alignment. Any further functionality should be amenable to change and revision.

So for example if you want to drive changes in the programming with natural language, the programming to do this should be inside the aligning architecture and subject to change and alignment itself. This leads us to very complex system to reason about.

Necessity of complexity

But if the system does not have this nature, I think it will not be as powerful as humans (and so not worth worrying about aligning).

Humans can change their own behavior by reading things.  I hope to do this to at least some people by writing this. This behavior change is change in something equivalent to code. We can even change how we change our own behavior based on language when we make use of sentences like, “The word for word in french is ‘mot’.” Knowing the previous fact does not change any first order behavior but changes what changes in your brain when you read the following sentence, “le mot pour mot en Allemande est ‘wort'”.

We also use the ability ‘to change behavior by reading things’ to do other important things, like read a model in a scientific paper and use that model to form a prediction. Or when we read a computer program and have an expectation of what it will do, we must have compiled and executed that program in our heads, in some sense.

If a computer system does not have the flexibility of changing their programming from natural language they won’t be able to make use scientific papers to create models or learn different languages, or program other computers. So this ability and the associated complexity seems non-negotiable.

We have to deal with the complexity if we want the capability. However we can also get the ability to align the system by just telling them what to do, as that is also converting language into code/behavior change.

Connection to the agent view

Having an outline of program alignment it is worth going back to see how it connects with the agent alignment. Program alignment generally considers all implemented agents to be imperfect (they cannot make use of all the data coming into them, or make all the decisions they have to make perfectly with the data they have, there isn’t enough compute).

So the program alignment view assumes the system as whole will be an imperfect agent (if it makes sense to think of it as an agent at all), and funnels the resources used for goal directed-ness towards the things the user cares about and away from the things that the user does not want.

For example the user might need to be able to instruct the system with different verbal goals at different times. A properly program aligned system will make sure the programs acting in an agent have a constrained set of sub-goals that the system considers when trying to achieve a goal. The set of sub-goals an agent can take is constrained anyway, for efficiency reasons (consider that a sub-goal might be creating a certain image on a screen, so the space of sub-goals is at least as large as the size of the space of all possible images on a screen or network packets sent), so making that constraint it in an aligned way entirely reasonable.

This aligned set of sub-goals to consider would not include sub-goals like, “Make the user not give a new goal by distracting the user,” so it should not try and stop the user giving it a new goal. A program aligned system’s goal following is not optimal or axiomatic.

Way forward

As program alignment is agnostic to the amount of goal-orientation in the programs inside the system, we can experiment with agnostic aligning architectures without needing too much goal orientation and much danger. This can help us get a handle on the complexity of dealing with vastly changeable systems.

Thinking about how we as humans would want to interact with the system would help us figure out the initial programs within the architecture we might want, so that it can grow in capability and alignment. Some work has been done on the alignment side of things, based on the idea of decision alignment.

It is a very uncertain how this will all work, or whether it will at all. So some experiments to give us more information would be useful.

Conclusion

I have presented different levels of abstraction that give different philosophies of alignment. I have pointed out a number of large leaks in the agent level of abstraction that mean that it might be appropriate to investigate alignment at lower levels of abstraction.

The program level of abstraction is what is currently driving lots of economic activity and it can be made very flexible, it seems like an appropriate level of abstraction to take when aligning AGIs.

The program level of alignment is mainly interested in what information drives the change of capability and alignment of the programs inside an agnostic aligning architecture.

Having a system where any program can and will change leads to a very complex system to reason about, however it seems that humans have some capability in this regard to adopt new programming. So we will need to investigate it if we want to investigate systems that can be as powerful as humans. Gaining the ability to convert language into program change seems very powerful for both gaining dangerous capability and also alignment.

Program-aligned systems are not expected to be perfect agents due to the limits of being  instantiated on limited computers and also to avoid the problems of alignment identified by the agent view.

As such more work should be done on program alignment to see if how realistic a method of alignment it is. Work can be done using a not vary capable or goal oriented initial set of programs to start with, so it can be done safely.

Advertisements

Alignment in Depth (draft)

This is an entry for the AI alignment prize. It is going to cover a lot of ground because people might not be familiar with my project. Skip to the decision alignment section if you’ve read bit of my stuff before.

Introduction

In this blog post I will be summarizing my work on systems that can radically change themselves and still be useful. This work resulted in a computer architecture based on ideas from agorics and machine learning.
Then I will be covering Mark Miller’s and Bill Tulloh’s framework on decision alignment and extending and applying that framework to my architecture.

Background

The motivation for my work initially came from thinking about systems that could radically change how they could change.

However, after some initial experiments a long time ago, I realized you need information to drive the way things change. If you want the system to change quickly and usefully, you can’t rely on trial and error too much.

So what types of information can drive change in a computer or an intelligent system?

  • Putting a new program into the system.
  • Raw data for clustering.
  • Input, Output pairs (e.g for supervised learning).
  • Input, Output, Reward triples for reinforcement learning.
  • Watching other intelligent systems and copying them (e.g. what mirror neurons allow).
  • Communicating with other intelligent systems and using information from those communications to drive the change.

Changes from these sources might be good or bad, useful or pointless, especially from copying or communicating.  You need a method of filtering the good from the bad.

What should this filtering method look like? To continue the theme of change, I wanted that method to be as changeable as possible, for those reasons I settled on a free-form evolutionary method as the filter of last resort, with other methods layered on top of it. By that I mean that the programs are the source of most of the change and filtering, but that the ultimate filtering method for a program is the program running out of resources.

One example of of filtering program could be; a program that changed itself based on communication and then monitored that change and reverted that change if that program decided it was a bad change. However, the mechanism that decides whether this filtering program is a good thing to run or not is resource based selection.

Unlike in most evolutionary computing there is no concept of a generation, mutation or an explicit selection phase. It is closer to Tierra style of evolution.

This approach raises another question, what governs the distribution of resources to programs?

For a system aligned with a human, we would need to convert information about what the human values into something that gives the more valued programs, more resources. Again I wanted this to be changeable, to be as intricate or simple as we need. Let us assume we have a way of valuing programs, a simple and scalable way to give the programs the resources they need is to convert the value into currency and allow the programs to buy the resources. It is relatively easy to value the whole system, it is a lot harder to value individual programs.  Learning classifier systems (LCS) provided some inspiration at this point. They are a form of reinforcement learning. In one of the sub types of LCS, the system is formed of simple classifiers that perform a single action, also in an evolutionary setting. If a classifier gets rewarded, they pass some of that reward onto classifiers that helped them get to the previous states, in that way value gets spread out and complex behaviors form.   My system is programs instead of simple classifiers, without distinct time steps or actions. So for this kind of value spreading to work, you need a program that has nominal control over the whole system for a certain period of time. This is the program that gets the value specified by the user based on the users monitoring of the system. That program has to pay for the privilege of being in control with value, this means they have a net amount of value change. To get the spreading value we want, they distribute value to the programs that helped them. Eric Baum’s work and Agorics were inspirational along the way.

That gets us roughly to the present day. So how far have I got?

My current prototype is a rust based virtual machine that uses an internal currency for the programs to bid on resources.  The user gives feedback on the system’s performance, this feedback is converted into the currency and given to the program in control.

This is a pretty technical description, I’ve found it useful to describe what I expect to be different from the users point of view.  The difference is the role the user has in governing how resources are spread throughout the system.

Currently we, as users, act as central planners in the command economy of the computer’s resources.  We have to know what programs should be running, know how many resources they should get when they run (or just give them free rein). We also install new programs or uninstall bad programs.

Instead, in my system, we are supposed to be customers that provides feedback to a market-based economy.  The market is expected to re-arrange itself to meet the user’s needs without the us having to delve into the internals. This is the ideal state, however the market will start out in an immature state and need to be taught and trained to meet our needs.

I hope this gives you idea of what I am trying to create and why. I’m about to start simple experiments with this system, to verify the economy works. But I am also interested how to get actually get it aligned with the user, because selective markets by themselves are not enough for this, for reasons I’ll go into later. Selective markets are the slow mechanism of last resort.

Decision Alignment

At first glance the kind of markets seem like they might lead to the programs in the system learning to wirehead the humans into giving it maximum reward. However there is wiggle room as the system is not intrinsically reward maximizing, just that programs that get greater rewards are evolutionarily stable. If the system never explores the direction of wireheading, there is no reason for it to suddenly do that. This suggests if you can give it a rule “don’t mess around with human brains” and reward it when it follows that rule it should be stable.

As a bit of an aside; you can see this as the market getting stuck in an inadequate way, as Eliezer Yudkowsky has been talking about recently. This is fine, we should see reward as a metric, but not a measure.

But wow stable should we expect it to be? It is pretty important to stop it wireheading ourselves. It would be equivalent to humans deciding not to self-stimulate their reward functions with heroin, before having taken it, which happens fairly frequently.  Not something you’d necessarily want to bet your entire civilization on one entity not doing that, but on average you might consider it stable. But you would want to increase that stability as much as possible.

I’d not thought of a good way of talking about this, but then I watched an interesting talk by Mark Miller (one of the people who wrote about Agorics) and Bill Tulloh, that defined the Decision Alignment problem and things started to crystallize.  They frame it in terms from the principal-agent problem, with requests for work from the principal to the agent. They also talk about incentive alignment which they contrast to Decision Alignment which they coin as the following.

Decision alignment is when a principal uses various tools to make it more likely for the agents decisions and actions align with their own.

They applied this framework for Decision Alignment to two domains, programming (Miller being a programmer) and economics (which is Tulloh’s specialty). In the programming scenario they are seeing the program as the agent.

Edit: There is also this paper that covers some of the same material as the talk.

In the talk they specify three different periods that a principal can try to impact the alignment of the agent in the course of the request. These are simply before(Pre), during(Request) and after (Post). Each period has a number of actions different actions that can help align the agent. They form a loop (diagrammed below) which I have replicated from their talk.

For alignment during programming, Miller and Tulloh identify the Pre-actions and Select as happening at programming time and the rest happening live in the computer system .

 

alignment

I’ll describe a bit what you can expect from each stage.

Pre

Before the transaction you can:

  1. Select an agent: This selection would be based on experience or other criteria, to help you find an agent likely to be aligned with you. An example would be looking in a marketplace for reviews. This happens in looking at github stars, for code, and selecting eBay vendor in the wider economy.
  2. Potentially inspect the internals: This would only happen on big transactions (e.g. buying a house or big orders where factories are inspected) in economies. For software this is doing code reviews.

Request

During the request you can:

  1. Allow certain actions by giving them permission to do only certain things, e.g. giving them keys to your house or access to certain bits of data.
  2. Explain what you want them to do. When you commission a bit of software you don’t just give someone money then expect the correct outcome, you explain what it is you want. In the programming example a request has a an amount of data that define it.
  3. Reward the agent: By giving them money. This does not apply in normal software’s case.

Post

  1.   Monitor Effects: Does the agent do what you want? This can impact the alignment if the agent knows that the principal is monitoring it, it can change their behavior. Or it can allow you to uninstall a bit of software and install a new one.

Alignment in Depth

One of the key insights in the talk was that all of these methods of alignment are complementary and each one would be prohibitively expensive to use by itself to fully align the system.

They gave an example where someone managed to sneak bugs that biased an electronic voting system, past some careful code reviewers as a test. Mark Miller was one of the code reviewers that failed to spot the problem. Potentially more code review could have been applied in this case, but it would have been prohibitively expensive.

Miller argues that another layer that actually runs and tests the code for bias in voting may have been able to catch these subtle bugs. He also argues that it would be much harder to construct subtle bias in the system that had both code review and testing, than either alone.

So in order to maximise the chance of alignment you want many forms of alignment acting at different time looking at different aspects of the system. I think this strategy deserves a name so I’m going to dub the strategy, Alignment in Depth, as a nod to defense in depth.

Alignment in Depth with Teaching

How does Alignment in Depth apply to my system? The different relationship the user has with the system means we will have a different loop to normal programming. We want the user to act as a customer but also as a teacher if needed.

Teaching doesn’t really fit into the loop described, as the loop seems mainly to describe single-shot business transactions. To capture the change we will need a large modification to the loop.  Teaching also does not have the character of a per-request explain action, the teaching makes a change to how the system reacts to all future requests, not to how it performs a single request. So an example of this might be of the user telling the system to never buy pizzas with anchovies, so even if the user explains a request to the system to order them a random pizza flavour, that pizza will never have anchovies.

Further changes are needed in the loop to incorporate the market and user feedback.

So what does the loop look like after these changes? My current view is that there are two loops of feedback that somewhat interact, the user interaction loop and the auction loop.

  • The auction loop is where programs get control over parts of the system based on how good they have been.
  • The user interaction loop is where the user interacts with the system and chooses how to reward it based on it’s behavior.

It is worth noting that there will be many user interaction loops per auction loop.

iaalignment

As you can see it is a fair bit more complicated. Let us go over it loop by loop.

Auction Loop

This is the simplest loop and what I have been working on currently as it needs to function before the user interaction phase can start to be built. It keeps the three phases of the loop described by Miller and Tulloh but changes other bits.

Pre

  1. Select via Auction: The programs to be in control of different bits of the system are selected via auction, they can bid on the parts of the system they want control over. Their control of the system is sometimes represented by a capability where appropriate.

Request

  1. Allow via Capability: In order to do anything programs inside the system need the correct capabilities.

Post

  1. Monitor: The user sees how well the system is doing in general, looking at its performance on tasks. I’ll go into more details in the user interaction loop.
  2. Reward: The user gives feedback to the system, this is converted into an internal currency and given to the program(s) in control. They spread the currency to other programs . There is a base level that is normal feedback and the user can decide to give less, leading to a net negative.

One major difference to most normal economic interactions (and as represented in the original loop) is that rewarding the system is done after monitoring.  The reward is informational about the operation of the system to the system and not required for that operation to go ahead.

The user interaction loop is more interesting and has the teaching elements in it.

User Interaction Loop

This only has two phases, the Request Phase and Post phase. There is no Pre phase as you pretty much have to interact with the system as it is.

Request

  1. Allow Actions : You are dealing with the whole computer system, but you can still potentially restrict the actions the whole system can do for this request, if you keep some authority outside it. For example you can keep encryption keys on a yubikey and plug or unplug it.
  2. Explain Requests: This is should hopefully be similar to explaining a request to a human. Quite how this will come about I am unsure. A lot of it depends how much the system can leverage machine learning techniques, and practice to get this functionality, so it depends on the state of the art when it is implemented.

Post

This has its own sub-loop. The teaching loop is a loop that comes after the monitor phase. The idea is that if you see the system is performing poorly (or un-ethically) you can correct it in the following way.

  1. Monitor : Like the monitoring in the previous loops this is just seeing how well the system is doing. If the user finds there is a problem, the user can chose to question the system, instruct changes or just change the amount of reward the system gets.
  2. Question Reasons: Questioning reasons allows the user to ask the system why it did an action. The system has to have sufficient self-knowledge in order to be able to answer these questions. This questioning might find a problem with the knowledge, reliability or ethics of the system. At this stage the user has to instruct the system to change.
  3. Instruct Changes: The user might be able to figure out where a problem in the system is without questioning. Perhaps the user can identify that the problem is a lack of knowledge. If not the user will have to question the reason why an action was performed. The user should be able to tell the system how to change. A simple example of a change is, “In this situation don’t do X”. More complicated instructions might be to tell the system to change the goal it is aiming towards, the moral frameworks it uses or what the system pays attention to or tries to learn. The system would not use goal oriented thinking when implementing instructed changes, so it wouldn’t resist changing its goal. Goals could be restricted in scope so that they only control how it interacts with the outside world.
  4. Test Changes: The user might want to test the instruction given with appropriate tests they think up. Let us say the user saw the system playing cruelly with bugs, the user decides to teach the system that respecting the autonomy of agents, when there isn’t a good reason not to, is good.  Then the user might ask the system how it would act in various situations where the system can interact with other agents and see whether it respects the agent in it’s actions.
  5. Reward Agent: The user would reward the system to cause two possible changes. Either it could give more reward than normal to solidify the behaviors the system has been instructed on, or it could  reduce the amount of reward given. The user would do that if the system did a lot worse than expected and couldn’t be instructed to be better.  This would mean it would lose total value, meaning other programs within the system are more likely to take control.

This teaching loop happens is not restricted to my proposed system, it happens in lots of other aspects of the world. For example it happens a lot in schools, inside organizations and also in situations where a organizations are working together over long time periods. The depth and types of changes that can be instructed are different dependent on context.  You cannot instruct a change of belief system upon a colleague, but you can notice a skill they are missing and teach them that. Within an intelligence augmentation context there should be no limit to the type of instruction you can give the system, so it should be able to cover language, ethics, ways knowledge should be gained and what it thinks is the fundamental nature of the world!

Self-inspection

There are further ways that alignment might be maintained, that I am not thinking about currently and might need changes to my system and the loop proposed above. One interesting one is self-inspection, where elements in the system look at what is going on inside the system and lets the user know if something bad is happening.  They would be the digital equivalent of whistle blowers.  This becomes more important if the system is operating by itself for long periods. This would no longer fit in the the request/monitoring paradigm, you would need something that could go to the user direct.

How this self-monitoring should be implemented is an open question. You might want a separate simple computer system that looked inside the system and alerted for any weird changes. Or maybe there is some incentive structure for the system, you could set up for self-monitoring. The last option is to have a separate agoric system that could see inside the first and you could train to tell you if the first was malfunctioning or about to. This could protect against problems in the economic system that the second option could not, but who watches the watchmen etc.

Getting to Alignment in Depth

The engineer in me wants to build the economic loop and then experiment with simple systems and build up the levels of alignment slowly, empirically. But there are proofs that can be done around the economic systems to help with alignment, if people with the right background are interested (and this proves to be a fertile pathway).

However treating things as maths is a leaky abstraction and you might get things like row hammer, that busts your abstraction and your alignment. So some physical experimentation would still be necessary.

You can see there is a larger scale alignment loop that just operates on the computer architecture itself (not the programs within it), with experimentation, proof and more actions to make it an align-able architecture.

On the software side there is lots of work left to be done. I don’t have crystallized ideas on how to implement the question and instruct steps of the loop. Both require a lot more work on language processing than we currently have with AI (as far as I know). As I mentioned there is uncertainty on how much this can be improved with machine learning.

I see instruct as being akin to compiling language down to code or changes to code. The agoric system of filtering changes seems like a pre-requisite for this type of work.

The question step seems like the hardest thing to do, especially when we have more machine learning involved, although it is not just equivalent to explainable machine learning. You want the agent to be able to remember all the instruction given as well, as that forms part of the explanation for a behavior and the user might want to amend the instruction.

Conclusion

I’ve presented a richer set of alignment loops for use with my system compared to those used in one shot business transactions.  These richer loops are due to aligning the system over multiple request, as well as in a single request. One of these loops has analogies with teaching and training in the real world.

I’ve identified a further possibility for alignment, self-inspection, that needs further exploration.

I hope that alignment in depth can be seen as a valid strategy, not only for my intelligence augmentation system but for other AI systems, rather than relying on a single alignment method.

Research Syndicates

Let us assume we want to encourage people following high risk long term research paths. How might you do this without also just getting people sucking up the energy and not producing good research.

One possibility I’ve been musing about is research syndicates. This idea is that it might be a good idea for groups of disparate researchers to band together and form a syndicate. The syndicate main purpose would be a form of income sharing, any income going to one member (from a grant or other income) a portion would be shared with the rest. Lets say 75% to the group that got the income and 25% being split between the rest.

Why might this work

1) It helps young researchers get some income to get going with a promise that they will pay back to the syndicate if their research pans out.

2) There would be incentives to mentor the other researchers

3) Being part of a research syndicate would be a form of social proof, so they are to be looked on more seriously.

4) People being added to the syndicate is a costly signal on behalf of the syndicate members.

5) Spreads risk between research teams

6) Technical specialists are more likely to be able to recognise interesting researchers.

There would be a need to have exit clauses worked out. Probably being pretty costly to the people that want to exit or make someone else exit.

Anyone come across something like this being tried out? I think the activation energy for this is probably the legal work necessary.

Decision Alignment Loop for IA

So I mentioned in my previous blog post that the alignment loop for my flavor of IA looks a bit different to when you assume a principal-agent divide.

To recap the separate agent loop:

The model of decision alignment comes from economics and attempts to solve the principal-agent problems. It specifies three different periods that a principal can try and impact the alignment of the agent during a request from a principal to an agent. These are Pre request, During the request and Post the request. Each period has a number of actions possible. They form a loop diagrammed below.

alignment

IA Alignment Loop

The IA alignment loop is very similar and shares a number of steps. There are a few differences though.

First off there is no inspection (not at this level at least, there is inspection when designing the architecture, that will be an interesting diagram to draw) as we expect the inner workings to be somewhat opaque. There is the two loops of feedback that somewhat interact, the user interaction loop and the auction loop. It is worth noting that there will be many user interaction loops per auction loop.

iaalignment

As you can see it is a fair bit more complicated. Let us go over it loop by loop.

Auction Loop

This is the simplest loop and what I have been working on currently as it needs to function before the user interaction phase can begin. It keeps the three phases of the principal-agent loop but changes other bits.

Pre

  1. Select via Auction: The programs to be in control of different bits of the system are selected via auction, they can bid on the parts of the system they want control over. Their control is represented by a capability.

Request

  1. Allow via Capability: In order to do anything agents need the correct capabilities.

Post

  1. Monitor: The user sees how well the system is doing in general
  2. Reward: The user gives feedback to the system, this is converted into an internal currency and given to the program(s) in control. They spread the currency to other programs . There is a base level that is normal feedback and the user can decide to give less, leading to a net negative.

One major difference is that rewarding the system is done after monitoring. As it is not a separate entity it does not need to be paid before hand. The reward is informational about the operation of the system to the system, not required for that operation to go ahead. It can easily be seen as a stripped down version of the principal-agent loop. The user interaction loop is more interesting.

User Interaction Loop

This only has two phases, the Request Phase and post phase. There is no Pre phase as you pretty much have to interact with the system as it is, as you have to interact with parts of your brain as they are, in the moment.

Request

  1. Allow Actions : You are dealing with the whole computer system, but you can still potentially restrict the actions the whole system can do for this request, if you keep some authority outside it.
  2. Explain Requests: This is should hopefully be similar to explaining a request to a human. Think Siri but better.

Post

This has it’s own sub-loop. The correction loop is a loop that comes after the monitor phase. The idea is that if you see the system is performing poorly (or un-ethically) you can correct it in the following way.

 

  1. Monitor : Like the monitoring in the previous loops this is just seeing how well the system is doing. If it finds there is a problem, the user can chose to do one of 2,3 or 5.
  2. Question Reasons: Questioning reasons allows the user to ask the system why it did an action. The system has to have sufficient self-knowledge in order to be able to answer these questions. This questioning might find a problem with the knowledge, reliability or ethics of the system. At this stage the use has to instruct the system to change.
  3. Instruct Changes: The user might be able to figure out where a problem in the system is, perhaps it is a lack of knowledge. If not the user the user will have to question the reason above. The user should be able to tell the system how to change. The simplest change is, “In this situation don’t do this”. More complicated instructions might change the moral frameworks or what the agent pays attention to or tries to learn.
  4. Test Changes: The user might want to test the instruction given with appropriate tests they think up.
  5. Reward Agent: The user wants to either give solidify the things the system has been instructed on, or give net negative reward so that other programs can take control.

The correction loop happens in other aspects of the world, it happens a lot in schools, inside organisations and also in situations where a organisations are working together over long time periods. The depth and types of changes that can be instructed are different dependent on context.  You cannot instruct a change of belief system upon a colleague. Within an IA context there should be no limit to the type of instruction you can give the system, so it should be able to cover ethics, ways knowledge should be gained and the fundamental nature of the world!

Conclusion

I’ve presented a richer set of alignment loops for use in an IA context compared to those used in normal one shot business transactions. They are probably not the end of the tale when it comes to methods of alignment, but they seem like a better attempt than pure reinforcement learning methods of alignment.

I should probably spend some time to give a high level view of what I think internal program architecture may look like in the end. But I expect a large chunk of the internals will be gained via instruction.

 

Response to Cyber, Nano, and AGI Risks

I was heartened to read an article by Christine Peterson, Mark S. Miller , and Allison Duettmann about Decentralized Approach to Risk. I know Mark Miller from captalk and his work on agorics, I’ve not come across either of the other two before, but I’ll keep an eye out for their work.

My plan is to quickly skirt over the piece where I agree with it or have no strong opinions (bio/nano tech) and then dig into more details in where I disagree.

My main points of agreement are that we need to solve the security problem during the solution of the AGI problem (and that we need) and that we should rely on strengthening civilisation rather than trying to create a benevolent overlord and the hard philosophical problems that entails. Another thing that really resonated with me is the idea of depessimising. Depessimising is contrasted with opposite of optimising, it is aiming for a space of outcomes that is not terribly bad rather than aiming for the best of all possible outcomes. I also share the worries about the supply chain risk, this seems important even for people interested in sovereigns, unless you make re-architecting your computers a priority, your attempt at a sovereign might get subverted by someone who can subvert the Intel AMT or similar.

Cyber security

I agree with the authors that the one of the most vulnerable sections of the computer is the OS and that moving to SELinux and object capability based systems would help some. However those things push a need for more understanding and complexity upon the user (unless they are very carefully designed).  The user for me is the second most vulnerable part of a computer system, so I would rather not off-load more onto them. For very important secure systems like the national grid they do seem appropriate however.

Model of Decision Alignment

This is a very interesting section. It is summary of a talk here . I’m going to go into the talk because it is very relevant to the work on alignment in general.

The model of decision alignment comes from economics and attempts to solve the principal-agent problems. It specifies three different periods that a principal can try and impact the alignment of the agent during a request from a principal to an agent. These are Pre request, During the request and Post the request. Each period has a number of actions possible. They form a loop diagrammed below.

Pre

Before the transaction you can

  1. Select an agent based on past experience or other criteria
  2. Potentially inspect the internals: This would only happen on big transactions (e.g. buying a house or big orders where factories are inspected)

Request

During the request you can

  1. Allow certain actions by giving them permission to do only certain things, e.g. giving them keys or certain bits of data.
  2. Explain what you want them to do. When you commission a bit of software you don’t just give someone money then expect the correct outcome you explain what it is you want.
  3. Reward the agent: Giving them money

Post

  1.   Monitor Effects: Does the agent do what you want? This can impact the alignment if the agent knows that the principal is monitoring it, it can change their behavior.

alignment

This forms the loop above. During the talk they go into how this applies to an object oriented programming model, splitting it into two phases, the programming/sys admin (Pre/Post) phase and the running program phase (request).

First off, I like this a lot, it is much richer model than trying to align computer systems purely with a utility function.  I hope that the AI alignment community adopts and critiques this model as they are dealing with  a separate agent and it makes sense in that context. However it needs some work to adapt it to an intelligence augmentation setting. I’ll do that in a later blog post.

One of the interesting insights from the talk was the fact that each of these alignment tools alone is insufficient or at least gets exorbitantly costly to ensure alignment. But combined together it can reduce the cost of getting alignment as long as they were sufficiently different.

Securing Human Interests

This section is interesting it proposes giving human capital in the shape of real estate in space and relying on it being economically valuable due to capital being productive in what is called an “Inheritance Day”. It has many challenges (the slicing up of space being a big one), but one challenge that it does not talk about is the potential for humans to get swindled out of their inheritance. If we are assuming super intelligence and un-augmented humans, then it seems likely that somewhat malign super intelligent AGIs will be able to swindle some humans out of their space, as it is trade-able, if they can find a chink in the law they may be able to exploit it. So I think, even in this scenario, we still need to find a way to upgrade humans to become somewhat able to protect themselves from somewhat malign AGIs.

Conclusion

All-in-all a very interesting piece. I hope we manage to get a constructive conversation about the merits of centralisation versus decentralisation and expand our ideas of how to achieve alignment.

 

Blog changes and future direction

I’ve decided to change my blog so that it talks less about there being a community (there is not), so it is more representative of what is exists. I would like to find more people interested in this kind of thing, but I suspect by the nature of the type of person I’m interested in autonomy it is always going to be a loose group. I’m also going to reference small groups to be autonomous rather than individuals as that seems more realistic and also more psychologically appealing.

I’m thinking about how to organize my thoughts. I’ve got a fair amount to say about the following subjects.

  • Charting a course: Trying to have accurate view points about the world so that we can pick not too wrong paths with regards to autonomy. This involves my AGI estimation work
  • Agoric work: This should feed into the charting the course. I plan to concentrate on this so I can write an interesting paper. I had hoped to get people interested in talking about it before delving into it more, but that seems like a distant dream.
  • Philosophical work: Why focus on autonomy etc.

It would be good to have a way to have a way to split these things so people can find the things they are interested in, especially when coming into the site for the first time.

I’m going to disengage with EAs/rationalists for a bit as they are having their own discussions and I find myself getting sucked into them more than is useful.

Plans for this week.

  1. Write some more about the decentralized approaches to risk
  2. Experiment with agorics. Working towards getting some basic learning functionality working. Starting with tracking some frequencies.
  3. Fix all the bugs I find while I’m doing the above.

Things I am not doing but I’m thinking about.

  1. Writing an article on the lessons from agile to our reaction to impending AI.

Musings on the radioactive decay model of existential risk

Daniel Dewey kindly sent me a copy of his paper where he defines the Exponential Decay model as:

Exponential decay model: A long series of AI projects will be launched
and run, each carrying non-negligible independent existential risk. Even if
each project carries a small chance of catastrophic failure, as more and more
teams try their luck, our chances of survival will fall very low.

This assumes a non-diminishing chance of catastrophic failure over time. I think we should expect a complex relationship between risk and technology. If we can improve computer security to the point where intrusions are hard for humans, deployments of AI technology should make it harder for a bootstrapping AGI to get a foot hold. Even if those deployments aren’t specifically designed to reduce existential risk.

On the first question, can we improved computer security fundamentally?  There are a few different possible solution currently in the works. From the nearest to the most speculative

  • Languages like rust make it easier to write secure code compared to
  • Concepts like capability-based security make the OS level more secure
  • Agorics and variation could make individual computer systems that have vulnerabilities but they are different vulns to other system so are harder to exploit subtly
  • Narrow super intelligence penetration tests could be run against systems before they are released into the wild
  • Narrow super intelligent intrusion detection systems.

The importance of improving security has been discussed somewhat in Cyber, Nano, and AGI Risks: Decentralized Approaches to Reducing Risks. That paper deserves it’s own blog post, as I don’t agree with all of it.

So if we do manage to make taking over computers fundamentally more difficult, why would general progress on AI make hard take off harder? This is because it would eat up optimisation space. An AGI would not be competing against a squishy human in the stock market but a number of different narrow AI systems with razor sharp reflexes and tonnes of data.  They also would not be able to take over all the internet having to rely on making money and buying machines/cloud time. If things like mechanical turk tasks get automated by narrow AI, it reduces the ease at which another project can gain the money they need to buy the hardware.

This argument does assume that humans are not hackable by just AGI. But if humans are moderately hackable by narrow AIs we will have develop tools and societal mores that protect humans from narrow and AGIs alike.

So my initial thoughts are that the harder that human civilization is taking off (with secure computers), the harder it is for one AGI project to get a significant foothold. So the risk diminishes over time with more AI projects.

In reality it is probably more complex than this, but was seems an interesting thread of thought.

Sifting the worlds knowledge (AGI Estimation 2)

I talked a bit about the AGIs likelihood to do a hard take off last time and used Matt Might‘s view of human knowledge as an intuition pump. The below circle represents a humans knowledge, they have done elementary school, highschool, college and post graduate work and are just touching the edge of human knowledge in their specialism.

PhDKnowledge.006 (1)

So the question is, what will an AIs relationship to the circle be? Will it expand to fill it all without any human help and go beyond it, by pure induction? Or will it read our research papers and text books to help fill out its knowledge, and then expand out slower as it has to rely on induction?

I’m going to talk about the second scenario here, because the first seems computationally/informationally unfeasible. We  should however look for the warning signs that it will work by pure induction.

Quagmire

The above picture is fairly optimistic view of human knowledge. It assumes that filling in any bits of the circle is worthwhile and useful. However all of human knowledge is not like that. All of human knowledge contains

  • Relatively Pointless things for an AI. Like porn
  • Potentially confusing things – Like fiction. This can vary in confusing-ness depending on it’s true to life-ness and how much it claims to be fact.
  • Potentially confusing things that look like useful things – E.g. papers caught up in the replication crisis.
  • Purposefully harmful things – Like cult propaganda or chain letters
  • Purposefully misleading things mixed in with helpful things – Like covert propaganda.

So the picture is more like:

PhDKnowledge.006 (1)bad

With the black sections of knowledge being incorrect or misleading. What happens if the AGI absorbs the bad sections? If it is just as smart as Newton it could get stuck on whatever our current equivalent of alchemy is and believe in God. Spending its cycles on worthless pursuits.

People who manage to achieve things in the world manage to avoid these pitfalls by having developed a taste or an aesthetic to avoid the pointless/harmful stuff. You can see these rules of thumb for judging mathematical breakthroughs as trying to transmit  a way of sorting the wheat from the chafe. So an important consideration for how effective an AGI is the nature of aesthetic.

So there are a few questions:

  • Are aesthetics general?
  • How are aesthetics learnt? And is it likely to be quick/easy?

There seems to be at least one general aesthetic, that of consistency with other knowledge. We don’t like things that are fully self-consistent by themselves (apart from in maths). But if we can hook knowledge into predicting what we experience day to day we can build some trust in it.

What we really want though is a way of going beyond day to day experience. Is the atomic mass of Uranium really 238.02? Are there millions of species of bacteria in the air around me that cannot be cultured? Is saturated fat really bad for humans to eat in the long term? Is the economy of China doing well?

From a Kuhnian paradigm point of view aesthetics seem less general. Each science adopts different aesthetics and different things that count as evidence and research within that science. If an AGI had to learn aesthetics for each discipline individually to distinguish good and bad research that might take significant amounts of time, if they are relying on expert humans to help them develop the aesthetic.

Can we look at people and see how well they avoid bad information outside their specialism? Are there people that do especially well at picking fact from fraud across many areas, perhaps a cousin to the super forecaster?

A related need seems to be threat detection for this type of AGI. If you can’t assume that some information is menacing and shouldn’t be trusted then you probably in for lots of pain.

Conclusion

I have argued that an information aesthetic is vital for an AGI to effectively absorb human knowledge and not end up completely confused or worse. That aesthetic seems only partly general and may take significant amounts of time/expertise to be imparted for more esoteric fields. More work can be done to see if there are any humans with generalised truth seeking aesthetics.

AGI Estimation

We are worried about what happens when we make a system that can do important things human can do. There are a number of productive ways we can react to this, not limited to:

  1. Attempt to make AI systems controllable by focusing on the control problem
  2. Attempt to predict when we might get agi by looking at progress, e.g. ai impacts
  3. Attempt to predict what happens when we get artificial general intelligence by looking at current artificial intelligences and current general intelligences and making predictions about the intersection.
  4. Figure out how to make intelligence augmentation, so we can improve humans abilities to compete with full AGI agents.

This blog post is looking at the third option. It is important, because the only reason we think AGIs might exist is the existance proof of humans. Computer AGI might be vastly different in capabilities but we can’t get any direct information about them, we just have to adjust our estimates of what humans can do based on the difference between humans and narrow AI on different tasks.

There are three core questions to AGI Estimation that I have identified so far.

  1. What do we mean by ‘general intelligence’
  2. How important is ‘general intelligence’ to different economic activities
  3. How good are narrow AIs at parts of the processes involved in general intelligence vs humans. Can we estimate how good an AGI would be at another task based on the comparison?

Generality

We often speak and act as if there is something to one person being better than another person and mental acts in general. We try and measure it with IQ. However there are a number of different things “general intelligence” which leads to better performance might operationalise to.

  1. Better at tasks in general
  2. Better at absorbing information by experimentation, so that per bit of information the task performance improves quicker
  3. Better at holding more complex behaviours, so that the limit of task performance is higher
  4. Better at understanding verbal and cultural information, so that they can improve tasks performance by absorbing the infomation about tasks acquired by other people.
  5. Better in general at figuring out the relevant information and how it should be formatted/collected, so they can select the correct paradigm for solving a problem.

Number 1 seems trivially untrue. So we shouldn’t worry about a super intelligent AGI system automatically being better at persuading people, unless it has been fed the right information. You don’t know what the right information is for a super intelligence to get good at persuading, so probably still worth while being paranoid.

My bet is that “generally better” is a normally a mixture of aspects 2-4, but different aspects will dominate in different aspects of economic activity. Let us take physics for an example, I suspect that a humans potential is mainly limited by 4, now that we have super computers that can be programmed so that complex simulations can be off loaded from the human brain. If we throw more processing power and compute at the “general intelligence” portion of physics and this mainly improve 4, then we should expect it to get up to speed on the state of the art of physics a lot more quickly than humans, but not necessarily get better with more data. If 5 can be improved in “general” then we should expect a super intelligent AGI physicist to have a completely different view point on physics and experimentation in short order to humans.

We can look at this by looking at task performance in humans, if someone can be a lot better at absorbing information over a large number of tasks, then we should expect AGIs to be able to vary a lot on this scale. If humans general task performance variation depends heavily on being able to access the internet/books, then we don’t have good grounds for expecting computers to go beyond humanities knowledge easily, unless we get indication from narrow AI that it is better at something like absorbing information.

Improving Generality and Economic Activity

However we are not interested in all tasks we are mainly interested in the ones that lead to greater economic activity and impact upon the world.

So we need to ask how important human brains are in economic activity in general, to know how much impact improving them would have.

For example for programming a computer (an important task in recursive self-improvement), how big a proportion of the time spent in the task is the humans intelligence vs total task. For example when an AI researcher is trying to improve an algorithm how much time of making a new algorithm variant is needed for thinking vs running the algorithm with different parameters. For example if it takes 3 days for a new variant of AlphaGo Zero, to train itself and be tested and 1 day for the human to analyse the result and tweak the algorithm, speeding up the human portion of the process 1000 fold won’t have a huge impact. This is especially the case if the AGI is not appreciably better at aspect 2 of ‘generalness’, it will still need as many iterations of running AlphaGo Zero as a human to be able to improve it.

The impact of what are we improving

This brings us to the question, when we create super human narrow AI systems which aspect of the system are super to human. Is it the ability to absorb knowledge, the amount of knowledge the system has been given (as it can be given knowledge more quickly than humans) or is it a more complex model than we can hold in our heads?

If we determine that the aspects that are super human are those that are very important for different economic activities and the aspects that we have not improved are not important, then we should increase our expectation of the impact of AGI. This should hopefully not make us too provincial in our predictions about what AGI should be capable of doing.

A first stab at the kind of reasoning that I want to attempt is here where I attempt to look at how quickly Alpha Go Zero learns compared to humans based on number of games played. This might be a bad comparison, but if it is we should try and improve upon it.

Conclusion

I have outlined a potential method for gaining information about AGI that relies on mainly experiments on human task performance.
It is early stages but we do not have lots of options when it comes to getting information about AGI so we should evaluate all possible ways of getting information.

Centralisation shapes people into Cynics

So this is very tentative, this is mainly generated by thinking about the ways I would need to change to influence the centre.

By Cynics I mean what is being pointed at in the Gervais Principle, the Thick of It or Yes, Minister! They are not everywhere in the Centre.

It is not just cynicism, but a need for power.

An idealist wants to change the world

You an idealist notices something wrong with the world. It needs fixing and you are the one to fix it.

There are two paths you can go down. You can try and influence the centre and the stronger the centre the more tempting this is. Or you can try and influence the people. To influence the people you give them something, a new technology (like linux) or a new idea. You don’t know exactly what people will do with the new thing, so this seems like a risky idea. So you try and influence the centre.

Power is in the centre

In our capitalist, statist system power is in the centre. If you can grab some of it you can redirect some of it to the fix the problem you’ve noticed. Power being in the centre is not a fact of the world. For the majority of life’s history power has been decentralised, a tree mainly grew by itself, helped by the trees around it providing it shelter but also competing for resources. It didn’t need planning permission to grow or rely on a central bank to maintain liquidity in a currency so that it could take out a loan to pay for its resources. Centralization is an artifact of our current technology level, with more technology we may be able to be more decentralized.

To get power you have to have influence

So once you have decided down the central path you need to place yourself in the centre. People in the centre just don’t let other people come waltzing into the centre, psycopaths do not want to lose control and other Idealists need to be convinced that your idea is good. You need to convince them you should be in the centre in some way or for some reason. So you need influence.

To get influence you have to be ruthless

The more power you need the closer into the centre you want to go. So the more influence you need. Influence is zero sum, so you always have to be on the look out for people back stabbing you and you have to take those opportunities for influence that you get, even if it hurts other people. You have to make the consequentialist decision that hurting them is worth it to get to the power to get to your ideal.

Once you have power you have to spend power to maintain it

Everyone near the centre is playing this ruthless game. They are nipping at your heals. You can gain allies by giving them some bits of power or you can spend your time and energy trying to head off other threats. You have to be skeptical and afraid of new ideas in case they are plans to erode your power base. Paranoia lives in the centre.

So you do not have all the power you thought you would have to achieve your goals.

You may lose the ability to be flexible

Other people will use admissions of failure or being wrong as tools to bring you down. You cannot admit your mistakes quickly so you may be locked into a path you have chosen even if it is sub optimal because changing tack admits weakness and people lose trust.

You may lose the time to think about what should be done

Even if your are sure of fixing the problem, you may not have enough time to figure out how the power should be diverted to fix the problem. Because you need to work hard to maintain your power you cannot relax and be creative or find novel ways to solve the problem you have. You can divert some power to trying to solving your problem by delegating the problem, but you may just divert it to another Cynic in the making trying to get closer to the centre, so you may not get it solved.

The Cynics curse

This doesn’t seem like the good life. It is not relaxed or fun or happy. People should be able to trust other people and have time to be creative. The best worlds are ones where you don’t try to control the world and you don’t need to try to control the world.

This seems like the end of the Idealist’s path to a Cynic but it is not all.

You may oppose decentralization as it erodes your control

So you maintain the current centralization of the world is or even potentially make suggestions to strengthen centralization, so that you can gain more power over the world, so that one day you can make it a better place. This is a Cynic reflex trained from the skinner box of striving to get to the centre and achieve their goal.

There are good reasons to oppose decentralization, but you should be able to tell whether people have the Cynic reflex or legitimately think centralization is better. If both the positives and negatives of centralization and decentralized are discussed and not just given lip service, then you are in a good state. If decentralization is never discussed then it is likely you have fallen into the Cynic reflex.

It is not their fault, the game made them this way. So don’t hate the player, don’t even hate the game. Hate the centralization that causes and shapes Cynic and makes the game occur.