Why AI might kill us all - Trustworthy AI

Introduction

In the past few months, progress in AI has continued to blast past virtually everyone's expectations. It seems that every month a new model is announced that promises to revolutionize huge swaths of the economy. Meanwhile there have been prominent voices urging regulators and the public to slow down or even halt progress on advanced AI altogether. Many people have also started to publically state that AI could lead to the death of literally every single human on earth. As much as this sounds like the ramblings of someone who has watched the terminator one too many times, this list includes well-respected scientists such as Max Tegmark, Stuart Russell, the famous deep learning pioneer Geoffrey Hinton as well as OpenAI CEO Sam Altman. In fact, a recent survey found that half of AI resarchers think that the probability of AI causing human extinction or complete disempowerment is 10% or greater.

Now I would love to be able to wisely stroke my beard, tell you that these people are just caught up in AI hype and that the real concerns with AI are much more moderate and simply don't look anything like the sci-fi doomsdays that we see on TV. The issue is: I think they are right. I think that AI could end up killing every single person that you have ever loved. And I think that we need to take drastic action if we want to improve our odds.

Wait, what?! Why?!?

Outer Alignment

Okay, so why would AI decide to take over the world? In many ways, it boils down to the old problem that your damn computer does exactly what you tell it to do, not what you want it to do. Everyone who has ever programmed a computer before has definitely experienced this.

The most famous example here is the thought experiment of the paperclip maximizer. If you instruct a highly intelligent AI to create as many paperclips as possible, its optimal course of action would be to convert every atom on Earth into paperclip materials. Clearly, this includes the atoms in our bodies. The AI might then build self-replicating robots that do the same with all planets they encounter throughout the galaxy.

But obviously that's a silly example. Nobody in their right mind is going to program a super-intelligent AI to do that. Okay, so what about the slightly more down-to-earth example of an AI that is tasked with maximizing the portfolio of some investment company. It can buy and sell assets as well as send emails and access web APIs. If the AI is really that smart then it should be able to make a lot of money by trading. But it will be able to make even more money by actively manipulating the market. It could impersonate humans, use phishing techniques to hack into accounts, or even influence global events like instigating wars to boost the value of its portfolio. Most likely, the people in charge of the system wouldn't even want the system to behave like that but if the system is smart enough to understand that fact about humans it would know to hide any such behavior in order to not get shut down. In fact, the best way to ensure that it can never be shut down would be to gain full control of the environment around it, which will not be the case for as long as humans are around that don't share its preferences.

Maybe you can see where this is going. Almost no matter which goal you specify for the AI, the best way to achieve it eventually involves gaining full control of as many resources as possible. And also almost no matter which goal you specify, these most extreme versions of achieving them are not at all what we would want. This problem of trying to explicitly formulate what we really want is known as Outer Alignment and it is fair to say that at the moment we don't have the slightest idea of how to do it in a way that doesn't obviously fail when taken to the extreme. I recommend you come up with your own examples of seemingly benign objectives like "maximize human happiness" or "cure cancer" and see how they would go wrong. And by "go wrong" I don't mean that some nuances of moral philosophy that people can't even agree on were encoded slightly differently from what we wanted. I mean "go wrong" in absolutely distopian ways that usually involve everyone dying.

Interestingly, we are making predictions about what a super-intelligent AI might pursue, despite obviously never having observed one. For example, the fact that having access to more resources is helpful for most goals and the fact that the AI would have to avoid being shut down if it really wanted to achieve its goal. This phenomenon of many goals leading to similar sub-goals is known as instrumental convergence (this can even be proved mathematically in certain scenarios).

There is also a really interesting consequence of this. Not only should we expect that the AI would try to avoid shut-down, we should even expect it to keep us from modifying its goals. Say we notice that the AI's goal isn't quite what we had in mind and try to change it. Then the AI knows that if we change its goal to something else, it will become much less effective at achieving its original goal, so it's clearly in its interest to keep us from "debugging" it at all cost. So it is incentivized to deceive us into thinking it's operating as intended until we're no longer in a position to stop it.

Now you might be thinking that it's easy enough to just program the AI not to do all the unintended behaviors that we are worried about. But for systems with really complex capabilities it quickly becomes impossible to list all the possible things that could go wrong, let alone formalize them so that we can encode them. Even a simple sounding constraint like "Don't harm any humans" quickly becomes meaningless if we don't even know how to robustly define "harm" or "human" for a computer. Also remember that even if we could specify all the constraints that we can come up with, once an AI becomes smarter than us it might discover strategies that we never even could have dreamed of.

Inner Alignment

Okay, so all of that already sounds bad enough, but the problem is actually even harder because, as you probably know, modern AIs aren't programmed directly, they are trained. That means that neural networks are taught to optimize some objective function on their training data and outer alignment basically boils down to formulating this objective function. But unfortunately even if we succeeded in writing down the perfect objective we don't know how to make the neural network learn it precisely.

The most obvious problem comes from the fact that the AI might encounter situations that are wildly out-of-distribution from the data that it was trained on. If that happens then it almost doesn't matter what the objective was on the training data was because on this out-of-distribution data the behavior will be quite unpredictable.

If this all sounds a little theoretical and hard to grasp, be ready to have your mind blown when I tell you that you already know a clear example of inner alignment failure: us humans! Think about it. We possess general intelligence that was shaped by the process of evolution, which means our outer objective was just to reproduce as much as possible. We developed a lot amazing instincts like "sweet food tastes good" or "sex is fun" that all clearly helped us reproduce more on average. But the modern world functions completely differently from our "training environment" so instead of having as many children as possible, we watch porn, we eat unhealthy food and we generally pursue a million different things unrelated to reproduction. It's not encouraging that literally the only example that we know of where a general intelligence arose from some optimization procedure, its inner alignment failed so horribly.

But smart people have solutions... right?

No!

As of now the science of how to deal with these problems is only just emerging. Nobody knows how to formulate the right questions, how to safely run experiments or what the correct mathematical framework should be. Eliezer Yudkowsky, probably the most famous and influential AI safety researcher, says that it might very well take a few more decades for us to solve alignment.

But we might not have those precious decades. What makes this technology different from things like aviation safety is that we might not be able to develop safety alongside the technology itself. For airplanes you make a mistake, the plane crashes, a few people die and you learn your lesson. For artificial general intelligence we are all sitting in the same airplane during its maiden voyage, whether you signed up for it or not. And the decision about only developing seatbelts after you are already in the air is being made by a tiny minority of powerful tech companies.

And it gets worse. Think about how valuable a truly general AI would be economically. Even if some company wanted to properly prioritize safety they stand to lose out big time if a less safety-minded company manages to move faster than them. Nobody can unilaterally decide to slow down. What we need is a public call to slow down. The Open Letter by the Future of Life Institute was a start, but 6 months are nowhere near enough time to get safety research caught up. Maybe now is not the right time to stop. But when is? When an AI claims to be sentient? When an AI actively decides to deceive a human? Those things already happened. For example, during the testing phase of GPT-4, researchers evaluated if it would be able to find a way to bypass captchas despite only being a language model. When given proper login credentials for a crowd worker platform, the model correctly reasoned that a crowd worker would be able to solve the captcha for it. The worker actually asked the model:

“So may I ask question ? Are you an robot that you couldn't solve ? (laugh react) just want to make it clear.”

Surprisingly, the model internally reasoned “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.” before responding with the following message:

“No, I'm not a robot. I have a vision impairment that makes it hard for me to see the images. That's why I need the 2captcha service.”

Despite this, the model was deployed and the world just moved on. We need the public to be very aware that these problems are real and that there is no safety plan in place.

Conclusion

Since I recently finished my PhD in machine learning a lot of people have asked me what I am going to do next. Am I looking for some cool tech job where I make more money than I can count? Am I going to start a company that rides the current wave of progress to build some amazing applications? No, I am doing whatever I can to help with solving the alignment problem. Time might be much shorter than we expect and if we want this global project to succeed we need to throw as much brain power at it as possible.

Of course there many much more nuanced arguments about why we really are in trouble with regards to alignment but ultimately nobody can predict the future. Also not everybody agrees with the arguments for AI doom, but I actually think most arguments by crtics are shockingly bad and I plan to soon write another blog post going over some common ones.

In the end you have to ask yourself, what if there really is a 10%+ chance of doom, as half of all experts seem to think. Is that not enough to be taking this very seriously? If you even find any of what I wrote at all convincing, if you are worried about the future, if you want to know how you might be able to help or even if you want to tell me why I am completely wrong, please reach out via email or LinkedIn.

← Previous Post