Catastrophically Confident Classifiers

Does your AI know when it doesn't know?

Posted by Alexander Meinke on November 03, 2022 · 13 mins read


After more than three years of work, my PhD is finally coming to an end. I thought this would be the perfect time for me to give a short little crash course into what I've been up to during this time.

Catastrophically confident classifiers

Imagine you have visual classification algorithm that recognizes different breeds of dogs. You used the latest and greatest deep learning model and of course your data is just amazingly clean and abundant. Your accuracy is awesome and you don't have a care in the world, deploying your model. But now what if some joker came along and showed your classifier something that's not even a dog? Or maybe just not the type of dog you were expecting?

What type of dog is a hot dog?

Of course there is no right answer for your classifier to give here so ideally it should just shrug its shoulders at you and say "I don't know!", for example by just giving really low confidence predictions. But does it do that? Usually not. The hot dog seems like a silly example but what if you are instead classifying different skin lesions in medicine? Or if your robo-taxi is deciding if that's a flock of birds or a sweet old grandma crossing the street? Not quite so silly anymore.

Okay, so what can we do to teach our neural net to be less cocky in its generalizations? While there are hundreds of things that people have tried, a really reliable way has been to teach the classifier humility the same way you would teach it anything else: by showing a whole bunch of examples. Basically, you go on the internet and download every single picture you can get your hands on. And now during training, half the images you show your classifier contain a dog with the corresponding label while the other half are random images on which you tell your classifier to please have low confidence. Of course, this takes extra data but you know what they say: there's no free lunch. At least you don't need to label the extra data...

To show this in action, let's look at a specific dataset. CIFAR10 is a simple classification task of ten different classes using small resolution 32x32 images. The dataset is still very popular among academic researchers because we can run experiments on it without needing a GPU cluster whose power consumption might rival a small nation's. Now we are going to train a Plain model using just the CIFAR10 data which we will call the "in-distribution" and an Outlier Exposure model (OE) that gets shown all the other non-sense images that we can find. We'll call the latter the "out-distribution". You can see the result in the figure below, but spoiler: It works.

Chimpanzees are not in CIFAR10, so the classifier better not be confident here.

Problem solved?

Now is when it starts to get interesting. It's great that the OE model did so well on the example that we showed it, but does this work for every possible out-distribution image that we could show it? Well, no, of course not. We can go looking for examples that are particularly hard and will certainly find some. However, often enough they might even have something in common with the in-distribution images. But here is one super interesting type of failure. If we allow every pixel value of an out-distribution image to change ever so slightly, and we play with such changes, we can easily find images that receive sky-high confidence by the OE model.

The two chimps on the right look completely identical, so how come the model is suddenly so over-confident?

The problem is that, similar to the classification itself, our uncertainty estimation is not adversarially robust. Okay, that doesn't sound great, so what can we do about it? It's once again time to train the neural network to deal with something. This time what we want to do is look for examples that look very similar to our out-distribution images but induce very high confidence in our model during training. Showing these super difficult images to our model hardens the classifier's confidence estimation against the meanest samples we can try to come up with. The resulting method is called ACET.

Even when we try to look really hard for a variation of the chimp image that fools the ACET model, it holds up its promise of delivering low confidence.

But NOW it's solved, right?

Not so fast. We are actually far from done. First of all, as you can imagine, trying to find the absolute hardest image in the neighborhood of each out-distribution image at every single training step is expensive. As in training-takes-20-times-longer-expensive. Okay, that's not great, but maybe it's worth it? Unfortunately, a model like ACET has more drawbacks. For example, it doesn't actually recognize un-adversarial out-distribution samples as out-distribution as reliably as OE would. And worse still, it doesn't reach the same level of accuracy as a normal model would.

But worse worse still, we can never truly know if our method really was successful. What I mean by that is, if I hand you this image of a chimpanzee and a fully trained ACET model, can you promise me that I won't find a high-confidence image in the neighborhood? Of course, we would need a very precise notion of what on earth we mean by "neighborhood". Let's say, I am allowed to change every single pixel value by up to 1%. Even now, the only way to know that no such high-confidence images exist would be to check all of them. Even if we could check billions of them every second it would take far longer than many trillions of times the age of the universe to crunch through all these possibilities. All just for a single low-resolution chimpanzee!

The way we deal with this is to simply not look at every single possible image, but rather only search specifically for images that will probably have high confidence - similar to how we did it during training. But this always leaves the question, if maybe a more cleverly designed search algorithm could show that our evaluation was actually flawed. Examples of this have happened dozens of times and I have even discovered issues like that myself where I tested some existing model and found that instead of close to 100% robustness they had close to 0%.

Math! To the rescue!

Let's briefly recap our issue: we want to know what the highest confidence is a model could predict across all images that are within 1% pixel changes of our good friend, the chimpanzee. Computing this precisely is so brutally hard that no supercomputer on earth can do the trick. But what if we don't need to know the true answer of what the highest confidence is, but instead just want an upper bound. That means we spit out some number and, even though we can't know the true answer, we can promise that it is definitely, certainly and absolutely going to be smaller than our bound. Like, if I ask you what 3,583+4,189 is you might be too lazy to compute this but you can immediately tell me that it's less than 10,000.

Now to understand how exactly we can do this would require a little more math than I want to cover in this post - maybe at some point I can write an article about it. But just to sound cool I am going to tell you that the answer involves playing with geometric shapes in many thousands or even millions of dimensions. The important point is that we can come up with a method that gives us upper bounds on the confidence within milliseconds instead of billions of years. Not a bad speedup.

The catch is that, if we compute this upper bound for the chimp on our ACET model, it proudly proclaims that the maximum confidence will be no larger than 100%. Well, duh! We didn't really need fancy math to tell us that. The problem is that the neural network isn't really designed to make the certification procedure particularly precise. But how on earth could we do that?

By now you might already be able to guess: by training the network to be easily certifiable. Basically, now during training, we don't tell the model to have low confidence on out-distribution images, we don't tell the model to have low confidence on the adversarially manipulated out-distribution images, we instead tell it to have small upper bounds on the confidence for adversarially manipulated out-distribution images. Because catchy acronyms are as important in deep learning as everything else we called this method GOOD (for Guaranteed Out-of-Distribution Detection... don't ask what happened to the second "D").

Now we can guarantee that there simply is no images that is almost identical to the chimp but receives higher than 23% confidence.

The amazing thing is that now it simply doesn't matter if we are good or bad at finding specific images with high confidence in the neighborhood of a given out-distribution image. We can trust that our evaluation is solid because it is mathematically guaranteed to be correct. Of course, it's not all sunshine and rainbows. The training of our GOOD model was always quite tricky to do and you couldn't apply this method to neural networks with too many layers. Actually, we managed to fix all these issues in our NeurIPS 2022 paper, but that is a story for another day. If you are super curious, feel free to explore the paper on your own!


Of course, this neat story I told skips a lot more than just the technical details. For example, we discovered some fascinating connections between this work and the interpretability of a neural network's predictions. Also, the question of how a neural network's confidence changes as we go further and further away from our in-distribution examples is more surprising than one might think. If enough people are interested I will probably cover some of these topics as well at some point. Until then, I hope you enjoyed this little glimpse into how one teaches neural networks to know when they don't know.