Post-hoc Interpretability Is Not Enough

Topic: https://mlu.red/muse/52484066310.html

Comments

DanielFilan 5y, 192d ago [edited]

Comments:

I think this is sort of a wrong ontology. What a 'post-hoc' interpretability technique is going to look like is some procedure that takes an ML model and determines some fact about it. Such a technique could be applied to models at the end of training, but it could equally well be applied during training (as in fact you mention), and the later course of training could depend on the outputs of this technique. So I guess this is a criticism of a style of analysis? But I don't really know.
"X doesn't work yet for the biggest ML models" does not imply "X will never work for the biggest ML models", and isn't even a good argument for the conclusion. FWIW, I also would guess that there are post-hoc interpretability methods one could use on GPT-3.
It takes some time to investigate your model before deploying, but if your interpretability technique is good it shouldn't take that much time. I also think your criticism applies to any safety technique that slows down training.
The biggest promise of post-hoc interpretability is that it lets you figure out if your ML system is properly functioning before you deploy it, and structurally it seems capable of that: take your system, learn a bunch of facts about it, deduce relevant conclusions. Also, as mentioned in point 1, the relevant techniques could be used during training to change the course of training e.g. "use interpretability to find a really bad case for your model, train the model more on that bad case, and see if you've fixed it". It won't guarantee on its own that the fix will work (to do that you'd need to understand training dynamics), but it seems pretty good. All in all, I'm happy with the types of questions these techniques can answer.
Here's how I think about gameability: you have some technique that tells you some narrow facts, and you make some inferences based on those facts, and those inferences turn out to be wrong because they were based on vague analogical reasoning. I basically think that this is a problem with current interpretability techniques not being able to give enough useful information, and people being sloppy with their thinking. The first one hopes is solveable, the second is perhaps harder but I'm also optimistic about it.
Overall, it seems difficult to me to understand how post-hoc interpretability techniques could themselves produce an aligned AI, because they don't tell you how to build anything, just about what you've built, so in some sense I guess I agree with your thesis "these sorts of techniques are likely not going to be enough for AI alignment", and think you could have given a much simpler argument. But I do think that these techniques can play an important role in alignment solutions - the mainline scenario I imagine is that you have some training method that you think should produce an aligned AI, but you can't actually prove anything and so you want to check. There are also other potential uses, e.g. facilitating human-AI communication, exposing thinking that you don't want, etc. So mostly I disagree with the implied message of this essay.

remark

Comments:

1. I think this is sort of a wrong ontology. What a 'post-hoc' interpretability technique is going to look like is some procedure that takes an ML model and determines some fact about it. Such a technique could be applied to models at the end of training, but it could equally well be applied during training (as in fact you mention), and the later course of training could depend on the outputs of this technique. So I guess this is a criticism of a style of analysis? But I don't really know.

2. "X doesn't work yet for the biggest ML models" does not imply "X will never work for the biggest ML models", and isn't even a good argument for the conclusion. FWIW, I also would guess that there are post-hoc interpretability methods one could use on GPT-3.

3. It takes some time to investigate your model before deploying, but if your interpretability technique is good it shouldn't take that much time. I also think your criticism applies to any safety technique that slows down training.

4. The biggest promise of post-hoc interpretability is that it lets you figure out if your ML system is properly functioning before you deploy it, and structurally it seems capable of that: take your system, learn a bunch of facts about it, deduce relevant conclusions. Also, as mentioned in point 1, the relevant techniques could be used during training to change the course of training e.g. "use interpretability to find a really bad case for your model, train the model more on that bad case, and see if you've fixed it". It won't guarantee on its own that the fix will work (to do that you'd need to understand training dynamics), but it seems pretty good. All in all, I'm happy with the types of questions these techniques can answer.

5. Here's how I think about gameability: you have some technique that tells you some narrow facts, and you make some inferences based on those facts, and those inferences turn out to be wrong because they were based on vague analogical reasoning. I basically think that this is a problem with current interpretability techniques not being able to give enough useful information, and people being sloppy with their thinking. The first one hopes is solveable, the second is perhaps harder but I'm also optimistic about it.

6. Overall, it seems difficult to me to understand how post-hoc interpretability techniques could themselves produce an aligned AI, because they don't tell you how to build anything, just about what you've built, so in some sense I guess I agree with your thesis "these sorts of techniques are likely not going to be enough for AI alignment", and think you could have given a much simpler argument. But I do think that these techniques can play an important role in alignment solutions - the mainline scenario I imagine is that you have some training method that you think should produce an aligned AI, but you can't actually prove anything and so you want to check. There are also other potential uses, e.g. facilitating human-AI communication, exposing thinking that you don't want, etc. So mostly I disagree with the implied message of this essay.

owenshen 5y, 191d ago

Mostly regarding point 6 because I think that's where the bulk of the disagreement is:

I agree that there are probably post-hoc techniques can tell us useful things about our model, and I think having them would definitely be useful. But I don't think existing methods can do that very well. LIME and SHAP are gameable, and saliency maps are often just edge detectors. Do you agree with this much, at least?

The other part here that's mostly my intuition and not well-justified is that post-hoc techniques are getting at the wrong thing. Something-something, you're upper-bounded by how much information you can extract via post-hoc techniques, and this will preclude certain types of useful queries. I guess, as the title of the post implies, it's not that I don't think they're useless, just that they won't be enough.

remark

DanielFilan 5y, 190d ago

I agree that LIME, SHAP, and saliency maps are not very useful. In general, I think to be good, post-hoc interpretability shouldn't look like current work.

In terms of the info value, what happens when you deploy your trained AI depends on the final product, not the training trajectory or counterfactuals. So if what you want out of your interpretability is to know what will happen after deployment, then post-hoc interpretability is going to be sufficient. This isn't enough to know how to build a good AGI, but as mentioned in point 4, I think it is useful.

remark