The Value Alignment Project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
AI systems will operate with increasing autonomy and capability in complex domains in the real world. How can we ensure that they have the right behavioural dispositions – the goals or ‘values’ needed to ensure that things turn out well, from a human point of view?
Stuart Russell has called this the value alignment problem. Led by teams at the Future of Humanity Institute at the University of Oxford and the Centre for Human-Compatible Artificial Intelligence at UC Berkeley, this project seeks to design methods for preventing AI systems from inadvertently acting in ways inimical to human values.
The Future of Humanity Institute takes an interdisciplinary approach that encompasses techniques from machine learning, theoretical computer science, decision theory, and analytic philosophy. Examples include lines of research aiming to modify reinforcement learning agents to be 'interruptible' (such that they do not resist attempts to shut them down) or 'active' (such that agents must incur a cost to observe their rewards).
Led by Professor Stuart Russell, the Centre for Human-Compatible Artificial Intelligence is developing new theoretical and empirical approaches to address several core issues:
- the role of explicit uncertainty about objectives in the design of intelligent systems that are provably safe and that have an incentive to allow correction by humans;
- the development of effective inverse reinforcement learning methods to ascertain human objectives in the face of significant cognitive architectural constraints and processing limitations in humans;
- the extent to which increasing AI capabilities require more accurate value alignment between AI systems and humans in order to avoid significant downside risks.
Additional lines of research for technical AI safety can be found in numerous research agendas such as those from Google Brain.