DeepMind's AI can now learn from human preferences

Research conducted by DeepMind and OpenAI has trained a reinforcement learning algorithm to adapt based on non-technical human feedback
Bakal / iStock

One of artificial intelligence’s biggest problems is the need for handholding from pesky humans. New research by DeepMind and OpenAI aims to help an AI learn about the world around it based on minimal, non-technical feedback.

Read more: DeepMind's new algorithm adds 'memory' to AI

It’s a system that all comes down to that simplest of human traits: inference. While we humans find it very simple to tell the difference between similar looking objects or actions, for an AI this process of reasoning is still fiendishly difficult. That’s not a problem when you’re asking an AI to tell the difference between labradoodles and fried chicken, but it becomes potentially dangerous when let loose on more important problems.

The researchers found that humans with no technical experience could teach a reinforcement learning system a complex goal, thus removing the need for an expert to specify a goal in advance. In their experiment, details of which are published on arXiv, they describe a method whereby the algorithm uses a reward predictor based on human judgement. Up until now, reinforcement learning systems required a hard-coded reward function to work out what the problem was they had to solve, but this research removes that necessity.

In simple terms, a human was asked to look at two videos of bendy robots doing backflips and rate the best. Over time, the algorithm inferred what the human was looking for. Crucially, for those fearful of rogue AIs killing us all, the system also requires that the learning algorithm continually seeks human approval. Think of it a bit like a well-trained, subservient puppy.

Taking the algorithm from clueless onlooker to acrobatic expert took less than an hour of a human evaluator’s time. That’s fine for trivial tasks, but if DeepMind and its competitors are to realise their ambition of an AI with general intelligence then feedback times will need to be drastically cut. Even for such a seemingly simple task, the human evaluator still had to provide 1,00 pieces of feedback.

DeepMind and OpenAI also tested their new feedback and reward prediction system on retro Atari games. As with the backflip challenge, the hard-coded reward function was removed. With no score to go on the AI relied entirely on human feedback. When a human evaluator approved of its behaviour, the AI learned and, over time, became proficient at the games.

This article was originally published by WIRED UK