Human Reward

Much of my research has focused roughly on the question: If reward in a reinforcement learning framework is given by a live human trainer as he or she observes the agent’s behavior—rather than from the usual pre-coded reward function—how should an agent use these feedback signals to best learn the behavior that the human intends to teach?

Below are links to explore my work with Peter Stone and others on learning from human reward.

Interactive Shaping – the problem of learning from human-generated reward
TAMER – our (myopic) foundational system for interactive shaping
Non-myopic learning – learning to maximize human reward over the long term such that the agent performs well on the trainer’s intended task
TAMER+RL – learning from both human-generated reward and predefined reward from a Markov Decision Process

With a talented crew at the MIT Media Lab, I put together this video of my work on TAMER: