Non-Myopic Learning

Acknowledging that the agent’s learning objective (e.g., maximize immediate reward or sum of reward over the long-term) may or may not line up with performance on the trainer’s intended task (e.g., follow the human trainer), we examined the relationships between reward positivity, temporal discounting, episodicity, and task performance. Contributions of this work include

  1. empirical support for and justification of the newly noted myopic approach used across all previous projects for learning from human reward;
  2. the first successful instance of non-myopic learning from human reward and evidence that non-myopic approaches will enhance the effectiveness of teaching by human reward, shifting the burden from users to the robot;  and
  3. providing evidence for the incompatibility of non-myopic learning and episodic tasks for a large class of domains, and conversely providing an endorsement of framing tasks as continuing when learning from human reward.
exponDiscLoopmaze-TrainSuccessRate

Success rates for training in an episodic grid-world task domain, in which the agent acts approximately optimally with respect to its time-discounted expectation of future reward. In this setting, myopic discounting (a discount factor of 0) leads to the best task performance. Note that task performance—here getting to the goal—can differ from performance with respect to the agent’s learning objective, which all agents in this experiment are nearly maximizing.

Relevant publications

W. Bradley Knox and Peter Stone. Learning Non-Myopically from Human-Generated Reward. In Proceedings of the International Conference on Intelligent User Interfaces (IUI). March 2013.
[pdf] (2.1 MB)
IUI 2013

W. Bradley Knox and Peter Stone. Reinforcement Learning from Human Reward: Discounting in Episodic Tasks. In Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication (Ro-Man). September 2012.
Finalist for CoTeSys Cognitive Robotics Best Paper award
[pdf] (2.1 MB)
Ro-Man 2012