My recent research has focused on the manual design or learning of reward functions. I’m specifically interested in reward functions that are aligned with human stakeholders’ interests. The publications that have resulted fall within two categories.
Manual reward function design
We investigate how experts tend to design reward functions in practice (by hand) and how they should do so.
- Reward (Mis)design for Autonomous Vehicles (AIJ 2023; arxiv 2021)
- The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications (AAAI 2023)
Inferring reward functions from human input
This RLHF research is particularly focused on the human part, such critiquing assumptions about why people give certain preference labels.
- The EMPATHIC framework for task learning from implicit human feedback (CoRL 2020)
- Models of human preference for learning reward functions (arxiv 2022; accepted to TMLR 2024)
- Learning Optimal Advantage from Preferences and Mistaking it for Reward (arxiv, 2023; accepted to AAAI 2024)
- Contrastive Preference Learning: Learning from Human Feedback without RL (arxiv, 2023; accepted to ICLR 2024) <– reward specification is only implicit