Spring 2025: Reinforcement Learning and Optimal Control for Autonomous Systems II

ROB-GY 73237 / TR-GY 8023

Instructor: Eugene Vinitsky

Course details

Meeting room: Tuesdays, Thursdays 3:30-5 in 909, 2 Metrotech
Office hours: Tuesdays 2-3pm in Room 459, 6 MetroTech Center
Contact info: my email is easy to find online but I'm not putting it here so the robots can't get it.

Prerequisites

Reinforcement Learning and Optimal Control for Autonomous Systems I (or equivalent background). In particular, we assume you know:
- What is value iteration (VI) and policy iteration (PI)
- What is an MDP
- Basic PyTorch
- Basic probability (expected values, PDF/CDF, chain rule of probability, Bayes rule)
Access to Linux or Mac. We will not spend time supporting Windows machines. If you have Windows, you should set up a Linux dual boot or Windows Subsystem for Linux.

Course Breakdown

This is the second part of a two-course sequence on reinforcement learning and optimal control for autonomous systems. Building on the foundations from Part I, this course will cover advanced topics in RL and optimal control with applications to autonomous systems.

At the conclusion of the course, I expect to cover:

Advanced policy gradient methods (TRPO, PPO, SAC)
Exploration in RL (UCB, intrinsic motivation, curriculum learning)
Imitation learning (behavior cloning, DAGGER, inverse RL, VLA models)
Offline RL (CQL, IQL, decision transformers)
Partial observability (POMDPs, belief states, approximation schemes)
Monte Carlo Tree Search and expert iteration
Learning from high-dimensional observations
Multi-agent RL (Nash equilibrium, self-play, coordination)
LLM RL (RLHF, DPO, preference learning)

Course Project

This is a graduate course intended to help you to potentially incorporate ideas from RL and optimal control into your research. As such, the primary element of the course is a project that you will develop over the duration of the class. There are three elements of the project:

Project proposal (2/19)
Mid-semester checkpoint (3/26)
Final project write-up (4/28)

The final project is intended to be a paper written in the style of an academic submission. The paper should roughly be eight pages, though it can be longer, and should conform to the style of a paper e.g. it should either demonstrate a new result, describe the construction of an engineering project in detail, or be an in-depth investigation of a topic. As an alternative option to new work, I will also allow for the construction of a clear write-up of a paper that would allow a beginner to understand it. See the ICLR blog post track for examples of what I mean. Note that longer papers will not receive additional credit for being so: eight pages is the expectation and the longer limit is merely to give you extra space if you need it. Similarly, if a substantive result can be described in fewer than eight pages that is also fine.

Project Proposal

The project proposal should be a one-page, LaTeX document outlining a concrete research question or engineering task. I'm being insistent about LaTeX here because if you don't know LaTeX yet you probably need to learn it at this point. We are doing it quite early in the course, possibly before you have all of the relevant background, so that I can help you refine your project proposal.

Mid-Semester Checkpoint

At this point I expect you to have a 3 page writeup that outlines your progress so far, any open questions that you have not been able to resolve, and a list of additional work that remains to be done before the completion of the final project.

Grading

This is a graduate course and grading is intended to be fairly lenient; the expectation is that you are excited to learn things and do not need to be cajoled to do so by fear of a bad grade. If you do the work, you should expect to receive a very good grade. The grading will be as follows:

5% attending at least one office hour
35% homework
20% project proposal
5% mid-project checkpoint
30% final project

Late material policy

Life happens. As such, you have a total of 4 late days you can distribute across anything. Just write it at the top of the material you are turning in how many late days you are using up. After that, for each day something is late it will lose 10% of the total possible grade. If there are extenuating circumstances that require you to exceed those 4 days, let me know! As I've said, life happens.

Cheating Policy

Cheating is obviously not allowed. Copying answers or code from another student or the internet constitutes cheating and you will be referred to the appropriate NYU procedure for dealing with cheating. Things must be in your own words. However, collaborating with another student is allowed as long as you indicate the student that you collaborated with and your answers and writing are in your own words.

ChatGPT Policy

I love ChatGPT and use it all the time (mostly for little bash scripts and regex). However, one goal of learning is to develop fluency, the process in which you can come up with ideas and use tools and knowledge without reference to an external data store. This is very similar to development of language fluency; imagine if instead of learning a foreign language you tried to look every word up in a dictionary! The same thing happens in research; there are ideas you want to be able to pull out without having to look them up every time. I will try to make clear in the course what those foundational concepts are. A similar thing applies to writing; you want to be able to write quickly and thoughtfully and the only way to get there is practice and repetition.

Using ChatGPT to skip steps delays the development of fluency. In this way, overuse of ChatGPT will possibly get you through the course quicker in the shorter term but will cause you to be a worse researcher in the long term and harm your educational experience.

As such, my rules for ChatGPT are the following:

You may use ChatGPT as a learning tool, to ask it questions about material in the hope of receiving useful explanations that are tailored to you. Note that just like anything else on the internet, these explanations may be wrong!
I ask that you not use ChatGPT as a writing tool. This includes using it to sketch out the form of your project or to draft a preliminary version of it. Even if you would do this in practice, the goal here is to develop the skill of writing research!
You should not be using ChatGPT to write the answers to homeworks. Asking ChatGPT to write any part of the solutions to these problems will be considered cheating.

Note, I acknowledge that there's basically no way for me to check whether you have followed these rules. My hope is that you use it in the ways outlined above because the alternative will harm your development as a researcher, not out of fear of consequence.

Office Hours

I will host a 1 hour office hour once a week. The time will be updated here once it is scheduled. Please use this time to come talk to me about homeworks, research, whatever. It's your time to use and I am excited to talk to you! You are required to attend at least one office hour during the semester. If you cannot make the time we pick, email me to find an alternate time!

Inclusion Statement

The NYU Tandon School of Engineering values an inclusive and equitable environment for all our students. I hope to foster a sense of community in this class and consider it a place where individuals of all backgrounds, beliefs, ethnicities, national origins, gender identities, sexual orientations, religious and political affiliations, and abilities will be treated with respect. It is my intent that all students' learning needs be addressed, and that the diversity that students bring to this class be viewed as a resource, strength, and benefit. If this standard is not being upheld, please feel free to speak with me.

Moses Center Statement of Disability

If you are student with a disability who is requesting accommodations, please contact New York University's Moses Center for Students with Disabilities at 212-998-4980 or mosescsd@nyu.edu. You must be registered with CSD to receive accommodations. Information about the Moses Center can be found at www.nyu.edu/csd. The Moses Center is located at 726 Broadway on the 2nd floor.

NYU School of Engineering Policies and Procedures on Academic Misconduct

Introduction: The School of Engineering encourages academic excellence in an environment that promotes honesty, integrity, and fairness, and students at the School of Engineering are expected to exhibit those qualities in their academic work. It is through the process of submitting their own work and receiving honest feedback on that work that students may progress academically. Any act of academic dishonesty is seen as an attack upon the School and will not be tolerated. Furthermore, those who breach the School's rules on academic integrity will be sanctioned under this Policy. Students are responsible for familiarizing themselves with the School's Policy on Academic Misconduct.
Definition: Academic dishonesty may include misrepresentation, deception, dishonesty, or any act of falsification committed by a student to influence a grade or other academic evaluation. Academic dishonesty also includes intentionally damaging the academic work of others or assisting other students in acts of dishonesty. Common examples of academically dishonest behavior include, but are not limited to, the following:
1. Cheating: intentionally using or attempting to use unauthorized notes, books, electronic media, or electronic communications in an exam; talking with fellow students or looking at another person's work during an exam; submitting work prepared in advance for an in-class examination; having someone take an exam for you or taking an exam for someone else; violating other rules governing the administration of examinations.
2. Fabrication: including but not limited to, falsifying experimental data and/or citations.
3. Plagiarism: intentionally or knowingly representing the words or ideas of another as one's own in any academic exercise; failure to attribute direct quotations, paraphrases, or borrowed facts or information.
4. Unauthorized collaboration: working together on work that was meant to be done individually.
5. Duplicating work: presenting for grading the same work for more than one project or in more than one class, unless express and prior permission has been received from the course instructor(s) or research adviser involved.
6. Forgery: altering any academic document, including, but not limited to, academic records, admissions materials, or medical excuses.

If you are experiencing an illness or any other situation that might affect your academic performance in a class, please email Deanna Rayment, Coordinator of Student Advocacy, Compliance and Student Affairs: deanna.rayment@nyu.edu. Deanna can reach out to your instructors on your behalf when warranted.

Useful references

Sutton and Barto - Reinforcement Learning: An Introduction (the classic in the field, very thorough)
An Introduction to Deep Reinforcement Learning
Reinforcement Learning: Foundations

Course Schedule - Topics

Week	Topics Covered	Course Materials	Assignments / Assessments
Week 1 1/20, 1/22	Recap on the previous class An overview of RL: Model-free vs. model-based, on vs. off-policy RL empirics: How do we measure improvements? What benchmarks are we using? Overview of Bellman equations Policy gradient theorem Q-learning, temporal difference learning, the deadly triad Target networks, prioritized experience replay Deterministic PG theorem	Empirical design in reinforcement learning Chapter 4 on Q-learning Chapter 11 of Sutton and Barto	Homework 1 out
Week 2 1/27, 1/29	Policy Gradient Methods From policy gradient theorem to standard PG methods PG theorem to REINFORCE Conservative policy iteration to TRPO/PPO Importance sampling Variance reduction tricks Max entropy RL (SAC)	Chapter 13 of Sutton and Barto The Mirage of Action-Dependent Baselines
Week 3 2/3, 2/5	Exploration in RL The curse of dimensionality, combinatorial locks Exploration in bandits, UCB UCB-VI algorithm for tabular MDPs LSVI-UCB algorithm for function approximation Practical exploration: epsilon-greedy, Go-Explore, curriculum learning Never Give Up, reverse curricula, exploration from human data	Sections 7.2 and 7.4 Unifying Count-Based Exploration and Intrinsic Motivation The Effective Horizon Explains Deep RL Performance	Homework 1 due, Homework 2 out
Week 4 2/10, 2/12	Imitation Learning Part I Intro to the online learning framework What is regret, why regret? Follow-the-regularized leader algorithms The difference between IL and supervised learning Behavior cloning DAGGER	Online Learning Lecture Notes An Invitation to Imitation Learning
Week 5 2/17, 2/19	Imitation Learning Part II Inverse reinforcement learning: GAIL, AIRL Imitation learning in robotics Common tokenization schemes Imitation of reference motion algorithms VLA models		Homework 2 due, Homework 3 out Project proposal due (2/19)
Week 6 2/24, 2/26	Offline RL What is offline RL? Conservative offline RL methods (CQL) Filtered BC, Implicit Q-learning The offline policy evaluation problem Decision transformers	Introduction to Offline Reinforcement Learning Pi 0.6 Implicit Q-learning
Week 7 3/3, 3/5	Partial Observability Models of partial observability: PSRs, OOMs, POMDPs Intractability of POMDPs The belief state MDP The Bayes filter Approximation schemes: k-step history, discretization	Planning and Acting in Partially Observable Stochastic Domains	Homework 3 due, Homework 4 out
Week 8 3/10, 3/12	MCTS and Search Methods An overview of bandits (to motivate UCB) Vanilla MCTS MCTS variants: Gumbel MuZero, Stochastic MuZero MCTS as regularized policy optimization The expert iteration framework	Bandit Based Monte-Carlo Planning
3/17, 3/19	Spring Break - No Class
Week 9 3/24, 3/26	Learning from High-Dimensional Observations Why is learning from high dimensional data hard? Theories of learning in high dimension Block MDP, Low-rank MDP	Learning in High Dimensions	Homework 4 due Mid-semester checkpoint due (3/26)
Week 10 3/31, 4/2	Multi-Agent RL What makes multi-agent hard? From optima to equilibria Nash equilibrium Zero-sum games and Nash theorem Equilibrium finding algorithms: Fictitious play, Double Oracle, Self-play
Week 11 4/7, 4/9	Human-Robot Coordination Definitions: ad-hoc coordination, locker room talk, cheap talk, zero-shot coordination Why is this hard? Coordination using human data - piKL algorithms
Week 12 4/14, 4/16	LLM RL Intro to RLHF and preference learning Underlying assumptions (utility functions, IIA) DPO Limitations: Preference models do not learn to rank RL practices in large language models The challenge of learning correlated rewards
Week 13 4/21, 4/23	Sundry Topics Guest lecture (TBD) Unsupervised RL Successor features UVFAs Forward Backward methods
Week 14 4/28, 4/30	Final Presentations and RL in Practice How to tune RL algorithms How to check if your results are "real" When not to use RL Project presentations		Final project due (4/28)

Class Competitions

Competition 1: Best humanoid motion reference imitation
Competition 2: Drone racing