End to End Learning for Self-Driving Cars and the Distribution Mismatch Problem
PAPER:
I recently came across this interesting paper by NVIDIA autonomous driving team:
- End to End Learning for Self-Driving Cars (NVIDIA’s team)
SUMMARY
They take a supervised learning approach to learn a mapping from the image input to the steering command. It is essentially a modern (mid 2010s) version of ALVINN from late 1980s. The function approximator is a convolutional neural network (a normalization + 5 convolutional + 3 fully connected). They use a lot of collected data based on actual driver’s behaviour to train their network (about 70 hours, corresponding to about 2.5M data samples — not explicitly mentioned) and some data augmentation.
COMMENTS:
It is exciting to see an end-to-end neural network learned how to perform relatively well. But there are potential problems: One challenging problem with such a classical supervised learning approach is due to the distribution mismatch caused by the dynamical nature of the agent-environment interaction: When an agent makes a mistake at each time step, the distribution of the future states slightly changes compared to the distribution of the optimal agent (or the driver, for this case). This has a compounding effect and the difference in distributions can potentially grow as the agent makes more interactions with the environment. As a result, as time passes, the agent is more likely to be in regions of the state space from which it doesn’t have much training data. So the agent starts behaving in ways that are not predictable even though it might perform well on the training distribution (this is the distribution mismatch problem in machine learning/statistics).
A solution to this problem is to use DAGGER-like algorithms:
- Stéphane
Ross, Geoffrey
Gordon, and J. Andrew Bagnell, “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning,” AISTATS, 2011.
Aside the aforementioned work, which analyzes the phenomenon in the imitation learning context, the analysis of how the distribution of the agent’s changes, in the context of reinforcement learning, has been done by several researchers, including myself. I only refer to two papers. See their references for further information.
- Remi Munos, Performance bounds in Lp norm for approximate value iteration, 2007.
- Amir-massoud Farahmand, Remi Munos, and Csaba Szepesvari, Error Propagation for Approximate Policy and Value Iteration, NIPS, 2010.