Paper Review; Does computer vision matter for action?

Some of the problems in computer vision are motivated by their relevance to robotics and their prospective utility for systems that move and act in the physical world. However, a recent research stream in the intersection of Machine learning and robotics shows that models can be orchestrated to map raw visual input to action and this is in contrast to the belief that action is a motivation for computer vision research. If any robotic system can be trained directly for the tasks at hand with any raw image as input and no explicit vision modules, what is the utility of further perfecting models for semantic segmentation, depth estimation, optical flow, and other computer vision tasks?

This is the motivation for the research work Does computer vision matter for action? and it was authored by BRADY ZHOU, PHILIPP KRÄHENBÜHL, AND VLADLEN KOLTUN of Intel Labs, the University of Texas at Austin respectively.

EXPERIMENTS

The researchers reported controlled experiments that accessed whether specific vision capabilities are useful in mobile sensorimotor systems that act in complex 3-dimensional simulations derived from immersive computer games used to conduct these experiments. To conduct these experiments, they used realistic three-dimensional simulations derived from immersive computer games. The game engines were instrumented to support controlled execution of specific scenarios that simulate tasks such as driving a car, traversing a trail in rough terrain, and battling opponents in a labyrinth. Then sensorimotor systems equipped with different vision modules were trained and their performance on these tasks was measured.

Two types of models were used for this experiment:

A baseline pixel-to-action model that was explicitly trained for the tasks at hand. These models do not rely on any computer vision module and they incorporate the assumption that perceptual capabilities will arise as needed, in the course of learning to perform the requisite sensorimotor task.

A model that received as additional input the kind of representation that are studied in computer vision research such as semantic label maps, depth maps, and optical flow.

The models were performed using two simulation environments: the open-world urban and suburban simulation Grand Theft Auto V [7-9] and the VizDoom platform for immersive three-dimensional battles [5, 10]. In these environments, three tasks were set up: urban driving, off-road trail traversal, and battle. The tasks are illustrated in the figure above.

For each task, agents that either act based on the raw visual input alone or are also provided with an intermediate segmentation were trained. The intermediate representations are illustrated in the figure below.

RESULTS AND FINDINGS

From the figure above, it is derived that:

Intermediate representations clearly help. Therefore, computer vision does matter. When agents are provided with representations studied in computer vision, they perform better in sensorimotor tasks. The upshot is remarkable and is harmonious across simulation platforms and tasks.

Some computer vision capabilities appear to be more impactful for mobile sensorimotor operation than others. Especially depthwise estimation snd semantic scene segmentation.

Images trained directly to action(pixel-to-action representation) have the least accuracy while that trained with all the intermediate representations used for this experiments(semantic and instance segmentation, monocular depth and normals, optical flow, and material properties (albedo)) has the highest accuracy.

Supporting experiments. In the ‘Image + All (predicted)’ condition, the intermediate representations are predicted in the original place by a convolutional network; the agent is not given ground-truth representations at test time. The results indicate that even the predicted vision method confers a significant advantage. Explicit computer vision is particularly helpful in generalization, by providing abstractions that help the trained system sustain its performance in previously unseen environments.

CONCLUSION

In conclusion, the experiment carried out in this research work proves that computer vision does matter for action. The intermediate representations of images are indeed useful for sensorimotor tasks. Models equipped with explicit intermediate representations train faster, achieve higher task performance, and generalize better to previously unseen environments.

Original paper can be found here