In general, humans make decisions through perception, such as avoiding because they see obstacles. Although this “sense-to-action” logic has been applied to sensors and cameras, it has become the core of the current robot autonomous system. However, the current level of machine autonomy is far from the level of human decision-making based on visual data, especially when dealing with open-world sensory control tasks such as first-person perspective (FPV) aeronautical navigation.
But the field of new machine learning systems that Microsoft recently shared offers new hope: helping drones make the right decisions through images.
Microsoft has been inspired by the First-Person View (FPV) drone race, in which operators can plan and control the drone’s course with a single-eye camera, greatly reducing the likelihood of a hazard. Therefore, Microsoft believes that this model can be applied to the new system, so that visual information can be directly mapped to implement the correct decision-making action.
Specifically, the new system explicitly separates the perception component (understanding what “sees”) from the control strategy (deciding what to do), which allows researchers to debug the deep neural model. On the simulator side, because the model must be able to distinguish between the subtle differences between the simulation and the real world, Microsoft trained the system using a hi-fi simulator called AirSim and then deployed the system directly to a drone in a real-world scenario without modification.
Above, the drones microsoft used in the tests
They also used an auto-encoder framework called CM-VAE to closely connect the differences between simulation and reality, thus avoiding overfitting the synthetic data. With the CM-VAE framework, the image entered by the perception module is compressed from a high-dimensional sequence to a low-dimensional representation, such as from more than 2000 variables to 10 variables, with a compressed pixel size of 128×72, as long as it can describe its most basic state. Although the system encodes images using only 10 variables, the decoded images provide a rich description of the “seen scene” of the drone, including the size, location, and different background information of the object. Moreover, this dimensional compression technique is smooth and continuous.
To better demonstrate the system’s capabilities, Microsoft tested a small agile quadcopter with a front-facing camera, which it tried to navigate based on images from an RGB camera.
The researchers tested the loading system’s drones on a 45-meter-long S-orbit consisting of eight barrier frames and a 40-meter-long O-track. Experiments show that the performance of using the CM-VAE auto-coding framework is much better than the performance of direct coding. Even with strong visual interference, the system successfully completes its tasks.
The image above is a side view and top view of the test site
During the simulation training phase, we tested the drone in visual conditions that it had never been “seen”, and we took the perception control framework to the extreme.
After simulation training, the system is able to independently “self-navigate” in a challenging real-world environment, making it ideal for deployment in search and rescue missions. Participants in the study said the system would show great potential in practice — autonomous search and rescue robots were better able to identify and help humans, despite differences in age, size, gender, race and other factors.
Microsoft researchers train AI in simulation to control a real-world drone
Training deep control policies for the real world