Advancements in Synthetic Video Generation for Autonomous Driving
1 OVERVIEW
Advanced driver assistance systems ( ADAS ) and their application in autonomous vehicles are evolving day by day . One of the main factors in focus is the availability of real-world or realistic datasets for training the deep-learning autonomous driving models . The quality , quantity and diversity of these datasets are important while training , testing , and validating deep learning models . Training any model corresponding to different conditions is a big challenge , as it is difficult to have a dataset that includes multiple geographies , weather conditions , surroundings , and road conditions .
According to the RAND Corporation report 1 , autonomous vehicles must be driven billions of miles for reliability in terms of safety and security , so it requires almost 10 years to collect all the necessary datasets . This manual driving approach to collect datasets would also require a workforce and proper sensor configurations , which will eventually be expensive . Simulators like CARLA 2 can reduce the reality gap and generate huge trainable datasets , but the rendered scenarios appear unrealistic compared to real-world scenarios . Deep neural networks can help us generate realistic data specific to the ADAS applications under consideration .
Our proposed system is based on Generative Adversarial Networks ( GANs ), which use Semantic Region-Adaptive Normalization ( SEAN ) 3 based Image Generator . The article explores the possibilities of generating semantic inputs for GANs . It also explores pre-processing of sematic inputs and post-processing for generator outputs using Image processing . We propose an evaluation metric that can reduce human intervention . It also employs ideas from existing video synthesis solutions 4 , 5 .
The Video-to-video synthesis 6 method takes semantic labels as input and produces higher-quality photorealistic videos by considering previous frames and flow maps . It uses a cascaded approach to generate higher-quality outputs . Common issues of long-term temporal coherence has been addressed in our solution taking cues from world-consistent video-to-video synthesis [ 4 ]. This approach solves long-term temporal coherence problems by taking Spatially Adaptive Normalization ( SPADE ). Adding depth 7 and guidance image along with semantic 8 and flow inputs further refines output .
1 https :// www . rand . org / pubs / research _ reports / RR1478 . html
2 https :// scholar . uwindsor . ca / etd / 8305 /
3 https :// arxiv . org / abs / 1911.12861
4 https :// arxiv . org / abs / 2007.08509
5 https :// arxiv . org / abs / 1808.06601
6 https :// arxiv . org / abs / 1910.12713
7 https :// arxiv . org / abs / 2007.08854
8 https :// arxiv . org / abs / 1911.10194 108
March 2024