BeyondVision: A Multi-Modal Video Generative Model

Abstract

We present BeyondVision, a multi-modal video generative model that integrates wireless sensing signals with video input to enhance environmental understanding and predictive accuracy. BeyondVision goes beyond traditional generative models that primarily rely on visual data or text prompts. By integrating wireless sensing, it interprets both spatial and contextual information beyond the camera’s field of view, even in challenging visibility conditions. Furthermore, BeyondVision can capture physical properties invisible to RGB cameras that significantly influence future video frames. By fusing data from video and other modalities, BeyondVision achieves a richer and more comprehensive environmental representation, enhancing the accuracy of object dynamics and spatial relationship predictions. Through extensive scenario evaluations, BeyondVision demonstrates its superiority in multi-modal video generation. BeyondVision achieves up to a 21% boost in video generation quality and a 27.9% enhancement in human action prediction accuracy, significantly outperforming vision-only methods across various challenging conditions such as low visibility and non-line-of-sight environments.


Low-light Road Condition w/ LiDAR & Radar

Original Vista Ours
GIF 1
Animated GIF
FVD: 296
Animated GIF
FVD: 255
GIF 1
Animated GIF
FVD: 294
Animated GIF
FVD: 267
GIF 1
Animated GIF
FVD: 624
Animated GIF
FVD: 505
GIF 1
Animated GIF
FVD: 134
Animated GIF
FVD: 82

Through-wall Human Movement w/ WiFi

Original SVD Ours
GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1

Walking in Low-visibility w/ Infrared

Original (Infrared) SVD Ours
GIF 1 GIF 1 GIF 1

Temperature-aware Action w/ Infrared

Original (RGB) Original (Infrared) SVD Ours
GIF 1 GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1 GIF 1