BeyondVision: A Multi-Modal Video Generative Model

We present BeyondVision, a multi-modal video generative model that integrates wireless sensing signals with video input to enhance environmental understanding and predictive accuracy. BeyondVision goes beyond traditional generative models that primarily rely on visual data or text prompts. By integrating wireless sensing, it interprets both spatial and contextual information beyond the camera’s field of view, even in challenging visibility conditions. Furthermore, BeyondVision can capture physical properties invisible to RGB cameras that significantly influence future video frames. By fusing data from video and other modalities, BeyondVision achieves a richer and more comprehensive environmental representation, enhancing the accuracy of object dynamics and spatial relationship predictions. Through extensive scenario evaluations, BeyondVision demonstrates its superiority in multi-modal video generation. BeyondVision achieves up to a 21% boost in video generation quality and a 27.9% enhancement in human action prediction accuracy, significantly outperforming vision-only methods across various challenging conditions such as low visibility and non-line-of-sight environments.

Original	Vista	Ours
	FVD: 296	FVD: 255
	FVD: 294	FVD: 267
	FVD: 624	FVD: 505
	FVD: 134	FVD: 82

Original (RGB)	Original (Infrared)	SVD	Ours

BeyondVision: A Multi-Modal Video Generative Model

Abstract

Low-light Road Condition w/ LiDAR & Radar

Through-wall Human Movement w/ WiFi

Walking in Low-visibility w/ Infrared

Temperature-aware Action w/ Infrared