We present BeyondVision, a multi-modal video generative model that integrates wireless sensing signals with video input to enhance environmental understanding and predictive accuracy. BeyondVision goes beyond traditional generative models that primarily rely on visual data or text prompts. By integrating wireless sensing, it interprets both spatial and contextual information beyond the camera’s field of view, even in challenging visibility conditions. Furthermore, BeyondVision can capture physical properties invisible to RGB cameras that significantly influence future video frames. By fusing data from video and other modalities, BeyondVision achieves a richer and more comprehensive environmental representation, enhancing the accuracy of object dynamics and spatial relationship predictions. Through extensive scenario evaluations, BeyondVision demonstrates its superiority in multi-modal video generation. BeyondVision achieves up to a 21% boost in video generation quality and a 27.9% enhancement in human action prediction accuracy, significantly outperforming vision-only methods across various challenging conditions such as low visibility and non-line-of-sight environments.
Original | Vista | Ours |
---|---|---|
![]() |
![]() FVD: 296
|
![]() FVD: 255
|
![]() |
![]() FVD: 294
|
![]() FVD: 267
|
![]() |
![]() FVD: 624
|
![]() FVD: 505
|
![]() |
![]() FVD: 134
|
![]() FVD: 82
|
Original | SVD | Ours |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Original (Infrared) | SVD | Ours |
---|---|---|
![]() |
![]() |
![]() |
Original (RGB) | Original (Infrared) | SVD | Ours |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |