Problem Statements
Temporal Alignment Track
Generate Videos with Temporal and Semantic Audio Sync
Spatial Alignment Track
Create Videos with Spatially Aligned Stereo Audio
β° The challenge is now live!
π» Don't know where to start? Check out the starter-kits for Temporal Alignment Track and Spatial Alignment Track.
Generate Synchronized & Contextually Accurate Videos
Welcome to the Sounding Video Generation (SVG) Challenge 2024!
The Sounding Video Generation (SVG) Challenge 2024 is a competition to create AI models that make videos where the visuals match perfectly with sounds, like a dog barking in sync with the video. Participants will work to improve how well sounds and scenes align, with prizes for the best results.
This challenge invites you to build models that generate synchronized and contextually accurate videos. You can showcase their skills and push the boundaries of sounding video generation with two tracks -
- Temporal Alignment
- Spatial Alignment
π Introduction
Video generation research has progressed significantly, with large-scale diffusion models producing realistic videos. However, sounding video generation, which involves well-aligned video and audio modalities, remains underexplored. The SVG Challenge aims to advance this field by providing a platform for benchmarking and showcasing state-of-the-art models.
π₯ The Sounding Video Generation Challenge
Build state-of-the-art AI models to generate videos, ensuring the audio is synchronized and contextually appropriate.
β° Temporal Alignment Track
This track aims to generate videos that are temporally and semantically aligned with their corresponding audio. This involves producing high-resolution videos (256x256 pixels, 8fps) with monaural audio (1 channel, 16kHz).
You will tackle two types of alignment:
-
Semantic Alignment: The audioβs semantic class should match the video. For instance, if the video shows a dog barking, the audio should contain a barking sound.
-
Temporal Alignment: The audio should be synchronized with the video. For example, the barking sound should occur precisely when the dog is seen barking.
In this track, submissions will be evaluated on how well the audio and video synchronize over time. Participants will use customised datasets named SVGTA24 derived from the Greatest Hits dataset with prepared video captions for training. A baseline model based on AnimateDiff and AudioLDM is provided. Submissions will be tested on a set of text prompts to assess synchronization.
More details are available on the Temporal Alignment Track page.
π Spatial Alignment Track
This track aims to create videos with spatially aligned audio, giving a sense of space and direction. This involves producing high-resolution videos (256x256 pixels, 4fps) with stereo audio (2 channels, 16kHz).
Participants should focus on generating videos where the spatial alignment of the audio enhances the sense of space and direction, ensuring that the audio and video components are well-integrated.
Participants will use a customized SVGSA24 dataset derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds. This will be used for training and submit systems that generate video and 2-channel audio signals. A baseline model based on MM-Diffusion is provided. Evaluation will consider how well the generated video and audio align spatially.
More details are available on the Spatial Alignment Track page.
π Timeline
The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:
- Warmup Round: 29th Oct 2024
- Phase I: 2nd Dec 2024
- Phase II: 3rd Jan 2025
- Challenge End: 25th Mar 2025
π Prizes
The total prize pool is $35,000, divided between the two tracks. Teams can win prizes across multiple leaderboards.
Track 1: Temporal Alignment ($17,500)
-
First place: $10,000
-
Second place: $5,000
-
Third place: $2,500
Track 2: Spatial Alignment ($17,500)
-
First place: $10,000
-
Second place: $5,000
-
Third place: $2,500
Please refer to the Challenge Rules for more details on the Open Sourcing criteria for eligibility.
Participants
Leaderboard
01 | lljjol | 6.000 |
01 |
|
6.000 |
01 | kcy4 | 6.000 |
02 | christian.simon | 11.000 |
03 | akio_hayakawa | 13.000 |