In this task the system must synthesize sounds to match a silent video.
The system is trained using 1000 examples of video with sound of a drum stick striking different surfaces and creating different sounds. A deep learning model associates the video frames with a database of pre-rerecorded sounds in order to select a sound to play that best matches what is happening in the scene.
The system was then evaluated using a turing-test like setup where humans had to determine which video had the real or the fake (synthesized) sounds.
A very cool application of both convolutional neural networks and LSTM recurrent neural networks.