People who can play musical instruments simply bring their own aura in life! However, it is really hard to learn an instrument, how many people have been caught in the dead cycle from getting started to giving up. But can’t you play an instrument, can’t you really play good music? Recently, MIT, in conjunction with the Watson Artificial Intelligence Laboratory (MIT-IBM Watson AI Lab), has developed an AI model, Foley Music, that perfectly restores the soundtrack to the music based on the playing gestures!
And is not regardless of the kind of instruments, violin, piano, Yukri, guitar, all can.
Just pick up the instrument, it is a professional concert! If you like different tones, you can also edit the music style, A,F, G.
The technical paper, Foley Music: Learning to Generate Music from Videos, has been included in ECCV 2020.
Next, let’s look at how the AI model restores music.
Foley Music, which plays a variety of instruments.
Just as you need to know body movements and dance styles for a dance soundtrack, the soundtrack for an instrument player also needs to know its gestures, movements, and the instrument you use.
Given a playing video, AI automatically locks in the body keys of the target object (Body Keypoints), as well as the instruments and sounds played.
Body Keys: This is done by the Visual Perception Model in the AI system. It feeds back through two indicators of body posture and gestures. The body usually extracts 25 off 2D points and the fingerlifts 21 2D points.
Instrument Sound Extraction: Using the Audio Representation Model, the researchers proposed an audio characterization form for the Musical Instrument Digital Interface ( MIDI). It is the key to Foley Music’s differentiation from other models.
For a six-second playvideo, about 500 MIDI events are typically generated, which can be easily imported into a standard music synthesizer to generate music waveforms, the researchers said.
After the information is extracted and processed, the Visual-Audio Model will consolidate all the information and transform it to produce the music that ultimately matches it.
Let’s first look at its full architecture diagram: it consists mainly of visual encoding, MIDI decoding, and MIDI waveform output.
Visual encoding: The visual information is encoded and passed to the converter MIDI decoder. Extract key coordinate points from video frames and use GCN (Graph-CNN) to capture potential representations of human dynamics over time.
MIDI decoder: Models the correlation between human attitude characteristics and MIDI events with Graph-Transfomers. Transfomers is a self-regression-generated model based on codecs, primarily for machine translation. Here, it can accurately predict the sequence of MIDI events based on human characteristics.
MIDI output: The MIDI event is converted to the final waveform using a standard audio synthesizer.
The results of the experiment.
Researchers have confirmed that Foley Music is far superior to other existing models. In the comparative experiment, they trained Foley Music using three data sets and selected the 9 medium instruments for comparison with other GAN-based, SampleRNN and WaveNet models.
The data set, AtinPiano, MUSIC and URMP, covered approximately 1,000 high-quality music performance videos in more than 11 categories. The instruments are organ, bass, bassong tube, cello, guitar, piano, grand, Hawaiian quartet and violin, all of which are video length 6 seconds. The following are the results of the quantitative assessment:
As you can see, the Foley Music model has a highest predictive performance of 72% on bass instruments, while other models are up to 8%.
In addition, the results are more pronounced from the following four indicators:
Correctness: The relevance of the resulting song to the video content.
Noise: Music noise is minimal.
Sync: Songs are most consistent with video content in time.
Yellow is the Foley Music model, which outperforms other models on all indicators, with the highest of 0.6 on the three indicators of correctness, noise and synchronicity, and less than 0.4 for the other, and 9 instruments.
In addition, the researchers found that MIDI events helped improve sound quality, semantic alignment, and time synchronization compared to other baseline systems.
GAN model: It uses human characteristics as input, by identifying whether the spectrum generated by determining its attitude characteristics is true or false, after repeated training, the spectrum map is converted to audio waveform by Fourier inversion.
SampleRNN: An unconditional end-to-end neuroaudio generation model that is simpler than waveNet structures and produces voice faster at the sample level.
WaveNet: It’s Google Deepmind that has introduced a voice generation model that does a good job of text-to-speech and voice generation.
In addition, the advantage of the model lies in its scalability. MIDI representation is fully interpretable and transparent, so predicted MIDI sequences can be edited to produce AGF-toned music in different styles. This is not possible if a model is used using waveforms or spectrum maps as an audio representation.
In the final paper, the researchers showed that the study achieved the expansion of musical styles by establishing a well-established correlation between visual and musical signals through the human key points and MIDI representation. A better research path has been developed for the current research video and music links.
Below is a Youtube video to get a feel for AI music!