This page offers a selection of resources related to my teaching and research activities.
Teaching resources
Tutorials
Video talks
Videos of the Workshop on Musical Timbre, co-organised by Juan P. Bello, Matthias Mauch, Geoffroy Peeters and myself at Télécom ParisTech in November 2011:
- What is Musical Timbre? by Michèle Castelengo
- Deep Network Geometry of Timbre, by Stéphane Mallat
- How Do We Perceive Timbre? by Tim D. Griffiths
Class materials
The following are mostly in French.
- Python, Numpy, Scipy tutorial notebooks.
- SI227 - Etudes de cas en signal
- SI393 - ATHENS week: Multimedia Indexing and Retrieval
- PESTO Web - Machine learning
- MDI343 - Apprentissage statistique et fouille de données
- MDI224 - Méthodes d'optimisation continue et applications
- Cours indexation audio, M2 ENIT-Paris V
- Cours codage audio, INT
Software resources
Yaafe is "yet another audio feature extractor" initially developed in 2009 by Benoit Mathieu at Télécom ParisTech, and later maintained by Thomas Fillon and others. It is a software designed for the efficient computation of many audio features to be extracted simultaneously. Yaafe automatically organizes the computation flow so that the intermediate representations (FFT, CQT, envelope, etc.), on the basis of which most audio features are composed, are computed only once. Further, the computations are performed block per block, so yaafe can analyze arbitrarily long audio files.
soft_cofact is a set of Matlab scripts, written by N. Seichepine, which compute both:
- l2-smooth and piecewise constant (l1-smooth, TV-like) nonnegative matrix factorisation (NMF);
- and soft nonnegative matrix co-factorisation with IS or KL divergence and l1 or l2 coupling, useful for multiview and multimodal settings.
Beta NMF: Theano-based GPGPU implementation of NMF with beta-divergence and multiplicative updates; by Romain Serizel.
Group NMF: Theano-based GPGPU implementation of group-NMF with class and session similarity constraints; by Romain Serizel.
Mini batch NMF: Theano-based GPGPU implementation of NMF with beta-divergence and mini-batch multiplicative updates; by Romain Serizel.
Supervised (group) NMF: Python code to perform task-driven NMF and task-driven group NMF; by Romain Serizel and Victor Bisot.
More software resources by the ADASP team here.
Research datasets
EMOEEG is a multimodal dataset for dynamic EEG-based emotion recognition with audiovisual elicitation. Read more »
The UE-HRI dataset is a multimodal dataset collected for the study of user engagement in spontaneous Human-Robot Interactions (HRI). It consists of recordings of humans interacting with the social robot Pepper, considering a wide range of heterogeneous sensors: a microphone array, cameras, depth sensors, sonars, lasers, along with user feedback captured through Pepper’s touch screen. Read more »
The 3DLife/Huawei ACM MM GC 2011 dataset consists of multiview and multimodal recordings of Salsa dancers, captured at different sites, in particular at our local studio, with different pieces of equipment. Read more »
More datasets by the ADASP team here.
Research demos
Here is a selection of demos related to recent work I have contributed to. Credits go to the postdocs, Masters' and PhD students who prepared most of the following and are hereafter namely mentioned, as well as colleagues who took part in this research.
Weakly supervised representation learning for unsynchronized audio-visual events
Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. This is the core of the PhD work of my student Sanjeel Parekh, co-advised with Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez and Gaël Richard. In this work, we have proposed a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements.
The following video, prepared by Sanjeel Parekh, shows additional localization results for audio and visual cues depicting an event. Please refer to our paper for more details.
Guiding Audio Source Separation by Video Object Information
In this work we have proposed novel joint and sequential multimodal approaches for the task of single channel audio source separation in videos. This is done using a nonnegative least squares formulation to couple motion and audio information. Experiments with two distinct multimodal datasets of string instrument performance recordings illustrate their advantages over the existing methods. Some separation results are given here.
Dance movement analysis using Gaussian processes
This work focuses on the decomposition of dance movements into elementary motions. Placing this problem into a probabilistic framework, Gaussian processes are exploited to accurately model the different components of the decomposition. The video below shows the sequences of original skeleton joint positions (left) and the 4 different components of the decomposition (right).
More video illustrations of applications of the proposed method can be seen here. A detailed technical report is also available here.
Audio-driven dance performance analysis
This work addressed the Huawei/3Dlife Grand challenge by proposing a set of audio tools for a virtual dance-teaching assistant. These tools are meant to help dance students develop a sense of rhythm to correctly synchronize their movements and steps to the musical timing of the choreographies to be executed. They consist of three main components, namely a music (beat) analysis module, a source separation and remastering module and a dance step segmentation module. These components enable to create augmented tutorial videos highlighting the rhythmic information using, for instance, a synthetic dance teacher voice, but also videos highlighting the steps executed by a student to help in the evaluation of his/her performance. Examples of such videos, prepared by Robin Tournemenne, are given hereafter. Check the related publications.
This is an example of dance teacher videos augmented with audio effects highlighting the musical timing information.
Original video for the 5 th choreography
Synthetic teacher voice | Synthetic hand clapping |
This is an example of videos of a student dancer augmented with audio effects highlighting the automatically detected steps.
Original video | Step evaluation with beeps sounding on detected steps only |
Step verification with a mix of onfloor piezo sounds and beeps on steps detected |
Musical evaluation to control if steps are consistent with the musical timing |
Enhanced Visualisation of Dance Performances from Automatically Synchronised Multimodal Recordings
The Huawei/3DLife Grand Challenge Dataset provides multimodal recordings of Salsa dancing, consisting of audiovisual streams along with depth maps and inertial measurements. In this work, we proposed a system for augmented reality-based evaluations of Salsa dance performances. The following videos, prepared together with Jean Lefeuvre, illustrate the functionalities of the software application that was developped in this work. Check the related publications.
Camera Layout view
Viewpoint and audio stream selection
Audiovisual augmentations illustrating automatic step analysis
Automatic alignment of two dancers