This page offers a selection of resources related to my teaching and research activities.

Teaching resources



Video talks

Videos of the Workshop on Musical Timbre, co-organised by Juan P. Bello, Matthias Mauch, Geoffroy Peeters and myself at Télécom ParisTech in November 2011:

Class materials

The following are mostly in French.

Software resources


Yaafe is "yet another audio feature extractor" initially developed in 2009 by Benoit Mathieu at Télécom ParisTech, and later maintained by Thomas Fillon and others. It is a software designed for the efficient computation of many audio features to be extracted simultaneously. Yaafe automatically organizes the computation flow so that the intermediate representations (FFT, CQT, envelope, etc.), on the basis of which most audio features are composed, are computed only once. Further, the computations are performed block per block, so yaafe can analyze arbitrarily long audio files.

soft_cofact is a set of Matlab scripts, written by N. Seichepine, which compute both:

  • l2-smooth and piecewise constant (l1-smooth, TV-like) nonnegative matrix factorisation (NMF);
  • and soft nonnegative matrix co-factorisation with IS or KL divergence and l1 or l2 coupling, useful for multiview and multimodal settings.

Beta NMF: Theano-based GPGPU implementation of NMF with beta-divergence and multiplicative updates; by Romain Serizel.

Group NMF: Theano-based GPGPU implementation of group-NMF with class and session similarity constraints; by Romain Serizel.

Mini batch NMF: Theano-based GPGPU implementation of NMF with beta-divergence and mini-batch multiplicative updates; by Romain Serizel.

Supervised (group) NMF: Python code to perform task-driven NMF and task-driven group NMF; by Romain Serizel and Victor Bisot.

More software resources by the ADASP team here.

Research datasets


EMOEEG is a multimodal dataset for dynamic EEG-based emotion recognition with audiovisual elicitation.   Read more »

The UE-HRI dataset is a multimodal dataset collected for the study of user engagement in spontaneous Human-Robot Interactions (HRI). It consists of recordings of humans interacting with the social robot Pepper, considering a wide range of heterogeneous sensors: a microphone array, cameras, depth sensors, sonars, lasers, along with user feedback captured through Pepper’s touch screen.   Read more »

The 3DLife/Huawei ACM MM GC 2011 dataset consists of multiview and multimodal recordings of Salsa dancers, captured at different sites, in particular at our local studio, with different pieces of equipment.   Read more »

More datasets by the ADASP team here.

Research demos


Here is a selection of demos related to recent work I have contributed to. Credits go to the postdocs, Masters' and PhD students who prepared most of the following and are hereafter namely mentioned, as well as colleagues who took part in this research.

Weakly supervised representation learning for unsynchronized audio-visual events

Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. This is the core of the PhD work of my student Sanjeel Parekh, co-advised with Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez and Gaël Richard. In this work, we have proposed a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements.

The following video, prepared by Sanjeel Parekh, shows additional localization results for audio and visual cues depicting an event. Please refer to our paper for more details.

Guiding Audio Source Separation by Video Object Information

In this work we have proposed novel joint and sequential multimodal approaches for the task of single channel audio source separation in videos. This is done using a nonnegative least squares formulation to couple motion and audio information. Experiments with two distinct multimodal datasets of string instrument performance recordings illustrate their advantages over the existing methods. Some separation results are given here.

Dance movement analysis using Gaussian processes

This work focuses on the decomposition of dance movements into elementary motions. Placing this problem into a probabilistic framework, Gaussian processes are exploited to accurately model the different components of the decomposition. The video below shows the sequences of original skeleton joint positions (left) and the 4 different components of the decomposition (right).

More video illustrations of applications of the proposed method can be seen here. A detailed technical report is also available here.

Audio-driven dance performance analysis

This work addressed the Huawei/3Dlife Grand challenge by proposing a set of audio tools for a virtual dance-teaching assistant. These tools are meant to help dance students develop a sense of rhythm to correctly synchronize their movements and steps to the musical timing of the choreographies to be executed. They consist of three main components, namely a music (beat) analysis module, a source separation and remastering module and a dance step segmentation module. These components enable to create augmented tutorial videos highlighting the rhythmic information using, for instance, a synthetic dance teacher voice, but also videos highlighting the steps executed by a student to help in the evaluation of his/her performance. Examples of such videos, prepared by Robin Tournemenne, are given hereafter. Check the related publications.

This is an example of dance teacher videos augmented with audio effects highlighting the musical timing information.

Original video for the 5 th choreography

Synthetic teacher voice Synthetic hand clapping



This is an example of videos of a student dancer augmented with audio effects highlighting the automatically detected steps.

Original video Step evaluation with beeps sounding on detected steps only


Step verification with a mix of onfloor piezo sounds and
beeps on steps detected
Musical evaluation to control if steps are consistent
with the musical timing

Enhanced Visualisation of Dance Performances from Automatically Synchronised Multimodal Recordings

The Huawei/3DLife Grand Challenge Dataset provides multimodal recordings of Salsa dancing, consisting of audiovisual streams along with depth maps and inertial measurements. In this work, we proposed a system for augmented reality-based evaluations of Salsa dance performances. The following videos, prepared together with Jean Lefeuvre, illustrate the functionalities of the software application that was developped in this work. Check the related publications.

Camera Layout view

Viewpoint and audio stream selection

Audiovisual augmentations illustrating automatic step analysis

Automatic alignment of two dancers