3DLife ACM MM Grand Challenge 2012 - Realistic Interaction in Online Virtual Environments

Updates

On December 14th, 2015: a new dance annotation tool developed by CERTH-ITI colleagues was made available here.
On February 28th, 2012: posted data to synchronise the various sensor streams, see readme file on the server.
On February 23rd, 2012: fixed the wrong kinect file ''habib_c6_t1_kinect_2.oni'' which was the same as ''habib_c6_t1_kinect_1.oni''.
On February 23rd, 2012: posted the following files, which were missing: ''bertrand_c3_t2_feetcam.avi'' and ''anne-sophie-m_c5_t2_feetcam.avi''.
On July 28th, 2011: posted ratings of dance performances by Anne-Sophie K., Anne-Sophie M., Bertrand, Habib, Jacky, Ming-Li and Thomas.
On July 4, 2011: posted audio calibration data, see calibration folder.
On July 1, 2011: updated files anne-sophie-k_c5_t1_feetcam.avi, anne-sophie-k_c6_t1_feetcam.avi, anne-sophie-k_c5_t1_torsocam.avi and anne-sophie-k_c6_t1_torsocam.avi; c5 and c6 versions had been swapped one for another by mistake.
On July 1, 2011: updated file bertrand_c1_t1_feetcam.avi which was corrupted.
On June 16, 2011: updated the WIMU readme file.
On June 16, 2011: posted data of dancers Anne-Sophie K., Ming-Li, Remi, Roland and Thomas.
On June 14, 2011: posted choreography ground-truth annotations (see music folder).
On June 14, 2011: posted beat and bar ground-truth annotations (see music folder).
On June 14, 2011: posted data of dancers Laetitia and Martine.
On May 28, 2011: posted data of dancers Jacky and Jean-Marc.
On May 27, 2011: posted data of dancers Gabi, Gael and Habib.
On May 26, 2011: posted data of dancer Helene.
On May 25, 2011: posted data of dancer Anne-Sophie M.
On May 19, 2011: "Terms of usage" section updated.
On May 5, 2011: completed initial description of dataset.
On May 4, 2011: posted calibration images for all sessions.
On May 3, 2011: posted Kinect and WIMU data of dancer Bertrand.

This challenge calls for demonstrations of technologies that support real-time realistic interaction between humans in online virtual environments. This includes approaches for 3D signal processing, computer graphics, human computer interaction and human factors. To this end, we propose a scenario for online interaction and provide a data set around this to support investigation and demonstrations of various technical components.

Consider an online dance class provided by an expert Salsa dancer teacher to be delivered via the web. The teacher will perform the class with all movements captured by a state of the art optical motion capture system. The resulting motion data will be used to animate a realistic avatar of the teacher in an online virtual ballet studio. Students attending the online master-class will do so by manifesting their own individual avatar in the virtual dance studio. The real-time animation of each student's avatar will be driven by whatever 3D capture technology is available to him/her. This could be captured via visual sensing techniques using a single camera, a camera network, wearable inertial motion sensing, or recent gaming controllers such as the Nintendo Wii or the Microsoft Kinect. The animation of the student's avatar in the virtual space will be real-time and realistically rendered, subject to the granularity of representation and interaction available from each capture mechanism.

Of course, we are not expecting participants to this challenge to recreate this scenario, but rather work with the provided data set to illustrate key technical components that would be required to realize this kind of online interaction and communication. This could include, but is not limited to:

3D data acquisition and processing from multiple sensor data sources;
Realistic (optionally real-time) rendering of 3D data based on noisy or incomplete sources;
Realistic and naturalistic marker-less motion capture;
Human factors around interaction modalities in virtual worlds
Multimodal dance performance analysis, including dance steps/movements tracking, recognition and quality assessment
...

Dataset

Created on April 10, 2011Last updated on February 28, 2012.

The dataset consists of multimodal recordings of Salsa dancers, captured at different sites with different pieces of equipment. It will be expanding over the next few weeks as new recording sessions are completed and working data files are made available.

So far 15 dancers, each performing 2 to 5 fixed choreographies, have been captured at Telecom ParisTech recording studio. The data include:

Synchronised 16-channel audio capture of dancers' step sounds, voice and music;
Synchronised 5-camera video capture of the dancers from multiple viewpoints covering whole body, plus 4 non-synchronised additional video captures: one mini DV camera (with audio) shooting the dancers' feet, a second mini DV camera (with audio) shooting the torso; one Kinect camera covering the whole body from the front, and a second covering the upper-body from the side (see below for more details);
Inertial (accelerometer + gyroscope + magnometer) sensor data captured from multiple sensors on the dancer's body;
Depth maps for dancers' performances captured using a Microsoft Kinect;
Original music excerpts;
Different types of ground-truth annotations, for instance annotations of the choreographies with reference steps time codes relative to the music and ratings of the dancers' performances (by the Salsa teacher).

The formats of the different streams of data are given in the following table.

Sensor data	Codec	Parameters
Audio signals	PCM WAV	Mono, 32 bits, 48000 Hz
Unibrain cameras (1 to 5)	Raw AVI from decompressed MJPEG	RGB 24 bits, 320x240, 30 fps
Mini DV cam. - Feet	Video: DV Video Audio: PCM S16	Video: 720x576, 25 fps Audio: Stereo, 16 bits, 32000 Hz
Mini DV cam. - Torso	Video: DV Video Audio: PCM S16	Video: 720x576, 25 fps Audio: Stereo, 16 bits, 48000 Hz
Kinects	OpenNI
WIMU signals	ASCII

Recording setup

Setup

Audio setup

7 Schoeps omni-directional condenser microphones (overhead).
1 Sennheiser wireless lapel microphone (dancer's voice).
Bruel & Kjaer 4374 piezoelectric accelerometers and charge conditioning amplifier unit with two independent input channels
Four acoustic-guitar internal Piezo transducers
2 Echo Audiofire Pre8 firewire digital audio interfaces. Accurate synchronisation between multiple Audiofire Pre8 units is achieved through Word Clock S/PDIF.
A server based on Debian with real-time patched kernel is used to perform audio playback and recording. This server runs an open-software solution based on Ffado, Jack along with a custom application for batch sound playback and recording.

On-floor audio sensor positions are given in the following table.

Sensor type	Audio channel	x, y coordinates in mm
Piezo	1	30, 550
Piezo	2	1970, 450
B&K accelerometer	3	950, 20
B&K accelerometer	4	1050, 1990
Piezo	17	30, 1560
Piezo	18	1970, 1460

The mapping between Shoeps microphones and audio channels is given in the Figure below. These microphones are logarithmically spaced. From left to right, the distance between:

Mic 8 and Mic 19: 39.5 cm;
Mic 19 and Mic 20: 23 cm;
Mic 20 and Mic 21: 13.5 cm;
Mic 21 and Mic 22: 8 cm;
Mic 22 and Mic 23: 4.5 cm;
Mic 23 and Mic 24: 2.5 cm.

Audio files

Video setup

The equipment consisted in 5 firewire CCD cameras (Unibrain Fire-i Color Digital Board Cameras), which were connected to a server running the Unibrain software for recording.

Inertial measurement units

Data from inertial measurement units (IMUs - see image below) were also captured with each sequence. Five IMUs were placed on each dancer; one on each dancer's forearm, one on each dancer's ankle, and one above their hips. Each IMU provides time-stamped accelerometer, gyroscope and magnetometer data at their location for the duration of the session.

Music and choreographies

So far 15 dancers have been recorded (6 women and 9 men). Bertrand is considered as the reference dancer for men and Anne-Sophie K. as the reference dancer for women, in the sense that their performances are considered to be the "templates" to be followed by the other dancers.

Each dancer performs 2 to 5 solo Salsa choreographies among a set of 5 pre-defined ones roughly described as follows:

C1: 4 Salsa basic steps (over two 8-beat bars), where no music is played to the dancer, rather, he/she voice-counts the steps: "1, 2, 3, 4, 5, 6, 7, 8, 1, ..., 8" (in French or English).
C2: 4 basic steps, 1 right turn, 1 cross-body; danced on a Son clave excerpt.
C3: 5 basic steps, 1 Suzie Q, 1 double-cross, 2 basic steps; danced on Salsa music excerpt labelled C3.
C4: 4 basic steps, 1 Pachanga tap, 1 basic step, 1 swivel tap, 2 basic steps; danced on Salsa music excerpt labelled C4.
C5: a special one as it is a solo performance mimicking a duo, in the sense that the girl or the boy is asked to perform alone movements that are supposed to be executed with a partner. The movements are: 2 basic steps, 1 cross-body, 1 girl right turn, 1 boy right turn with hand swapping, 1 girl right turn with a caress, 1 cross-body, 2 basic steps; danced on Salsa music excerpt labelled C5.

Whenever possible a real duo rendering of choreography C5 has been captured. It is referred to as C6 in the data repository.

The dancers have been instructed to execute these choreographies respecting the same musical timing, i.e. all dancers are expected to synchronise steps/movements to particular music beats. A manual annotation of the music in terms of dance movement ideal timing is provided along with the original music excerpts. The following figure gives a snapshot of the annotation together with visualisations of the timing of basic steps. It is important to note that the dancers have been asked to perform a Puerto Rican variant of Salsa, and are expected to dance "on two".

Audio excerpt C3 annotation with choreography movements timing (in red) along with bars and beats (in blue)

Man basic steps

Synchronisation, calibration and ground-truth annotations

While the signals captured by some subsets of sensors are prefectly synchronised, namely all audio channels (except the audio streams of the mini DV cameras), and the 5 unibrain camera videos, synchronisation is not ensured across all streams of data. To minimise this inconvenience, all dancers were instructed to execute a "clap procedure" before strating their performance, where they successively clap their hands and tap the floor with each foot. Hence, the start time of each data stream can be synchronised (either manually or automatically) by aligning the clap signatures that are clearly visible within a 2-s time window from the beginning of every data stream (see for instance audio clap signatures on audio signals snapshot above or image below).

Camera calibration data is provided that consists of images of a calibration shape (see images below).

The ground-truth annotations include:

Manual annotations of the music in terms of beats, given in Sonic Visualiser (.svl) format and ascii (.cvs) format;
A manual annotation of the music in terms of dance movement ideal timing, given in Sonic Visualiser (.svl) format and ascii (.cvs) format;
Ratings of each dancer performance by the teacher Bertrand.

Obtaining and using the dataset

Created on April 10, 2011Last updated on May 5, 2011.

To obtain the data, please proceed as follows:

Download the application form available here and complete it;
When complete, either:
- A: Fax it to +353 -1 -700 7995 plus send an email to 3DLifeGrandChallenge@gmail.com indicating that the form has been sent by fax;
- B: Scan it and send it as an attachment to 3DLifeGrandChallenge@gmail.com
Proceed to registration, by clicking the following button, to obtain a username and password that will enable you to download the dataset (through secure FTP):

Terms of usage

Created on May 5, 2011Last updated on January 11, 2012.

The 3DLife ACM Multimedia Grand Challenge 2011 Dataset can be used for any research and development purposes provided that:

the application form given above has been completed, signed and received;

all published documents that use the dataset, or refer to the 3DLife Grand Challenge 2011 general goals, guidelines, general results, etc. cite the publication provided hereafter and refer to the dataset as the '3DLife ACM Multimedia Grand Challenge 2011 Dataset':

@article{
year={2013},
issn={1783-7677},
journal={Journal on Multimodal User Interfaces},
volume={7},
number={1-2},
doi={10.1007/s12193-012-0109-5},
title={A multi-modal dance corpus for research into interaction between humans in virtual environments},
url={http://dx.doi.org/10.1007/s12193-012-0109-5},
publisher={Springer-Verlag},
keywords={Dance; Multimodal data; Multiview video processing; Audio; Depth maps; Motion; Inertial sensors;
Synchronisation; Activity recognition; Virtual reality; Computer vision; Machine listening},
author={Essid, Slim and Lin, Xinyu and Gowing, Marc and Kordelas, Georgios and Aksay, Anil and Kelly,
Philip and Fillon, Thomas and Zhang, Qianni and Dielmann, Alfred and Kitanovski, Vlado and Tournemenne,
Robin and Masurelle, Aymeric and Izquierdo, Ebroul and O’Connor, NoelE. and Daras, Petros and Richard, Gaël},
pages={157-170}
}

Researchers are also free to submit work for publication to any relevant conferences/journals/etc. outside of ACM Multimedia 3DLife Grand Challenge 2011, as long as the publication date occurs after the GC has been completed (Dec 1st 2011).

Acknowledgments

Warmest thanks go to all the contributors to these capture sessions, especially:

The dancers: Anne-Sophie K., Anne-Sophie M., Bertrand, Gabi, Gaël, Habib, Hélène, Jacky, Jean-Marc, Laëtitia, Martine, Ming-Li, Rémi, Roland, Thomas.
The tech guys: Alfred, Dave, Dominique, Fabrice, Georgios, Gilbert, Lazaros, Marc, Mounira, Noel, Phil, Radek, Robin, Slim, Qianni, Sophie-Charlotte, Thomas, Xinyu, Yves.

3DLife/Huawei ACM MM Grand Challenge 2012