Project datasets

How to read the datasets

All datasets below are provided in the form of csv files. If you are using D3 or Altair for your project, there are builtin functions to load these files into your project.

Also remember that you can use libraries from the underlying environment: Python for Altair, Javascript for D3, and Java for Processing (such as to parse dates or other structured types).

Classic datasets

These are simple multidimensional datasets that are for the most part classic infovis datasets. If you use one of these data sets, you will need to focus your effort on creating good, interactive representations that are well-suited to your analytic tasks. They are sure to easily fit within memory.

Cars

A dataset of about 400 cars with 8 characteristics such as horsepower, acceleration, etc.

Cereals

About 80 cereal products with their dietary characteristics.

Countries

A dataset of 160 countries with ~40 characteristics such as debt, electricity consumption, Internet users, etc.

Films

About 1600 movies with properties such as length, main actor and actress, director and popularity.

Wikipedia Edits

A log of 1000 wikipedia edits with article name, user, date and amount of changes.

Less common datasets

These data sets might be more interesting in that fewer (or no) visualizations are available online yet, and they can lead to interesting insights.

Grand débat national

Data from the Grand débat. Includes mostly free-form text with some structured data including id, title, when created, published, updated, deleted, author type, postal code, and text contents. Organized into themes, such as Public Services, Public Spending, Ecological Transition, etc. Since this is mostly free-form text, you will probably need to focus on how to visualize unstructured or semi-structured data.

  • To use this data set, request access from the instructor.

Le “Vrai” débat

Data from the “vrai” débat. Includes over 25 000 comments, including scores, total votes, percent approval, title, comment, author, date, and theme. As with the grand débat, this will probably involve some work on text analysis and extraction.

  • To use this data set, request access from the instructor.

Baby names

This dataset contains all baby names in France from 1900 to 2017.

Causes of Death

Causes of death in France from 2001-2008. Variables include year, gender, cause of death, and number of deaths.

Other data on European countries can be downloaded from the Eurostat Website:

  • Use the tree to browse the databases by themes, then open the database of your choice by clicking on the left icon.
  • A default tabular view appears and a user interface allows you to add more dimensions or filter the data (it might require some time to get used to).
  • Once you are satisfied with the table, click on the disk icon on the top then select the xls format. Cleanup the xls file using Excel then export it as a csv file.

New Born Baby Patterns

This dataset consists of three files: sleep periods, feeding periods, and diaper changes of a baby in its first 2.5 months

Time Use

How people spend their time depending on country and sex, with activities such as paid work, household and family care, etc.

You can generate csv files that include other dimensions such as day of the week or month by going to the Eurostat Website and proceeding as indicated above.

Happiness

European quality of life survey with questions related to income, life satisfaction or perceived quality of society.

The above table is quite small and only provides the average rating for the question How happy would you say you are these days? Rating 1 (low) to 10 (high) by country and by sex. On its own, this dataset it probably insufficient for this class project. You are encouraged to download and visualize answers to other questions as well. For this, go to the Eurofound Website, select the question to the left then use the bottom links to download the csv file.

Income Inequalities

The Gini index per country per year (sparse data).

Other data per country per year can be downloaded from gapminder, such as electricity generation per person, alcohol consumption, air traffic accidents, and more classical measures such as GDP. You can possibly combine several indicators together.

HIV Prevalence

HIV prevalence per country per year, with uncertainty bounds. Cells need some parsing.

Speed Dating

Speed dating data with over 8,000 observations of matches and non-matches, with answers to survey questions about how people rate themselves and how they rate others on several dimensions. This is a large and rich dataset which might take you some time to fully understand.

World Values Survey

A comprehensive survey consisting in 300+ questions asked to people from different countries on their values, gathered across several years. You can answer a subset of the questions here and see which country best represents your values.

  • No csv file is provided here for the moment, but you can download Excel files for individual questions by following the link below. Requires some cleaning up.
  • Source Website.

WVS Cultural Map of the World

An aggregated dataset computed from the World Values Survey that measures cultural proximity of countries across two dimensions, and for different time periods. A small but interesting dataset.

Dream Bank

A collection of over 20,000 dream reports with dates. The reports come from a variety of different sources and research studies, from people ages 7 to 74.

Your own data

You may also choose your own dataset. In order to do so, you must first get your dataset approved by the instructor. Data should be sufficiently complex.


Creative Commons License Many of these datasets have been cleaned up by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. Please acknowledge these authors when reusing content from this page, and the source data authors for external links. This page licensed under a Creative Commons Attribution-ShareAlike 3.0 License.