Dataset Guidelines

Description / Purpose

Datasets may be used as examples in various content types throughout the Codecademy catalog. When using external datasets on the Codecademy platform, it is important to first confirm that the data can be legally published and used on Codecademy. Datasets should also be properly cited where necessary.

Licenses

The term Open Data refers to data that is not restricted by copyrights or patents, and which is freely available to be used and re-published by anyone (eg., open government datasets). Any open datasets can be used on Codecademy. If you are unsure whether a dataset falls into this category, check with your project lead or project manager to get in touch with the Codecademy legal team.

In addition to open data, any datasets with a Creative Common license can be used on Codecademy, as long as the license is CC-BY or CC-BY-SA; we CANNOT use anything with a noncommercial (NC) license.

Data Repository Links

Codecademy has a github repository with some datasets that have been licensed to us through partnerships.

The following links and repositories contain datasets that can be used on Codecademy, as long as they are cited appropriately:

The following links contain large collections of datasets, but licenses vary. Double check the license before use:

Note that it is possible to filter datasets on Kaggle to Creative Common licenses only, then double check to make sure that the license is not NC.

Citing Data Sources

In general, datasets should be cited at the end of the content item where they are used. If the same data is used in multiple exercises of a lesson, it can be cited at the end of the lesson. For example, the following text could be placed at the end of the last exercise in a lesson using the spam dataset from the UCI repository:

The spam data for this lesson were taken from the UCI Machine Learning Repository. Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

It is also good practice to mention data sources in-text within content items where open-source data is being used. For example, an exercise instruction might begin:

In the workspace, we have downloaded a dataset from FiveThirtyEight which contains data on SPI ratings and rankings for men's international soccer teams. The data has been loaded for you as a pandas DataFrame named spi.

This enables the learner to find the data on their own if they are interested in learning more.

Highlight Ethical Quandries in Tech

As mentioned in our D&I standards, when selecting an example dataset, consider choosing one that has clear social consequences impacting people's lives.