More data sources
We recommend these additional resources for finding other interesting data sets to explore.
Additional data resources (for projects)
There are many open data repositories across the wider internet; here are a few places to start.
1. Kaggle
Kaggle is one of the largest platforms for data science competitions, and it hosts a vast array of data sets covering various domains. You can find data sets ranging from simple to complex, along with competitions and kernels (code notebooks) shared by the community.
2. Tidy Tuesday
TidyTuesday is a weekly data project aimed at the R community. Organized by the Data Science Learning Community, TidyTuesday provides a new data set every week along with starter code to encourage exploration and analysis using R. There is even a companion podcast!
3. Awesome Public Datasets
This GitHub repository curates a list of topic-centric publicly available data sets across a wide variety of domains. There is a Slack channel to join if you would like to connect with the Awesome Data community.
4. Data.gov
Managed by the U.S. government, Data.gov offers a vast collection of data sets from various government agencies, covering topics such as climate, agriculture, healthcare, education, and more.
Smaller data sets (for practice)
Many packages come with built-in data sets; while these are probably not complex or interesting enough for an Academy milestone or full project, they may be useful as you build your code skills.
Working in Python
Many Python packages with built-in data sets use a structure like <package-name>.<data>.
For example, to load the diamonds data set from plotnine:
from plotnine.data import diamonds
To load the iris data set from scikit-learn:
from sklearn.datasets import load_iris
iris = load_iris()
Working in R
Throughout Academy, you see many data sets commonly used for teaching (e.g. gapminder, palmerpenguins). Many R packages include data sets that are relevant to that package. (Example: package tidyr contains small data sets useful for practicing data tidying/reshaping.)
To see what data sets are available with a given package, first load the library and specify the package name.
library(ggplot2)
data(package = "ggplot2")
You can also see all data sets available across all loaded libraries by running data() in your Console.
For a list of all datasets available within R, see Rdatasets.