Downloading Datasets with the Kaggle API
Many public datasets used in data science tutorials live on Kaggle. Downloading them manually through the browser works once, but it breaks reproducibility — there is no record of where the file came from or which version was used. The Kaggle API solves this: one command downloads the exact dataset, and that command can live in a notebook or a script alongside the analysis.
The Olist dataset contains nine CSV files covering orders, customers, products, sellers, payments, reviews, and geolocation. A good example to demonstrate cohort analysis, RFM segmentation, customer lifetime value, and more.
Let’s set up the Kaggle API and download the Olist dataset.
Kaggle account and API token
You need a Kaggle account. Once logged in, go to Account → API → Create New Token.
Save it to ~/.kaggle/access_token.
The chmod 600 restricts read access to your user only.
Installation
Keeping large files out of repos
Downloaded datasets should never be committed to a git repository — they are large, binary, and can always be re-downloaded. The solution is to store them in a fixed local location outside the project and reference that location via an environment variable.
Create a .env file in the project root. Add these lines to .gitignore to make sure neither the data nor the credentials end up in version control:
Then set the path to your data directory in .env:
Load environment variables into the shell before using:
Downloading a dataset
The dataset identifier is the path segment after kaggle.com/datasets/ in the URL. For the Olist Brazilian E-commerce dataset it is olistbr/brazilian-ecommerce.
-p sets the destination directory (created if it does not exist). --unzip extracts the archive and removes the zip file.
Downloading from Python
First, set up the environment variable in Python:
Downloading inside a Quarto document to make the download step explicit and reproducible:
The Python API uses the same ~/.kaggle/access_token credentials automatically.
Using from R
Checking available datasets
To search for datasets from the command line:
To see all files in a dataset before downloading: