Downloading Datasets with the Kaggle API

Python
Tools
How to download public datasets from Kaggle from code. Setup credentials and a reproducible path for keeping large files out of repos. Read environment variables, and use the Kaggle API from Python and R.
Author

Marina Varfolomeeva

Published

June 1, 2026

Many public datasets used in data science tutorials live on Kaggle. Downloading them manually through the browser works once, but it breaks reproducibility — there is no record of where the file came from or which version was used. The Kaggle API solves this: one command downloads the exact dataset, and that command can live in a notebook or a script alongside the analysis.

The Olist dataset contains nine CSV files covering orders, customers, products, sellers, payments, reviews, and geolocation. A good example to demonstrate cohort analysis, RFM segmentation, customer lifetime value, and more.

Let’s set up the Kaggle API and download the Olist dataset.

Kaggle account and API token

You need a Kaggle account. Once logged in, go to Account → API → Create New Token.

Save it to ~/.kaggle/access_token.

mkdir -p ~/.kaggle && echo REPLACE_WITH_YOUR_KAGGLE_TOKEN > ~/.kaggle/access_token && chmod 600 ~/.kaggle/access_token

The chmod 600 restricts read access to your user only.

Installation

pip install kaggle

Keeping large files out of repos

Downloaded datasets should never be committed to a git repository — they are large, binary, and can always be re-downloaded. The solution is to store them in a fixed local location outside the project and reference that location via an environment variable.

Create a .env file in the project root. Add these lines to .gitignore to make sure neither the data nor the credentials end up in version control:

.env
/path/to/data/

Then set the path to your data directory in .env:

DATA_DIR=/home/user/path/to/data/

Load environment variables into the shell before using:

export $(xargs < .env)

Downloading a dataset

The dataset identifier is the path segment after kaggle.com/datasets/ in the URL. For the Olist Brazilian E-commerce dataset it is olistbr/brazilian-ecommerce.

kaggle datasets download \
  -d olistbr/brazilian-ecommerce \
  -p "$DATA_DIR/olist" \
  --unzip

-p sets the destination directory (created if it does not exist). --unzip extracts the archive and removes the zip file.

Downloading from Python

First, set up the environment variable in Python:

from dotenv import load_dotenv
import os

load_dotenv()
data_dir = os.getenv("DATA_DIR")

Downloading inside a Quarto document to make the download step explicit and reproducible:

import kaggle

kaggle.api.dataset_download_files(
    'olistbr/brazilian-ecommerce',
    path=f"{data_dir}/olist/",
    quiet=False,
    unzip=True
)

The Python API uses the same ~/.kaggle/access_token credentials automatically.

Using from R

readRenviron(".env")
data_dir <- Sys.getenv("DATA_DIR")
orders <- read.csv(file.path(data_dir, "olist", "olist_orders_dataset.csv"))

Checking available datasets

To search for datasets from the command line:

kaggle datasets list -s "brazilian ecommerce"

To see all files in a dataset before downloading:

kaggle datasets files olistbr/brazilian-ecommerce