Downloading Datasets with the Kaggle API

Python

Tools

How to download public datasets from Kaggle from code. Setup credentials and a reproducible path for keeping large files out of repos. Read environment variables, and use the Kaggle API from Python and R.

Author

Marina Varfolomeeva

Published

June 1, 2026

Many public datasets used in data science tutorials live on Kaggle. Downloading them manually through the browser works once, but it breaks reproducibility — there is no record of where the file came from or which version was used. The Kaggle API solves this: one command downloads the exact dataset, and that command can live in a notebook or a script alongside the analysis.

The series that follows this post uses two Olist datasets: the Brazilian E-Commerce dataset (nine CSV files — orders, customers, products, sellers, payments, reviews, geolocation) and the Marketing Funnel dataset (two CSV files — marketing qualified leads and closed deals). They share a seller_id key, enabling end-to-end analysis from lead acquisition to customer outcomes.

Let’s set up the Kaggle API and download both.

Kaggle account and API token

You need a Kaggle account. Once logged in, go to Account → API → Create New Token.

Save it to ~/.kaggle/access_token.

mkdir -p ~/.kaggle && echo REPLACE_WITH_YOUR_KAGGLE_TOKEN > ~/.kaggle/access_token && chmod 600 ~/.kaggle/access_token

The chmod 600 restricts read access to your user only.

Installation

pip install kaggle

Keeping large files out of repos

Downloaded datasets should never be committed to a git repository — they are large, binary, and can always be re-downloaded. The solution is to store them in a fixed local location outside the project and reference that location via an environment variable.

Create a .env file in the project root. Add these lines to .gitignore to make sure neither the data nor the credentials end up in version control:

.env
/path/to/data/

Then set the path to your data directory in .env:

DATA_DIR=/home/user/path/to/data/

Load environment variables into the shell before using:

set -a; source .env; set +a

Downloading a dataset

The dataset identifier is the path segment after kaggle.com/datasets/ in the URL. For the Olist Brazilian E-commerce dataset it is olistbr/brazilian-ecommerce.

kaggle datasets download \
  -d olistbr/brazilian-ecommerce \
  -p "$DATA_DIR/olist" \
  --unzip

-p sets the destination directory (created if it does not exist). --unzip extracts the archive and removes the zip file.

The same pattern works for any other Kaggle dataset. The Olist Marketing Funnel dataset goes into the same directory so both datasets will share one path:

kaggle datasets download \
  -d olistbr/marketing-funnel-olist \
  -p "$DATA_DIR/olist" \
  --unzip

Downloading from Python

First, set up the environment variable in Python:

from dotenv import load_dotenv
import os

load_dotenv()
data_dir = os.getenv("DATA_DIR")

Downloading inside a Quarto document makes the download step explicit and reproducible. Download both Olist datasets into the same directory:

import kaggle

for dataset in [
    "olistbr/brazilian-ecommerce",
    "olistbr/marketing-funnel-olist",
]:
    kaggle.api.dataset_download_files(
        dataset,
        path=f"{data_dir}/olist/",
        quiet=False,
        unzip=True,
    )

The Python API uses the same ~/.kaggle/access_token credentials automatically.

Using from R

readRenviron(".env")
data_dir <- Sys.getenv("DATA_DIR")
orders <- read.csv(file.path(data_dir, "olist", "olist_orders_dataset.csv"))

Checking available datasets

To search for datasets from the command line:

kaggle datasets list -s "brazilian ecommerce"

To see all files in a dataset before downloading:

kaggle datasets files olistbr/brazilian-ecommerce