Create an auto-updating dataset on Kaggle with Deepnote

Create an auto-updating dataset on Kaggle with Deepnote
Photo by Behnam Norouzi / Unsplash

Kaggle has become the de facto place to find data for data science projects. Even though the internet is filled with data, Kaggle is great because the data is often in an easy-to-use format. Unfortunately, as the years go by, Kaggle is suffering from the same problem as the rest of the internet.

Datasets get old, and as they age, their usefulness quickly declines. What's worse, not only do they become almost useless, they actively clog up the system for everyone else. Scrolling through datasets takes more and more time for everyone as a result. To do my part to help the internet, in this tutorial, I want to show you how to create your very own auto-updating dataset on Kaggle for free using Deepnote.

If you don't know what Deepnote is, it's a collaborative data notebook that is Jupyter compatible. They have a generous free tier that allows you to perform analysis, publish public notebooks, and schedule them to run on your schedule. This last part, scheduling, is what we will use for this tutorial.

In this tutorial, we will set up our Deepnote environment to work with Kaggle, pull and clean some data from the internet, and then schedule the data to auto-update using Deepnote and the Kaggle library. We will use average price data from the Bureau of Labor and Statistics (BLS). This data contains average prices for cities and regions for both food and non-food items.

Notably, the BLS already has a Kaggle account and some datasets. Unfortunately, they haven't been updated since 2017! So let's create our own Kaggle dataset that is always fresh to help out our fellow data scientists.

To get started, you should already have a Deepnote and Kaggle account set up and ready to go.

Step 1: Setup Kaggle API Keys

We will be using Kaggle's API to create and update our dataset. This API requires a token, which we can generate on the Kaggle account page. Head to https://kaggle.com/{your_user_name}/account and find the API section, then click on Create New API Token.

This will download a JSON file (called kaggle.json). You can open this file in a simple text editor, and you will see that it contains your username and API key. There are two ways to use this:

  1. Place the JSON file in the location ~/.kaggle/kaggle.json.
  2. Set environment variables for your username and API key.

Using the JSON file is fine if you use your local machine, but I greatly prefer the environment variable method because it gives a few advantages.

Using environment variables, you don't risk accidentally exposing your API key to others if you share your project. Also, environment variables can be shared across Deepnote projects via integrations. This allows you to set your environment variables once and reuse them across several projects.

Create Deepnote Integration

Let's create an environment variable integration in Deepnote. First, click the Integrations button on the left pane of your Deepnote workspace. Then click on the Environment variable integration button. Add the following details:

  • Integration Name: Any logical name, something like kaggle_api_keys.
  • Key/Value #1: KAGGLE_USERNAME, your Kaggle username.
  • Key/Value #2: KAGGLE_KEY, the API key from the JSON file you downloaded.

Connect Integration to Project

Now that our integration is ready, we can connect it to our project so we can use the Kaggle Python package. Create a new project in Deepnote, then click on the Integrations button on the right pane. You should see the integration you created earlier. Click Connect and your environment variables are auto-magically connected to your project.

Now that our environment variables are loaded into our project we can import both the pandas and kaggle packages.

Note: If you don't already have the kaggle packages installed you can install them in Deepnote using !pip install kaggle in a cell. Once installed you will be prompted to add the package to the requirements.txt file. Click on that or add it yourself. Either way, Deepnote will ensure that all packages in your requirements.txt file are installed before your code is run.
import pandas as pd
import kaggle

The kaggle package will look for either the JSON file in the .kaggle directory or the KAGGLE_USERNAME and KAGGLE_KEY environment variables. If it doesn't find them it will give you an error letting you know. Since we don't see that error, we are good to go.

Step 2: Import and Merge Data

Our average price data is available through plain text files that are tab separated. There are several tables we will pull down from the Bureau of Labor and Statistics:

  • ap.data.0.Current - All current year-to-date data. Data is separated monthly, by item, and by area.
  • ap.item - Item codes. These map the item code to an actual item name (like bananas or gas).
  • ap.area - Area codes. These map the area code to a physical location name (Southeast, DFW, etc).

The descriptions for these tables and others are located here. The directory showing all the average price tables is here. It's a good idea to spend some time reading the documentation so we understand our data well before working with it.

Let's start with the current prices table and then pull in the item and area tables. We pass through the sep argument as our table is tab separated.

df = pd.read_csv("https://download.bls.gov/pub/time.series/ap/ap.data.0.Current", sep='\t')
df.shape

We have 189 thousand rows of prices. These range from January 1995 to July 2022.

df.head()
df.tail()
df.columns

For some reason, the column names have some extra spaces, so I will set these again to remove those. This will ensure our tables get merged properly later.

df.columns = ['series_id', 'year', 'period', 'value', 'footnote_codes']

The important column in this dataset is series_id. It tells us the item and area code for the given row. The BLS helpfully gives a breakdown of the series_id column, so we can understand how it is structured.

The series_id (APU0000701111) can be broken out into:

  • survey abbreviation - AP
  • seasonal(code) - U
  • area_code - 0000
  • item_code - 701111

Given this info, we need to break down our data by area and item. After scrolling through the data, it seems that the season doesn't change in our data. Let's pull down the item and area tables next.

item = pd.read_csv("https://download.bls.gov/pub/time.series/ap/ap.item", sep='\t')
area = pd.read_csv("https://download.bls.gov/pub/time.series/ap/ap.area", sep='\t')

We can see that our item table is a detailed item name of the product corresponding to the item code. Items include both food and non-food items like electricity and gasoline.

item.sample(5)
area.sample(5)

Now we will merge the data into a single table that can be added as a Kaggle Dataset. We will do the merging in two different ways. Functionally they are the same, but I wanted to show both here as a kind of mini-tutorial.

The first is with more typical pandas code where we will perform an operation and then save the dataframe. The second is by chaining pandas methods together. I'm not new to chaining, but Matt Harrison recently inspired me to take chaining to a new level to write cleaner, more readable pandas code.

Typical Pandas Code

Let's show how we would typically do this in pandas. I'll make a copy of the dataframe so we can work on it without changing the original.

df1 = df.copy()

We will split out the area and item codes from the series_id. These are a predictable length, so we can simply use slicing. If the item code is shorter than the 6 characters (some of them are), we can simply strip out the extra spaces that would be generated at the end.

df1['area_code'] = df1['series_id'].str.slice(3, 7)
df1['item_code'] = df1['series_id'].str.slice(7, 13).str.strip()

Now we can merge our dataframes to get text descriptions of these items and area codes.

df1 = df1.merge(item, how='left', on='item_code')
df1 = df1.merge(area, how='left', on='area_code')

We have an extra column (footnote_codes). It's all NaN's we can remove it.

df1 = df1.drop(columns=['footnote_codes'])

Chained Methods

The chained code below is one more line than the code above because of the end parenthesis. It uses indentation to highlight the methods, with the goal being that this code is more readable.

To write code such as this, you go line by line and build the chain, executing the cell each time. I'm still not totally sold on this, though, because you do miss the ability to have markdown next to each chunk of code. It's certainly less verbose, though.

df = (df.assign(area_code = lambda x: x['series_id'].str.slice(3, 7),
                item_code = lambda x: x['series_id'].str.slice(7, 13).str.strip())
        .merge(item, how='left', on='item_code')
        .merge(area, how='left', on='area_code')
        .drop(columns=['footnote_codes', 'area_code', 'item_code'])
)
df.sample(10)

We can save our dataframe to a CSV file, and now our data is ready to be added to Kaggle!

df.to_csv("average_prices.csv", index=False)

Step 3: Add Dataset to Kaggle

The Kaggle API allows you to interact with competitions, datasets, and notebooks. For this tutorial, we will only be working with datasets. The documentation for the Kaggle API is here.

The Kaggle package is designed to work via the CLI. Luckily we can do this in a notebook by simply adding a ! in front of the command.

!kaggle datasets list -s "bls average price"

We can see that the Bureau of Labor and Statistics has three datasets on Kaggle already. Unfortunately, they haven't been updated since 2017! This should be further motivation to ensure that any dataset you upload is refreshed regularly. An old dataset is only useful to solve yesterday's problem.

The kaggle library expects that our dataset is in its folder, so let's create a folder and move our CSV file over.

!mkdir bls-average-prices
!cp average_prices.csv bls-average-prices

You should now have a folder in your project files called bls-average-prices with the average_prices.csv file inside.

We will initialize our dataset metadata file with the init command. This will generate a file in our dataset folder called datapackage.json.

!kaggle datasets init -p bls-average-prices

You can now open the JSON file and add the title and id that you would like. The default license is CC0-1.0, which is a public domain license. As our data is in the public domain, this works fine. You can find many more items that can be added here.

Your JSON file should at minimum look something like this.

Now we can create our dataset with the create command. This will look at the JSON file and create the dataset with your data in our folder with the title and id you put in.

!kaggle datasets create -p bls-average-prices

If everything works right, you should see something like the output above. The library will give you a private URL you can go to add additional metadata. To create a high-quality dataset on Kaggle, much more metadata is needed (source, description, etc.), so it's a good idea to take the time here to update this.

Also, keep in mind that your dataset isn't public just yet. Once you are happy with the added details, you can make it public at the web interface above. I went ahead and did this, and you can see the result here.

Create a new dataset version

The idea is to update this dataset regularly (so we don't have old datasets clogging up the internet pipes). To do this, we will run our notebook on a schedule. Each notebook run will pull down the most recent data and create a new dataset version on Kaggle.

We don't want to make a new dataset each time so that you can comment out the init and create commands above by using the # character at the beginning of the line.

To create a new dataset version, we use the version command and give an update message.

!kaggle datasets version -p bls-average-prices -m "Automatic Update"

There you have it. Now we have all the code needed to update our dataset on whatever schedule we want.

Step 4: Schedule Notebook

Scheduling a notebook in Deepnote is very simple. At the top of the notebook, you will see the Run notebook button with a down arrow. Click on the down arrow and select Schedule notebook. Here you can set the frequency of updates, and also whether Deepnote should send an email if the run succeeds or fails.

Our data is updated monthly, but we can set it weekly as Deepnote does not currently allow for a monthly update.

Conclusion

Now you can sit back and enjoy the wonder of auto-updating datasets. You can feel the satisfaction of knowing that you aren't contributing another dataset that will be old in a few months. After all, we as data scientists should be solving today's problems, not yesterday's.

Thanks for reading. I hope you can apply what you learned in this tutorial to your work. If you want to receive updates via email whenever I post something new, sign up for the Datafantic newsletter. Happy coding!

You can click the button below to find the complete notebook on Deepnote.