Bootstrapping a labeled dataset with transformers

Bootstrapping a labeled dataset with transformers
You can find the code and data for this article at this link. It's all hosted on Deepnote, a new kind of data notebook designed for collaboration. Deepnote has become my home in the cloud for all of my data science work. Thank you to Deepnote for sponsoring this week's article.

Even with all of the GPT-3/ChatGPT hype, most of the value of machine learning in business today comes from supervised learning using "traditional" machine learning methods. Methods that include gradient-boosted trees, logistic regression, and various bayesian models. These methods require labeled training data specific to the task at hand.

I don't see this changing any time soon, no matter what the crypto-turned-AI bros say. Data scientists are barely scratching the surface of building ML solutions for companies, and the companies themselves are just getting used to how to do this. This doesn't mean ChatGPT has no use, but I also don't believe it will be as broadly applicable to everyday business problems as traditional machine learning.

Later, I'll get into why I believe these large language models won't scale to create products in every business vertical (at least not profitably). But for now, I'll go through a simple project I put together as a proof of concept of how LLMs are a great tool in our toolbelts as long as we acknowledge their limitations.

The Project

Recently I started thinking about how social media sites like Twitter and Facebook assign interests to users for advertising purposes. I wanted to game out a situation to see what I could make of it. If I worked at a Twitter-like social media company today (perhaps Mastodon) and I wanted to advertise, how could I build a model to label users by interest?

Targeting by user interest, beyond demographics and location, is a vital tool to provide to advertisers. Getting the data around user interests is the challenge though. When just starting a machine learning project, you often have to collect the data yourself, if it even exists.

If you are lucky you might be set up like Facebook. Facebook has excellent data because every Facebook page (and group) self-selects its category, and user likes give a signal here. How much you engage with that page or group is another signal.

But if you are a site like Twitter, user data is much less structured. You could try something like labeling top accounts and then labeling followers of those accounts. This would miss a lot though because I might love traveling but don't follow travel Twitter influencers.

To get started I took a step back and thought about the problem from several angles. I came up with a sort of thesis statement for the project. This is extremely useful because it helps guide the project and prevent scope creep. Here is what I came up with:

The model needs to be able to reasonably accurately classify posts into a predefined list of interests/topics. Also, it must be able to do this very efficiently so that it could be scaled to millions of posts per hour using relatively inexpensive hardware.

Now I had to collect data that can be used to build the model. Since I don't actually work at this fictional social media company, I decided to cheat a bit and use Twitter data. I used Twitter's 1% API (before free access is shut down on February 9, 2023) to collect 500,000 Tweets. The API gives a 1% sample stream of all Tweets on the platform, but you aren't able to filter by language or location, so I ended up with a truly random sampling of Tweets.

From these 500,000 Tweets, I cleaned and filtered the Tweets using the following strategies:

  • Remove links, @ replies/mentions, and the RT label.
  • Filter to only Tweets with >10 words.
  • Filter to only English Tweets. This was done using spaCy's language detector. It labeled the 190k Tweets in about 30 minutes.

I ended up with a dataset of 95,576 cleaned English Tweets of greater than 10 words. Now I need to create topics so the Tweets can be labeled.

Since I'm not actually good at marketing or knowing what interests advertisers want, I headed over to the Twitter advertising center and heavily borrowed from the listed interest categories. I ended up with the following list of topics:

automotive, books and literature, business, economy, data science, professional career, education, family and parenting, cooking at home, eating out at restaurants, video games and gaming, personal health, home garden, law government and politics, movies and television, music, news and current events, personal finance, house pets, science, society, sports, style and fashion, technology, leisure travel, business travel

Now I need labels! The traditional answer at this point is to hire a company in the Global South to label tens of thousands of posts for a few dollars an hour. But I didn't have the budget for this or the time.

Choosing a Labeling Strategy

I needed a labeling solution that I could put together on my laptop in a day or two so I can get the first model out quickly and then iterate. I had fiddled with OpenAI's playground a good amount, so I knew that GPT3.5 does a pretty good job with tasks like this. But I also knew that the GPT3.5 API wasn't cheap.

I tested a few Tweets on the GPT3.5 API and found that it could use up to 300 tokens per request. If I were to take a sample of 15,000 Tweets, that would be 300 x 15,000. Since it's charged per 1,000 tokens, divide the previous number by 1,000 and then multiply by the price of 2 cents.

At $90 it's cheaper than a human labeler, but still pretty expensive. If the now-for-profit company is charging too much, how about the wonderful always smiling open source HuggingFace? Open source > closed source right?

Over on the HuggingFace model hub, the most popular zero-shot classifier is currently Facebook's bart-large-mnli model. I thought this was a good place to start so I set up a quick classification pipeline (which is very very easy with HuggingFace) where a Tweet would be passed through along with our topic list above.

I looped through a sample of 100 Tweets and the model took just over 10 minutes to assign a topic. If we linearly extrapolate these numbers out we would be looking at 25 hours to run this pipeline on a sample of 15,000 Tweets using a machine with 4vCPUs.

Of course with this model I only pay for the compute I use, which is significantly less than OpenAI. But maybe I'm interested in getting results now. This is where Modal comes in. Modal allows you to define some arbitrary Python code and then scale that code out to many many nodes for execution. The process to do this is pretty simple.

First I defined the image that will be used to perform inference. The same packages are needed in the containers that we have in our notebook. I based this code mostly on this excellent guide in the Modal docs.

Next, I created a class that sets up the inference pipeline and a function that makes the predictions. The function also takes the resulting probabilities given by the pipeline and gives back labels that have a greater than 50% probability.

Finally, I run the prediction pipeline using the map method. This spins up a large number of containers and sends a Tweet to each one, then return back the results from the predict function. I tested this out on more than 15,000 Tweets and it took less than 20 minutes to complete inference for all of the Tweets.

I found three different ways to bootstrap labels for a text dataset:

  • OpenAI's API. Expensive but they do work pretty well.
  • Local inference using an open-source model. As long as you have the machine and have a bit of time, but works out to be the cheapest option.
  • Distributed inference using Modal. This isn't free (but Modal does give you $30 of usage per month right now). But if you absolutely want it now, this is a pretty great option.

Now that we have a small training dataset that is labeled, we can train our first model and see how it performs.

Training a model with the bootstrapped dataset

I wanted to keep the model as simple as possible, and remember our goal is to make a model that can perform inference very quickly and at scale. So I put together a simple LinearSVC model using scikit-learn.

I printed out the classification report and I wasn't super impressed. There is a weighted accuracy of 0.88 and a macro average of 0.67. This isn't bad, but it's also not great either. The real issue is class imbalance.

The hold out test set is 3,000 Tweets in size, but by far the most common class is "none". More than 2,500 of the 3,000 Tweets are labeled "none" in the test set. This proportion holds for the training dataset as well since I split the data in a stratified fashion.

With so few examples of many of the classes, the model doesn't perform well on these classes. What is needed now is even more labeled posts, perhaps the full 90k in order to get closer to reasonable performance.

BUT our model is very fast. On the test set of 3,000 Tweets, the model takes around 300ms to perform inference. If we scale this up linearly we can assume about 1 second per 10,000 Tweets. This of course dependent on hardware but I find this to be pretty quick.

I chose to stop here with this project because I don't actually work at a social media company and the value of a model like this for me isn't very high. But I hope it does give a good example of how useful transformers can be to help overcome the cold start problem in machine learning, at least for text data.

With almost no budget and very little time, you too can bootstrap a labeled text dataset so you can get started on that machine learning project.

Why GPT-3/ChatGPT won't scale to many verticals

So let's get back to my controversial hot take. Even though large language models can be very useful, I don't believe they will be successful in creating products in most market verticals. The simple reason I believe this is cost.

Running an LLM like ChatGPT that is capable of doing all of the things we want has been estimated to cost around $0.01 per query. If this cost is accurate, the $20 price tag of ChatGPT only lasts to 2,000 queries until OpenAI starts to lose money. That's 66.6 queries per day if a month is 30 days.

That might sound like a lot of queries, but let's remember that the entire software revolution was built on massive margins of 80-90%. At only 30 queries per day per user the economics of ChatGPT aren't very attractive with a gross margin of 50%. Now let's remember that the cost per user doesn't reduce like with SaaS. Marginal costs for AI products scale linearly with your users.

These products will only be successful in the current investment environment if they are able to achieve margins comparable to current software products. The current prevailing theory is that there will be a ChatGPT for many market verticals moving forward. It doesn't seem hard to imagine LLM products delivering more than $20 per month of value to customers, but getting them to pay more than $20 a month might be the hardest part.

The Notebooks

If you are curious or want to replicate my results, you can access the code and notebooks I used for this post here. I host it all on Deepnote, a great place to perform analysis and share it with others.

Thank you to Deepnote for sponsoring this week's post. I host all the code and data for datafantic on Deepnote, and I believe it's an excellent tool for data professionals to collaborate and share their work.

Deepnote is a new kind of data notebook that’s built for collaboration— Jupyter compatible, works magically in the cloud, and sharing is easy as sending a link.