You can find the code and data for this article at this link. It's all hosted on Deepnote, a new kind of data notebook designed for collaboration. Deepnote has become my home in the cloud for all of my data science work. Thank you to Deepnote for sponsoring this week's article.
There aren't enough failed data science projects out there. Usually, projects only show up in public if they work. I think that's a shame. If we learn more from our successes than our failures, it makes sense to share more failures to help those around us.
- I started with a vague idea.
- I didn't know if the data was available or easy to get.
- Failing to start over and reassess when I realized the project was going in weird directions.
- The model's objective wasn't directly connected to the way the model might be used.
- I failed to appreciate the core of the problem was effectively predicting the economy, which is a much bigger and more complex problem.
So that's what I'm doing with this project. I'll detail how I did what I did and what I learned along the way. Most importantly, I'll highlight what I believe were the key mistakes I made. So let's dig into the steaming pile that this project turned into.
My initial idea was to go for a clickbait headline. Is the housing market going to crash? I thought I would scrape housing listings from Zillow.com for a few cities in the United States, then use that data to predict whether the housing market was going up or down. Sound vague and unclear? Yep, you got it. That was mistake number 1.
Usually, when I have a vague idea like this, I can spin it into something close to or next to the problem that might lead to some insights. I had a few ideas on how to spin it:
- Create some sort of aggregate price index for the market(s) I was looking at and use it to build a simple forecast model.
- Use the historical price drops of homes to predict other houses that may be reduced in price soon. Then use something like the ratio of houses that are being reduced as a signal to predict the overall trend in the market.
These are somewhat more specific, but remember, I didn't have any data yet, and I have no idea if the correlations or trends I was relying on actually existed. That was mistake number 2.
I'm not new to data projects, I've been doing them for years, and I've made all of these mistakes before. I was fully aware that I was making mistakes going into this. I chose not to step back and reformulate the plan.
It didn't help that I was trying to live stream the project on YouTube. I felt I had to do something with the time I had already spent (**cough cough** sunk cost fallacy). This led to me pushing forward with the project. That was mistake number 3.
Collecting the Data
Moving forward, I now had to get some data. I started trying to scrape Zillow using Selenium and BeatifulSoup, my trusty tools when using sites where the content is loaded dynamically. When I looked at the data that was returned from the headless browser, I found that the data I wanted wasn't there. Instead, I saw a wonderful CAPTCHA message blocking my way.
Of course, there are ways to get around this. Proxies work pretty well. I've used ScrapingBee in the past and liked it, and just recently, I was able to get access to Bright Data's proxy service as well.
As I was Googling around for a solution that might not require me to use a proxy, I stumbled upon Redfin's Data Center page, where I was able to download aggregate housing data for every city in the US going back to 2012. Here is what the data looks like:
Data includes median sale price, price drop ratios, and many more valuable indicators of the housing market in cities. It's exactly what you would want for this type of problem, so I decided right there to use this data instead of fiddling with Zillow.
I also thought combining the Redfin data with consumer sentiment data would be a good idea. Fannie Mae administers a monthly survey asking consumers questions like, "Is it a good time to buy a home?" or "Is it a good time to sell a home?".
These questions are reported as percentages, and an index is also created. Here is what it looks like:
I downloaded this data and then merged it with the Redfin data using the date as the key. The Fannie Mae data is national, so the result is one Fannie Mae survey row matching many Redfin rows. The resulting merged data became the dataset that I would use to build the machine-learning model.
At this point, I was reasonably happy because the data I had seemed to be good quality and quite comprehensive.
Modeling the Data
My machine-learning approach to this problem was a simple one. Take the data from Redfin & Fannie Mae and try to predict the median sale price of single-family residential homes 1-3 months ahead. I narrowed the list of cities to those with an average of at least 100 homes sold each month. I did this to ensure there were enough sales to ensure a decent sample of home prices. This gave me 238 cities (down from more than 1,000) to make predictions for.
Before I started, I wanted to make a baseline to understand whether my model was performing well. The baseline I came up with is a simple one, the month-over-month variation in median sale price. For example, the average percent variation in the median sale price for one month is 5.9% over the entire dataset. Looking three months ahead, the average variation is 7.5%.
This led me to the metric I would use for the project, mean average percent error (MAPE). If my MAPE is below the average variation, I'll call that doing ok. If this sounds hand-wavy, you are totally right.
Usually, your error metric and baseline are set using a benchmark useful in real life. I don't have something like that in this case, so I came up with this crude measure. To me, it makes sense, but I'm sure there is a better way to do it.
To prep the data for machine learning, I did a few things:
- For each column in the training dataset, lag variables were added 1-3 months back.
- The target variables (y1, y2, y3) are each city's median sale price 1, 2, and 3 months ahead respectively.
- The city name was not included as it hurt model performance.
I knew that the behavior of the data pre-COVID was very different than post-COVID. To illustrate, here is a time series of the median sale price for Plano, Texas.
This led me to a two-phase approach to building the machine learning model. First, build a model using pre-COVID data, then build a model using the entire dataset. I used XGBoost for both. I did this to understand how a model would have performed pre vs post-COVID. If results were good pre COVID and bad post-COVID, that would be a useful insight.
Let's move on to the results of the model.
Building with data, one project at a time.
Get notified when a new project drops.Subscribe
For the first phase, training data was from 2012 to 2017 with the data for 2018 and 2019 being used as a test set. The results are in the chart below. The model MAPE is in blue, and the average month-over-month change is in orange. The values y1, y2, and y3 correspond to the median home sale price for that city 1, 2, and 3 months ahead.
For the second phase I used the entire dataset. The training data included the years 2012-2020 with the testing dataset being the years 2021 and 2022 (up to October). Here are the results. Again, the model MAPE is in blue, and the average month-over-month change is in orange.
This looks not bad right? The model is performing better than the observed historical variation in prices. But is this really a good result? To understand, let's dig into the mind of a person that might use these predictions, a homeowner.
If I'm considering selling my house today or waiting to list it later, I want to know two things. One, will the value of homes be up or down 3 months from now? Two, by how much? Average percent error isn't particularly useful in this context. So what really matters is whether the model is directionally correct and the predictions are appropriate in scale.
MAPE wasn't a great choice in this context. In fact, regression might not have been the right choice. I didn't align the output of the model with the way decisions might be made using that model. That was mistake number 4.
I plotted the true values and predictions for Plano, Texas to visualize whether the model was directionally correct. In blue is the true value and in orange is the XGBoost model predictions. Here are the predictions for one month ahead.
Not too bad. The model is directionally correct here and gives reasonable predictions one month ahead.
Note: I'm only showing 2021 and 2022 here mostly because it's easier to see the trend but also because it's the test set, and the model hasn't seen this data. Also, the date corresponds to the date of the X values and not the date of the y. I simply used the dates to align the predictions and true values.
Now let's see the predictions two months ahead.
Things are starting to unravel. The model misses direction changes a few times and fails to foresee the dip in home prices in 2022. Three months ahead should be even worse right?
Yep, not good. The model here is performing pretty poorly and missing most direction changes. I looked at predictions for many cities, and this same pattern was observer enough that it caused me to lose faith in the model entirely.
The Core Problem
I shouldn't be surprised by these results, though. Every major decision up until this point was filled with mistakes. I was vague with the project scope, I didn't check if the data I needed was easy to get, I pressured myself to produce something, and I didn't align the model with how people might actually want to use it.
But all of these mistakes are basically procedural. You can make a checklist and ensure you avoid these mistakes if you want to. I realized the core problem with this project is deeper than this. In short, I failed to understand the nature of the problem.
With any market, the price is determined by the people who are engaging with the market and actively buying and selling. Just because a house is worth $200,000 today doesn't mean it can't be worth $150,000 in three months. The market is the only determiner of price.
To extend this, how buyers and sellers feel this month can change significantly in three months. How people feel is largely based on macroeconomic factors like inflation and the local and national unemployment rate. I failed to appreciate that predicting home prices is really similar to predicting the larger economy. That was mistake number 5.
Make Lemonade with Lemons?
Even with all of these mistakes, this data still has opportunities. The model's one-month predictions were decent, and a short-term lead metric may be useful for those on the fence about buying or selling. In this situation, it may be useful to predict a category (price up or down) rather than a target price.
Another idea is to forecast the price drop ratio for a city. This might be a useful indicator of the market. For example, a large number of price drops means the values of homes on the market are possibly above the true market value at the time.
Rather than doubling down and trying to make something from nothing, I decided to step back from this topic for now and write up this blog post. I hope you found it useful or at least a reminder of what not to do. As I write this, I'm letting auto-sklearn chew on the dataset and see if it comes up with something better...
If you are curious or want to replicate my results, you can access all the code I used for this post here. I host it all on Deepnote, a great place to perform analysis and share it with others.
Deepnote is a new kind of data notebook that’s built for collaboration— Jupyter compatible, works magically in the cloud, and sharing is easy as sending a link.