Building a Predictive Model, Real World Data is Messy
Movies, Film, Cinema, whatever you want to call it is one of my favorite things to watch. There is just something special about how you feel after watching a great movie and that need you feel to share that experience with other people. When I was tasked with building a predictive model, I knew I wanted to deal with a dataset regarding this love of film and media.
I found an extensive dataset that was scraped from imdb.com that I would use to predict the final weighted average rating on the site. In other words, I have a regression problem at hand. The dataset was split into four csv files that included details on 85,855 different movies. The four files included details on the movies, ratings coming from a range of different demographic data, the personal details about the cast members that played a role in the production of the movie, and the roles that the cast members played in those movies ranked by their importance.
Wrangling Data
The first step I took towards building my model was to explore, clean, filter, merge and create my wrangle function for my data. After exploring my dataset, I settled on breaking down my wrangle function into 6 parts.
First, I dropped any columns that were duplicate data or were more than 60% null values. Second, I converted the date that the movie was published to a datetime object so that I could split my data into training, validation, and testing sets down the line. Third, I merged my datasets together to create one dataframe that will be used moving forward. Fourth, I dropped any high cardinality columns, in this case it was any columns that had more that 80,000 unique values. Fifth, I dropped the average vote column because this would lead to data leakage as our target variable is weighted average vote. Last, I dropped any duplicate rows that were created as a result of merging so that we had the correct size dataframe and not dealing with duplicate data.
Splitting Data
The dataset that I was working with was as current as January 2020, so I decided to split my dataset as follows. My training data will be movies published before 2018, my validation data will be movies published in the year of 2018, and my testing data will be movies published after 2018.
Establishing a Baseline
Before I get to building my model, I needed to establish a baseline so that when I train my model I have something to compare it to. I decided to use the Mean Absolute Error (MAE) as the metric to measure the ability of my model.
The mean of my target variable, weighted average rating, was 5.92 on a scale from 1 to 10 and the baseline MAE was 0.95. In other words, if I were to predict the weighted average rating of every movie to be 5.92, on average I would be off by 0.95.
Building Models
So now that I had my baseline, I could start building my models. In the interest of being thorough I decided to build two models, a Ridge model and an XGBRegressor model so that I could tune the alpha hyperparameter, as well as other hyperparameters to help make my model generalize to data it hasn’t seen as well as possible. I built both models using a pipeline including an OrdinalEncoder for the categorical variables, a SimpleImputer to fill in the null values with the mean of the column, and the model that we were training.
Before going any further I like to check the metrics for how the models perform without any hyperparameter tuning. My Ridge model had a validation MAE of 0.119 and my XGBRegressor model had a validation MAE of 0.111. Since the XGBRegressor model was performing better on the validation data without any adjustment, that is the model I used to tune the hyperparameters.
Hyperparameter Tuning
In an effort to get my MAE metric as low as possible, I turned to the GridSearchCV to tune my hyperparameters so I could be sure that I was as thorough as possible. The hyperparameters that I wanted to tune were the SimpleImputer strategy and the XGBRegressors’s max depth, alpha, and n_estimators with 10 folds.
I have found that the best way to utilize the GridSearch is to test a range of 3 numbers for each of the hyperparameters and adjust from there. That way when I see the best score and the best paramters from the GridSearch, if the best parameter was the highest of the three numbers I know that I should run another GridSearch with higher numbers until I narrow down on the best setting for all the hyperparameters. I found that the best settings for my model was to impute the null values with the mean of the column, a max_depth of 4, an alpha of 2, and I cut off the n_estimators at 700. I say that I cut off the n_estimators there because the model was still improving, however only by thousandths of decimal places. I decided that that was the cutoff point where it was not worth the extra computational resources to improve the MAE by 0.001.
Checking Metrics
Now that my model has been built and the hyperparameters have been tuned, and my validation MAE is as low as 0.099 I now feel comfortable moving forward and looking at both the feature and permutation importances of my model.
Feature & Permutation Importances
Below are the two bar graphs that show the top 10 feature importances and permutation importances. I’m sure what jumps out to you is the same thing that jumped out to me when I saw this: why is the males_allages_avg_vote showing up as so much more important to the prediction than the other features? Is this data leakage?
Upon further exploration into the data, I found that it’s not data leakage that I’m dealing with, but instead an overrepresentation of males in my dataset. Below are a few values from the columns that have the number of votes from males and females. As you can see, there are just significantly more male votes than female votes in the dataset, so it does make sense that that feature is as important as it is to the model prediction.
Visualistations
Since we have a feature that is signifcantly more important to the model prediction than the other features, I felt that a useful visualization to show is a partial dependence plot with that one feature. The plot below is showing how as the average rating for males of all ages increases, the model more strongly predicts a higher weighted average rating.
Conclusions
So to cover everything I did in the excercise: I created a wrangle function to clean and merge my data, I split the data into training, validation and testing datasets, I established a baseline MAE of 0.95, I built one linear and one tree model to see which one performed better on the validation data before any tuning, I tuned the hyperparameters of the XGBRegressor model with a GridSearchCV and created the three visualizations of feature imporatance, permutation importance, and the partial dependence plot of the most important feature in the model’s prediction. The last thing to do is to finally use our test data and see how our model performs. My final MAE for the test data was 0.11, so I feel very comfortable saying that my model is able to generalize well to data it hasn’t seen before.
In terms of takeaways from this project, I think this was a great excericise in learning about how data can be messy in different ways. When you are learning you are often given clean datasets to work with, or there are null values you need to take care of, and duplicate columns. However, this is the first time where I had an overrepresentation of one piece of data in my dataset. It was extremely valuable to see how that can affect a model and know that this is something I’m able to identify and handle moving forward. If you would like to see my code or reproduce my work you can find everything on my github here.