Overfitting & Variable Selection

Overfitting occurs when we add to many variables to our model to try to improve our in-sample performance.

As we aim to build a great / perfect model that explains the variation that does occur, we try to add all the possible predictor variables. What is important here is that we get the right combination of predictors, so we have to evaluate the contribution of each one – which leads us to using backwards stepwise selection.

To show how I will be using the backwards stepwise selection, I will be using a linear regression model as it has an R – squared value that helps us predict how well the model fits our data. This does work on all models though. For my linear regression model I will be using loves as my continuous response variable which I have used in previous posts. A good thing to keep in mind is that all of the model building should take place in the TRAINING DATA ONLY.

I have divided my data into the training and test set, which I have shown in previous posts, randomly assigning 70% of my data to the training set using the code below:

I have 22 variables in my data set, so I will be using loves as my response variable. Not all the variables are numbers, for example one is the ingredients list, so I will be using the other 10 quantitative variables as predictors. Let’s see what happens:

In my model above with 10 predictors we can see that only 3 are truly significant to explain the variation in loves which were, Ratings, Number of Reviews and Online Only. The other predictors were not useful contributors even if they would be chosen as significant among a smaller group of variables.

So my new baseline model is:

The R-squared dips by just the slightest amount of .0029 with fewer variables, but this is a more realistic representation of reality since we only included significant predictors.

The R-squared for this model indicates that Ratings, Number of Reviews and whether the product is offered online only are explaining about 55.53% of the variability in loves from product to product.

Are any of these variables contributing very little? If there is something contributing only a fraction of a percent, we should consider removing it. To make that determination, we will use the backwards stepwise selection.

So we will try to remove each of the three variables one at a time:

When Online Only is removed, R-squared changed from 58.05% to 57.96% ; Online Only is contributing about .09% to R-squared.

When Number of Reviews is removed, R-squared changes from 58.05% to 2.49%; Number of Reviews contributes about 55.56% to R-squared.

When Ratings is removed, R-squared changes from 58.05% to 57.96%; Ratings contributes about .09% to R-squared.

After assessing the first round of removing variables it seems that both Online Only and Ratings only contribute .09% to R-squared which is only a fraction of a percent meaning we can remove both variables. The new model is shown below:

Conclusions: Since the first round of testing allowed me to remove two predictor variables I am now done with backwards stepwise selection as I have no more variables to remove. If I only removed one variable then I would continue with more rounds of testing until I can’t remove any more variables.

I feel confident in that this one variable is helping to create the best possible linear regression model from the training data. Every variable was taken into account and Number of Reviews came out on top by being the most substantial to the R-squared percentage.

Thank you for following along!!

Leave a comment

Design a site like this with WordPress.com
Get started