Linear Regression Prediction

To assess a model’s prediction capability we need to divide our data into a training set and a test set.

So from here on out we will be dividing our Sephora data into two parts: 1) the training set – to build our model with and use this as the “working set” and 2) the test set – to exclusively use to test the model with that we already created.

My data is not organized in time order, it is just a collection of the products in random order, so I’ll just randomly divide the products into two sets:

I now have two datasets to use. The training data contains 70% of the original 9168 observations, and the test data contains 30%.

My next step will be to build the model:

The output is shown below:

So its model equation is:

Loves = -499.185 + 1193.474*Rating + 39.108*Number of Reviews + 4507.289*Exclusivity

Now this model equation is slightly different from my original equation from previous posts, but that is because the data is a little different. Only 70% of the data is being used here as this is the training set, so there are going to be some natural fluctuations in the coefficients. All the coefficients are relatively similar which is a good thing. The intercept is also not significant, but we ran into that same issue in my previous posts.

Now I can make predictions for products in the test set using the model equation above and calculate their errors:

Once I have calculated the errors for the test set, I can draw histograms of the errors in both sets and look at the average size of the errors in each:

Now we can take a better look at these charts to assess stability and accuracy.

Stability: The two histograms are relatively similar and hold the same shape. Both histograms are centered at zero and most observations are within 100,000 loves each (we are predicting loves, so the errors are the same unit). The means are also similar, so the model is pretty stable.

Accuracy: On average the model is off by about 10,500 loves when estimating loves. Now that number seems large for an average error, but when we put it into context, 10,500 loves is not a lot. Products can have up to millions of loves so for it to be only 10,500 on average is not terrible. The histograms displayed that we have mainly over or underestimated by 100,000 and even then that is not too bad for estimating loves again considering that most products have extremely high amounts. On the extreme side the highest we have overestimated is by 800,000 loves which is a lot, but only seems to be for one product. I would say that my model is fairly accurate and stable.

It is important that model’s be both stable and accurate. My model so happens to be both right now, but soon we will be learning ways to increase both stability and accuracy.

Thanks for following along!!!

Leave a comment

Design a site like this with WordPress.com
Get started