Conclusions

We have reached the end of the course and it is time to comment on the final model that I have built for loves after removing outliers, testing for accuracy and stability, and adding interactions.

After I already divided my data into the training and test sets, we can look at the latest model attempt to predict loves using number of reviews, exclusiveness, and rating (I will not be using the model containing any of the interactions because they were not significant). *Side Note – every time I have to rerun the model the R-squared changes slightly due to the fact that the training and test set are randomized each time I have to run it*

OR:

Loves = -852.85 + 4314.40*Exclusive + 33.8208*Number of Reviews + 1604.89*Rating

Its R-squared is 50.46% which means that this model explains about half of the data which is good, but we are not even close to explaining most of the variations between product to product. Now this is a hard variable to predict as we are trying to predict human behavior, but there can still be more work to do.

Lets look at prediction accuracy and stability, if we were to make predictions of products that were not already in our dataset.

By looking at the average error sizes in the histogram we can clearly see that our model has stability, now when we look at the means it is off by around 500 points. Normally this would be a large amount, but the variable we are dealing with (loves) can go up to the millions in loves and 500 loves is not a make or break amount. I know that that 500 is just an average amount, but even then even if it ranges from 1000-5000 points it is still not enough for me to worry about accuracy when we are dealing with large numbers, so I would say this model has accuracy.

Overall the model is a decent predictor. Since it is not an actual life or death situation, the low R-squared is fine and since the model has accuracy and stability I would use it. The only thing I would worry about is the intercept not being significant, so we could potentially collect more data to try to improve the significance.

This originally was not the problem I was trying to solve, I was trying to predict prices. I soon realized that I would need different predictor variables to effectively predict price which I did not have. Trying to predict the amount of loves a product had was the next best thing and is a good thing for Sephora to look at to see how well their products are being received by their customers. If Sephora wanted to effectively use this model their next step would to try to collect a little more recent data and add some more variables.

Thank you for following along on this journey of exploring R with me!!

Leave a comment

Design a site like this with WordPress.com
Get started