Logistic Regression Assumptions

Now we can test the assumptions and requirements of our logistic regression model, just like we did for our linear regression model.

The logistic regression model I created in the last post is shown below:

Its model equation would look like this:
log(p/p-1) = .664 – 0.0000023*Loves + 0.0463*Difference in Price – 0.194*Rating

Where p is the probability that there is a marketing flag listed on a product.

Logistic Regression Models have 6 total checks.

Good Linear Model

Just like in the linear regression assumptions, it’s important to include all relevant variables and include all irrelevant variables. Since we are just practicing our modeling skills that we will just proceed through this one.

No Perfect Multicollinearity

Again just like in linear regression, multicollinearity still matters! Below shows the code and the result when I test my logistic regression model for multicollinearity:

As we can see here, we have no perfect multicollinearity between any of the predictors. None of these values are even close to 1. We can say that we pass this assumption and have no perfect multicollinearity in our model.

Independent Errors

Our test for independence now solely relies on our intuition. My intuition is the same as it was in the linear regression assumptions: some of these products are grouped into the same categories and a lot of the products are grouped into the same brands so one brand could be marketed more than the other.

Complete Information

Complete information is a hard assumption to test and we can begin by understanding if we have a range of loves, differences in prices, ratings, and products with marketing flags. I will be creating histograms/plots to show the range of data for these variables.

As we can see, we have a large range of data and some of the graphs show that we do not have complete information. 3 of the graphs show a concentration on a few values. For the ratings histogram we do not have enough information about lower ratings. Most ratings are higher than a 3 which can be due to Sephora stocking themselves with high end quality products. As for the difference value histogram, most products at Sephora are not discounted and if they are it is not by much which is why we see a higher concentration at 0 and on the lower end. For the Loves histogram we have a higher concentration between 0 and 100,000 and I am not exactly sure why that is. It is possible that they super high amount of loves is for specific, highly rated products maybe only at Sephora. The Marketing Flags graph represents all products, aka each product either has a marketing flag or nothing.

The only graph that really poses an issue is the difference in price histogram because I am not sure if it represents all combinations. On the other hand we do have 9,168 observations which means we have a reasonable amount of data, we just do not have reasonably distributed data.

Complete Separation

To test for complete separation, I have prepared the following scatterplots:

Because no vertical line can be drawn in any of the plots to separate products with and without marketing flags by ratings, loves, or difference in price, this model is not suffering from complete separation.

Large Sample Size

Usually a logistic regression model requires thousands of observations and this case we are in the clear because the Sephora dataset possesses 9,168 observations. My model passes the large sample size requirements but is not entirely passing others such as complete information and independent errors. My model can still effectively run, but we could see poor prediction accuracy when it comes down to products containing a large amount of loves or a large difference in price.

We will soon discuss prediction accuracy for our models and maybe one way we could improve our dataset is to remove outliers or collecting more data.

Thank you for following along!!

Share this:

Related

Leave a comment Cancel reply