Logistic Regression Prediction

After taking a look at our model with linear regression prediction, it is time to return to our logistic regression model mentioned in previous posts and assess its prediction accuracy and stability.

For a refresher this is model that we are dealing with below:

log(p/p-1) = .664 – 0.0000023*Loves + 0.0463*Difference in Price – 0.194*Rating

Where p is the probability that there is a marketing flag listed on a product.

To asses our models capability of prediction and accuracy we will first divide the dataset into a training set and a test set. This will be done by randomly selecting 70% of the data and making that the training set and the rest will be in the test set. I will be building my model using the training data and compare prediction performance in the training data vs. the unseen products in the test set.

The code above creates two new datasets, and the code below shows the dimensions of it:

Now we can build our logistic regression model to predict if products have marketing flags or not in our training data set only:

In this training set model, all of the predictors and the intercept are significant.

The model equation is:

log(p/(1-p)) = .663 – .0000019578*Loves + .044422*Difference in Price – .196525*Rating

Now I can make predictions for the test set using this model equation:

Once I have predictions, I can round the probabilities into 0’s and 1’s, which stamps yes and no’s for each prediction for each product.

Now trnactual, trnpred, tstactual, and tstpred are lists of 0’s and 1’s for each product, and I can use confusion matrices to compare my predictions (trnpred and tstpred) to the actual product values (trnactual and tstactual).

The code for confusion matrices never changes, so this code will always be helpful when we have built the trnpred, tstpred, trnactual, and tstactual variables. The R output for the confusion matrices are below:

TRUE represents a 1 and FALSE represents a 0.

Stability: I can see from looking at the training data (left) and the test data (right) that all the percentages for the same categories do not change much. This represents good stability in the model.

Accuracy: My model is correct whenever it predicts a 0 for a product having no marketing flags (0) or when it predicts a 1 for a product having a marketing flag (1). So my model is correct 58.8% of the time ( this is the total of 48.5% for 0/0 and 10.3% for 1/1). This means that my model is incorrect 41.2% of the time. This is very large. Almost 40% of the time my model is making a mistake.

Further: When my model makes a mistake, it predicts that a product has a marketing flag when it does not about 3.8% of the time. When my model makes another mistake, it predicts that a product does not have a marketing flag when it actually does about 37.4% of the time. This is not too bad because of the fact that we are only talking about advertisements on products. Since this is not a life or death situation we could use this model for marketing purposes.

I would say for this model it does a decent job in terms of stability and accuracy. Like I said before since this is not an instance of life or death a prediction accuracy of more than 50% is not too bad. We can still always find ways to improve our model even though it is not too bad right now.

Thank you for following along!!

Leave a comment

Design a site like this with WordPress.com
Get started