Outliers

When I was investigating my data in the past few weeks I could have sought out outliers in my dataset, but I will take a look again and attempt/consider removing any observations that I see to be unusual or weird.

First I will draw a chart of Loves vs. Number of Reviews, since this was my only highly significant and valuable predictor. Outliers should be removed before modeling begins, so because of this I will be using the original dataset, d, to investigate for outliers.

In my plots above we can see one or two extremely high and out of range observations in the number of reviews plot. We can also see multiple vertical lines, the line that is the most concerning is the one at the 1000 mark. There is nothing that comes to mind of how multiple products will all have the same amount of reviews, so I think I will try to remove it. In the ratings and number of reviews plot I do not really see any irregularities. In the plot with loves and exclusiveness (binary variable) we can see an observation that is way above the general data. This should be taken care of when I remove the top percentile.

Since it is not necessarily strange for a product to have 0 loves and 0 reviews on a product, I will not consider removing them, but for the extremely high observations I will be trimming off the top percentile. First we have to find the top percentiles:

I decided to use 99.95 as the last percentile because the 99th percentile wiped away too much data. As we can see below when I removed the 99.95th percentile we only removed 7 observations:

My new scatterplots now looks like this:

To take a deeper look into the vertical lines in the number of reviews plot I will aggregate the Number of Reviews by Number of reviews and get the length to see what looks the most concerning to try to remove.

After taking a further look into this data, the vertical lines can be explained by the fact that it seems when the data was collected, after the number of reviews hit 1000, the data was rounded into 1000, 2000, 3000, etc..

The vertical lines is just how the data was collected and I do not think I should remove it since that would be taking important information away.

It is safe to say that I can continue my model building.

After dealing with outliers, let’s see if removing the outliers improves the overall fit of my model.

Now I will divide the dataset and build my model:

If you recall from my previous post w could see that R-squared was 61.71% and after removing the outliers, my R-squared has dropped to 49.26%.

I only removed 7 observations and R-squared had a pretty substantial drop of 12.45%, so it would make sense to test for accuracy and stability again to make sure we are looking at the right thing and not getting false hope.

Typically removing outliers is a good way to improve overall model fit, but it seems like it decreased ours. Maybe they were not really outliers after all?

Thank you for following along!!

Share this:

Related

Leave a comment Cancel reply