Exploring my dataset with R

In this post I will be using R to explore my dataset further and understand it a little better. I will be working with the same variables mentioned in the last post, but as a refresher I will show them below.

From this screenshot we can see I have 9 variables to work with.

In this code on the left we can see that I have a large set of 9,168 observations. We also have 21 columns which includes the 9 variables listed above and some other information regarding the products.

To look further into the dataset we can view some information regarding averages with prices and such. Below is going to be the code I used to investigate average regular prices and variable prices.

At first glance these average prices look pretty good and one would think that their prices are not too high. You can also tell that since the regular price and the value price are very close we can infer that not many products go on sale too often.

But, to see if this mean was a good descriptor of average prices I decided to take a look at standard deviation.

After running this code we can see that they both have a very large standard deviation. This large standard deviation tells us that each observation on average is around $47 away from the mean price and $49 away from the mean value price.

To take a look at whether this is somewhat normally distributed we can take a look at a histogram.

Looking at the histogram we have made we can see that is by no means normally distributed and it is actually skewed right. Since this is not normally distributed we should take a look at the other variables.

After creating the numeric histograms we can see that all of them are not normally distributed. I think part of the problem here is that there are some huge outliers because Sephora does carry some expensive products and some products have millions of loves or reviews on them. I might in the future try to factor out some of the outliers and see if it changes how the dataset looks.

To get a better understanding of marketing strategies I wanted to see how many products listed on the Sephora Website had marketing flags and after performing the codes on the left we can see that its almost a 50/50 split. This is interesting to look at because Marketing flags may persuade someone to buy a product or even have more reviews written on it. This develops the question of do products with marketing flags have a higher average of reviews.

Looking deeper into this question I decided to take a crack at it using an aggregate function. By performing this function we can see that the products that do not have a marketing flag actually have a higher average number of reviews than those with marketing flags.

I wanted to create another numeric variable as I wanted to expand my research in this dataset. I decided to look at the number of loves and the number of reviews. I created a new variable called num_lovesperreview to see if how close it was to 1. If it was close to 1 we could infer that every person who left a review left a love. This is a bit of a stretch though because not every person who reviews a product actually likes it.

As you can see above, is the code I used to create it and the new variable is at the end of the dataset. For the first product listed you can see that there is 750.5 loves for every review. We can conclude that a lot of people love the product, but did not necessarily write a review.

In the future I will be trying to predict the price on a product using ratings or loves. I am going to try to incorporate price as much as possible because when a customer shops it usually the first thing they look at and it is a huge decision factor.

I know that this post was quite long but thank you for following along!!

Leave a comment

Design a site like this with WordPress.com
Get started