In this lesson, l will continue discussion on the dplyr package. I had started introducing basic data manipulations using dplyr in a previous post which can be found here. To continue the discussion, we will go through how we can group data by one or more variables in our data set. We are going to work with a customer.csv  which has format as shown below.

Lets read the file into an R object with filedata <- read.csv(“c\\customer.csv”).

we convert to data frame tbl to use in dplyr by

filedatatbl <- tbl_df(filedata)

Assume we want to group our data set by gender and store in a new variable by_gender.

by_gender <- group_by(filedatatbl,gender).

We can also group by multiple variables in our data set, for example we can group by gender and age with the command by_gender_age <- group_by(filedatatbl,gender,age)

When you print out a grouped data, it will tell you the grouping  variables in the data set and how many groups are generated based on the group by. In addition, once a data gets grouped we can perform operation on each of the groups separately. For example, what is the mean age of females and males can be performed after we group by gender. We learnt in the previous discussion the function summarize() which collapses data sets into single row. In this situation for a grouped data, summarize() will collapse each group into a row for our new data set. So to find the average age per gender, we can  issue command

summarize(by_gender,mean(age))

 

Ok, move on to discuss Chaining which allows you to string together multiple function calls in a way that is compact and readable, while still accomplishing the desired result.

For example, in the above step to get the mean age per gender, we first group the data and stored it in an intermediary variable by_gender then perform summarize   on the data in this variable. The concept of chaining is to perform the whole step in a single shot without storing the data in intermediary variables.   The chain operator is %>% and the concept follows like below

filedatatbl %>% group_by(gender)

This means, take what is on the left of the operator %>% and then perform the action on the right of the operator %>%. You will notice l did not include the filedatatbl in the group_by function because it’s passed on the left of %>%.

To continue to get the mean age after grouping by gender, we write

filedatatbl %>% group_by(gender) %>%summarize(mean(age))

so that will accomplish the steps we performed earlier in just single shoot. let’s say we want to arrange the output by visitcount so that customer with highest visitcount will b at the top.

filedatatbl %>%  group_by(gender) %>% summarize(mean(age)) %>% arrange(desc(visitcount))

This can continue for sometime till we reach a final result we want. lets assume we want to see only customers older than 12 years in the final output

filedatatbl %>%  group_by(gender) %>% summarize(mean(age)) %>% arrange(desc(visitcount)) %>%filter(age>12)

 

The benefit of %>% is that it allows us to chain the function calls in a  linear fashion . The code to the right of %>% operates on the result from the code to the left of %>%.

I hope this quick review is helpful.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *