In this post, l will briefly go over the dplyr package which can make data manipulation very easy in R. First install the package from CRAN by issuing install.packages(“dplyr”). After installing load the package by issuing the command library(dplyr). This tutorial is on dplyr version 0.4.0 or later so check to make sure the requirement is met with command  packageVersion(“dplyr”)

Lets assume we have a .csv file containing customer information such as below stored on our c drive.

Let’s read the file into a variable filedata with filedata <- read.csv(“c\\customer.csv”). we can see the content of the variable by typing filedata .To manipulate data in dplyr, we have to convert  filedata (data frame ) to a data frame tbl in dplyr. The command  is filedatatbl <- tbl_df(filedata). The data frame tbl prints nicer than the regular data frame. The manipulation will be made on filedatatbl so we can remove the filedata from our workspace with rm(“filedata”). Here is a screen shot of above steps.

Lets go over  5 verbs dplyr provided for data manipulation; select(), filter(), arrange(),mutate() and sumarize().

To quickly find documentation on any of the above verbs you can type ?select etc

Lets select customerid and registrationdate from filedatatbl:

select(filedatatbl, customerid, registrationdate). The systax says provide the data frame tbl as the first arguement and pass all the columns you want to see in the output.

You can also use the sequence generator : to specify range of columns to select.

For example, if l want to select  firstname  through age, l can use command

select(filedatatbl,firstname :age)

You can also select the same columns in reverse with command

select(filedatatbl,age:firstname)

Instead of selecting only the columns to see in the filedatatbl, you can also select the columns you don’ t want to see with a – before the name. For example, if l don’t want to see the registrationdate, l can write select(filedatatbl,-registrationdate)

To eliminate range of columns from lastname through registrationdate, we can use command select(filedatatbl,-(lastname:registrationdate))

below screen shot shows above steps

Now lets play with the filter verb. Assume you have customers in your data and you want to know those who registered on a particular date say 1/1/2015

filter(filedatatbl,registrationdate ==”1/1/2015″)

You can also find out customers who registered on 1/1/2015  and age less than 20.

filter(filedatatbl,registrationdate ==”1/1/2015″,age < 20 )

We can also say we want to see all customers who either registered on 1/1/2015 or age is less  20.

filter(filedatatbl,registrationdate ==”1/1/2015″ | age < 20 )

Below screen shot shows the steps above.

Now lets go ahead and arrange or sort our data using the arrange verb in dplyr.

Lets say we want to display the customers by age from  youngest to oldest, we use command arrange(filedatatbl,age). To accomplish the opposite by arranging from oldest to youngest, we use arrange(filedatatbl,desc(age))

Lets arrange by multiple columns firstname and lastname

arrange(filedatatbl,firstname, lastname) and we can do either of them descending as well

arrange(filedatatbl,firstname, desc(lastname))

Below screenshot shows the steps above.

Lets get to the verb mutate(). This is used to create a new variable in the dataset based on existing variables. Assume we want to add a new variable fullname  we use command

mutate(filedatatbl,fullname = paste(firstname,lastname,sep=” “)

The last verb is summarize() which collapses the dataset to a single row. lets say we want to find out the average age of our customers. we can accomplish with

summarize(filedatatbl, avg_age = mean( age)).

Screenshot below shows steps above.

I hope this hands on helps in getting familiar with dplyr package.

 

 

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *