I have discussed little bit about the statistics software R and it’s package R commander and how to use it. Today, l’m going to just touch down a simple aspect of the use of R that can give non users of R some difficulties. I’m describing how to draw a Box plot using R and how to draw side-by-side box plots using R. To begin with, a box plot is a type of graph or display that can be used to easily interpret data. I’m not going into deep explanation of box plots so it’s good to google the importance and uses of box plots. So I’m assuming you know what a box plot looks like. To draw a Box plot, you’ll normally find the five number summaries of your data. The five number summaries of any observation consist of the lowest observation, first quartile, median or second quartile, third quartile and maximum observation. So to use a simple convention to make things easy for us, we can use this definition of the five number summary
Five number sumary = Min Q1 Q2 Q3 Max, where Min = Lowest observation, Q1= first quartile, Q2 = Median, Q3 = third quartile and Max = Maximum observation.
To get the five number summary, you have to first arrange your data set in ascending order. Then, the Min is simply the first number and the Max is the last number.
To calculate the quartiles(Q1,Q2,Q3), we must do a little bit of calculation to find the location of each of them in the ordered data set.
By definition, the quartiles divide the ordered numerical data into four equally sized parts. Intuitively, Q1 has 25% of data below it, Q2 has 50% of data above and below it, Q3 has 75% of data below it.
Lets assume the size of our observation or total number of elements in our data set is n, and the index or the location in the data set where each of the quartiles are located be represent by A. Now let’s calculate the A for each quartile and find the data represent it.
For Q1, A = 25% * n
for Q2, A = 50% *n
for q3, A = 75% *n
Now, if any of the computed values for A is a fraction, just round it up.(Always round it up, never round down. example 2.1 => 3 and 2.6 => 3). After rounding it up, go back to your ordered data set and count the elements from the beginning up to the number you’ve just found by rounding up the fraction, which is the A. The element at the position of A in the ordered data set is the value for the quartile you are calculating at that time.
On the other hand, if calculating the “A” results in an integer, then the quartile is found by this idea; Locate the element at position A and positition A+1 of your ordered data set. Add them together and divide the sum by 2. The result is the quartile your are calculating at that time.
So at this time, we have our five number summaries and we are going to draw the box plot using R.
Start R from your computer. You can do this using only commands or you can use R commander to do the same thing. In this case, l’m going to use only commands to draw my box plot. So I’ve started R and I’m going to read the five number summaries for my data set into the R database and perform the drawing of the box plot on it. So I’m at the command prompt of R which looks like >, now l’m going to enter the commands to read in my five number summaries and get the box plot drawn.(Remember R is case sensitive)
>x = scan()
When l type x=scan() and press enter, R will allow me to enter my values and it will be stored in “x” which we can think of as an array or a list. When you press enter after the command, you’ll see a number 1: You can type all the five number summaries on the same line with space between each value and when you are done, just press enter or you can enter one on a line and press enter to go to the next line. When you have finished entering the data, press enter twice and R will show you the number of data you have entered. It will look like ” Read 20 items”. Then you’ll see the command prompt waiting for you to do some work on the data you’ve entered. So let’s draw the box plot using the values stored in the x-array or list. The command to use is boxpot()
>boxplot(x)
Type that and press enter. The box plot will show in a new window. you maximize and print that box plot using the handles on the new window.
Now, by default, the box plot is draw vertically so if you want it to be horizontal, you only set Horizontal = True.
>boxplot(x, horizontal = TRUE)
If you want beautiful colors for the box plot, you use the col arguement.
>boxplot(x,horizontal = TRUE, col =(“blue”))
Let us label the box plot by giving it title and label for x axis to denote that it’s for a five number summary of some data.
>boxplot(x,horizontal = TRUE, col =(“blue”), xlab = “Five number summaries of visitors”, main =”2smart4school site visitors boxpolt”)
Ok, I hope that idea is clear, for more information on the commands in R, read the manual. Now l’m going to discuss side-by-side box plot which is simply drawing two box plots on the same graph for comparism. So let’s say we have five number summaries for visitors to 2smart4school.com and mypcdr.net (my companies’s website) for a particular month and we want to compare these data using Side-by-side box plot. The process is the same as above with little differences.
So I’ll read the data set for 2smart4school in an array or list A and that of mypcdr in array or list B. The reading is the same as we did above. That is after reading the first data set and you arrived at the command prompt, type the command to read the second data set for the other array or list.
After reading the data into array A and B, we start plotting our Side-by-side Box with the command boxplot() as usual but this time, we have to put both A and B in the brackets.
>boxplot(A,B)
That will draw the two box plots side by side on the same graph.
If you want to use color to different the boxes, simply specify different colors for them
>boxplot(A,B, col=c(“blue”,”yellow”), horizontal = TRUE, main =”Side-by-side Box plot of side visitors for 2smart4school and mypcdr”, xlab = “five number summaries”)
Now, let’s add a legend to the side-by-side box so it can easily be read what each box represents.
After the graph is drawn and it’s showing pretty the way you like it in the new window, just click in the window where you were typing the commands and continue with this. Do not close the window with the Box plots or else, you’ll have to draw it again before adding the legend. Just leave it open as it is and only click in the window of R where the command prompt shows. So we use the command legend to denote legend in R. at the command prompt we type the following
>legend(x=”bottomright”, legend = c(“2smar4school”,”mypcdr”), fill =c(“blue”,”yellow”))
This is telling the R compile to add a label box at the bottom right with the keywords 2smar4school and mypcdr with filled colors of blue for 2smart4school and yellow for mypcdr. You can specify different locations for the legend. It can be bottomright, bottomleft,bottom,top,topleft,topright, center etc. Whenever you finished defining the label and press enter, it will show up in the Side-by-side box window at the position you’ve specified for X. When you want to specify a different location or you’ve made some mistake, you have to redraw the box plot by simply closing the window containing the boxes and at the command prompt in R, press the arrow up or down key to see if the command that you used to draw the boxes first showed up. if it doesn’t, don’t freak out. you can copy the commands from the R that you have typed and paste it at the current command prompt and press enter to get the boxes drawn again. After the boxes are drawn, copy the command for the legend and paste it at the command prompt. Make the changes you desired to do and press enter. The legend will now appear with the changes you’ve specified. if you spotted any mistake in the discussion, please don’t hesitate to leave a comment so l can take care of it. There are lot of cool stuffs to be done using R and I think it’s good to start using it and reading it’s manual if you are involved with statistics in any way or just for fun.