In this post, I am going to show functions that l have written to work on a data in files which contain observation of pollutants monitored at different locations. The data is a .csv file and there are 332 monitors which are set up at different locations. Sample of data looks like below

“Date”,”sulfate”,”nitrate”,”ID”
“2003-01-01”,NA,NA,1
“2003-01-02”,NA,NA,1
“2003-01-03”,NA,NA,1
“2003-01-04”,NA,NA,1
“2003-01-05”,45,8,1
“2003-01-06”,NA,NA,1
“2003-01-07”,NA,NA,1
“2003-01-08”,NA,NA,1
“2003-01-09”,NA,NA,1
“2003-01-10″,56,23,1

The monitors measure the pollutants sulfate and nitrate at a particular site. The ID column refers to the ID of the monitor and Date column refers to the date the measurement was taken. Each monitor has its own file identified by the ID. for example 001.CSV, 009.CSV, 010.CSV, 100.CSV, 332.CSV. The problem is after collecting reading from all these 332 monitors you want to analysis the data and see what is going on in terms of means, complete cases and correlation between the two pollutants.

For these problem, we assume we put all the files at location on our computer called Spectra. First task is to write an R function which will take the directory where the files are located, a pollutant name, and a range of monitors to find the mean of that pollutant across those monitors. Lets call the function pollutantmean.

pollutantmean <- function(directory,pollutant,id = 1:332){
pol_vectRmNa <- vector(mode=”numeric”, length = 0)
for(value in id)
{
if(value < 10){
pre<-“00”
}else if(value >=10 && value < 100){
pre<-“0”
}else {
pre <- “”
}
current_file <- paste(directory,”\”,pre,value,”.csv”,sep = “”,collapse = NULL)
file_data <- read.csv(current_file)
pol_vectRmNa <-c(pol_vectRmNa,file_data[!is.na(file_data[pollutant]),pollutant])
}
mean(pol_vectRmNa)
}

Second problem is to given the directory and range of monitors, write an R function to find number of complete cases for each of the monitors. Lets call this function complete.

complete <- function(directory,id = 1:332){
vect_id <- vector(“numeric”,length = 0)
vect_nobs <- vector(“numeric”,length = 0)
for(value in id)
{
if(value < 10){
pre<-“00”
}else if(value >=10 && value < 100){
pre<-“0”
}else {
pre <- “”
}
current_file <- paste(directory,”\”,pre,value,”.csv”,sep = “”,collapse = NULL)
file_data <- read.csv(current_file)
cnt <- sum(complete.cases(file_data))
vect_id <- c(vect_id,value)
vect_nobs <-c(vect_nobs,cnt)
}
complete_data <- data.frame(id=vect_id,nobs=vect_nobs)
complete_data
}

Our last problem is given the direction, a threshold number which limits the number of complete cases for each of the monitors/files to be considered write an R function to find the corellation between the two pollutants sulfate and nitrate. Lets call this function corr.

corr <- function(directory,threshold=0){
cor_vect <- vector(mode=”numeric”, length = 0)
vect_sulphate <- vector(mode = “numeric”,length = 0)
vect_nitrate <- vector(mode=”numeric”,length = 0)
files <- list.files(path = directory,pattern = “.csv”,full.names = TRUE)
for(value in files)
{
file_data <- read.csv(value)
if(sum(complete.cases(file_data)) > threshold){
good <- complete.cases(file_data)
complete_data <- file_data[good,]
cor_vect<-c(cor_vect,cor(as.numeric(complete_data$sulfate),as.numeric(complete_data$nitrate)))
}
}
cor_vect
}

Comments and suggestions welcome!

Thanks for reading.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *