domenica 27 aprile 2014

Basic SQL Operation in R



I want to have in R the equivalent of most of the basic operations normally performed in SQL.
In this post it will follow a sniplet in SQL and immediately after the correspondent in R.

Topics Covered:
- Distinct
- Where
- Inner / outer joins
- Group by


Before starting with the Pure R syntax, just keep in mind that R is providing a very useful package called SQLDF. Through this package it is possible to perform a simple SQL query over tables / data frames.

 # installs everything you need to use sqldf with SQLite  
 # including SQLite itself  
 install.packages("sqldf")  
 # shows built in data frames  
 data()   
 # load sqldf into workspace  
 library(sqldf)  
 sqldf("select * from iris limit 5")  
 sqldf("select count(*) from iris")  
 sqldf("select Species, count(*) from iris group by Species")  
 # create a data frame  
 DF <- data.frame(a = 1:5, b = letters[1:5])  
 sqldf("select * from DF")  
 sqldf("select avg(a) mean, variance(a) var from DF") # see example 15  

Source: http://code.google.com/p/sqldf/



WHERE


 SELECT *   
 FROM df1   
 WHERE product = "Toaster"  


In R:
 df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3))) ;  
 df <- df1[df1$Product=="Toaster",];  




DISTINCT

the select distinct in SQL:

 select distinct x  
 from my_table;  

The equivalent in R is:

 > x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))  
 > xx <- unlist(x)  
 > xx  
 a1 a2 a3 b1 b2 b3 c1 c2 c3   
  1 2 3 2 3 4 4 5 6   
 > unique(xx)  
 [1] 1 2 3 4 5 6  




INNER / OUTER JOINS

Having in SQL the following query:

 select *   
 from product [left] [right] [outer] join countries  
     on (product.customer_id = countries.customer_id)  


In R:
 df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))  
 df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))  
 > df1  
  CustomerId Product  
       1 Toaster  
       2 Toaster  
       3 Toaster  
       4  Radio  
       5  Radio  
       6  Radio  
 > df2  
  CustomerId  State  
       2 Alabama  
       4 Alabama  
       6  Ohio  
 #Outer join:   
 merge(x = df1, y = df2, by = "CustomerId", all = TRUE)  
 #Left outer:   
 merge(x = df1, y = df2, by = "CustomerId", all.x=TRUE)  
 #Right outer:   
 merge(x = df1, y = df2, by = "CustomerId", all.y=TRUE)  
 #Cross join:   
 merge(x = df1, y = df2, by = NULL)  

Source:
http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right


GROUP BY


For the Group By function there are many options.
Let's start with the most basic one:

Having in SQL the following snipplet:
 CREATE TABLE my_table (  
  a varchar2(10 char),   
  b varchar2(10 char),   
  c number  
 );  
 SELECT a, b, mean(c)  
 FROM my_table  
 GROUP BY a, b  


In R:
 grouped_data <- aggregate(my_table, by=list(my_table$a, my_table$b, FUN=mean);  

Alternatively:
 > mydf  
  A B  
 1 1 2  
 2 1 3  
 3 2 3  
 4 3 5  
 5 3 6  
 > aggregate(B ~ A, mydf, sum)  
  A B  
 1 1 5  
 2 2 3  
 3 3 11  



If your data are large, I would also recommend looking into the "data.table" package.

  
 > library(data.table)  
 > DT <- data.table(mydf)  
 > DT[, sum(B), by = A]  
   A V1  
 1: 1 5  
 2: 2 3  
 3: 3 11  



And finally the most recommended ddply function:
 > DF <- data.frame(A = c("1", "1", "2", "3", "3"), B = c(2, 3, 3, 5, 6))  
 > library(plyr)  
 > DF.sum <- ddply(DF, c("A"), summarize, B = sum(B))  
 > DF.sum  
  A B  
 1 1 5  
 2 2 3  
 3 3 11  

Source:
http://stackoverflow.com/questions/18799901/data-frame-group-by-column

Nessun commento:

Posta un commento