I want to have in R the equivalent of most of the basic operations normally performed in SQL.
In this post it will follow a sniplet in SQL and immediately after the correspondent in R.
Topics Covered:
- Distinct
- Where
- Inner / outer joins
- Group by
Before starting with the Pure R syntax, just keep in mind that R is providing a very useful package called SQLDF. Through this package it is possible to perform a simple SQL query over tables / data frames.
# installs everything you need to use sqldf with SQLite
# including SQLite itself
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
sqldf("select * from iris limit 5")
sqldf("select count(*) from iris")
sqldf("select Species, count(*) from iris group by Species")
# create a data frame
DF <- data.frame(a = 1:5, b = letters[1:5])
sqldf("select * from DF")
sqldf("select avg(a) mean, variance(a) var from DF") # see example 15
Source: http://code.google.com/p/sqldf/
WHERE
SELECT *
FROM df1
WHERE product = "Toaster"
In R:
df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3))) ;
df <- df1[df1$Product=="Toaster",];
DISTINCT
the select distinct in SQL: select distinct x
from my_table;
The equivalent in R is:
> x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))
> xx <- unlist(x)
> xx
a1 a2 a3 b1 b2 b3 c1 c2 c3
1 2 3 2 3 4 4 5 6
> unique(xx)
[1] 1 2 3 4 5 6
INNER / OUTER JOINS
Having in SQL the following query: select *
from product [left] [right] [outer] join countries
on (product.customer_id = countries.customer_id)
In R:
df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))
df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))
> df1
CustomerId Product
1 Toaster
2 Toaster
3 Toaster
4 Radio
5 Radio
6 Radio
> df2
CustomerId State
2 Alabama
4 Alabama
6 Ohio
#Outer join:
merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
#Left outer:
merge(x = df1, y = df2, by = "CustomerId", all.x=TRUE)
#Right outer:
merge(x = df1, y = df2, by = "CustomerId", all.y=TRUE)
#Cross join:
merge(x = df1, y = df2, by = NULL)
Source:
http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right
GROUP BY
For the Group By function there are many options.
Let's start with the most basic one:
Having in SQL the following snipplet:
CREATE TABLE my_table (
a varchar2(10 char),
b varchar2(10 char),
c number
);
SELECT a, b, mean(c)
FROM my_table
GROUP BY a, b
In R:
grouped_data <- aggregate(my_table, by=list(my_table$a, my_table$b, FUN=mean);
Alternatively:
> mydf
A B
1 1 2
2 1 3
3 2 3
4 3 5
5 3 6
> aggregate(B ~ A, mydf, sum)
A B
1 1 5
2 2 3
3 3 11
If your data are large, I would also recommend looking into the "data.table" package.
> library(data.table)
> DT <- data.table(mydf)
> DT[, sum(B), by = A]
A V1
1: 1 5
2: 2 3
3: 3 11
And finally the most recommended ddply function:
> DF <- data.frame(A = c("1", "1", "2", "3", "3"), B = c(2, 3, 3, 5, 6))
> library(plyr)
> DF.sum <- ddply(DF, c("A"), summarize, B = sum(B))
> DF.sum
A B
1 1 5
2 2 3
3 3 11
Source:
http://stackoverflow.com/questions/18799901/data-frame-group-by-column