domenica 27 aprile 2014

Basic SQL Operation in R



I want to have in R the equivalent of most of the basic operations normally performed in SQL.
In this post it will follow a sniplet in SQL and immediately after the correspondent in R.

Topics Covered:
- Distinct
- Where
- Inner / outer joins
- Group by


Before starting with the Pure R syntax, just keep in mind that R is providing a very useful package called SQLDF. Through this package it is possible to perform a simple SQL query over tables / data frames.

 # installs everything you need to use sqldf with SQLite  
 # including SQLite itself  
 install.packages("sqldf")  
 # shows built in data frames  
 data()   
 # load sqldf into workspace  
 library(sqldf)  
 sqldf("select * from iris limit 5")  
 sqldf("select count(*) from iris")  
 sqldf("select Species, count(*) from iris group by Species")  
 # create a data frame  
 DF <- data.frame(a = 1:5, b = letters[1:5])  
 sqldf("select * from DF")  
 sqldf("select avg(a) mean, variance(a) var from DF") # see example 15  

Source: http://code.google.com/p/sqldf/



WHERE


 SELECT *   
 FROM df1   
 WHERE product = "Toaster"  


In R:
 df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3))) ;  
 df <- df1[df1$Product=="Toaster",];  




DISTINCT

the select distinct in SQL:

 select distinct x  
 from my_table;  

The equivalent in R is:

 > x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))  
 > xx <- unlist(x)  
 > xx  
 a1 a2 a3 b1 b2 b3 c1 c2 c3   
  1 2 3 2 3 4 4 5 6   
 > unique(xx)  
 [1] 1 2 3 4 5 6  




INNER / OUTER JOINS

Having in SQL the following query:

 select *   
 from product [left] [right] [outer] join countries  
     on (product.customer_id = countries.customer_id)  


In R:
 df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))  
 df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))  
 > df1  
  CustomerId Product  
       1 Toaster  
       2 Toaster  
       3 Toaster  
       4  Radio  
       5  Radio  
       6  Radio  
 > df2  
  CustomerId  State  
       2 Alabama  
       4 Alabama  
       6  Ohio  
 #Outer join:   
 merge(x = df1, y = df2, by = "CustomerId", all = TRUE)  
 #Left outer:   
 merge(x = df1, y = df2, by = "CustomerId", all.x=TRUE)  
 #Right outer:   
 merge(x = df1, y = df2, by = "CustomerId", all.y=TRUE)  
 #Cross join:   
 merge(x = df1, y = df2, by = NULL)  

Source:
http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right


GROUP BY


For the Group By function there are many options.
Let's start with the most basic one:

Having in SQL the following snipplet:
 CREATE TABLE my_table (  
  a varchar2(10 char),   
  b varchar2(10 char),   
  c number  
 );  
 SELECT a, b, mean(c)  
 FROM my_table  
 GROUP BY a, b  


In R:
 grouped_data <- aggregate(my_table, by=list(my_table$a, my_table$b, FUN=mean);  

Alternatively:
 > mydf  
  A B  
 1 1 2  
 2 1 3  
 3 2 3  
 4 3 5  
 5 3 6  
 > aggregate(B ~ A, mydf, sum)  
  A B  
 1 1 5  
 2 2 3  
 3 3 11  



If your data are large, I would also recommend looking into the "data.table" package.

  
 > library(data.table)  
 > DT <- data.table(mydf)  
 > DT[, sum(B), by = A]  
   A V1  
 1: 1 5  
 2: 2 3  
 3: 3 11  



And finally the most recommended ddply function:
 > DF <- data.frame(A = c("1", "1", "2", "3", "3"), B = c(2, 3, 3, 5, 6))  
 > library(plyr)  
 > DF.sum <- ddply(DF, c("A"), summarize, B = sum(B))  
 > DF.sum  
  A B  
 1 1 5  
 2 2 3  
 3 3 11  

Source:
http://stackoverflow.com/questions/18799901/data-frame-group-by-column

venerdì 25 aprile 2014

Boss Vs. Leader

I think it is a bit old, but I would like to have it stamped it on my blog...
I do not have so much time these days :/ this is the most I can do...



domenica 13 aprile 2014

ORACLE: Analytical Functions


The concept of analytical query is something that can highly speed up the development and the execution of your queries.
In particular because they are automatically optimized by oracle itself.

Here there are reported in a veeeeery small nutshell:


Count (member of elements in the same group)
SELECT empno, deptno, 
COUNT(*) OVER (PARTITION BY 
deptno) DEPT_COUNT
FROM emp
WHERE deptno IN (20, 30);

     EMPNO     DEPTNO DEPT_COUNT
---------- ---------- ----------
      7369         20          5
      7566         20          5
      7788         20          5
      7902         20          5
      7876         20          5
      7499         30          6
      7900         30          6
      7844         30          6
      7698         30          6
      7654         30          6
      7521         30          6

11 rows selected.



Row Number (id of the entry within the group)
SELECT empno, deptno, hiredate,
ROW_NUMBER( ) OVER (PARTITION BY deptno ORDER BY hiredate NULLS LAST) SRLNO
FROM emp
WHERE deptno IN (10, 20)
ORDER BY deptno, SRLNO;

EMPNO  DEPTNO HIREDATE       SRLNO
------ ------- --------- ----------
  7782      10 09-JUN-81          1
  7839      10 17-NOV-81          2
  7934      10 23-JAN-82          3
  7369      20 17-DEC-80          1
  7566      20 02-APR-81          2
  7902      20 03-DEC-81          3
  7788      20 09-DEC-82          4
  7876      20 12-JAN-83          5

8 rows selected.


Rank & Dense Rank (member of elements in the same group)
SELECT empno, deptno, sal,
RANK() OVER (PARTITION BY deptno ORDER BY sal DESC NULLS LAST) RANK,
DENSE_RANK() OVER (PARTITION BY deptno ORDER BY sal DESC NULLS LAST) DENSE_RANK
FROM emp
WHERE deptno IN (10, 20)
ORDER BY 2, RANK;

EMPNO  DEPTNO   SAL  RANK DENSE_RANK
------ ------- ----- ----- ----------
  7839      10  5000     1          1
  7782      10  2450     2          2
  7934      10  1300     3          3
  7788      20  3000     1          1
  7902      20  3000     1          1
  7566      20  2975     3          2
  7876      20  1100     4          3
  7369      20   800     5          4

8 rows selected.


Lead & Lag (next / previous member of the group respect the current element)
SELECT deptno, empno, sal,
LEAD(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) NEXT_LOWER_SAL,
LAG(sal, 1, 0) OVER (PARTITION BY dept ORDER BY sal DESC NULLS LAST) PREV_HIGHER_SAL
FROM emp
WHERE deptno IN (10, 20)
ORDER BY deptno, sal DESC;

 DEPTNO  EMPNO   SAL NEXT_LOWER_SAL PREV_HIGHER_SAL
------- ------ ----- -------------- ---------------
     10   7839  5000           2450               0
     10   7782  2450           1300            5000
     10   7934  1300              0            2450
     20   7788  3000           3000               0
     20   7902  3000           2975            3000
     20   7566  2975           1100            3000
     20   7876  1100            800            2975
     20   7369   800              0            1100

8 rows selected.


First Value & Last Value
-- How many days after the first hire of each department were the next
-- employees hired?

SELECT empno, deptno, hiredate ? FIRST_VALUE(hiredate)
OVER (PARTITION BY deptno ORDER BY hiredate) DAY_GAP
FROM emp
WHERE deptno IN (20, 30)
ORDER BY deptno, DAY_GAP;

     EMPNO     DEPTNO    DAY_GAP
---------- ---------- ----------
      7369         20          0
      7566         20        106
      7902         20        351
      7788         20        722
      7876         20        756
      7499         30          0
      7521         30          2
      7698         30         70
      7844         30        200
      7654         30        220
      7900         30        286

11 rows selected.



Source:
http://www.orafaq.com/node/55



mercoledì 9 aprile 2014

File system access on Oracle




It may sound easy, but accessing the file system from oracle can be painful.
I am not talking about read / write a file. I am talking about making a ls or dir command, crete folders, move files, etc.
In this post I would like to recall an easy system about making ls.

Actually the solution is already very well explained in this web page:
http://plsqlexecoscomm.sourceforge.net/


The solution is mainly based on a java package installed in the Oracle DB, which is accessing the file system and arranging the data in a proper way.

First of all it is needed to install the package (available on the link above) and then perform a simple query like the one below:

select * 
from table(
    file_pkg.get_file_list(file_pkg.get_file('/'))
)

And here you are: you get the result of a ls command executed on the root accessible as a simple select.