<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\0756813476980165976394\46blogName\75The+Bayesian+Conspiracy\46publishMode\75PUBLISH_MODE_BLOGSPOT\46navbarType\75BLUE\46layoutType\75CLASSIC\46searchRoot\75http://bayesianconspiracy.blogspot.com/search\46blogLocale\75en\46v\0752\46homepageUrl\75http://bayesianconspiracy.blogspot.com/\46vt\075642017532500428956', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

The Bayesian Conspiracy

The Bayesian Conspiracy is a multinational, interdisciplinary, and shadowy group of scientists. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.

This blog is produced by James Durbin, a Bayesian Initiate.

Executing arbitrary SQL on CSV files.

My life is full of comma separated value files. A common occurrence is to be given a csv file of data and want to explore it a little bit to see what is there, or to extract several columns of data out of the csv file and create a new csv file, or grab only certain columns of data and certain rows of data and create a new csv file. Quite often I find that I even need to join one csv file with another csv file. Each time I'm faced with this task I find that what I want is to do execute sql statements on the csv file itself. After ages of writing one-off scripts to process files like these, I finally looked into actually treating csv files as databases. It turns out that the h2 database engine has good support for both in-memory databases and csv file importing. With Groovy, it's very nice. Simply download the h2 database jar and drop it in your ~/.groovy/lib/ directory. Then you can write code like this:

#!/usr/bin/env groovy

import groovy.sql.Sql
import org.h2.Driver

// Create an h2 jdbc in-memory database,
// calling it what you like, here db1
def db = Sql.newInstance("jdbc:h2:mem:db1","org.h2.Driver")

// Create a table from your csv file...
db.execute("create table people as select * from csvread('people.csv')")

// Execute a normal sql query on this table...
db.eachRow("select * from people where age < 40"){row->
println row
}

If you want to create a new csv file from your query, you will have to do a tiny bit more work. First you will need to extract the column headings of your query from the metadata, and then you will need to collect the rows into comma separated list. It's easy boiler-plate once you know how. Here is an example:

#!/usr/bin/env groovy 

import groovy.sql.Sql
import org.h2.Driver

// Create an h2 jdbc in-memory database,
// calling it what you like, here db1
def db = Sql.newInstance("jdbc:h2:mem:db1","org.h2.Driver")

// Create a table from your csv file...
db.execute("create table people as select * from csvread('people.csv')")

// Get the column headings...
db.eachRow("select * from people where age < 40 limit 1"){row->
meta = row.getMetaData()
numCols = meta.getColumnCount()
headings = (1..numCols).collect{meta.getColumnLabel(it)}
println headings.join(",")
}

// Execute a normal sql query on this table...
db.eachRow("select * from people where age < 40"){row->
meta = row.getMetaData()
numCols = meta.getColumnCount()
vals = (0..<numCols).collect{row[it]}
println vals.join(",")
}

The h2 database engine pretty quick too. I can query a 100k file (both read it in as a table and do select on it) in about 5 seconds using code like that above.

Finally, to do a join on two csv files, you only need to create the two files and execute the sql, like:

#!/usr/bin/env groovy 

import groovy.sql.Sql
import org.h2.Driver

// mem says in-memory db, and call it what you like, here db1
def sql = Sql.newInstance("jdbc:h2:mem:db1","org.h2.Driver")

// Create the first table...
sql.execute("create table people as select * from csvread('people.csv')")

// Create the second table...
sql.execute("create table children as select * from csvread('children.csv')")

sql.eachRow("""
select people.name,children.child,people.age
from people,children
where people.name=children.name and people.age > 40
"""
){result->

meta = result.getMetaData()
cols = meta.getColumnCount()
vals = (0..<cols).collect{result[it]}
println vals.join(",")
}

Labels: , , , , ,

You can leave your response or bookmark this post to del.icio.us by using the links below.

Post a Comment | Bookmark | Go to end |
  • Anonymous Anonymous:  

    Nice work! I used a variation on this to prepare ARFF files and it is really rather useful. (June 8, 2012 at 10:19 PM) top