The Bayesian Conspiracy: OnlineTable: Accessing csv files row at a time by column name.

OnlineTable: Accessing csv files row at a time by column name.

Stamped: 2:52 PM |

Here is a handy class I use to simplify accessing a CSV or TAB file one line at a time. I feel like I've seen this somewhere else also. I hope it's not just in GINA! Anyway, sometimes the simplest things are the most handy so it bears repeating even if it is. Suppose you have a file, beer.csv, that has a header describing the columns followed by rows of data like:

    brand,price,calories,alcohol,type,domestic
    LeinenkugelsRed,4.79,160,5.0,1,1
    SamuelAdamsBoston,5.96,160,4.9,1,1
    GeorgeKilliansIrishRed,4.70,162,4.9,1,1
    RedWolf,4.11,157,5.5,1,1
    Becks,5.83,148,4.3,3,0
    PilsnerUrquell,7.80,160,4.1,3,0

Then you can use OnlineTable to parse it like:


new OnlineTable("beer.csv").eachRow{row->
  println "brand: "+row.brand
  println "calories:"+row.calories
}

Or, if you prefer the map notation, you can use it like:


new OnlineTable("beer.csv").eachRow{row->
  println "brand: "+row['brand']
}

This version automatically inspects header to see if it's a CSV or TAB delimited file. The class is shown below but is also included as part of durbinlib.jar The latter will be updated with additional features.


/***********************************
* Support for accessing a table one row at a time.
*
*/
class OnlineTable{      
  
      String fileName
    
      def OnlineTable(String f){
        fileName = f
      }     
      
      def eachRow(Closure c){
        new File(fileName).withReader{r->
          def headingStr = r.readLine()
          def sep = determineSeparator(headingStr)
          def headings = headingStr.split(sep)
          r.eachLine{rowStr->
            def rfields = rowStr.split(sep)
            def row = [:]
            rfields.eachWithIndex{f,i->row[headings[i]]=f}
            c(row)
          }
        }
      } 
            
      def determineSeparator(line){
        def sep;
        if (line.contains(",")) sep = ","
        else if (line.contains("\t")) sep = "\t"
        else {
          System.err.println "File does not appear to be a csv or tab file.";
          throw new RuntimeException();
        }
        return(sep);
      }   
}

Labels: csv, groovy, table

Post a Comment

« Home

A simple Cytoscape Groovy example »

Saager Mhatre:

I always thought OpenCSV filled up this use case nicely. And at just 28k it's light too! (December 10, 2012 at 11:23 PM) top

James Durbin:

OpenCSV strikes me as too complicated and not helpful enough at the same time. For example, I don't see any way that you can access fields by column name in OpenCSV. For that matter, it looks like OpenCSV has no knowledge of a heading for a csv file at all. Maybe it's helpful in Java, but all of the OpenCSV functionality seems easily replaced with built-in Groovy functionality. Consider this example:

CSVReader reader = new CSVReader(new FileReader(ADDRESS_FILE));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println("Name: [" + nextLine[0] +
"]\nAddress: [" + nextLine[1] + "]\nEmail: [" + nextLine[2] + "]");
}

This can be done in Groovy like this:

new File(ADDRESS_FILE).splitEachLine(","){fields->
println "Name: ${fields[0]} Address: ${fields[1]} Email: ${fields[2]}"
}

What OnlineTable adds to this Groovy code is mapping the heading line to fields so you don't have to. The OnlineTable equivalent of this example is thus (assuming the csv file now has a heading):

new OnlineTable(ADDRESS_FILE).eachRow{row->
println "Name: $row.name Address $row.address Email: $row.email"
}

That is a big difference in simplicity and utility from the OpenCSV example, I think. (December 11, 2012 at 12:26 AM) top

My bad, I somehow incorrectly 'remembered' using OpenCSV with header rows. Now that I look at it again, it is missing quite a few pieces. There's even a feature request for it.

Nevertheless, one of the things that attracted me to OpenCSV (even though I haven't used it a whole lot) is it's support for quoting and escaping. IMHO, that's one use case that can become a pain in fairly large datasets.

But, I'll concede that your solution is much groovier in that it espouses internal iterators with block support.

Good stuff and apologies again for the confusion. (December 11, 2012 at 10:02 AM) top

The Bayesian Conspiracy

Recent

OnlineTable: Accessing csv files row at a time by column name.

Post a Comment

var a = 3; if(a == 0) {document.write('no comments');} else if(a == 1) {document.write('one response');}else{document.write(a+' responses');}