Quick CSV/TAB file viewer.
My life is full of large csv and tab files. In a previous post I showed one tool I wrote to make my csv and tab file life easier, csvsql. That groovy script let you treat a csv or tab file as a database and do arbitrary queries on it. Today's post addresses another problem. I often get data files that I know nothing about from various sources and I am faced with figuring out what is in the file, or figuring out which of several files might contain information that I need. Basically, I need to look at the file as a first step to getting an idea whether it is useful to me and if so how I might go about processing it. One can "cat" the file, of course, and get some vague idea what is in the file, but if the file has many columns a simple cat can come out illegible. There are fancier things one can do with cat, such as:
cat test.csv | column -s, -t | less -#2 -N -S
This at least gives you formatted columns that you can scroll through, but it's clunky, the columns can't be easily sorted, and so on. Importing the file into a spreadsheet is another option. For some unfathomable reason, every spreadsheet out there takes an eon to start up. Some of them won't take csv or tab files from the command line either, so you have to launch the spreadsheet then navigate to the file in question which is actually one of the biggest drawbacks of spreadsheets for my use case. Even worse, most spreadsheets simply choke on what are for me modest sized data files (say, a few tens of thousands of rows with a dozen to few hundreds of columns). After facing this irritation dozens of times, I decided to see if I could hack together a csv/tab viewer that would offer the advantages of a spreadsheet with the speed and command-line convenience of cat. The resulting script is run like this:
viewtab test.csv
It takes about 7 seconds to load a 9803 x 296 csv file and display it. Compare that with Apple's Numbers which takes 3 seconds to launch, a couple of seconds for me to find the file, tries to load it for about 7 seconds and finally tells you the file is too big and simply gives up! The results of viewtab look like this:
Once it is displayed, you can sort by column simply by selecting that column and clicking again to toggle between ascending and descending sort. viewtab relies on a Table class I wrote in Java. The easiest way to get that is just to install durbinlib.jar. Assuming you have groovy already installed, you can download and install durbinlib.jar from github like this:
git clone git://github.com/jdurbin/durbinlib.git
cd durbinlib
ant install
This will compile and install durbinlib.jar and dependencies in ~/.groovy/lib/. The actual script is below. Just copy it to a file called viewtab, make the script executable, and then put it in your path somewhere. I highly recommend that you set an environment variable to give Groovy/Java the option to use a lot of RAM:
export JAVA_OPTS='-Xmx3000m'
I'll probably expand this script in a lot of ways, but even in it's current simple form it has really taken some of the dread out of exploring csv/tab files.
Update: viewtab is now bundled with durbinlib, so after the install of durbinlib simply add durbinlib/scripts to your path and you get viewtab, csvsql, and some other goodies also.
#!/usr/bin/env groovy
err = System.err
{
Table dt;
TableModel(Table table){dt = table;}
public String getColumnName(int col) {return(dt.colNames[col]);}
public String getRowName(int row){return(dt.rowNames[row]);}
public int getRowCount() { return(dt.rows())}
public int getColumnCount() { return(dt.cols()) }
public String getValueAt(int row, int col) {return(dt.get(row,col))}
public Class getColumnClass(int c) {return(dt.get(0,c).getClass())}
public boolean isCellEditable(int row, int col){return false;}
public void setValueAt(Object value, int row, int col) {}
}
fileName = args[0]
// Crudely determine if it's a tab or csv file
new File(fileName).withReader{r->
line = r.readLine()
if (line.contains(",")) sep = ","
else if (line.contains("\t")) sep = "\t"
else {err.println "File does not appear to be a csv or tab file.";System.exit(1)}
}
dt = new Table(fileName,sep,bFirstRowInTable=true)
err.print "Creating gui table..."
dtm = new TableModel(dt)
swing = new SwingBuilder()
frame = swing.frame(title:fileName,defaultCloseOperation:JFrame.EXIT_ON_CLOSE){
scrollpane = scrollPane {
thetab = table(autoResizeMode:JTable.AUTO_RESIZE_OFF, autoCreateRowSorter:true){
tableModel(dtm)
}
}
}
err.println "done."
frame.pack()
frame.show()
About the code: I had already written a Table class in Java to quickly read in and manipulate csv/tab files, so I just needed to make this the model for a JTable. I used Groovy's SwingBuilder to tie it all together. The code is short and simple and to look at it should have been a 30 minute job, but I think I spent several hours of trial and error trying to figure out exactly how to make SwingBuilder work with my custom tableModel. SwingBuilder is nice, but the documentation is totally inadequate. Hopefully someone looking for SwingBuilder documentation will stumble here and see the example that I couldn't find.
Labels: csv, durbinlib, groovy, JTable, spreadsheet, SwingBuilder