<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d6813476980165976394\x26blogName\x3dThe+Bayesian+Conspiracy\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLUE\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttps://bayesianconspiracy.blogspot.com/search\x26blogLocale\x3den\x26v\x3d2\x26homepageUrl\x3dhttp://bayesianconspiracy.blogspot.com/\x26vt\x3d-948109813487012623', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

### The Bayesian Conspiracy

The Bayesian Conspiracy is a multinational, interdisciplinary, and shadowy group of scientists. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.

This blog is produced by James Durbin, a Bayesian Initiate.

## Java Statistical Libraries

This is a simple post listing a few of the Java statistical libraries I have used at one point or another. Google often seems to fail me when searching for libraries like these so I am hoping that this will help a few people to connect up with these useful libraries. This is just a list of the libraries I have used. I am sure there are others and welcome suggestions in the comments.

• jMEF Library for working with mixtures of exponential families, including estimating parameters from a sample.

• JSC Java Statistical Classes. A variety of statistical functions covering combinatorics, correlation, curvefitting, descriptive statistics, distributions, regression, etc. Solid collection of basic statistical functions.

• Stochastic Simulation in Java Includes support for generating variates from various distributions, computing various statistical measures and tests, and support for Monte Carlo methods.

• Apache Commons Math A wide range of math and some statistical functionality.

• Cern's Colt library for high performance scientific computing includes some statistical functions.

Here is another list of mathematical and some statistical libraries in Java: NIST Java Math List

## Execute (real) shell commands from Groovy.

This post is about running shell commands from within Groovy, specifically bash but it is easy to adapt to other shells. You can already run commands with syntax like:


"ls -l".execute()

That is about as simple as it gets and works great for many situations. However, execute() runs the given command passing it the list of options, the options are NOT passed through the shell (e.g. bash) for expansion and so on. As a result, you can NOT do something like:

"ls *.groovy".execute()

In this case, no shell sees the * to expand it, and so it just gets passed to ls exactly as it is. To address this, we can create a shell process with ProcessBuilder and pass the command to the shell for execution. A common use case for me is to want to just pipe the shell command's output to stdout. With some Groovy meta-object programming we can make this a method of GString and String so that you can execute any kind of string simply by calling, for example, a .bash() method on the string. Below is a class that does that. This class (including improvements) is included in durbinlib.jar. With this class, one can not only properly execute the ls *.groovy example above, but can even execute shell scripts like:

"""
for file in \$(ls); do echo \$file
done
""".bash()
To turn on this functionality it is necessary to call RunBash.enable() first. So a full example using the durbinlib implementation is:
#!/usr/bin/env groovy

import durbin.util.*

RunBash.enable()

"""
for file in \$(ls); do echo \$file
done
""".bash()

A skeleton of the class itself follows:
import java.io.InputStream;

class RunBash{

static boolean bEchoCommand = false;

// Add a bash() method to GString and String
static def enable(){
GString.metaClass.bash = {->
RunBash.bash(delegate)
}
String.metaClass.bash = {->
RunBash.bash(delegate)
}
}

static def bash(cmd){

cmd = cmd as String

// create a process for the shell
ProcessBuilder pb = new ProcessBuilder("bash", "-c", cmd);
pb.redirectErrorStream(true); // use this to capture messages sent to stderr
Process shell = pb.start();
shell.getOutputStream().close();
InputStream shellIn = shell.getInputStream(); // this captures the output from the command

// at this point you can process the output issued by the command
// for instance, this reads the output and writes it to System.out:
int c;
while ((c = shellIn.read()) != -1){
System.out.write(c);
}

// wait for the shell to finish and get the return code
int shellExitStatus = shell.waitFor();

// close the stream
try {
shellIn.close();
pb = null;
shell = null;
} catch (IOException ignoreMe) {}
}
}

Labels: , ,

## OnlineTable: Accessing csv files row at a time by column name.

Here is a handy class I use to simplify accessing a CSV or TAB file one line at a time. I feel like I've seen this somewhere else also. I hope it's not just in GINA! Anyway, sometimes the simplest things are the most handy so it bears repeating even if it is. Suppose you have a file, beer.csv, that has a header describing the columns followed by rows of data like:

    brand,price,calories,alcohol,type,domestic
LeinenkugelsRed,4.79,160,5.0,1,1
GeorgeKilliansIrishRed,4.70,162,4.9,1,1
RedWolf,4.11,157,5.5,1,1
Becks,5.83,148,4.3,3,0
PilsnerUrquell,7.80,160,4.1,3,0


Then you can use OnlineTable to parse it like:

new OnlineTable("beer.csv").eachRow{row->
println "brand: "+row.brand
println "calories:"+row.calories
}

Or, if you prefer the map notation, you can use it like:

new OnlineTable("beer.csv").eachRow{row->
println "brand: "+row['brand']
}

This version automatically inspects header to see if it's a CSV or TAB delimited file. The class is shown below but is also included as part of durbinlib.jar The latter will be updated with additional features.

/***********************************
* Support for accessing a table one row at a time.
*
*/
class OnlineTable{

String fileName

def OnlineTable(String f){
fileName = f
}

def eachRow(Closure c){
r.eachLine{rowStr->
def rfields = rowStr.split(sep)
def row = [:]
c(row)
}
}
}

def determineSeparator(line){
def sep;
if (line.contains(",")) sep = ","
else if (line.contains("\t")) sep = "\t"
else {
System.err.println "File does not appear to be a csv or tab file.";
throw new RuntimeException();
}
return(sep);
}
}

Labels: , ,

## A simple Cytoscape Groovy example

Cytoscape is a network analysis package used quite often in systems biology. Cytoscape has scripting support for Python, Ruby, and Groovy. There wasn't a meaningful Groovy example so I made a couple of minor changes to the Ruby example, along with logging added for debugging:

import cytoscape.Cytoscape;
import cytoscape.layout.CyLayouts;
import cytoscape.layout.LayoutAlgorithm;
import cytoscape.CytoscapeInit

errFile = new File("test.out").withWriter{w->
w.println "CYTOSCAPE SCRIPT LOG"

try{
props = CytoscapeInit.getProperties()
props.setProperty("layout.default", "force-directed")

new File("sampleData").eachFileMatch(~/.*\.sif/){f->

## Weka: getting predictions from cross validation

A common question in weka forums is how to keep track of instances with names. Weka does not have a name field for instances, so to keep track of instances one has to create a string ID attribute that has the name of each instance. The catch, though, is that most classifiers don't work with string attributes, and you wouldn't want to classify on the ID anyway. The official solution then is to delete the ID attribute before calling the classifier. Of course, if you delete the ID, you loose the names for your instances! Oof! The solution is to use the meta.FilteredClassifier classifier with the RemoveType filter as the filter. When you hand a FilteredClassifier off to Evaluation, it will apply the filter before sending it to the classifier, but will keep track of the relationship between the source Instances (with the ID) and the filtered set sent to the classifier. Great. Now what if you want to know how instances were classified during your cross-validation? The API for extracting those classifications is not obvious, but it's easy enough once you know where to look. In Evaluate.crossValidateModel() you pass in a StringBuffer to hold the predictions. This can then be parsed to obtain the predictions and the instance names they go with. Source code to do this below:

#!/usr/bin/env groovy import weka.core.*import weka.core.converters.ConverterUtils.DataSourceimport weka.filters.unsupervised.attribute.RemoveTypeimport weka.classifiers.*import weka.classifiers.meta.FilteredClassifierimport weka.classifiers.evaluation.*;import java.util.RandomarffName = args[0]// Read arff file...data = DataSource.read(arffName)// Pick out the class attribute..data.setClassIndex(data.numAttributes() -1)  // Create a classifier from the name...// By using filtered classifer to remove ID, the cross-validation// wrapper will keep the original dataset and keep track of the mapping // between the original and the folds (minus ID). classifier = """weka.classifiers.meta.FilteredClassifier       -F weka.filters.unsupervised.attribute.RemoveType       -W weka.classifiers.misc.HyperPipes"""options = Utils.splitOptions(classifier)classname = options[0]options[0] = ""classifier = Classifier.forName(classname,options) // Perform cross-validation of the model..eval = new Evaluation(data)predictions = new StringBuffer()eval.crossValidateModel(classifier,data,cvFolds = 5,  new Random(1),predictions,  new Range("first,last"),false)lines = predictions.toString().split("\n")  // Output of predictions looks like:  // inst#     actual  predicted error prediction (ID)//     1      1:low      1:low       1 (P1)//     2     2:high      1:low   +   0.5 (P6)//     3     2:high     2:high       1 (P0)lines[1..-1].each{line->  // Parse out fields we're interested in..        m = line =~ /\d:(\w+).*\d:(\w+).*$$(\w+)$$/  actual = m[0][1]  predicted = m[0][2]  sample = m[0][3]  println actual+"\t"+predicted+"\t"+sample+"\t"+!line.contains("+")}

Labels: , ,

## Invalid duplicate class definition

"Invalid duplicate class definition...One of the classes is a explicit generated class using the class statement, the other is a class generated from the script body based on the file name. Solutions are to change the file name or to change the class name. "

When I was first learning Groovy I would get this error from time to time. It was puzzling to me because sometimes I'd get this error, and sometimes it would seem like the same situation and I wouldn't. It seemed quite random to me when the error would crop up. When it did, I'd usually just rename the class and go on, resolving to figure it out later. To save you the trouble, here's what is happening.

Groovy has two ways to treat a .groovy file: either as a script, or as a class definition file. If it is a script you can not have a class by the same name as the file. If it is a class definition file you can. It is very easy to tell whether a .groovy file is going to be treated as a script or as a class definition file. If there is any code outside a class statement in the file (other than imports), it is a script. What is happening is that if there is any code to be executed in the file then Groovy needs a containing class for that code. Groovy will implicitly create a containing class with the name of the file. So if you have a file called Grapher.groovy that has some code in it that isn't inside a class definition, Groovy will create an implicit containing class called Grapher. This means that the script file Grapher.groovy can not itself contain a class called Grapher because that would be a duplicate class definition, thus the error. If, on the other hand, all you do in the file Grapher.groovy is define the class Grapher (and any number of other classes), then Groovy will treat that file as simply a collection of class definitions, there will be no implicit containing class, and there is no problem having a class called Grapher inside the class definition file Grapher.groovy.

It's worth mentioning that the script version of Grapher.groovy will be compiled into a class called Grapher that extends groovy.lang.Script. In the other case, when Grapher.groovy merely defines classes, one of which is Grapher, that Grapher class will be compiled into a class that implements groovy.lang.GroovyObject.

I'm sure this is all explained somewhere in the Groovy documentation, but it didn't soak in to me until I read this Nabble post from which I extracted this explanation.

UPDATE: The text of this error message has changed (at least in some cases) to be a bit more informative. Now it reads:

One of the classes is an explicit generated class using the class statement, the other is a class generated from the script body based on the file name. Solutions are to change the file name or to change the class name.
The underlying mechanics behind this error are the same.

Labels:

## Groovy compared to Perl

If you are a bioinformatics person considering taking up Groovy as an alternative to Perl, you will naturally wonder how the two compare on a range of simple tasks. Luckily, there is a great set of examples over at PLEAC (Programming Language Examples Alike Cookbook). Most bioinformatics Perl programmers have probably seen the Perl Cookbook by Tom Christiansen & Nathan Torkington. The aim of PLEAC is to implement all of the solutions found in the Perl Cookbook in other languages. The Groovy examples are 100% complete, so you can see how every Perl cookbook solution could be performed in Groovy. Go have a look!

Labels: , ,

## Groovy for bioinformatics.

I'll have a lot to say about Groovy for bioinformatics in this blog. Until I have time to write some posts on the topic here is Mark Fortner's brief comparison of how Groovy stacks up to Perl for bioinformatics. Mark has also set up a BioGroovy project wiki.

Labels: , ,

## Groovy is groovy. Groovy is Java.

Groovy is groovy. At least for me. Groovy is also Java, like quick and dirty Java all dripping with syntatic sugar. If you know Java you essentially know 80% of what there is to know about Groovy because underneath it's all Java objects and the syntax is roughly a superset of Java. Most Java programs will run, unaltered, as Groovy programs. Although Groovy's dynamic features, powerful language constructs, and other goodness will grow on you, a Java programmer can approach Groovy initially as if it were simply a kind of Java that can be executed without compiling and with the option to soft focus some of the details.

For example, here is a dumb little Groovy script that I wrote to convert a comma separated file into a tab separated file.

#!/usr/bin/env groovyfor(line in System.in.readLines()){  bits = line.split(",");  for(bit in bits){    print bit+"\t";  }  println "";}

This is something I wrote on the first day I was learning Groovy so it is not idiomatic Groovy. Groovy can be much more succinct than this. A Perl person might prefer a command line one-liner like:

cat test.tab | groovy -pe '(line =~/\t/).replaceAll(",")'

Nonetheless, this little script shows how one can come to Groovy gently from Java. First, that's a complete Groovy program. If it were Java, there'd have to be more, the declaration of a containing class, for example. Groovy provides an implicit containing class with the name of the file. You'd also have to be more fastidious about declaring the types of variables. Groovy works out types during runtime. The types Groovy eventually assigns to variables will be Java types. line and bits, for example, will resolve to java.lang.String, and all the functionality of java.lang.String is there for you just like you remember it. Also, unlike a Java program, Groovy scripts don't need compiling (if you need to, though, you can compile Groovy to a .class file to use in a Java program). You can just run the file like a Perl script, for example:

cat data.txt | comma2tab  > data.tab

So you can see that you get that sort of Perl-like hack-together-something-quick quality: no compiling, direct execution of the source file (via shebang or the command line interface), rapid edit/test cycle, and stripped down verbosity. Suppose I also wanted to add a date tag to each line. When I was doing my scripting with Perl I'd have to grind to a halt while I looked up various Perl date APIs. Obviously, if 99% of my code were scripting Perl I'd probably already know that, but my code is more like 10% C++, 65% Java, and 25% scripting. As a result, I'm much more likely to already know the Java API for something than the Perl API. With Groovy I can add date functionality without having to learn any new API's, I can just use the Java one I already know:

#!/usr/bin/env groovy    date = new Date();  for(line in System.in.readLines()){    println date.toString()+"\t";    bits = line.split(",");    for(bit in bits){      print bit+"\t";    }    println "";  }

That print is, by the way, System.out.println, it' just lets you use the shorthand if you like. You don't have to though. With a simple import, you can use any of your own Java classes just as easily. Drop any external library into your CLASSPATH, or a jar into your ~/groovy/lib directory, and you have ready access to that functionality too. You can even hack together simple GUI's in just a few lines of code.

Given that I have worked on several pure-Java projects in the past few years, Groovy really fills a niche for me. Working in bioinformatics I have a lot of need for scripting, to create quick and dirty one-off programs to crunch some data or do some bulk operation, something someone wants today and will never use again. I also have a fair bit of need for production quality webpages, for computation intensive algorithms, and for GUI visualization tools. The combination of Groovy+Java nicely fills all of these needs for me. I can write my high quality webpages in Java (or Grails), my computation intensive algorithms in Java (which has become performance competitive in the past few years, within 2x of C [3]), my GUI visualization tools in Groovy/Java/Swing, and my one-off scripts and pipeline glue in Groovy. One set of libraries and API's to master to do all my work. The ability to share code across all the kinds of work I do. All my code is cross platform. It's like having my cake and eating it too!

If you know Java, you owe it to yourself to try Groovy. If you find yourself writing one kind of software in one language, another in a different language, you owe it to yourself to try Groovy. The rest of you, you should probably try it too.

[1] Six Things Groovy Can Do For You
[2] IBM Groovy articles
[3] 1.8x median, behind only C++, C, and ATS

Labels: , ,

### The Bayesian Conspiracy

I'm a Ph.D. student in bioinformatics, studying genomics, machine learning, and statistics. I spent the previous ten years developing genomics analysis tools for a major sequencing center.

You might see some bits here on programming, machine learning, bioinformatics,
computational biology, and maybe some other stuff too.