September 1, 2008

Bioinformatics – BioJava and C#

Filed under: Bioinformatics — Tags: , — Doug @ 10:36 pm

I stumbled across the Bioinformatics Group at the UofM, and realized that I met the president at a birthday party for a mutual friend a few months ago.  I may have the opportunity to contribute to a project or two in the coming semester(s), so I started reading a bit about bioinformatics (again).

I went looking for some code, and found a framework called BioPerl, which seems fairly popular.  My perl skills have atrophied over the years, and when I found BioJava, I was a bit more excited.  It provides a number of useful functions, and seems fairly active.  There is also a related database project, BioSQL, that both BioPerl and BioJava (along with BioRuby and BioPython) have incorporated language bindings.  BioJava even uses Hibernate as its O/R mapping layer.

Since I like to work in C#, I started playing around with porting BioJava to C#.  It’s a huge project, but it’s also a great way to see how BioJava is put together.  I’ve managed to get far enough that I can transcribe DNA to RNA using the following code:

        private static void TranscribeDNAtoRNA()
                //make a DNA SymbolList
                ISymbolList symL = DNATools.CreateDNA("atgccgaatcgtaa");

                Console.WriteLine("DNA: " + symL.SeqString);

                symL = DNATools.ToRNA(symL);

                // just to prove it worked
                Console.WriteLine("RNA: " + symL.SeqString);
            catch (IllegalSymbolException ex)
                // this will happen if you try and make the DNA seq using non IUB symbols
            catch (IllegalAlphabetException ex)
                // this will happen if you try and transcribe a non DNA SymbolList

When run, the output is:

DNA: atgccgaatcgtaa
RNA: augccgaaucguaa

Yup.  A few dozen classes and a few hundred lines of code, and I can replace t’s with u’s.  Pretty exciting, eh?

Actually, I think it is pretty cool.  I’m pretty close to having the code working that will let me translate the RNA to a protein sequence or form the complement of a DNA strand.  Not rocket science, but I’ve only begun to tap the surface.  The framework allows reading sequence files (BLAST, FASTA), edit large sequences (efficiently), do pairwise alignment, and a whole lot more.

If you’re curious, you can compare the above C# code to the original Java code, which comes from the BioJava cookbook.

Blog at