doug-swisher.net

September 26, 2008

BioSharp web site is live

Filed under: Bioinformatics — Tags: — Doug @ 4:47 pm

I’ve uploaded the first incarnation of the BioSharp web site.  It is still a bit thin, but at least the API docs are available.

The next step in the project will be to work on an application that was the whole motivation for BioSharp.  As that progresses, I’m sure I’ll be continuing to port bits.  According to my port status page, I’m just under 5% done…only 1413 classes left to port.  Even at this point, though, the library has some useful functionality, as demonstrated by a few of my earlier posts.

If you are interested in seeing a specific module ported over, don’t hesitate to add a comment here to post in the forums.  I’ll also look at getting a mailing list set up sometime soon.

September 23, 2008

Finding and flipping a DNA sequence with BioSharp

Filed under: Bioinformatics, Software — Tags: — Doug @ 11:06 am

The BioSharp port is still moving forward.  I have enough functionality now to be able to create a DNA sequence, find a subsequence with that sequence, and create a new sequence with the subsequence flipped around.

For example, it can take “aacgaa”, search for “cg”, flip it around, and create the new sequence “aagcaa”.  It would be trivial to do this just by string manipulation; hopefully the investment in the library will be worth it.

Here is a bit of sample code to do the search and flip.

private static void FindAndFlip()
{
    // Create our two bits of DNA
    ISymbolList bigDNA = DNATools.CreateDNA("acgatagatagctacgcatagctagctaagctacgactacgctacgctacg");
    ISymbolList subSequence = DNATools.CreateDNA("agctagctaagct");

    // Find the smaller piece within the larger piece
    KnuthMorrisPrattSearch search = new KnuthMorrisPrattSearch(subSequence);

    int[] results = search.FindMatches(bigDNA);

    if (results.Length == 0)
    {
        Console.WriteLine("subSequence not found!");
        return;
    }

    // Reverse the small piece
    ReverseSymbolList reverseSubSequence = new ReverseSymbolList(subSequence);

    // Make a copy of the big sequence that we can play with...
    ISymbolList reverseBigDNA = new SimpleSymbolList(bigDNA);

    // Overwrite the forward sequence with the reverse...
    Edit edit = new Edit(results[0], subSequence.Length, reverseSubSequence);

    reverseBigDNA.Edit(edit);

    // Print out the results...
    Console.WriteLine("subSequence:        " + subSequence.SeqString);
    Console.WriteLine("reverseSubSequence: " + reverseSubSequence.SeqString);
    Console.WriteLine("bigDNA:             " + bigDNA.SeqString);
    Console.WriteLine("reverseBigDNA:      " + reverseBigDNA.SeqString);
}

Here is the output from this snippet, with the flipped sequence highlighted in red:

subSequence:        agctagctaagct
reverseSubSequence: tcgaatcgatcga
bigDNA:             acgatagatagctacgcatagctagctaagctacgactacgctacgctacg
reverseBigDNA:      acgatagatagctacgcattcgaatcgatcgaacgactacgctacgctacg

Note that this is simply the reverse of the subsegment, and not the reverse compliment.  The reverse compliment would be just as easy to do, though…

September 16, 2008

You know your port is in trouble when…

Filed under: Software — Tags: — Doug @ 11:15 pm

In porting BioJava, I came across the following comment:

Don’t use this class directly. This class contains deep voodoo code. Run away while you still can.

Looking a bit deeper at the class, it generates code on the fly.  That wouldn’t, in itself, be too bad, except it doesn’t generate Java and compile it, it generates bytecode!

Here is a small snippet:

        GeneratedCodeMethod init = pclass.createMethod(
                "<init>",
                voidC,
                new CodeClass[]{
                  faceClassC,
                  projectionContextC
                },
                CodeUtils.ACC_PUBLIC);

        InstructionVector initIV = new InstructionVector();
        initIV.add(ByteCode.make_aload(init.getThis()));
        initIV.add(ByteCode.make_aload(init.getVariable(0)));
        initIV.add(ByteCode.make_aload(init.getVariable(1)));
        initIV.add(ByteCode.make_invokespecial(m_ourBase_init));
        initIV.add(ByteCode.make_return());
        pclass.setCodeGenerator(init, initIV);

Uh, yeah.  I can read and write Java, but I’m no expert, and I’ve certainly never looked at Java bytecode.  To make matters worse, the code uses the “continue label” construct, like the following (the “more code” placeholder is about 150 additional lines):

        METHOD_MAKER:
        for (Iterator methIt = faceClassC.getMethods().iterator(); methIt.hasNext();) {
          CodeMethod faceMethod = (CodeMethod) methIt.next();
          Set baseMethods = baseClassC.getMethodsByName(faceMethod.getName());

          if (baseClassC.getMethodsByName(faceMethod.getName()).size() > 0) {
            for(Iterator i = baseMethods.iterator(); i.hasNext(); ) {
              CodeMethod meth = (CodeMethod) i.next();
              if( (meth.getModifiers() & CodeUtils.ACC_ABSTRACT) == 0) {
                //System.err.println("Skipping defined method: " + faceMethod.getName());
                continue METHOD_MAKER;
              }
            }
          }

          // ...more code...
        }

I’m not saying it’s bad code; I’m just saying it’s not going to be much fun to port, especially given the lack of unit tests on these bits and the fact that I’ve never generated C# code on the fly, either.

Ah, well, at least they warned me with that comment up top.

September 14, 2008

So much to port…so little time.

Filed under: Diary — Tags: — Doug @ 11:13 pm

I haven’t posted in a while, as I’ve been working on porting BioJava over to C#.  It is a big library (to say the least), and I’ve almost (hopefully) got enough code working that I can find restriction enzyme sites on a strand of DNA.  So far, I’ve ported about 150 classes, although a fair number of them still have a handful of NotYetImplemented exceptions lying around.

Within the next week or two, I hope to have some code to post here, along with a link to the initial web page for the project.

September 5, 2008

When are two HashSets equal?

Filed under: Software — Tags: , — Doug @ 8:20 pm

It has been a couple of days since I’ve posted, as I’ve been trying to track down an elusive bug deep in the bowels of the C# port of BioJava.  I don’t have it fixed yet, but I’ve determined the root cause.

There are number of places in BioJava where they use a Set as a key in a dictionary, to handle things such as ambiguity symbols.  For example, in DNA, the symbol ‘N’ represents any base A, G, C, or T.  If a reverse lookup is done, asking for the ambiguity symbol for AGCT, it should always return the same symbol – N.

In the port, I goofed and used a List<> in a couple of places to port a Set.  When I went to fix that and replace it with a .Net HashSet<>, it didn’t resolve the problem, much to my surprise.  I wrote a quick test to try and figure out what was going on:

[Test]
public void HashSetAsKey()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(2);
    h2.Add(1);

    Dictionary<HashSet<int>, string> dict = new Dictionary<HashSet<int>, string>();

    dict.Add(h1, "hello");

    Assert.AreEqual("hello", dict[h2]);
}

This test fails, with a KeyNotFound exception.  The sets are equivalent, so why didn’t this work?  Time for another test.

[Test]
public void HashSetEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.AreEqual(h1, h2);    // Why does this fail???
}

Even adding the items in the same order fails!  In fact, you can remove the four Add calls and compare two empty HashSets – the test will still fail.

After some digging, it turns out that the equality test I was expecting is implemented as a separate method, called SetEquals.  The following test will pass:

[Test]
public void HashSetSpecialEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.IsTrue(h1.SetEquals(h2));
}

So, when are two HashSets equal?  It looks like the answer is: never.

Argh.

That makes the HashSet class useless as the key to a dictionary, unless I’m missing some other way to make it work.

I’m off to write my own Set class.

September 1, 2008

Bioinformatics – BioJava and C#

Filed under: Bioinformatics — Tags: , — Doug @ 10:36 pm

I stumbled across the Bioinformatics Group at the UofM, and realized that I met the president at a birthday party for a mutual friend a few months ago.  I may have the opportunity to contribute to a project or two in the coming semester(s), so I started reading a bit about bioinformatics (again).

I went looking for some code, and found a framework called BioPerl, which seems fairly popular.  My perl skills have atrophied over the years, and when I found BioJava, I was a bit more excited.  It provides a number of useful functions, and seems fairly active.  There is also a related database project, BioSQL, that both BioPerl and BioJava (along with BioRuby and BioPython) have incorporated language bindings.  BioJava even uses Hibernate as its O/R mapping layer.

Since I like to work in C#, I started playing around with porting BioJava to C#.  It’s a huge project, but it’s also a great way to see how BioJava is put together.  I’ve managed to get far enough that I can transcribe DNA to RNA using the following code:

        private static void TranscribeDNAtoRNA()
        {
            try
            {
                //make a DNA SymbolList
                ISymbolList symL = DNATools.CreateDNA("atgccgaatcgtaa");

                Console.WriteLine("DNA: " + symL.SeqString);

                symL = DNATools.ToRNA(symL);

                // just to prove it worked
                Console.WriteLine("RNA: " + symL.SeqString);
            }
            catch (IllegalSymbolException ex)
            {
                // this will happen if you try and make the DNA seq using non IUB symbols
                Console.WriteLine(ex);
            }
            catch (IllegalAlphabetException ex)
            {
                // this will happen if you try and transcribe a non DNA SymbolList
                Console.WriteLine(ex);
            }
        }

When run, the output is:

DNA: atgccgaatcgtaa
RNA: augccgaaucguaa

Yup.  A few dozen classes and a few hundred lines of code, and I can replace t’s with u’s.  Pretty exciting, eh?

Actually, I think it is pretty cool.  I’m pretty close to having the code working that will let me translate the RNA to a protein sequence or form the complement of a DNA strand.  Not rocket science, but I’ve only begun to tap the surface.  The framework allows reading sequence files (BLAST, FASTA), edit large sequences (efficiently), do pairwise alignment, and a whole lot more.

If you’re curious, you can compare the above C# code to the original Java code, which comes from the BioJava cookbook.

Blog at WordPress.com.