doug-swisher.net

September 5, 2008

When are two HashSets equal?

Filed under: Software — Tags: , — Doug @ 8:20 pm

It has been a couple of days since I’ve posted, as I’ve been trying to track down an elusive bug deep in the bowels of the C# port of BioJava.  I don’t have it fixed yet, but I’ve determined the root cause.

There are number of places in BioJava where they use a Set as a key in a dictionary, to handle things such as ambiguity symbols.  For example, in DNA, the symbol ‘N’ represents any base A, G, C, or T.  If a reverse lookup is done, asking for the ambiguity symbol for AGCT, it should always return the same symbol – N.

In the port, I goofed and used a List<> in a couple of places to port a Set.  When I went to fix that and replace it with a .Net HashSet<>, it didn’t resolve the problem, much to my surprise.  I wrote a quick test to try and figure out what was going on:

[Test]
public void HashSetAsKey()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(2);
    h2.Add(1);

    Dictionary<HashSet<int>, string> dict = new Dictionary<HashSet<int>, string>();

    dict.Add(h1, "hello");

    Assert.AreEqual("hello", dict[h2]);
}

This test fails, with a KeyNotFound exception.  The sets are equivalent, so why didn’t this work?  Time for another test.

[Test]
public void HashSetEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.AreEqual(h1, h2);    // Why does this fail???
}

Even adding the items in the same order fails!  In fact, you can remove the four Add calls and compare two empty HashSets – the test will still fail.

After some digging, it turns out that the equality test I was expecting is implemented as a separate method, called SetEquals.  The following test will pass:

[Test]
public void HashSetSpecialEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.IsTrue(h1.SetEquals(h2));
}

So, when are two HashSets equal?  It looks like the answer is: never.

Argh.

That makes the HashSet class useless as the key to a dictionary, unless I’m missing some other way to make it work.

I’m off to write my own Set class.

September 1, 2008

Bioinformatics – BioJava and C#

Filed under: Bioinformatics — Tags: , — Doug @ 10:36 pm

I stumbled across the Bioinformatics Group at the UofM, and realized that I met the president at a birthday party for a mutual friend a few months ago.  I may have the opportunity to contribute to a project or two in the coming semester(s), so I started reading a bit about bioinformatics (again).

I went looking for some code, and found a framework called BioPerl, which seems fairly popular.  My perl skills have atrophied over the years, and when I found BioJava, I was a bit more excited.  It provides a number of useful functions, and seems fairly active.  There is also a related database project, BioSQL, that both BioPerl and BioJava (along with BioRuby and BioPython) have incorporated language bindings.  BioJava even uses Hibernate as its O/R mapping layer.

Since I like to work in C#, I started playing around with porting BioJava to C#.  It’s a huge project, but it’s also a great way to see how BioJava is put together.  I’ve managed to get far enough that I can transcribe DNA to RNA using the following code:

        private static void TranscribeDNAtoRNA()
        {
            try
            {
                //make a DNA SymbolList
                ISymbolList symL = DNATools.CreateDNA("atgccgaatcgtaa");

                Console.WriteLine("DNA: " + symL.SeqString);

                symL = DNATools.ToRNA(symL);

                // just to prove it worked
                Console.WriteLine("RNA: " + symL.SeqString);
            }
            catch (IllegalSymbolException ex)
            {
                // this will happen if you try and make the DNA seq using non IUB symbols
                Console.WriteLine(ex);
            }
            catch (IllegalAlphabetException ex)
            {
                // this will happen if you try and transcribe a non DNA SymbolList
                Console.WriteLine(ex);
            }
        }

When run, the output is:

DNA: atgccgaatcgtaa
RNA: augccgaaucguaa

Yup.  A few dozen classes and a few hundred lines of code, and I can replace t’s with u’s.  Pretty exciting, eh?

Actually, I think it is pretty cool.  I’m pretty close to having the code working that will let me translate the RNA to a protein sequence or form the complement of a DNA strand.  Not rocket science, but I’ve only begun to tap the surface.  The framework allows reading sequence files (BLAST, FASTA), edit large sequences (efficiently), do pairwise alignment, and a whole lot more.

If you’re curious, you can compare the above C# code to the original Java code, which comes from the BioJava cookbook.

Blog at WordPress.com.