doug-swisher.net

October 19, 2008

Breaking CAPTCHA

Filed under: Software — Tags: — Doug @ 12:20 am

I always figured there were ways spammers could get around CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), but I’d never seen it discussed.  For those that might not be familiar with the name, a CAPTCHA is one of those weird text challenges where you have to type in some cryptic text to register on a site:

CAPTCHA Sample

CAPTCHA Sample

I stumbled across an article describing a trojan that presents voyeurs with a woman doing a strip tease – each successful CAPTCHA entry removes another article of clothing.  One such trojan (HotLan) has been used to create more than 500,000 accounts on popular e-mail sites.

It struck me as a bit ironic that clicking on the “Discuss this article” link on that website prompted me with a registration form that included a CAPTCHA challenge.

The CAPTCHA web site acknowledges the issue, but deems it “not a concern”:

While it might be the case that some spammers use porn sites to attack CAPTCHAs, the amount of damage this can inflict is tiny (so tiny that we haven’t even noticed a dent!).

In spite of all this, I don’t see much of an alternative, and I’m sure I’ll continue to use CAPTCHAs on sites.  At least it makes it a lot harder for spammers…

October 7, 2008

First BioSharp App – RestrictionFinder

Filed under: Bioinformatics — Tags: — Doug @ 7:57 am

I’ve started work on my first application that uses BioSharp.  It is called RestrictionFinder, and its purpose is to help find a pair of restriction enzymes that give distinct cleavage patterns when an insert is present in a plasmid in the forward direction, reverse direction, or absent.  It also has the ability to limit the search to pairs of enzymes with compatible buffers.

Here is a screen shot of the sequence entry form:

Sequence Entry Form

Sequence Entry Form

If the sequence contains uppercase and lowercase, the lowercase is assumed to be the insert, and the start/end positions are set automatically.

As part of the solution, I needed a small database (just a file, really) of enzymes and their buffers.  I could not find a readily available file for this, so I wrote a small console app that extracts the data from the REBASE web site.

This is still a work in progress, but the source code is checked into the BioSharp SVN repository.

September 26, 2008

BioSharp web site is live

Filed under: Bioinformatics — Tags: — Doug @ 4:47 pm

I’ve uploaded the first incarnation of the BioSharp web site.  It is still a bit thin, but at least the API docs are available.

The next step in the project will be to work on an application that was the whole motivation for BioSharp.  As that progresses, I’m sure I’ll be continuing to port bits.  According to my port status page, I’m just under 5% done…only 1413 classes left to port.  Even at this point, though, the library has some useful functionality, as demonstrated by a few of my earlier posts.

If you are interested in seeing a specific module ported over, don’t hesitate to add a comment here to post in the forums.  I’ll also look at getting a mailing list set up sometime soon.

September 23, 2008

Finding and flipping a DNA sequence with BioSharp

Filed under: Bioinformatics, Software — Tags: — Doug @ 11:06 am

The BioSharp port is still moving forward.  I have enough functionality now to be able to create a DNA sequence, find a subsequence with that sequence, and create a new sequence with the subsequence flipped around.

For example, it can take “aacgaa”, search for “cg”, flip it around, and create the new sequence “aagcaa”.  It would be trivial to do this just by string manipulation; hopefully the investment in the library will be worth it.

Here is a bit of sample code to do the search and flip.

private static void FindAndFlip()
{
    // Create our two bits of DNA
    ISymbolList bigDNA = DNATools.CreateDNA("acgatagatagctacgcatagctagctaagctacgactacgctacgctacg");
    ISymbolList subSequence = DNATools.CreateDNA("agctagctaagct");

    // Find the smaller piece within the larger piece
    KnuthMorrisPrattSearch search = new KnuthMorrisPrattSearch(subSequence);

    int[] results = search.FindMatches(bigDNA);

    if (results.Length == 0)
    {
        Console.WriteLine("subSequence not found!");
        return;
    }

    // Reverse the small piece
    ReverseSymbolList reverseSubSequence = new ReverseSymbolList(subSequence);

    // Make a copy of the big sequence that we can play with...
    ISymbolList reverseBigDNA = new SimpleSymbolList(bigDNA);

    // Overwrite the forward sequence with the reverse...
    Edit edit = new Edit(results[0], subSequence.Length, reverseSubSequence);

    reverseBigDNA.Edit(edit);

    // Print out the results...
    Console.WriteLine("subSequence:        " + subSequence.SeqString);
    Console.WriteLine("reverseSubSequence: " + reverseSubSequence.SeqString);
    Console.WriteLine("bigDNA:             " + bigDNA.SeqString);
    Console.WriteLine("reverseBigDNA:      " + reverseBigDNA.SeqString);
}

Here is the output from this snippet, with the flipped sequence highlighted in red:

subSequence:        agctagctaagct
reverseSubSequence: tcgaatcgatcga
bigDNA:             acgatagatagctacgcatagctagctaagctacgactacgctacgctacg
reverseBigDNA:      acgatagatagctacgcattcgaatcgatcgaacgactacgctacgctacg

Note that this is simply the reverse of the subsegment, and not the reverse compliment.  The reverse compliment would be just as easy to do, though…

September 16, 2008

You know your port is in trouble when…

Filed under: Software — Tags: — Doug @ 11:15 pm

In porting BioJava, I came across the following comment:

Don’t use this class directly. This class contains deep voodoo code. Run away while you still can.

Looking a bit deeper at the class, it generates code on the fly.  That wouldn’t, in itself, be too bad, except it doesn’t generate Java and compile it, it generates bytecode!

Here is a small snippet:

        GeneratedCodeMethod init = pclass.createMethod(
                "<init>",
                voidC,
                new CodeClass[]{
                  faceClassC,
                  projectionContextC
                },
                CodeUtils.ACC_PUBLIC);

        InstructionVector initIV = new InstructionVector();
        initIV.add(ByteCode.make_aload(init.getThis()));
        initIV.add(ByteCode.make_aload(init.getVariable(0)));
        initIV.add(ByteCode.make_aload(init.getVariable(1)));
        initIV.add(ByteCode.make_invokespecial(m_ourBase_init));
        initIV.add(ByteCode.make_return());
        pclass.setCodeGenerator(init, initIV);

Uh, yeah.  I can read and write Java, but I’m no expert, and I’ve certainly never looked at Java bytecode.  To make matters worse, the code uses the “continue label” construct, like the following (the “more code” placeholder is about 150 additional lines):

        METHOD_MAKER:
        for (Iterator methIt = faceClassC.getMethods().iterator(); methIt.hasNext();) {
          CodeMethod faceMethod = (CodeMethod) methIt.next();
          Set baseMethods = baseClassC.getMethodsByName(faceMethod.getName());

          if (baseClassC.getMethodsByName(faceMethod.getName()).size() > 0) {
            for(Iterator i = baseMethods.iterator(); i.hasNext(); ) {
              CodeMethod meth = (CodeMethod) i.next();
              if( (meth.getModifiers() & CodeUtils.ACC_ABSTRACT) == 0) {
                //System.err.println("Skipping defined method: " + faceMethod.getName());
                continue METHOD_MAKER;
              }
            }
          }

          // ...more code...
        }

I’m not saying it’s bad code; I’m just saying it’s not going to be much fun to port, especially given the lack of unit tests on these bits and the fact that I’ve never generated C# code on the fly, either.

Ah, well, at least they warned me with that comment up top.

September 5, 2008

When are two HashSets equal?

Filed under: Software — Tags: , — Doug @ 8:20 pm

It has been a couple of days since I’ve posted, as I’ve been trying to track down an elusive bug deep in the bowels of the C# port of BioJava.  I don’t have it fixed yet, but I’ve determined the root cause.

There are number of places in BioJava where they use a Set as a key in a dictionary, to handle things such as ambiguity symbols.  For example, in DNA, the symbol ‘N’ represents any base A, G, C, or T.  If a reverse lookup is done, asking for the ambiguity symbol for AGCT, it should always return the same symbol – N.

In the port, I goofed and used a List<> in a couple of places to port a Set.  When I went to fix that and replace it with a .Net HashSet<>, it didn’t resolve the problem, much to my surprise.  I wrote a quick test to try and figure out what was going on:

[Test]
public void HashSetAsKey()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(2);
    h2.Add(1);

    Dictionary<HashSet<int>, string> dict = new Dictionary<HashSet<int>, string>();

    dict.Add(h1, "hello");

    Assert.AreEqual("hello", dict[h2]);
}

This test fails, with a KeyNotFound exception.  The sets are equivalent, so why didn’t this work?  Time for another test.

[Test]
public void HashSetEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.AreEqual(h1, h2);    // Why does this fail???
}

Even adding the items in the same order fails!  In fact, you can remove the four Add calls and compare two empty HashSets – the test will still fail.

After some digging, it turns out that the equality test I was expecting is implemented as a separate method, called SetEquals.  The following test will pass:

[Test]
public void HashSetSpecialEquality()
{
    HashSet<int> h1 = new HashSet<int>();
    h1.Add(1);
    h1.Add(2);

    HashSet<int> h2 = new HashSet<int>();
    h2.Add(1);
    h2.Add(2);

    Assert.IsTrue(h1.SetEquals(h2));
}

So, when are two HashSets equal?  It looks like the answer is: never.

Argh.

That makes the HashSet class useless as the key to a dictionary, unless I’m missing some other way to make it work.

I’m off to write my own Set class.

September 1, 2008

Bioinformatics – BioJava and C#

Filed under: Bioinformatics — Tags: , — Doug @ 10:36 pm

I stumbled across the Bioinformatics Group at the UofM, and realized that I met the president at a birthday party for a mutual friend a few months ago.  I may have the opportunity to contribute to a project or two in the coming semester(s), so I started reading a bit about bioinformatics (again).

I went looking for some code, and found a framework called BioPerl, which seems fairly popular.  My perl skills have atrophied over the years, and when I found BioJava, I was a bit more excited.  It provides a number of useful functions, and seems fairly active.  There is also a related database project, BioSQL, that both BioPerl and BioJava (along with BioRuby and BioPython) have incorporated language bindings.  BioJava even uses Hibernate as its O/R mapping layer.

Since I like to work in C#, I started playing around with porting BioJava to C#.  It’s a huge project, but it’s also a great way to see how BioJava is put together.  I’ve managed to get far enough that I can transcribe DNA to RNA using the following code:

        private static void TranscribeDNAtoRNA()
        {
            try
            {
                //make a DNA SymbolList
                ISymbolList symL = DNATools.CreateDNA("atgccgaatcgtaa");

                Console.WriteLine("DNA: " + symL.SeqString);

                symL = DNATools.ToRNA(symL);

                // just to prove it worked
                Console.WriteLine("RNA: " + symL.SeqString);
            }
            catch (IllegalSymbolException ex)
            {
                // this will happen if you try and make the DNA seq using non IUB symbols
                Console.WriteLine(ex);
            }
            catch (IllegalAlphabetException ex)
            {
                // this will happen if you try and transcribe a non DNA SymbolList
                Console.WriteLine(ex);
            }
        }

When run, the output is:

DNA: atgccgaatcgtaa
RNA: augccgaaucguaa

Yup.  A few dozen classes and a few hundred lines of code, and I can replace t’s with u’s.  Pretty exciting, eh?

Actually, I think it is pretty cool.  I’m pretty close to having the code working that will let me translate the RNA to a protein sequence or form the complement of a DNA strand.  Not rocket science, but I’ve only begun to tap the surface.  The framework allows reading sequence files (BLAST, FASTA), edit large sequences (efficiently), do pairwise alignment, and a whole lot more.

If you’re curious, you can compare the above C# code to the original Java code, which comes from the BioJava cookbook.

August 24, 2008

User Interface Design Patterns

Filed under: Software — Doug @ 2:31 pm

I stumbled across a site that has a few user interface design patterns – something I had not seen before, but which seems like a great idea.  I did a search, and a couple of sites containing web-based patterns, including Yahoo’s design pattern library, welie.com and ui-patterns.com.  The search also turned up an O’Reilly book by Jenifer Tidwell called Designing Interfaces: Patterns for Effective Interaction Design that I may have to add to my library.

On a semi-related note, I found that Joel on Software has the first few chapters of User Interface Design for Programmers online:

August 20, 2008

Genealogical Record Keeping Systems

Filed under: Genealogy — Doug @ 11:02 pm

Thinking about the user interface for my genealogy program, I thought it would be good to take a look at some of the existing record keeping solutions, as that is where I’ll be starting.  The record keeping systems allow you to store evidence, but do not include lineage-linked views.  In other words, you can store census records, birth certificates, etc., but you won’t be able to display a pedigree from them.

Here are some very quick overviews of what I’ve seen after playing with each one a few minutes each.

Clooz

Clooz is a package I’ve mentioned before, as it was the one package I had used briefly (years ago).  There is a list of object types (census, people, buildings, sources, research log) down the left side and a list view on the right side.  There is a centralized list of people, and adding a census entry consists of adding a census record and then linking folks into the census record.  All data entry is done in dialogs, making it easy to get a handful of dialogs open at the same time (census record dialog, link people dialog, add person dialog).

Clooz 2.1 Screen Shot

Clooz 2.1 Screen Shot

The Clooz home page describes the package as follows:

Clooz 2.1 is a database for systematically organizing and storing all of the clues to your ancestry that you have been collecting over the years. This is not another genealogy program. It is an electronic filing cabinet that assists you with search and retrieval of important facts that you have found during the ancestor hunt.

It uses an access database behind the scenes, and has some integration with Legacy, but that only works with version 6, and I’ve upgraded to version 7.  There is a free download on their site, limited to 29 days or 15 launches.

Custodian

Custodian is a program I hadn’t seen before.  I found it somewhat by accident in an article that contrasts it with Clooz.  Similar to Clooz, it has a list of object types down the left, but it is an MDI application, so when you go to add new items, a child window pops up, which is a big list view with buttons down the side.  Most data entry is done right in the list view, until you get to something like a name, which requires editing in a dialog.

Custodian 3 Screen Shot

Custodian 3 Screen Shot

The package is very colorful, using lots of backgrounds and shading, which I found made it hard to read at times.  The data is stored in password-protected access database files, under Program Files, of all places.  There is a free download on their site so you can try it before you buy.  The trial limits data entry to ten records per section.

Bygones

I found a link to Bygones on Cyndi’s list, which has a section for these types of packages.  It has an interesting look, in that it appears to be a piece of paper.  I found that made it a bit hard to know where to enter data.  It is written using FileMaker Pro, but I don’t know if the look is typical.

Bygones 0.9d Screen Shot

Bygones 0.9d Screen Shot

Their home page has a handful of slide show tutorials, which I probably need to watch, as I wasn’t quite sure how to use the package.  I did watch the first half of the introductory slide show, and it looked interesting.

It is a free download as a self-extracting zip, with no installer.

GenScribe

GenScribe is a Mac program that appears to be closer to what I had envisioned, in that it displays a nice representation of a census record.  The opening screen is a list of buttons for various operations.  It has a list of work to be done at a specific venue, source records, and index records.

GenScribe Census Screen

GenScribe Census Screen

I’ll have to fire up the Mac I have on loan from work and experiment with how they do the data entry.

There is a free trial download, and the full product only costs $12.

Summary

I’m not sure if I’m really qualified to make any summary statements after just a few minutes of playing around, but I’m going to do so anyway.

My impression is that these packages, in general, suffer from the same problem as lineage-linked packages, in that they don’t adequately differentiate between evidence and conclusions; they just do it from the other end of the spectrum.  For example, in Clooz, you link individuals in your database to populate specific census records.  Well, what if the person named “John Doe” in the census isn’t really the same “John Doe” you have in your database?  This can be seen below, as the “Person” dialog has a list of the census records to which the person has been linked.

Clooz Person Dialog

Clooz Person Dialog

My goal is to enter census data, birth certificates, and the like nearly verbatim so that the viability of the census record is intact, even if the person isn’t really part of my ancestry.  I should be able to link and unlink evidence and conclusions without “touching” the evidence at all.

Will I really be able to pull it off, and put together something that works better than these packages?  Probably not, but I hope to at least learn a lot along the way.

Trying to move forward on my genealogy app

Filed under: Genealogy, Software — Doug @ 7:42 am

I’ve been a bit stuck on the genealogy project, for a couple of reasons.  First, I’ve been spending a fair amount of time watching the olympics, which hasn’t left much time for coding fun.  Second, I’ve only been playing a bit with a domain model, and domain driven design just doesn’t quite feel right for this project.  It’s probably my lack of experience with the paradigm, but it seems more applicable to complex business models (lots of interactions and state changes) than it does to a single-user data repository.  I don’t regret spending time with it, as I’ve learned a lot, but I think I need to try a new tack.

Yesterday, I attended the Minneapolis Silverlight User Group meeting, which reinforced the fact that WPF is very cool, and I should probably try to use it for this project.  I played around a bit last night with Family.Show, which is a very cool, glitzy app, but it doesn’t even allow entry of sources.  (It’s a WPF reference app, and in that it succeeds very well, but it’s a long way from a full-blown genealogy app, and they readily admit that.)

My new approach is going to be to get something working, include plenty of unit tests, refactor mercilessly, and try to do the simplest thing.  I’m curious to see where that will take me.  The first thing I’m going to implement is the ability to store 1880 US Federal census records.  I know I want to use SQLite, so I’ll be using that out of the gate.  I’ll likely wind up using NHibernate as well, but I’m going to hold off adding it until I get something working (I’ve never used it, and I’ve got enough new things to learn).

I’ve only begun to explore WPF, so I’m sure the first incarnation will be ugly, smelly code, but hopefully I can refactor it into something decent.  In the very first iteration, I’m not even going to try to maintain a clean separation of concerns; I’m just going to hit the database directly, which will make writing unit tests nearly impossible.  That will be one of the first things I’ll need to fix.

Initially, the app may look and feel a bit like a WPF version of Clooz, but before I get too far, I’ll want to start hooking the evidence to the conclusions to fully realize the design I laid out in my previous post.

Older Posts »

Blog at WordPress.com.