doug-swisher.net

August 9, 2008

Genealogy data model from 30,000 feet

Filed under: Genealogy — Doug @ 2:18 pm

Here is an overview of the genealogy model that I’m proposing.  There are still many details to be worked out, but this is a good place to start the discussion.

Genealogy data model from 30,000 feet

Genealogy data model from 30,000 feet

The administration section contains all the stuff that relates more to the process or the program, and less to the genealogy itself.  Examples:

  • Searches – records of searches done on genealogical sources.  For example: on 23-Aug-2006, the 1850 census for Champaign County, Ohio, was searched for Smith and Sheue surnames.  No extracts were found.
  • Tasks – a todo list.
  • Revision History – records of changes to the data (imports, modifications, etc)
  • Surety Schemes – part of the GenTech Data Model.  It defines ways to classify the “quality” of a source/extract.

The “other family trees” section needs a better name, but the gist is that you should be allowed to import other family trees (via GEDCOM, for example) and link them into your tree as a “guide”.  What do I mean by “guide”?  Well, many programs treat a GEDCOM file as hard evidence, and import it into the conclusions section.  The problem is that you have no idea whether the person that constructed the tree followed sound research techniques, or just threw a bunch of names into a hat.  The information contained in them can still be useful as clues to build up your tree, so the model allows them to be imported and linked into the conclusions to help guide your searches.  They will not carry the same weight as evidence, however.

That said, the tool should allow import of a high quality GEDCOM file (or other format) and apply the data to evidence.  There are problems with the GEDCOM format, however, that make this frought with peril.  (More about that in an upcoming post.)

The evidence section contains lists of sources and the data extracted from those sources.  Examples:

  • Source – a document that contains information relevant to the family history in some way.  This might be a census, land deed, newspaper clipping, letter, interview transcript, etc.  The GenTech Data Model provides the ability to build a hierarchy of sources, which would be useful for things like a census, where one would start at the Federal level, followed by States, then Counties, etc.
  • Image – a copy of a source page – scanned document, microfilm image, screen shot of an online database, etc.  It would probably also be handy to store text documents, but I’m not sure where those fit into the model as yet.
  • Extract – The data extracted from a specific source page (or set of pages).  For example, a census record includes many households, and some households may span more than one page.  An extract would contain all the information about one household, broken down into facts.  (Much more about this in an upcoming post.)

The conclusions section contains assertions made by the researcher about the evidence, and allows another researcher to analyze their reasoning.  This is done by creating a Persona and linking it to individuals in evidence records (or individuals in other family trees).  For example, you could create a persona and tie them to the “John Smith” that appears in the 1850 Ohio Census, and then tie them to the “John Smith” that appears in the 1880 Iowa Census, along with the reasoning for doing so.  This essentially asserts that the 1850 and 1880 individuals are one in the same.  As more instances of this individual is found, they would be tied to this same persona.

The GenTech Data Model does this by creating multiple personas and layering them on top of each other, built up by assertions.  In my mind, it would work just as well to create one persona, and tie them to multiple individuals in the evidence.  This will make it easier to do things like display a pedigree of family group sheet.  In the GenTech scheme, unraveling the layering of assertions is expensive (from a computing standpoint).  It was also unclear to me how they expected to handle a revision where a lower-level assertion was removed (a reinterpretation of the evidence), as this would affect all the assertions layered on top of it.

What is the difference between evidence and conclusions?  Evidence is not subject to interpretation, unlike conclusions.  Well, that’s not entirely true, as extracting the evidence from a source relies on interpretation of the handwriting.  The interpretations at the conclusion level are much more broad, however.  For example, how can one conclude that someone in the 1880 census is really the same person as someone in the 1850 census?  That sort of conclusion is difficult, and the reasoning behind it needs to be accessible for later researchers (or the same researcher a few years down the road.)

One other thing that I’ll mention to wrap up this post.  Not only is it possible in the conclusions to state that a persona is the same person as someone in the evidence, but it is also possible to state that someone is not the same person as someone in the evidence.  For example, we could tie a persona to our 1850 individual, and then note that this persona is not the same as the individual in the 1880 census.  This “negative” information is just as valuable as the “positive” match information.

August 7, 2008

Yet another genealogy tool?

Filed under: Genealogy — Doug @ 3:05 am

Yes, the project I’m currently poking away at is yet another genealogy tool.  There are a boatload of them already on the market (free, open source, commercial), so why does the world need another one?

Most “family tree” tools are just that – they help you create a pretty family tree.  I’ve used many of them, and the one thing they seem to lack is a nice way of dealing with all the research that may or may not be related to the family tree, or even worse, conflicts with the data in the family tree.

Don’t get me wrong.  There are some very well done tools out there: Legacy, The Master Genealist, Family Tree Maker, plus many more.  They don’t quite do what I want them to do, however.  (Or at least, not in the way I’d like to be able to do it.)

Let’s say, for example, that you’ve found your ancestor in the 1850 census and the 1880 census.  The 1850 census shows a birthplace of Ohio, but the 1880 census shows a birthplace of Pennsylvania.  Which one is correct?  Most family tree programs force you to enter one or the other (TMG being a notable exception).  What if one of those individuals isn’t in fact your ancestor after all?  How do you document which one is correct, but still retain the data for the person that is not your ancestor?  Most allow you to shoehorn in the data, but it isn’t always prominent – it gets buried down in notes or other fields away from the main screens.

At the other end of the spectrum is another nifty tool called Clooz.  This gives you the ability to store all your genealogical information (census records, birth certificates, etc) and quickly and easily search through them.  The part that it lacks, however, is a way to tie together these disparate facts into a family history.

I’ve made a couple of aborted attempts to create my own genealogy tool based on the GenTech data model (GDM).  The GDM works very hard to separate evidence, conclusions, and administration within the model.  The lack of clear separation is one of the things that frustrates me with the models of many existing tools.  The way that the GDM models conclusions uses a concept called assertions, which can be layered on top of each other.  It is a very powerful concept, but I could never figure out a way to extract out a family tree from the model.  That led to at least two projects being aborted.  This time around, I’m not going to try and follow the GDM.  I’ll use it as a guide, but I’ll go my own way as needs dictate.

There are a number of new technologies and tools that I think give this new incarnation a shot at being successful.  These include new development methodologies, such as Behavior Driven Development (BDD), which is an amalgam of Test Driven Development (TDD) and Domain Driven Design (DDD); new tools, such as NHibernate, iBATIS, SQLite, and Windows Presentation Foundation (WPF); and new patterns/models, such as Aspect-oriented programming (AOP), Inversion of Control/Dependency Injection, and DataModel-View-ViewModel.

This project will give me the opportunity to explore and play with all those tools and technologies.  Right now, I’m reading Jimmy Nilsson‘s book Applying Domain-Driven Design and Patterns: With Examples in C# and .NET, while trying to work my way through Charles Petzold‘s book Applications = Code + Markup: A Guide to the Microsoft Windows Presentation Foundation (I learned Windows 2.0 from his first windows book – yeah, I’ve been writing code for a while).

I hope to start playing with some code before too much longer, and I’ll blog about it here.

August 6, 2008

My new blog

Filed under: Diary — Doug @ 9:45 pm

Welcome to my new blog.

I tend to work on a handful of projects.  When I get tired (or stuck) on one, I go on to the next project.  Eventually, I’ll come full circle and pick up an old project again.

This blog has a couple of purposes.  First, it will let me document bits and pieces of the projects as I go, making it easier to pick them up later.  Writing down the bits and pieces helps think through them, so this blog is a bit of a sounding board.  I’ll often post links to tools/frameworks/code that I’ve found so I can find them later.  I may even post some drivel about games I’m playing, books I’ve read, or whatever strikes my fancy.

« Newer Posts

Create a free website or blog at WordPress.com.