August 9, 2008

Genealogy data model from 30,000 feet

Filed under: Genealogy — Doug @ 2:18 pm

Here is an overview of the genealogy model that I’m proposing.  There are still many details to be worked out, but this is a good place to start the discussion.

Genealogy data model from 30,000 feet

Genealogy data model from 30,000 feet

The administration section contains all the stuff that relates more to the process or the program, and less to the genealogy itself.  Examples:

  • Searches – records of searches done on genealogical sources.  For example: on 23-Aug-2006, the 1850 census for Champaign County, Ohio, was searched for Smith and Sheue surnames.  No extracts were found.
  • Tasks – a todo list.
  • Revision History – records of changes to the data (imports, modifications, etc)
  • Surety Schemes – part of the GenTech Data Model.  It defines ways to classify the “quality” of a source/extract.

The “other family trees” section needs a better name, but the gist is that you should be allowed to import other family trees (via GEDCOM, for example) and link them into your tree as a “guide”.  What do I mean by “guide”?  Well, many programs treat a GEDCOM file as hard evidence, and import it into the conclusions section.  The problem is that you have no idea whether the person that constructed the tree followed sound research techniques, or just threw a bunch of names into a hat.  The information contained in them can still be useful as clues to build up your tree, so the model allows them to be imported and linked into the conclusions to help guide your searches.  They will not carry the same weight as evidence, however.

That said, the tool should allow import of a high quality GEDCOM file (or other format) and apply the data to evidence.  There are problems with the GEDCOM format, however, that make this frought with peril.  (More about that in an upcoming post.)

The evidence section contains lists of sources and the data extracted from those sources.  Examples:

  • Source – a document that contains information relevant to the family history in some way.  This might be a census, land deed, newspaper clipping, letter, interview transcript, etc.  The GenTech Data Model provides the ability to build a hierarchy of sources, which would be useful for things like a census, where one would start at the Federal level, followed by States, then Counties, etc.
  • Image – a copy of a source page – scanned document, microfilm image, screen shot of an online database, etc.  It would probably also be handy to store text documents, but I’m not sure where those fit into the model as yet.
  • Extract – The data extracted from a specific source page (or set of pages).  For example, a census record includes many households, and some households may span more than one page.  An extract would contain all the information about one household, broken down into facts.  (Much more about this in an upcoming post.)

The conclusions section contains assertions made by the researcher about the evidence, and allows another researcher to analyze their reasoning.  This is done by creating a Persona and linking it to individuals in evidence records (or individuals in other family trees).  For example, you could create a persona and tie them to the “John Smith” that appears in the 1850 Ohio Census, and then tie them to the “John Smith” that appears in the 1880 Iowa Census, along with the reasoning for doing so.  This essentially asserts that the 1850 and 1880 individuals are one in the same.  As more instances of this individual is found, they would be tied to this same persona.

The GenTech Data Model does this by creating multiple personas and layering them on top of each other, built up by assertions.  In my mind, it would work just as well to create one persona, and tie them to multiple individuals in the evidence.  This will make it easier to do things like display a pedigree of family group sheet.  In the GenTech scheme, unraveling the layering of assertions is expensive (from a computing standpoint).  It was also unclear to me how they expected to handle a revision where a lower-level assertion was removed (a reinterpretation of the evidence), as this would affect all the assertions layered on top of it.

What is the difference between evidence and conclusions?  Evidence is not subject to interpretation, unlike conclusions.  Well, that’s not entirely true, as extracting the evidence from a source relies on interpretation of the handwriting.  The interpretations at the conclusion level are much more broad, however.  For example, how can one conclude that someone in the 1880 census is really the same person as someone in the 1850 census?  That sort of conclusion is difficult, and the reasoning behind it needs to be accessible for later researchers (or the same researcher a few years down the road.)

One other thing that I’ll mention to wrap up this post.  Not only is it possible in the conclusions to state that a persona is the same person as someone in the evidence, but it is also possible to state that someone is not the same person as someone in the evidence.  For example, we could tie a persona to our 1850 individual, and then note that this persona is not the same as the individual in the 1880 census.  This “negative” information is just as valuable as the “positive” match information.


Blog at