GEDCOM Universally Unique Identifier (UUID)

Post by **tatewise** » 10 Aug 2021 10:31

This discussion continues from the Gedcom Import from RM (19701) topic discussing _UID.

I had not realized the _UID was a universal id used by several products.
The article https://www.tamurajones.net/The_UIDTag.xhtml explains its use but unfortunately, the FamilySearch PDF document https://devnet.familysearch.org/docs/ge ... ifiers.pdf no longer appears to exist, so the calculation algorithm is undocumented.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 11:28

I suspect it is what RM uses to sync with Ancestry, as that appears to be completely reliable no matter how sparse the data on an individual. Shosh Kalson could probably comment further, as I remember her telling me that she has calculated _UID values by plugin, but she’s not been active on the forum for a while.

If the standard is no longer in the public domain (policy or error?), it probably would make it more difficult for FH to officially support it in the future.

Post by **tatewise** » 10 Aug 2021 11:57

I have found references to Universally Unique Identifier (UUID) such as:
https://en.wikipedia.org/wiki/Universal ... identifier
https://datatracker.ietf.org/doc/html/rfc4122
http://savage.net.au/Perl-modules/html/ ... /uuid.html
A Google search for 'GEDCOM Universally Unique Identifier (UUID)' lists others.
However, none of them seems to explain what GEDCOM Individual record details are used to create a UUID.
The last reference above actually gives examples for 1 NAME Alfred E. /Newman/ and 1 NAME Alfred Einstein /Newman/ that appears to produce two different UUID for effectively the same person, so I don't understand how it is unique.
e.g.
If there are two 1 NAME John /Smith/ records for two different people how do two different UUID get calculated?
How do independent products calculate the same UUID for those two people?

The references suggest the MD5 algorithm is involved but it must need more than just the NAME to produce different UUID for people with identical names.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 13:00

Re-reading Tamura’s article again, it appears to demonstrate that FH4 supported _UID generation, but it has subsequently been dropped by CP. I wonder why?

Post by **tatewise** » 10 Aug 2021 13:12

Tamura's article is misleading. It simply lists products with GEDCOM he has found that happen to contain _UID tags.
He says: "The above list was created by googling for GEDCOM files that contain the _UID tag."
FH has always 'supported' _UID tags, imported as UDF from other products, but has never computed UUID.
FH does not even allow _UID tags (or any other UDF) to be created unless you manually edit the GEDCOM file.

Ron Melby · Post by **Ron Melby** » 10 Aug 2021 13:40

there is a difference in UID, UUID, and GUID

fundamentally a random number generator, some sifting and gluing. Unfortunately, UUID and GUID are often used interchangeably.

as they are used now, lets say I have a system for generating UUIDs (think of them as randomized hashes of some di mimimus generation maybe based off the 4th thru 8th digits of os.clock) now, lets say we are really good buddies and I give you my uuid code and we both use it in our plugins. the chances that we both get a uuid on our separate systems at exactly the same 4-8th tick of the clock, and processors and hardware at same speed, etc, that we will come up with identical uuids, are skinny to say the least. the fact that all of us in the world do it on the same code, increases the chances, but still reasonably skinny. We are working with machines, the chances of us, alone--- to come up with a uuid of *GASP* nil are not 0.0000000000000. nor are the chances of our uuid generators coming up with two uuids that are equal over time 0.0.

next to GUIDs. microsoft, or lets say linux, or IBM etc can call them GUIDs and not be far wrong, because if you are running on their machines, they are the only game in town and control the uuids. Networks, well somehow they have to use remote procedure calls, with (lets call it the originating host) running the show. so Mike says, Network, I want to play a game--- network says ok, mike you are 3, your link-layer protocol packets are 18, your ICMPs are 4, and so on. Ron says, network I want to play a game-- and I get my numbers.

Now, insofar as gedcom is concerned, the UUIDs are made at lets say FamilySearch, Ancestry, MyHeritage, or whatever. Mike has a tree and (linked list and all that and alot more computing power at those companies than we have in total on this board) Mike has, a Herbert Muckenfutch, in his tree, and; as usual; has done it better than I, and faster than I, and more elegantly and simpler than I, and shot his Herb to Family Search before me, and I get an alert that hey, here is Herbert Muckenfutch and we see all these people around him, and is this your relative? I go yeah, click on it and say, I would like the _UID to be FE0000...0, nope they say, Mike has it at EE98...0 and thats the number we gave him. Never do our _UIDs match unless we agree on the same guy. so even in our own gedcoms, privately when we share they don't. the only way to make truely GUIDs for INDIs or FAMs or whatever structures is to have something like a Uniform Resource Locator for GEDCOM _UIDs.

Post by **NickWalker** » 10 Aug 2021 13:47

I guess this might be the missing PDF:
https://chronoplexsoftware.com/gedcomva ... 20UIDs.pdf
I think the idea is that as a record is created (even if just a blank record with no name), a unique (across the world) ID should be generated - similar to the GEDCOM ID created currently except that's only unique to your GEDCOM file. The PDF suggests that as records are merged all the corresponding unique IDs should also be merged, so an individual is likely to have multiple unique identifiers as they are imported into different systems.

It all sounds very unlikely to ever be practical in the real world to me.

Cheers

Nick

YouTube · Post by **Jane** » 10 Aug 2021 13:53

As far as I know _UID originated in PAF with the old Family Search submissions, which I believe was dropped when the new integration api's were released . Ancestry use _APID

e.g.

Code: Select all

_APID 1,8782::55373184

In Gedcom 7 UIDs are supported as optional items rather than being an extension as they are in 5.5.1, https://gedcom.io/specs/ PDF page 84, which also contains links to the computation routine.

I am with Nick on it's usefulness at this point.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 14:33

tatewise wrote: ↑10 Aug 2021 13:12 FH has always 'supported' _UID tags, imported as UDF from other products, but has never computed UUID.
FH does not even allow _UID tags (or any other UDF) to be created unless you manually edit the GEDCOM file.

I don't think that is exactly true. FH seems to generate a single _UID tag for each project in the format Tamura described, even when it is created from first principles rather than imported. It's not clear to me exactly what purpose it serves.

I agree that generating a unique tag for different representations of the same individual in different applications would be a challenge, but it is a pity that apps don't fully respect and support each other's _UID. If they did, you could always be confident that the same individual is identified correctly, no matter how many times the GEDCOM is imported, modified, and exported again.

I guess in the real world, the different commercial apps are not here to make our life easier - they exist to succeed and make money for their authors, even if those authors are enthusiasts such as Simon Orde or Bruce Buzbee.

Post by **tatewise** » 10 Aug 2021 15:03

Now I am even more confused!
BTW: I didn't think we were talking about the _UID in the Header but the _UID for each Individual record.
The article https://www.tamurajones.net/The_UIDTag.xhtml by Tamura says under UUID and PAF GEDCOM _UID: checksum with my emphasis:

UUID

The cross-reference identifiers are unique with a single GEDCOM file, but different GEDCOM files, are very likely to use the same identifiers for different records. For some purposes it would be nice to have truly unique identifiers. That requires two things; some way to create globally unique identifiers, and a GEDCOM tag to carry that identifier. The Universally Unique ID (UUID) is that globally unique identifier, and _UID is the tag that carries it.

Microsoft developers known UUID as Globally Unique IDentifier (GUID). What makes UUIDs so suitable is that they were developed so precisely so that everyone can generate UUIDs, without coordinating with anyone else, and still be practically sure that the generated identifier is unique.
A UUID is an 128-bit (16-byte) number, generally represented by 32 hexadecimal digits, divided into five groups, separated by hyphens, like this: 12345678-1234-1234-1234-123456789ABC.

PAF GEDCOM _UID: checksum
Code: Select all
0 @I1@ INDI
1 NAME /One/
2 SURN One
1 SEX M
1 _UID 92FF8B766F327F48A256C3AE6DAE50D3A114
1 FAMC @F1@
1 CHAN
2 DATE 5 May 2011
3 TIME 14:42:00
This is what a UUID looks like in a PAF GEDCOM. Notice that the _UID value is not divided into groups separated by hyphens, but represented as one long hexadecimal number. What is not immediately obvious because of its length is that the number shown is not a UUID. It cannot be an UUID, because an UUID is 32 hexadecimal digits long, and the _UID value is 36 hexadecimal digits long. When we hyphenate the number for readability, we find that there are four extra digits: 92FF8B76-6F32-7F48-A256-C3AE6DAE50D3-A114. The _UID value is a UUID followed by a checksum.

That little fact, essential to making sense of the _UID value, used to be undocumented. I once figured that out myself, but you don't have to do so, nor take my word for it. Nowadays, FamilySearch documents it, if you know where to look. The FamilySearch document GEDCOM Unique Identifiers states that it provides guidelines for the use of UUIDs, but it actually documents the format of the PAF _UID value; a 32-hexadigit UUID value followed by a 4-hexadigit checksum. That brief document includes some Windows C code, possibly the actual PAF source code. That code shows how to calculate the checksum.

PAF is a fork of Ancestral Quest, so it is no wonder that Ancestral Quest uses the same format. But it is not just Ancestral Quest that uses this format. This UUID format is the most popular one. Other applications that use the same UUID format for their _UID tag are Family Origins for Windows, Family Tree Heritage, Family Tree Legends, Genbox Family History, Legacy Family Tree, Reunion and RootsMagic.

So that suggests the _UID tag carries the Universally Unique ID (UUID) globally unique identifier or have I read it wrong?

KFN · Post by **KFN** » 10 Aug 2021 15:10

UUID values need a seed and in the case of many they use a random number or a time stamp. These UUIDs are not useful when applying them to people in a genealogy (for use in cross database associations) because they are not connected to any information that would help create the same UUID in multiple (unlimited) genealogies!

Mark1834 · Post by **Mark1834** » 10 Aug 2021 15:11

I read it as essentially a random number to ensure that the record created in a specific GEDCOM file is globally unique. So it stays with that individual as files are imported and exported between applications, but creating the same person in two different databases gives two distinct _UID values. I have confirmed that point by adding the same details to two separate copies of the same RM database. Each new instance of that individual has a distinct _UID, even if generated by the same app.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 15:13

I think that is the key point - the _UID uniquely tracks an individual between databases, but it is not a tool that ensures that Fred Bloggs in one database is the same as another Fred Bloggs if the two individuals were created independently of each other.

Ron Melby · Post by **Ron Melby** » 10 Aug 2021 15:14

_uid is the uuid (based on the RFC) that is correct.

GUID is not here, regardless of Microsoft calling it that, its just like their .jpg, for .jpeg.

uuids are unique in the main, like the areas we hang out in, but are not globally unique.

KFN · Post by **KFN** » 10 Aug 2021 15:28

I have confirmed that point by adding the same details to two separate copies of the same RM database.

Because the seed value is a random number or a time stamp, the instantiation/creation of a new Individual record will create a new UUID or GUID number. The only way to generate a proper Unique Identifier for an individual is to seed the generator with data that can make it unique, but also check for duplicates. In a single genealogy database a possible seed set could be: Full Birth Name, Birthdate, Birth Place, but of course this would not work since many times we don’t know these things when creating the Individual, or the data could change after creation.

In my thoughts a UUID in genealogy is not useful!

Post by **tatewise** » 10 Aug 2021 15:29

Ok that makes more sense ~ a globally unique random number that stays with an Individual record when exported and imported but accumulates more and more such numbers when records are synchronised in different product databases.
So an Individual record for a particular person could gain a _UID value from every product it passes through.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 15:37

No - GEDCOM 7 requires that any existing _UID is used in an unchanged form. That is the entire point about having a _UID!

Mark1834 · Post by **Mark1834** » 10 Aug 2021 15:57

Re-reading the various source documents, I think we have gone down a rabbit hole by assuming that the UUID is unique to a given individual, no matter what source information we have and in which app the record first originated. None of the original documents claim that, and I doubt that it is even possible to do.

It is a globally unique (for all practical purposes) identifier that stays with a particular individual as files are exported and imported between applications. Think of it as a REFN tag on steroids. Whenever a new individual is created in a compliant app, a new UUID is created.

Within that context, why the negativity in some posts? What’s the downside in different apps employing a common standard for uniquely identifying individuals during export and import? It would also simplify enormously the syncing process for different researchers working on the same tree.

Post by **tatewise** » 10 Aug 2021 16:18

Mark said:
"GEDCOM 7 requires that any existing _UID is used in an unchanged form. That is the entire point about having a _UID!"
How does that work?
Imagine the Individual record in the database already has a UID (UID not _UID in GEDCOM 7).
An imported Individual record for the same person also has a UID and is synchronised/merged with the database record.
Which UID takes precedence and what happens if the record is exported back to its original product?
I think the specification allows for multiple UID values.

The GEDCOM 7 IDENTIFIER_STRUCTURE is allowed {0:M} copies in any one record. See Pages 31-33.
Each copy must contain one REFN tag, or one UID tag, or one EXID tag. See page 42.

Yes, any one UID value "should be used without modification" but multiple UID values are not precluded.
Page 84 says: "Multiple structures describing diﬀerent aspects of the same subject would have diﬀerent UID values."

It sounds like your latest posting has reverted to what I said and counteracts your GEDCOM 7 statement.

Post by **NickWalker** » 10 Aug 2021 16:27

The universal identifier is created in a such a way that to all intents and purposes the chances of two being generated with the same ID are so small as to not be an issue.

The problem with using the UID as it is currently is that you can't be certain that all the UIDs have been produced using the same methodology. So there would really need to be an agreement that all would be created in the same way. As Jane pointed out, this seems to be becoming more formalised in Gedcom 7 so perhaps there is a glimmer of hope that this might possibly become useful.

tatewise wrote: ↑10 Aug 2021 15:29 Ok that makes more sense ~ a globally unique random number that stays with an Individual record when exported and imported but accumulates more and more such numbers when records are synchronised in different product databases.
So an Individual record for a particular person could gain a _UID value from every product it passes through.

No - each product would use the same _UIDs, e.g. if a user was created in FH and this file was opened in another product it wouldn't generate a new UID. However, if, for example you were to have two individuals in your file that turned out to be the same person, you would merge them and then the two UIDs would be attached to that one person. If you then shared your tree on-line then presumably people would be able to merge your data into theirs and so more UIDs would be accumulated. Search facilities would then gradually be able to link together identical people, but in this scenario over time an individual record could potentially gain thousands of UIDs. I can imagine lots of individuals being wrongly merged over time and thus picking up lots of incorrect UIDs which would then be difficult to unmerge. This is why I said I couldn't see it being practical, but its certainly an interesting thing to think about!.

Post by **tatewise** » 10 Aug 2021 16:35

Nick, I don't understand why you said "No" to my quote and then paraphrased it exactly with "people would be able to merge your data into theirs and so more UIDs would be accumulated".

Post by **NickWalker** » 10 Aug 2021 16:39

tatewise wrote: ↑10 Aug 2021 16:35 Nick, I don't understand why you said "No" to my quote and then paraphrased it with "so more UIDs would be accumulated".

It was because you said: "So an Individual record for a particular person could gain a _UID value from every product it passes through.". This implied to me that you thought that each product would generate a new _UID for the individual but that wouldn't be the case, unless a merger took place. So actually the product isn't relevant (as long as they use the same methodology). Hardly any of my individuals would ever gain a second UID regardless of how many products I used because I've very rarely merged any individuals in my file.

Post by **tatewise** » 10 Aug 2021 17:06

I did preface that with 'when records are synchronised in different product databases' and assumed it would be understood that the two synchronised copies each had an existing _UID created prior to synchronisation/merging.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 17:07

At the risk of quoting a recently discredited politician, I agree with Nick

. UUIDs generated via a common mechanism (UID in GEDCOM 7, _UID as common tag today) will be excellent for import and export of the same person across apps, as once generated, the UUID stays with the individual. However, merging two different individuals (even if they represent the same historical person) into one is potentially more complicated. However, merging doesn't require that both UUIDs are retained as far as I can tell, so perhaps GEDCOM 7 compliant apps will give the choice of discarding the superfluous UUID in order to retain identity, or keep both and so create a third person that is not identical to either of the two originals.

Added in edit - that is exactly how RM syncs individuals today. One record is designated as the master record, and keeps the same _UID, while the secondary record is discarded after merging.

Mark1834 · Post by **Mark1834** » 10 Aug 2021 17:11

tatewise wrote: ↑10 Aug 2021 17:06 I did preface that with 'when records are synchronised in different product databases' and assumed it would be understood that the two synchronised copies each had an existing _UID created prior to synchronisation/merging.

That doesn't have to be the case. Syncing is just as likely (more likely?) to be different versions of what started life as the same database, say different researchers updating a common original and merging the changes.

It sounds like we are all more or less on the same page, but might be expressing things slightly differently or emphasizing different aspects of the same process. There's no point in arguing who used the best words, it's where we end up that is more important.

Family Historian User Group

* GEDCOM Universally Unique Identifier (UUID)

GEDCOM Universally Unique Identifier (UUID)

Re: Gedcom Import from RM

Re: Gedcom Import from RM

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)

Re: GEDCOM Universally Unique Identifier (UUID)