This is an old revision of the document!


Find Duplicate Individuals

Find Duplicate Individuals

Download this prototype Plugin via this link:
find_duplicate_individuals Version 2.2

This version:

  • Supports the Omit Non-Duplicates management feature.
  • Allows mismatched Lastnames to deduct points regardless of any other Name matches.
  • Treats Lastnames with space separated parts as a single Name, e.g. Van Dyke.
  • Disregards the case of Forenames so that John matches john.
  • Lets the proportional Chronology Magnitude setting be specified in Months instead of Years.
  • Moves the Restore Defaults and Set Interface Font buttons to the Set Preferences tab.
  • Adds some User Interface tab settings, and points setting layout, to Set Preferences tab.

Introduction

This Plugin attempts to find likely duplicated Individual Records by scoring points for matching Record data.

The user interface allows a subset of Individual Records to be chosen, both by selecting a specific set of Individuals, and by setting a Date threshold for the Updated date value. This allows a subset of, for example, similar Surname Records, or only recently Updated records to be checked. The previous run Date can be easily set as this threshold for recently Updated records.

In addition, any pairs of Individual Records placed in the Non-Duplicates List will be excluded, as explained under the Omit Non-Duplicates tab.

Any Individual Records placed in a Named List called Non-Duplicates will also be excluded for the time being, but this feature will be removed in the next version.

Each included Individual Record is checked against every Individual Record. So only excluded Individual Record versus excluded Individual Record checks are skipped.

There is an Enable Diagnostic Mode option that lists more Individual Records and more Column Data in the Result Set, optionally Including Timespan Dates as described below, and consequentially runs much slower.

Names Assessment

The Plugin matches all Primary Name and Alternate Name fields, by comparing Surnames and every Name, Prefix, Suffix, Title, and Nickname both explicitly and by Soundex code. So Joseph Tom Henry SMITH is a good match with Henry tom JOSEPH-SMITH. However, 1 and 2 character initials and titles are ignored.

Surnames are always capitalised, so will only match other Surnames, but each part of a hyphenated Surname matches separately, i.e. SMITH matches SMITH above, and gains 7 points. Surnames with parts separated by spaces are treated as one name, i.e. VAN DYKE matches VANDYKE, and only gains 7 points.

Given Names etc., are set to lowercase, so only match other given Names etc., i.e. Tom matches tom above, gaining 6 points, because both are the 2nd forename; Henry matches Henry above, gaining only 3 points, as their positions are different; but Joseph does NOT match JOSEPH.

Each name is converted to a Soundex code such as J210, and these are also matched, i.e. Joseph = J210 = JOSEPH, and gains 2 points. Any perfect name matches do not gain extra points for a Soundex match.

Thus the above two names score 18 points.

If either person's Name does not exist, then the score is 0 points.

If the score fails to reach a minimum value, then points may be deducted, but this is disabled by default.

If simply the Surnames do not match, then points may be deducted, but this is disabled by default.

To avoid overwhelming the results when there are many matches, the score is limited to 20 points.

The name matching is not only performed for the pair of Individuals, but also for their Father, Mother, Spouse, and Child. Although, at present, only the first instance of each of these Relatives is assessed. Thus the maximum score for five good Name matches is 100 points.

If the score reaches a threshold of 9 points for Individuals, or 5 points for Relatives, then their Events are also assessed.

Event Assessment

The Plugin matches BMD Events by Date, and also by Place Name and Place Soundex for each comma separated part.

Every Date is considered to have a Start Date and End Date timespan.

For any Date Period, Date Range, or Quarter Date these are the earliest and latest dates that could apply, so Between Mar 1888 and 1890 would start 1 Mar 1888 and end 31 Dec 1890. If only one date is supplied such as After 1 Mar 1888, then the other date is set 50 years away, so in this case the timespan would start 1 Mar 1888 and end 1 Mar 1938.

Any single Date is treated similarly, so 1 Feb 1777 starts & ends on 1 Feb 1777, whereas 1666 starts 1 Jan 1666 and ends 31 Dec 1666. If a single Date only defines the year and is Approximate, Calculated, or Estimated, then 5 years are added before and after, so 1666 (app) starts 1 Jan 1661 and ends 31 Dec 1671.

When matching Events, if their Start Dates or End Dates match within 50 days, that gains 2 points each. This, for example, allows a single Date to match a Quarter Date well.

If the Date timespan overlaps, that also gains 2 points, so a good Date match gains 6 points in total.

If either Date is missing, then the score is 0 points.

If both Dates exist, but score 0 points, then 15 points are deducted.

Otherwise, any comma separated Place Name parts are matched like individual Names above. A Place Name match in the same part column gains 3 points, in a different column gains 2 points, and a Soundex code match gains 1 point.

So, with 3 comma separated Place Name parts this could add up to 9 points.

To avoid overwhelming the results when there are many matches, the score is limited to 10 points per Event.

At present only the first instances of Birth, Baptism/Christening, Marriage, and Death/Burial events are assessed.

Thus if all four types of Event have good matching Dates and Places, then 40 points are awarded.

Event matching may not only be performed for the pair of Individuals, but also for their Father, Mother, Spouse, and Child. Although, at present, only the first instance of each of these Relatives is assessed. Thus the maximum score for five good Event matches is 200 points, and together with 100 points for good Name matches, gives a grand total of 300 points.

Date Chronology

To assist with Chronology checks, many missing Event Date timespans are Synthesised from other Events including those of relatives. If for example a person’s Birth Date is missing, then its timespan may be set to the 100 years preceding their Death, or the period of their Mother’s child bearing years. Synthesised Dates are NOT used in the Event Date checks described in the previous section, but they may be used to Synthesise other missing Event Dates.

The chronological order of Event Dates is checked, and if for example the Birth of one Individual comes after the Baptism/Christening of the other, or the Marriage of one comes after the Death/Burial of the other, then points are deducted. The checks extend to Relatives to see for example if one Individual was born before the other’s Mother or Father were of child bearing age.

1 point is deducted for each Year (or part Year) of difference between each pair of faulty Chronology Dates.

If more than 20 points are deducted, then the two Individuals being assessed are excluded from the results.

Other

If the Individuals are closely related then 5 points are deducted according to the size of their Generation Gap, but immediate family are excluded from the results. These family members are Spouses, Siblings, Parent/Child, and Grand-Parent/Grand-Child.

The Plugin deducts 10 points if the two Individual Records have a different Gender recorded, and deducts 10 points if the two 1st Child Records have a different Gender recorded.

Note that Child Count is no longer checked.

All the points and other parameters described above are default values. At the head of the Plugin script is a set of constant definitions that may be edited and adjusted to your preferred values.

Result Set

The better the match, the higher the score, and the best 100 positive scores are listed in a Result Set both as a percentage and as total points, along with the points awarded in the categories described above.

In the Result Set, hold down the Shft key and select two adjacent Individual Records, and then use Edit > Merge/Compare Records and click OK. The two records are then shown side by side, allowing the details to be analysed, matched/discarded, and merged/cancelled as necessary.

To exclude Candidates that have been assessed as definitely NOT Duplicates, then use Add Selected Cell Records to Named List and place them in a Named List called Non-Duplicates. (Better Non-Duplicates Management may be added in a future version.)

The Plugin allows the previous Result Set to be redisplayed at any time, so the assessment of Candidates does not have to be completed in one session. Enable Diagnostic Mode and Including Timespan Dates may also be selected, even if they were not selected for the original Result Set.

Default Result Set






With Enable Diagnostic Mode selected, more Candidate Duplicates may be listed, even with negative scores, and more points sub-categories are shown. So the Names and Event scores are shown for the Father, Mother, Spouse, and Child. The Generations Up and Generations Down counts for determining close relatives are shown (see the =RelationCode() function for details). The Gender scores for both the Individual and Child are also shown.

Diagnostic Result Set






If Including Timespan Dates is selected, then both Individual’s Event Date Timespans are listed, indicating whether they are Actual or Synthesised.

Timespan Result Set

Omit Non-Duplicates

The Omit Non-Duplicates tab lists the pairs of Candidate Duplicates from the Result Set, and allows any selected entry to be moved to a list of Non-Duplicates, which will subsequently be excluded from any future Result Set.

Either selectively or as a whole, entries in the list of Non-Duplicates may be removed.

The user interface for this new feature is quite basic, and my need refining for later versions.

Omit Non-Duplicate Tab

plugins/wip/find_duplicate_individuals.1343648858.txt.gz · Last modified: 2012/07/30 06:47 by tatewise
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0