I agree completely with Lorna. It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.
Lorna asked
Could the scoring for the matching of parents’ names be handled separately from spouse and 1st child names?
and Bill asked
It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.
Lorna asked
Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates?
Yes OK, I will have IntDatesTimespan=50 and IntDatesVariance=5 for these two values respectively.
All these little tweaks are progressively slowing the Plugin down again, but is it still running fast enough, especially on the larger databases?
I have tried to incorporate many of the checks you have suggested to identify your Real Duplicates and eliminate the False Candidates. Are there any Real Duplicates the Plugin is still NOT finding? Are there any False Candidates the Plugin should be eliminating?
I still have the Non-Duplicates Management trick up my sleeve.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I'm still getiing quite a lot of "Matches" with the same fornames but different Surnames and born in different places.
However the highest score I've got for anyone with 1.7 was 50 and most only score about 20 or lower. I had one genuine duplicate with identical names, and DOB within 1 year, that only got a score of 13,probably because there was no place info for one of them. Even the parents mached up.
John ~ Could you post the details of a few of the highest scoring False Candidates, and details of the low scoring Real Duplicates that should be scoring more than 13 if the Individual Names and both Parents match, unless there are also some mismatching Events.
With regard to your Save Result Set Wish List Request, I will add an option to re-display the previous Result Set.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I'm still getting a lot of non-duplicates on my report. Many of these are high on my report mixed in with the actual duplicates. Most of these would be eliminated or dropped way down if the mis-matched parents names didn' result in so many points.
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.
If the new options you are adding work the same way, then will it really help all that much?
Since IntLastNameScore, IntForeNameRight IntSoundexNames and IntForeNameWrong apply to all name matches, if I set them to zero won't that also impact matching of names for the individuals themselves and for their spouses and children?
I would really like to leave name matching alone for the individuals, their spouses, and children and only impact mothers and fathers names if possible.
Ideally, if the mothers and fathers surnames are not an exact match, I would deduct so many points that the individuals ended up with no points or negative points.
If the surnames were an exact match, then I would deduct some points if the forenames did not match.
Rather than these working only if the names are a total mismatch, they would work if the names were not a total match. So... we would be able to deduct points if one mother was named Julia Ann Snodgrass and one was named Ann Marie Henshaw for example. The common Ann should not prevent points from being deducted.
I think it would be possible to have a complete set of Name points for each Relative. e.g. IntIndiDeduction IntIndiMaximum IntIndiLastName IntIndiForeRight IntIndiForeWrong IntIndiSoundex
Of course if that could be done, that would be ideal.
Can this be set up so that the values can be negative (not just 0) so that deductions can be made if either the forenames are not exact matches or the surname is not an exact match (rather than only being able to deduct points if the entire name score is 0)?
Bill ~ What if there was an IntFathMinimum threshold instead of zero, below which the IntFathDeduction would apply? Thus, after all the existing Forenames and Surnames and Soundex scoring, the following would apply:
If the score <= IntFathMinimum (default 0) then the score becomes IntFathDeduction.
If the score >= IntFathMaximum (default 20) then the score becomes IntFathMaximum as now.
So, since a Father's Surname mismatch scores zero, and any Forename and Soundex matches could be say 2 points each, then an IntFathMinimum of say 6 points would result in the IntFathDeduction if there were three or fewer matches. The same scheme would apply for each Relative.
Lorna ~ I am looking at making the Chronology Mismatch Deduction proportional to the discrepancy. So the more that Chronology Date Checks differ, then the greater the points deducted. Thus a discrepancy of up to 10 years would only deduct 1 point, whereas a discrepancy of 90 to 100 years would deduct 10 points, i.e. 1 point per decade. The years per point would be a User Preference Setting.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I think that suggestion would work well if the father's surnames don't match, but not as well if the surnames do match.
I have a lot of individuals on my list where the father's surnames match, but the forenames don't match exactly. They are getting as many as 20 points under the column "Father". In one example the fathers are "Thomas P Inshaw" and "Edwin Thomas Inshaw". For this same pair, the mother's names are "Sarah Ann" and "Ada". These two are obviously not a match, yet they get a total of 35 points and are the 4th highest pair on my report ahead of some actual duplicates.
If I set IntFathMinimum to something higher than 20 to eliminate this pair, then I think I'd also eliminate some of my actual duplicates. Maybe I'm just confused and you can correct me on this?
I think, if it is possible to code, my earlier suggestion might be a way to handle it. If the mother's and father's surnames are not an exact match, allow a deduction of points. If the surnames are an exact match, but the forenames are not an exact match, allow a deduction of points.
I find it difficult to understand the scores you quote with only partial information. With the current V1.7 default scoring:-
Thomas P Inshaw v. Edwin Thomas Inshaw should score 10 points = 7 Surname + 3 Forename wrong position. Plus other points for Event matches to get Father column score.
Sarah Ann v. Ada should mismatch and score 0 points in Mother column. No points should be added even if Events match.
So I am not clear where the total of 35 is coming from.
Remember, each Father, Mother, etc column is made up of 5 component scores; the Names match, and four Event matches. The Names match must score half of IntNamesMatched before any Event points are added. (I forgot to mention this in the WiP Help page.)
So in above Father Names match, if Forename in wrong position = 0 points, and IntFathMinimum were set to 8 points, then IntFathDeduction would apply and no Event scores added.
Having said all that, you clearly "know" the above Individual pair & Parents are NOT duplicates, but from my independent stand-point they look feasible "fuzzy" Name matches, that with corroborating Events, for these Parents, or the Individual, or other Relatives makes them possible candidates.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I am looking at making the Chronology Mismatch Deduction proportional to the discrepancy. So the more that Chronology Date Checks differ, then the greater the points deducted. Thus a discrepancy of up to 10 years would only deduct 1 point, whereas a discrepancy of 90 to 100 years would deduct 10 points, i.e. 1 point per decade. The years per point would be a User Preference Setting.
Mike, This sounds like a good idea, but would it be in addition to, or an alternative to:
If more than 2 Chronology checks fail then the pair is excluded from the results.
The latter, currently in force, is definitely a good idea. But if you introduced the 'proportional' chronolgy mismatch how many points would need to be deducted for it to count as 'failing' the chronology test? I suppose that would have to be another User Preference Setting.
Sorry... I didn't give you enough details. 13 points were for the name match on the individuals themselves, 2 points were for the individuals birth fact, and 20 under the Father total (I guess that is 10 points for the name and 10 points for the birth fact). Really the only point I was trying to make (although I did it poorly) was that out of the 35 total points, 20 of them were under the Father column. With your suggested options of IntFathDeduction, IntFathMaximum, IntFathLastName, IntFathForeRight, IntFathForeWrong, and IntFathSoundex I think I could adjust these to make the father have less of an influence on the total.
Thanks for the explanation on how the name match must score half of IntNamesMatched before any event points are added.
I must be too close to the data. To me, two individuals with different names who have fathers with different names and mothers with different names are probably not a match. I am just trying to find a way to get these types of pairs to not be at the top of my list hiding the real matches which usually have less points.
When your next version comes out, I'll try playing around with the new options and see what I come up with.
Mike, I have run vers 1.7 on an older copy of my file (Before I started using the duplicates plugin) and am currently working through the data to compile a spreadsheet for you. I am going to use this file for all future testing, so that I can gauge the results better. I should finish the spreadsheet tomorrow and will email it to you.
It incorporates many of your recent suggestions ~ see the WiP Help page for details.
It now allows the previous Result Set to be redisplayed at any time, so you can continue working on it without reassessing your database. With Enable Diagnostic Mode you may optionally select Including Timespan Dates.
All the User Preference Settings for Name Matching have a separate set for Individual, Father, Mother, Spouse, and Child. The Deduction, Minimum and Maximum are as discussed above, plus a Threshold needed to proceed with Event Assessments, etc.
Event Assessment also now has a Minimum needed to avoid entire Event mismatch.
The Timespan to extend 'After', 'Before', 'From', 'To' now defaults to 50 Years instead of 10 Years. Chronology proportional scoring deducts 1 point for each Year of discrepancy. If more than 20 points are deducted then the Individuals are excluded.
You can set limits on lowest value and maximum rows to display in the Result Set.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I tried version 1.8, but I am getting results that I don't understand.
If the score <= IntFathMinimum (default 0) then the score becomes IntFathDeduction.
Thomas P Inshaw v. Edwin Thomas Inshaw should score 10 points = 7 Surname + 3 Forename wrong position. Plus other points for Event matches to get Father column score. So in above Father Names match, if Forename in wrong position = 0 points, and IntFathMinimum were set to 8 points, then IntFathDeduction would apply and no Event scores added.
I am looking at the same pair. The fathers names are as listed above. I used the following settings for the Father Name Match Settings, but I am seeing 21 points in the Father column.
I would have thought I would have 7 points for the name. This is less than the 10 points that I have for IntFathMinimum so I would not have expected to see a total of 21 points under Father (I thought no event points would have been counted). So I would have expected to end up with -5 points under the Father column.
This web site was made with WebAPP v0.9.9.3.3, a web portal system written in Perl
All trademarks and copyrights on this page are owned by their
respective owners. Comments are owned by the Poster.
Marble theme based on "Crash" theme by my2cents