This adds several new features as described in the Help page.
It corrects a mistake in calculating Generation Gap and now also removes immediate relatives such as Siblings, and Parent/Child from the results.
It corrects a mistake in checking Event Date Chronology and now also removes glaring mismatches from the results, such as where one Individual was Married & Died before the other was Born & Baptised.
It now not only checks Individual Event Dates, but also those of the Father, Mother, Spouse, and 1st Child.
A Diagnostic Mode has been added to aid analysis of the points scoring system.
The Non-Duplicate Management feature previously discussed will have to wait for another day as I have other pressing commitments at present.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I have just tried it out. It has found a few more than 1.4 some new ones (I suspect lower scores before as they aren't new records). The gender match that I had before is now gone.
I will spend some more time this afternoon looking at more records and let you know.
Hi Mike, I was about to ask if non-duplicates could somehow be excluded but saw your thread re v1.5 and have read your notes re a named List ie: "In addition, any Individual Records placed in a Named List called 'Non-Duplicates' will automatically be excluded."
I have tried this on a pair of names but they keep coming back each time I run it. I have tried the variations of naming the list, ie with and without single and double quote marks, but cannot get them to be excluded.
I can't see what I'm doing wrong, any thoughts Dagwood
"I haven't failed. I've just found ten thousand ways that won't work" Thomas Alvar Edison
Thanks for your continued work on this. I think it is unlikely that I have any genuine duplicates in my data but it is interesting to see how the plugin works and what suggestions it makes.
As you said it one of your other posts, it is likely that whatever scoring system is used there will be some false positives which get a higher score than some genuine duplicates. A low score for genuine duplicates is sometimes unavoidable because the records for a duplicate pair may be fully compatible (in that there is no conflicting data) but if the data for one or both of them is scarce there may be few actual positive matches in the data to move the score in a positive direction. For example I have a pair of records which I entered as separate individuals but I linked using ‘Associated Person’ and added notes to both to explain that I thought they might be one and the same person. The evidence is fairly compelling but is purely circumstantial, recorded in notes rather than hard facts. Therefore the only actual match in their data is the name, and they get a low score. No refinement in the scoring system could change this (although I think perhaps they should not lose any points for a difference in child count: see below).
However there are some false positives getting a higher score that this pair, which shouldn’t. In these case there are a number of positive matches in the data but a single chronological incompatibility which ought to outweigh all the positives because it makes a match impossible. Here are two examples:
1. A and B have similar names. There are no event dates for A, but she was married and there is a birth date for her husband. Therefore her marriage date must be later than her husband’s birth date. But she is being matched with B, who died 60 years before that date.
2. C and D have similar names and the same number of children. There are no dates for C himself but there are baptism dates for his children, in the 1640s. Therefore his own birth date must be before the 1640s. But he is being matched with D who was born more than 300 years later.
Anomalies like these could be removed if the chronology checks were extended where an individual’s own dates are not known. Dates from their immediate family could be used to place constraints on the individual’s own dates. I do realise, of course, that adding more checks will slow down the run time of the plugin. A compromise has to be reached, but at present it is checking my 3533 individuals in just 9 seconds (46 seconds in diagnostic mode) so I have no problem with run time.
One other observation: I’m not sure that it’s a good idea to deduct points if the child count differs. In a typical case of duplicates, one of the records will have been in the gedcom file for a while and will have a fairly full set of family members recorded, e.g. from census returns. When a contemporary individual with the same name is found, the name might have turned up in a different type of document, such as an employment record, where no information about family would be recorded. The lack of family information means that no additional points are gained, but equally it should not mean that points are lost.
Thanks again for all the time you are giving to this plugin – it must be a labour of love!
Jane said: It's working for me, make sure the name of the list is
Non-Duplicates
It must be exactly as above try cutting and pasting the name in.
Tried again and copied and pasted this time as you suggested Jane. Still both names appear as duplicates in the list. So far I have tried as you suggested,with quotesand with single quote marks,just one name ,and both names. Every time the pair appear back on the list. I've even tried re-downloading this version and repeating it over again. I can't think what might be different to what you and others are doing. Dagwood
"I haven't failed. I've just found ten thousand ways that won't work" Thomas Alvar Edison
Dagwood ~ Are you sure it is the Non-Duplicatespair that are listed in the Result Set. Carefully check the Record Id. The two Individuals may still appear individually in the Result Set paired with other candidates.
Chris ~ Thanks for the error report - I'll check into it.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
I'm a bit confused by the version numbering; today's version is 1.6.
I think the scoring is perhaps still a bit adrift? I have a Mary Ann FIRTH and an Ann FIRTH coming 14th in my result set. They have the same year of birth -- 1817(app) -- but that's it. Different place of birth, completely different christening data (+4), and different parents (+3, +3). I wouldn't expect them to be so near the top of the list... unless I really don't have many (or indeed any) duplicates in my 12000 records. I /did/ have three genuine dups, which 1.4 found -- many thanks for that. When fully developed, this will prove a fabulous aid!
1)I think the points system needs tweaking slightly especially regards to places as the Raymont's score higher than the Dumble's ebven though there is more place name discrepencies.
and how scored
2) Regarding runtime of query, it does make a lot of difference for Runtime as to the spec of the machine.
i.e. Main PC - Windows7 64 Bit - 4GB Ram - Intel Core 2 Duo @ 3.06GHz takes 2 mins 5 Secs. Keeps CPU @ 50% appx so easy to do other things at same time.
Laptop - Windows XP 32 Bit - 1GB Ram - Intel Pentium M @ 1.4GHz takes 11 mins 35 secs. Keeps CPU @ 99% appx so hard to do anything else at same time.
Same GEDCOM of 10606 people.
Don't think you can do much about the speed but just so you can put a warning in maybe that lower spec machines will take a lot longer to run.
Just FYI v 1.6 did not replace v1.5 for me. As you can see from the screenshot it added a second instance with -1 at the end of the plugin name.
It seems to be taking a little longer to run than v1.4 but not significantly.
It is throwing more duplicates than before and virtually all of them are not. It does not seem to be respecting place of birth (if it should) I am getting Aberdeen matched with Canterbury purely because they are the same year.
The good news is that each refinement seems to find at least 1 genuine duplicate that the previous version missed.
The exclusion of siblings and parent/child individuals is working great in version 1.6.
Like Colin, I am seeing more non-duplicates than before. Usually this is because one forename is the same, but not usually the same one. For example, I am seeing a lot of pairs something like this: James Kenneth Henshaw and Alex James Henshaw. I went from 61 to 98 pairs and I'd guess that at least half of them are like this. Can the order of the names be considered and something deducted if they are not the same?
I have one pair, Margaret Ann Strader and Julia Ann Margaret Hunsicker. The forenames both have Ann and Margaret in them, but the first name and surname are different. This pair shows up 4th on my list. This pair both have husbands that have the last name Henshaw.
Both women have the same number of children. I wonder if too many points are being given for having the same number of children, in this case 4.
Also, for the first couple, the first child is named Nancy Lee Henshaw (a girl). For the second couple the first child is named Marion Lee Henshaw (a boy). The pair is getting 6 points under Child 1 because the two children both have Lee and Henshaw in their name even though their first names are different and they are of different gender.
So, all together this pair is getting 19 points.
All in all a great plugin. Thanks for all the hard work!
Sorry about the mix up regarding V1.5 and V1.6, but all should be OK now at V1.6 4 July 2012. Slightly too much haste in posting the fix for Chris, which would affect everyone using the subset selection option.
This version has some more date chronology checks using relation's dates where the individual has none, as suggested by Lorna, because they were easy to add.
A number of you are suggesting that Child Count is doing more harm than good, so I am tempted to remove it, now that there are so many other more significant checks.
Roger said
I have a Mary Ann FIRTH and an Ann FIRTH coming 14th in my result set. They have the same year of birth -- 1817(app) -- but that's it. Different place of birth, completely different christening data (+4), and different parents (+3, +3). I wouldn't expect them to be so near the top of the list... unless I really don't have many (or indeed any) duplicates in my 12000 records.
By my reckoning that pairing scores about 22, which with around 250 points now on offer, is very low. You say "completely different christening data (+4), and different parents (+3, +3)" but there must be something small in common for some points to be scored. You have hit the nail on the head with "unless I really don't have many (or indeed any) duplicates". Then the Plugin will insist on listing the highest scoring (but very low points) false positives.
Tim ~ A similar argument applies to your low score listings. Also the difference is only 5 points out of a maximum of 250+. Perhaps the Plugin should give percentage scores as well as points scores?
I had assumed larger databases would run on more powerful PC, but in the published version the 'Help & Advice' could say something about performance.
Colin ~ I can't explain the download problem. Those issues are usually associated with browser behaviour.
I suspect more again of the above regarding longer listings. The Plugin does take account of Place, but only adds points if there is some agreement (and only if Dates are similar), rather than deduct points if Places differ.
Bill ~ More of the same regarding low scores. The tip about Child Gender mismatch is good. The way the Plugin works makes checking the order of names tricky. I could award more points for a matching Surname than a matching Forename. As I said above, I may drop Child Count.
The name checking is deliberately fuzzy, because if genuine duplicates come from different sources, then the names may be slightly different, in a different order, or parts missing, or even Forenames & Surnames swapped. The latter can happen on Marriage or on Adoption.
What may have gone unnoticed is that as more checks have been added the maximum score has gone up and up. So what was a 'good' score in V1.0 may now be a 'poor' score in V1.6, so percentage scoring may help make this clearer.
Mike Tate is researching the TATE and SCOTT family tree and all relations.
This web site was made with WebAPP v0.9.9.3.3, a web portal system written in Perl
All trademarks and copyrights on this page are owned by their
respective owners. Comments are owned by the Poster.
Marble theme based on "Crash" theme by my2cents