* Check For Possible Duplicate Media (FH7) plugin - first prototype

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

For anyone who might be interested, the associated fragment of fhFileUtils.fh_lua script is shown below:
Line 614 is tResults.modifiedepoch = os.time({
So it looks like a problem with the file Date modified value, whereas Date created seems fine.

Code: Select all

		dDate = fileObj.DateCreated -- returns standard format table
		tResults.createdepoch = os.time({
			year = dDate.Year,
			month = dDate.Month,
			day = dDate.Day,
			hour = dDate.Hour,
			min = dDate.Minute,
			sec = dDate.Second,
		})
		dDate = fileObj.DateLastModified -- returns standard format table
		tResults.modifiedepoch = os.time({
			year = dDate.Year,
			month = dDate.Month,
			day = dDate.Day,
			hour = dDate.Hour,
			min = dDate.Minute,
			sec = dDate.Second,
		})
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

There's also a problem with the error handling around there, which I have on my list to fix (just so the plugin doesn't crash when it encounters an error).

I'm not in a position to diagnose the problem with Wine/Crossover. If you're willing to help (Mark or Colin) please contact me offline.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

It’s peripheral to debugging the plugin, but I’ll have a play this evening to see if we can tie it down more precisely. I don’t think any of the more experienced plugin authors (Jane, Helen, Mike) have easy access to a Mac or Linux box and I like a challenge! :)
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

May I suggest that eventually the Result Set takes a leaf from the FH Media lists and has separate columns for the Folder path and Filename, because currently the first lengthy part of the path from the Drive to Media is usually the same for every entry and so not very helpful.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Thanks for the inputs, folks. There have been some good suggestions and observations here.

We've identified the root cause of the WINE issue, and this will be corrected in a future FH update. There is a gap in WINE support for file properties that are not used or required in Unix-derived operating systems such as Linux and MacOS, namely short names and file extensions. fhFileUtils() retrieves all properties automatically to populate a convenient table, whereas using the FileObject directly is more selective, and give only what is requested (in this case, just file size).

I will produce an update taking all these suggestions on board, hopefully this week, which should be fully functional in both Windows and Mac/Linux. We can then come back to record merging in the New Year.
Mark Draper
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

Correction, Mark, it's not File extensions, it's file types. For example, for files ending in .TXT, the fil eType is "Text Document".
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

True - my language was a little loose. In Windows, the file name extension (the last bit after the full stop) defines the file type, such as .txt for a text file, .ged a GEDCOM file, etc. Linux/Unix does not use file name extensions, and in that ecosystem, file type is generally taken to mean something more fundamental - ordinary file, directory, or special file such as a symbolic link.

I can see why WINE gives up - they are very different uses of the same term!
Mark Draper
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

And file Type isn't even the same as file Extension in the Windows file system... I'm at a loss for a circumstance in which it would be needed, but no doubt somebody will think of one.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 17 Dec 2023 15:08 May I suggest that eventually the Result Set takes a leaf from the FH Media lists and has separate columns for the Folder path and Filename, because currently the first lengthy part of the path from the Drive to Media is usually the same for every entry and so not very helpful.
Just checking - presumably you mean display as per the default FH configuration where files stored in the project are just listed from Media\... onwards, and linked files have a full absolute path? Sub-folders are still shown in both cases, so it's not a strict folder/file separation.

Anyway, that's the change I've made, and it does look better and more consistent with the main media listings, so thanks for the suggestion.
Mark Draper
avatar
jelv
Megastar
Posts: 611
Joined: 03 Feb 2020 22:57
Family Historian: V7
Location: Mere, Wiltshire

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by jelv »

I'm pretty sure he meant like this - I've added an external file to the sample project to illustrate.
Media List.png
Media List.png (8.78 KiB) Viewed 1343 times
John Elvin
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Ah - got it. My main FH installation just shows the full file path, but checking on my Linux out-of-the-box copy, it's exactly as displayed here. It's easy enough to add that as well.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Just to confirm that is what I meant.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

I tested it, and got 12 duplicates, from which 4 with a message saying file missing. I also get a warning that accented characters may cause an issue.

When I do a search with dupeguru, a program that is based on size and histogram, I get more than 112 duplicates.
:D And all seems to be true duplicates.
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Thanks for that Olivier. I expect that any issues with accents will be fixed in the next update, due before the holidays. Could you post images for fully expanded All tabs (similar to this example from the Sample Project) for duplicates that dupeguru has flagged but the plugin hasn't, please?
Capture.PNG
Capture.PNG (23.25 KiB) Viewed 1191 times
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

I believe DupeGuru is looking directly at files, not FH Media record links to files.
So if there are files in the folder that are not linked to FH Media records then it may produce different results.
If all the files are in the Project Media folder then unlinked files can be found by Check for Unlinked Media plugin.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

tatewise wrote: 21 Dec 2023 19:06 I believe DupeGuru is looking directly at files, not FH Media record links to files.

I have no unlinked files. All the files I found by dupeguru are linked. Like you say, Mike, dupeguru is comparing the files themselves, not the FH links to the files. That seems to be the best way to find duplicate media, isn't it ?
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

No - it won't spot two Media records linked to the same file, or two copies of the same file linked to the same record (an example of which was reported earlier in this thread).

Remember that it is a many-to-many relationship between records and files, so just looking at the files doesn't say anything about the records.

It would help if you could give a specific example of a duplicate that you think the plugin is not spotting, please. Looking at the DupeGuru description, it also claims to spot similar but not identical images. Is that’s what’s happening here? Most projects will contain lots of very similar document images.
Mark Draper
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

Mark1834 wrote: 22 Dec 2023 00:31 It would help if you could give a specific example of a duplicate that you think the plugin is not spotting, please. Looking at the DupeGuru description, it also claims to spot similar but not identical images. Is that’s what’s happening here? Most projects will contain lots of very similar document images.
image.png
image.png (7.64 KiB) Viewed 1138 times
image.png
image.png (23.47 KiB) Viewed 1137 times
This is an example of DupeGuru finding 3 identical pictures - in the media directory of the project - linked to the same individual, with 3 different file names. The pictures are identical except their names: same size (228), same proportions (995 x1200). (In fact they are inherited from MH imports, through smart matches that often generate unwanted duplicates).

The following link shows you the output of my FH project in TNG:
https://tng.neptis.be/getperson.php?per ... tree=tree2

I understand that your plugin looks after duplicate media records, and not after duplicate media. And basically, I think that an FH plugin cannot spot duplicate media in a project. It can only analyse the media records that relate to them. That is fine. I am looking for an easy solution to get rid of duplicate media files in my project.
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Thanks for that, Olivier. The link you provided confirmed what I thought the cause was.

The three colour images of Suzanne Frère appear to be the same, so DupeGuru is flagging them as duplicates. However, the plugin applies a much stricter test, and only marks files as duplicates if they are absolutely identical, using a cryptographic hash. If I download the three images and determine the hash values directly in Windows, they are all different.

It's clear that the images have been slightly changed during your various copy and migration processes - not enough for DupeGuru to flag a difference, but the cryptographic hash spots it (that is what it is designed for, to identify slight differences in files caused by either copying errors or hiding undesired malware inside a file).

Unfortunately, it is very difficult, if not impossible, for a plugin to spot these small differences (and even if one could be constructed by an expert author, it wouldn't do as good a job as DupeGuru, which will use more specialist methods).

If DupeGuru can write a list of duplicates to file, it may be possible for a separate bespoke plugin to read that list and process the duplicates. However, that's outside the scope of what we are doing here.
Mark Draper
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

Mark1834 wrote: 22 Dec 2023 11:01 Thanks for that, Olivier. The link you provided confirmed what I thought the cause was.

If DupeGuru can write a list of duplicates to file, it may be possible for a separate bespoke plugin to read that list and process the duplicates. However, that's outside the scope of what we are doing here.
Thank you for your explanation, and I am happy to hear about the existence of that hidden hash.
In fact, DupeGuru can either delete the duplicate files, rename them, replace them by a hardlink or simlink to the original file. In my case, where I have a hunderd of duplicates, I guess it will be better to delete all these duplicates and then delete the media records without media.

I send a separate post with the result of your plugin.
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

Mark1834 wrote: 21 Dec 2023 17:35 Thanks for that Olivier. I expect that any issues with accents will be fixed in the next update, due before the holidays. Could you post images for fully expanded All tabs (similar to this example from the Sample Project) for duplicates that dupeguru has flagged but the plugin hasn't, please?
Capture.PNG
Here are the results of your plugin:
image.png
image.png (25.71 KiB) Viewed 1105 times
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

I have inspected the DupeGuru help page https://dupeguru.voltaicideas.net/help/en/scan.html.
It has several scanning modes to detect duplicates and some are more intensive than others.
Contents scans is summarised as follows:
We read files and if the contents is the same, we consider the two files duplicates.
We start by looking at file sizes.
We discard every file that is alone in its group. Then, we proceed to read the contents of our remaining files.
MD5 hashes are used to compute compare contents.

That process seems to be the same as the way Mark's plugin compares files linked to Media records and both use MD5 hash codes.

Olivier, please confirm you are using DupeGuru Contents scans.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

tatewise wrote: 22 Dec 2023 15:03
Olivier, please confirm you are using DupeGuru Contents scans.
No Mike, I used the " picture mode", and the "contents " scan type. This leads to about 240 duplicates with a match of 99 % or more.

If I use the standard mode and the "contents" scan type, I have 16 duplicates.
Attachments
image.png
image.png (69.78 KiB) Viewed 1072 times
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

An updated version 0.2 is attached. This should incorporate all the suggestions and observations made on the first prototype, so thanks again for those.
  • A more detailed final report showing whether duplicates are identical (Duplicate Records), are multiple copies of the same file in the same record, as reported above (Duplicate Files), or something else (Records Differ).
  • An interim summary report giving record counts and other useful information. This will be refined for the final version, including a link to plugin Help, but is useful for testing.
  • Even quicker run time, as it has eliminated unnecessary waiting between steps, and more user feedback while processing.
  • Full emulator compatibility.
  • I'm not sure exactly why the first version reported possible issues with accented characters (which is a standard FH warning message), but this version has different processing and appears to be robust.
All we need now is the final merging, which I plan to restrict to identical records only (marked as Duplicate Records in the report) in order to avoid any complexity around which is the preferred version of matching files with different names. That can wait until the New Year. Please have a go with the new version if you need a break from Christmas! Does it fix any issues you observed first time around?

Technical Notes:
  • Windows determination of MD5 hashes is not supported in WINE, so the plugin now uses a mixture of Windows commands and the Lua MD5 module. In native Windows, MD5 is used for ANSI-compatible filenames, and Windows for Unicode. Under WINE, all determinations use MD5, with temporary copies made of files with Unicode names.
  • I have modified the MD5 method slightly, so both Windows and WINE can process files of any size (tested with duplicate 2.4 GB mp4 files). The progress bar can go blank for a few seconds occasionally while processing extremely large files, but it completes without any issues.
  • I'm happy that changing the Windows code page will not prove a stumbling block, as I have confirmed that it is purely self-contained and does not affect the system (you can have multiple command windows open at the same time, each using a different code page).
  • I save the Windows command file as UTF-8 with no BOM, which removes the need for a dummy @echo off first line when the BOM is present.
  • To avoid issues arising from WINE incompatibility with fhFileUtils() getFileFolderDetails(..), I use the Filesystem object directly when getting file size. Hopefully, that will also fix the other fhFileUtils() oddity reported above.
Check for Possible Duplicate Media (FH7) (0.2).fh_lua
Mark Draper
User avatar
Richard_Hyland
Gold
Posts: 28
Joined: 06 Jun 2011 18:04
Family Historian: V7

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Richard_Hyland »

Mark

Ran on same 2 projects and no errors this time.

Thank you
Richard
Post Reply