* Exporting as UTF-8 (without BOM)

Importing from another genealogy program? This is the place to ask. Questions about Exporting should go in the Exporting sub-forum of the General Usage forum.
Post Reply
User avatar
kathykult
Gold
Posts: 17
Joined: 17 Aug 2006 15:07
Family Historian: V7
Location: Belvidere, IL
Contact:

Exporting as UTF-8 (without BOM)

Post by kathykult »

Hello,

After I recently upgraded to the new version of FH, I realized that it is now saving the GEDCOM file in UTF-16 (which includes the Bite Order Mark at the top of the file). My other applications can't use UTF-16, so now I have to export my file as UTF-8 (without BOM) from FH, which I didn't have to do before, as I previously used the GEDCOM file natively from FH without having to "export" it (it was using the old ANSI charset).

Since my other apps can't use UTF-16, they can use UTF-8 (but without the BOM). But when I export from FH, the special characters are not converting from UTF-16 bit to UTF-8 bit properly. For example, in my FH GEDCOM file, I have:

... 2 PLAC , Staßfurt, , Sachsen-Anhalt, Germany

but when I export as UTF-8 (with no BOM), it becomes:

... 2 PLAC , Staßfurt, , Sachsen-Anhalt, Germany

Does anyone know how to get FH to export the Unicode characters properly?

Any advice is most appreciated!
Kathy Kult
User avatar
mjashby
Megastar
Posts: 719
Joined: 23 Oct 2004 10:45
Family Historian: V7
Location: Yorkshire

Re: Exporting as UTF-8 (without BOM)

Post by mjashby »

Kathy,

Try using Mike Tate's "Export Gedcom File" Plugin available at: Export Gedcom File

Mervyn
User avatar
Jane
Site Admin
Posts: 8508
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by Jane »

H Kathy,

Please could you send a small sample file to support@family-historian.co.uk so they can see the example. I am sure they will want to look into the problem.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by tatewise »

I think it is exporting perfectly correctly.
The character ß requires the two bytes ß to represent it in UTF-8 encoding.

Note that in your old ANSI encoded GEDCOM you would NOT have been able to use the ß character.

Without a UTF-8 BOM, whatever program you use to open the GEDCOM file may not know it is UTF-8 encoded.
In your example, the program you have used is assuming it is ANSI encoded, and displays the ANSI characters equivalent to the two UTF-8 bytes.

Why don't you try exporting WITH a UTF-8 BOM?

What are your 'other apps' and are you sure they understand UTF-8?

BTW:
My Export Gedcom File Plugin will perform in exactly the same way.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
kathykult
Gold
Posts: 17
Joined: 17 Aug 2006 15:07
Family Historian: V7
Location: Belvidere, IL
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by kathykult »

Wow, thanks Mike!! Using UTF-8 WITH the BOM made the difference -- now it displays the characters as desired. Thanks so much for your help! In the early days of UTF-8, the BOM gave us problems in some of our programs where I work, so I've always just shy'ed away from using it. Now I see that some programs need the BOM to know the file is UTF-8.

I use a code editor called oXygen (it's what I use at work for XML files) to look at the GEDCOM, and I use TNG (The Next Generation) software to publish my GEDCOM on the web. Both were seeing the double-byte characters when the file had no BOM, but with the BOM they both only saw single-byte chars.

I guess I'll never fully understand all this character encoding business, but at least I learned something new today! :-) Thanks again!
Kathy Kult
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by tatewise »

Kathy, there ARE advantages in using my Export Gedcom File Plugin for exporting to TNG, because it has special options designed for the job.

Character encoding is fairly straight forward.

In the good old days we only had 128 ASCII characters and each one fitted into an 8-bit byte often with a parity bit.

When communications got better, the parity bit was redundant, so they increased to 256 ANSI characters that included some accented letters & extra symbols, but still fitted into an 8-bit byte.

Then world-wide communications demanded European accented characters, Cyrillic, Greek, Hebrew, not to mention all the Asian, Chinese & Japanese languages. So Unicode was born, and its Basic Multilingual Plane that handles most characters, is often encode in UTF-16 which is simply a pair of 8-bit bytes encoding 65,536 characters. (There are extensions for even more.)

The snag with UTF-16 is it doubles the size of every character from 8 bits to 16 bits, even if most of them are still 7-bit ASCII characters and the remaining bits are all zero. In case you had not noticed, your FH GEDCOM file has doubled in size.

UTF-8 solves that problem by encoding 7-bit ASCII characters in one 8-bit byte, but with the top bit zero, and then encodes all the rest into multiple bytes with the top bit one.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
kathykult
Gold
Posts: 17
Joined: 17 Aug 2006 15:07
Family Historian: V7
Location: Belvidere, IL
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by kathykult »

Thanks Mike. Do you know why FH decided to use UTF-16 rather than UTF-8? I did notice the file size doubled, and it takes a lot longer to "save" the file in FH now, especially when auto-save is enabled. Just wondering what the advantages are of UTF-16 over UTF-8 for Family Historian files.
Kathy Kult
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by tatewise »

The FHUG is completely independent from the Calico Pie FH product development, so I can only guess.
I suspect internally FH holds characters in UTF-16 format, so it is simpler (and quicker) to Open and Save in that form.
As you can guess from my description above, conversion between UTF-16 and UTF-8 is quite complicated (I know because I had to code that into my Export Gedcom File Plugin), so that might have made Open and Save even slower.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Jane
Site Admin
Posts: 8508
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Re: Exporting as UTF-8 (without BOM)

Post by Jane »

Mike is correct, Windows Unicode code applications use UTF-16 internally so loading and saving UTF-16 is quicker than encoding to UTF-8.

I think the recommended method for Windows is to save to UTF-16 and call the Windows API to encode to UTF-8 so that would probably increase rather than decrease the save time.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."
Post Reply