International Character Handling for Plugin Authors

Introduction

Many plugin authors encounter problems handing unexpected characters.

Problems due to mismatched character encodings can include:
  • file names in the Windows Filing System not being recognised
  • string manipulations yielding unexpected results
  • text displayed to the user being completely unintelligible, or sprinkled with �

This article sets out the background, describes the problems and recommends solutions. It is aimed as authors working with ƒh7; as of 2022, most new plugins are likely to be written for ƒh7 only, and the language pack facility in ƒh7 makes it more likely that unexpected characters will be encountered.

Developers of plugins aimed at multiple users (typically published via the Plugin Store) should assume that they need to cater for all possible characters. Authors of plugins for private use may encounter fewer problems but might still find the recommendations helpful.

Character Encodings Relevant to Plugin Authors

There are a number of character encodings relevant to plugin authors (in roughly historical order):

ASCII

ASCII supports unaccented English characters, digits 0-9, plus some punctuation and other symbols, and control characters such as Tab or Line Feed (New Line), to a total of 128 codes (the bottom 7 bits of a single byte). Not coincidentally (as many later encodings chose to align with ASCII), ASCII maps the same single byte codes to the same characters as some encodings that came later, which has the useful effect that if all you need is the upper- and lower-case English alphabet, you probably won’t see any problems. But once you wander outside the English walled garden, more characters are needed, and things get more complicated.

‘ANSI’/Windows Code Pages

‘ANSI’ is not a character encoding. Rather, it is a misnomer used to refer to the collection of Windows code pages introduced by Microsoft and others during the 1980s and 1990s to support more languages than English, and still supported (more or less) today.

Each character in a code page is encoded using a single byte. The first 128 character codes mirror ASCII; then (up to) another 128 locale-specific characters are added. One major consequence of the approach is that the same byte value represents many different characters or none.

For example:

  • Windows-1252 (the Latin alphabet), which is is the most-used single-byte character encoding in the world (and confusingly is also often called ‘ANSI’) uses byte value DE for the letter Þ (because we all need a thorn, right, especially if our ancestors ran Þe Olde Cake Shoppe?)
  • Windows-1253 (Greek) uses byte value DE for ή.
  • Windows-1255 (Hebrew) doesn’t use byte value DE at all.

These characters may not seem to you to be particularly useful, but they illustrate the point that text encoded using one code page may yield unexpected results if it’s decoded using another.

Another problem arises because there are languages (e.g. Thai) for which 128 characters are not enough. Thai may not be a huge issue for plugin authors, but there are places in Europe whose alphabet is not wholly supported by e.g. Windows 1252, for example the Dutch ij or the Polish ł.

So, code pages aren’t an entirely satisfactory solution to handling characters from multiple languages.

Unicode

Unicode is a character encoding standard that sets out to cover every reasonable writing system in the world (and a few unreasonable ones as well, like Klingon) without ambiguity. It defines unique ‘code points’ for every character in every language, and it can be implemented by various character encodings (Unicode Transformation Formats or UTFs) that map code points (and thus characters) to unique byte sequences. Two UTFs are relevant to plugin authors: UTF-16 and UTF-8.

UTF-16

UTF-16 encodes all characters in either 2 or 4 bytes. It is the only popular character encoding that is incompatible with ASCII (e.g. the English character X coded in ASCII is NOT the same code as character X coded in UTF-16.) It comes in 2 variants, depending on the underlying architecture or Endianness of the computer system involved: UTF16-LE and UTF16-BE. Windows uses UTF-16LE, with a BOM (byte order mark) 0xFF 0xFE at the start of a character stream to signify which variant is in use.

Microsoft was an early adopter of Unicode. They used UTF-16 widely within their operating and filing systems, and sometimes for plain text and word processing files, and recommended it to application developers. (A word of caution: when reading Microsoft library documentation in your idle moments, any reference you find to ‘Unicode’ should almost always be interpreted as a reference to ‘UTF-16LE’).

Almost all of the computing and Internet world adopted UTF-8, not UTF-16, and in 2019 Microsoft changed their stance and started recommending UTF-8 to application developers; however, changing all their existing system code to use UTF-8 would be a huge endeavour, so many Microsoft libraries etc. still assume UTF-16 (or Windows Code Pages/’ANSI’).

UTF-8

UTF-8 is now the dominant character encoding in computing and the Internet. It encodes all characters in 1,2,3 or 4 bytes. The first 128 characters of UTF-8 correspond one-to-one with ASCII, so that valid ASCII text is also valid UTF-8 text. UTF-8 does not require a BOM, nor is one recommended, but it is sometimes added at the start of a character stream in the form 0xEF 0xBB 0xBF. (If a BOM is added, it stops text from being backwards compatible with ASCII.) Where ƒh uses UTF-8 it often includes a BOM.

Where Might Encoding Issues Occur?

Encoding issues may occur anywhere that a plugin interacts with external entities/resources, or within the plugin when manipulating strings.

Most issues can be avoided by adopting UTF-8 wherever possible, and the steps to do this are detailed in Recommendations below, following detailed explanations of the issues involved.

Accessing Project Data

The following encodings are supported for project GEDCOM files and plugins in ƒh7.

GEDCOM Default GEDCOM Alternative  Plugin Preferred Plugin Alternative
UTF-8 UTF-16 UTF-8 ‘ANSI’

If you access project data via the ƒh API , you are insulated from the GEDCOM encoding. Strings on the ƒh API are passed in the ‘current string encoding’, which defaults to the encoding of the plugin file. ƒh handles the conversion if any between the current string encoding and the GEDCOM file encoding.

The current string encoding can be changed via the fhSetStringEncoding function, or interrogated using the fhGetStringEncoding function.

Note: if the current string encoding is ‘ANSI’ and the GEDCOM file is encoded in UTF-8 or UTF-16, there is a possibility that text will be corrupted or lost because not all UTF characters are supported in ‘ANSI’. ƒh will set a ‘Conversion Loss Flag’ flag to tell you that data has been lost. You can interrogate it using the fhIsConversionLossFlagSet function, but encoding your plugins as UTF-8 or setting the current string encoding to UTF-8 avoids this issue.

See the Plugin Help file for more on this subject.

RECOMMENDATION 1: Use UTF-8 for your plugin encoding and current string encoding.

String Manipulation

Handling UTF-8 strings

If you follow the recommendations in this article for accessing project data, handling files and gathering user input, any string manipulation you do within a plugin will be operating on UTF-8 text.

However, the Lua string library treats a string as simply a sequence of (single) bytes, which works for ASCII and ‘ANSI’ but not for UTF-8. There are some limited techniques for manipulating UTF-8 using the string library outlined at “Unicode UTF-8 Encoding” in Understanding Lua Patterns but a better solution is to use two UTF-8 libraries:

  • The Lua utf8 library within Lua 5.3 provides a basic set of functions for handling UTF-8 strings.
  • The additional utf8 library installed with ƒh7 can be used to perform all standard string operations on UTF-8 strings.

You can overlay the additional utf8 library on the string library (and invoke its functions as e.g.  string.find) or load it alongside the string library (and invoke its functions as utf8.find). Be aware that overlaying it on the string library can cause performance problems and unexpected behaviour in 3rd party libraries that rely on the string library, so test throroughly if you take this option.

RECOMMENDATION 2: Use the inbuilt Lua utf8 library and the additional utf8 library to manipulate strings.

String Handling in Other Libraries

The Penlight third-party library includes many modules that rely on the Lua string library, and assume that ‘ANSI’ is the encoding in use.  (You should test the Penlight functions you intend to use with UTF-8 strings.) This applies to the following commonly used Penlight libraries:

  • stringx
  • text
  • utils

but may affect other Penlight libraries as well.

The fhUtils library also relies on Penlight and the Lua string library for string handling.

User Interface

The IUP library (for constructing user interfaces) works by default in ‘ANSI’ , but can be configured to use UTF-8:

iup.SetGlobal("UTF8MODE", "YES")

RECOMMENDATION 3: Configure IUP to use UTF-8 on the user interface.

File Access and Manipulation

File Access Problems

There are a lot of options for file handling in Lua and the libraries made available with ƒh, which work to a greater or lesser extent with the Windows Filing System.

The Windows Filing System (except on very old systems) stores file and folder names in UTF-16.

The Lua io and os libraries expect file paths and names in the local code page (‘ANSI’).

So, it is possible to encounter a file path that the io and os libraries can’t deal with, because it includes UTF-16 characters that can’t be converted (by Windows) into the local code page. If this happens, the io and os libraries will fail to find the file. The Lua File System (lfs) library will encounter the same problem, as will file operations in other third party libraries that depend on lfs , io or os such as Penlight.

The ƒh API functions fhLoadTextFile and fhSaveTextFile will read and write UTF-8, UTF-16 and ‘ANSI’ encoded text files, and handle all characters in the file path. A plugin receives the content of any text file read encoded in the current string encoding (ideally UTF-8 to avoid conversion loss). However:

  • in the event of failure, these functions don’t return error messages explaining what went wrong
  • fhSaveTextFile will overwrite an existing file with no warning
  • fhLoadTextFile makes no checks that the encoding specified for the file matches the text string involved, so can return garbage (without any indication that it has done so)

The fhFileUtils library supplied with ƒh contains a variety of functions for accessing and manipulating (e.g. copy, move, delete, rename) files that will handle all characters in the file path. It uses fhLoadTextFile and fhSaveTextFile for accessing text files, but supplements them with additional error handling. However, it will still return garbage when reading a file that does not match the encoding specified.

RECOMMENDATION 4: Use fhLoadTextFile and fhSaveTextFile if you simply need to read and/or write text files stored in UTF-8, UTF-16 or ‘ANSI’ encoding, and you are happy with their lack of of error reporting.

RECOMMENDATION 5: Use the fhFileUtils library for file access and manipulation if your needs are more complex.

Determining a File’s Character Encoding

If you don’t know the encoding of a file, read it as a binary file (using fileGetContents from the fhFileUtils library), and use the snippet Determine the Character Encoding of a File. You can then use the ƒh API or fhFileUtils to read it as text in the correct format.

File Handling Within IUP and Associated Libraries CD and IM

IUP supports ‘ANSI’ paths by default in iup.filedlg (a file selection dialog), and the DROPFILE_CB callback (supporting file drag-and-drop onto an IUP dialog).

To support UTF-8 file paths in these contexts:

iup.SetGlobal("UTF8MODE_FILE", "YES")

However, this will not allow the CD or IM libraries (supporting graphics and digital images) to read or save files with Unicode characters in their path. If you can’t be certain that a file path doesn’t contain Unicode characters you should create a temporary file copy with an ANSI-encoded path (using os.tmpname) while working with those libraries, and copy the results to the target file when done.

RECOMMENDATION 6: Configure IUP to use UTF-8 file names.

Accessing Configuration Data

ƒh Configuration Data

Much ƒh configuration data is available via the ƒh API. However, there are some items that are not supported in the API, and are occasionally used by plugin authors. The relevant encodings for Version 7 are:

Data Encoding
Autotext UTF-8 BOM
Fact Sets UTF-16LE
Queries UTF-16LE
Source Templates UTF-8 BOM

This list is not an exhaustive list of ƒh configuration data.

Follow the advice in File Access and Manipulation to access these files, and in String Manipulation to handle their contents.

Plugin Configuration Data

Many authors manipulate plugin configuration data directly as simple text; in which case it is recommended that the configuration data is encoded in UTF-8 files.

However, fhGetIniFileValue and fhSetIniFileValue support the commonly-used Inifile format, and cope with non-‘ANSI’ characters in the file path, albeit with a lack of error messages when they fail, so are a simpler option.   If you wish to store Unicode values in the ini file, you must crreate it as an empty UTF-16LE text file before using it, using fhSaveTextFile(strFilePath,"", "UTF-16LE").

The Penlight config library does not cope with non-‘ANSI’ file paths.

RECOMMENDATION 7: Use fhGetIniFileValue and fhSetIniFileValue to manipulate Plugin Configuration Data for new plugins. Initialise the Ini file as an empty UTF-16 file using fhSaveTextFile(strFilePath,"", "UTF-16LE").

Running System Commands

The Lua os library (e.g. os.getenv, os.execute) supports ‘ANSI’ as does io.popen. If a call to these functions involves UTF-8 or UTF-16 strings they may fail (depending on the characteristics of the command being executed.)

So, for example, this will fail:

bOK, strresult, iresult = os.execute("D:\\Documents\\цонцлудатуряуе.html")

The fhShellExecute function handles UTF-8 characters, so this will work, and returns similar information to os.execute:

bOK, iErrorCode, strErrorText = fhShellExecute("D:\\Documents\\цонцлудатуряуе.html")

RECOMMENDATION 8: Use fhShellExecute to execute system commands.

Debugging

The print function only supports ‘ANSI’ (and the ASCII-compatible element of UTF-8) so you can’t rely on it for debugging if you need to check the value of a string in another encoding. Try placing a breakpoint and inspecting a string variable instead.

Recommendations Summarised

RECOMMENDATION 1: Use UTF-8 for your plugin encoding and current string encoding.

RECOMMENDATION 2: Use the inbuilt Lua utf8 library and the additional utf8 library to manipulate strings.

RECOMMENDATION 3: Configure IUP to use UTF-8 on the user interface.

RECOMMENDATION 4: Use fhLoadTextFile and fhSaveTextFile if you simply need to read and/or write text files stored in UTF-8, UTF-16 or ‘ANSI’ encoding, and you are happy with their lack of of error reporting.

RECOMMENDATION 5: Use the fhFileUtils library for file access and manipulation if your needs are more complex.

RECOMMENDATION 6: Configure IUP to use UTF-8 file names.

RECOMMENDATION 7: Use fhGetIniFileValue and fhSetIniFileValue to manipulate Plugin Configuration Data for new plugins. Initialise the Ini file as an empty UTF-16 file using fhSaveTextFile(strFilePath,"", "UTF-16LE").

RECOMMENDATION 8: Use fhShellExecute to execute system commands.

Implementing the Recommendations

This block of code executed at the beginning of a plugin will either implement the recommendations above, or lay the groundwork to implement them.

 

fhSetStringEncoding("UTF-8") --Use UTF-8 as current string encoding.

--utf8 library for string manipulation

require ( "utf8data" )
utf8 = require ( ".utf8" )
utf8.config["conversion"] = { uc_lc = utf8_uc_lc; lc_uc = utf8_lc_uc; }
utf8:init()

--optional libraries (omit any you're not intending to use)

require ('iuplua') 
fh = require('fhUtils')
fh.setIupDefaults() --initialise fhUtils
iup.SetGlobal("CUSTOMQUITMESSAGE","YES") --initialise IUP
iup.SetGlobal("UTF8MODE", "YES") -- Configure IUP to use UTF-8 on the user interface.
iup.SetGlobal("UTF8MODE_FILE", "YES")  --Configure IUP to use UTF-8 file names if you intend to use IUP file selection dialog
fhfu = require('fhFileUtils') 

 

 

Last update: 23 Aug 2022