Understanding Lua Patterns

  • Skill Level: Advanced and Intermediate
  • FH versions: V5, V6, and V7
  • In Topics: Writing plugins 

Introduction

A pattern is a formula for matching text strings that conform to some definable criteria (experienced coders need to remember Lua patterns are similar but not identical to regular expressions). Optionally, the pattern allows components of the string to be captured in order to compose a new text string incorporating those components.

For example, the pattern ABC%d%d%d (where %d matches any digit) will match the letters ABC followed by any three digits, such as ABC142 or ABC897 or ABC603, but will NOT match ABC93X.

So patterns comprise regular literal text characters that match themselves, such as ABC above, and magic or meta characters that have a special meaning, such as %d above. The following sections focus on the meanings of those magic characters.

Pattern Match Functions

Lua has some string functions that use patterns. Among them are:

string.find(string, pattern) finds the first instance of pattern in string and returns its position
string.gmatch(string, pattern) when called repeatedly, returns each successive instance of pattern in string
string.gsub(string, pattern, repl) returns a string where all instances of pattern in string have been replaced with repl
string.match(string, pattern) returns the first instance of pattern in string

Some of these functions have additional optional parameters, and behave slightly differently if the pattern defines Captures. For details, see the Lua reference manual (online) for the version that you’re using (the links are in Useful External Links below).

Constructing a Pattern

So, how do we construct a pattern?

At the foundation of patterns are Character Classes that represent sets of characters. Each character class matches just one character that conforms to that class. For example, the character class %d matches any one digit 0-9, whereas the character class A matches the single letter A.

More complex structures are Pattern Items that represent combinations of characters. For example, the pattern item %d+ matches a sequence of one or more digits, and A+ matches a sequence of A’s.

Finally a Pattern is a sequence of Pattern Items that may contain Captures of internal components whose purpose is explained later.

Remember the magic characters % . [ ] ^ $ ( ) * + - ? have special contextual meanings, so usually don’t match themselves.

Character Classes

Lua character classes are:

Class Meaning
Y represents the character Y itself as long as it is not a magic character
. represents any single character
%a represents all letters A-Z and a-z
%c represents all control characters such as Null, Tab, Carr.Return, Linefeed, Delete, etc
%d represents all digits 0-9
%g all printable characters except space (not Lua 5.1)
%l represents all lowercase letters a-z
%p represents all punctuation characters or symbols such as . , ? ! : ; @ [ ] _ { } ~
%s represents all white space characters such as Tab, Carr.Return, Linefeed, Space, etc
%u represents all uppercase letters A-Z
%w represents all alphanumeric characters A-Z and a-z and 0-9
%x represents all hexadecimal digits 0-9 and A-F and a-f
%z represents the character with code \000 because embedded zeroes in a pattern do not work (ƒh6/Lua 5.1)
\0 represents the character with code \000 because embedded zeroes in a pattern do not work (ƒh7/Lua 5.3)
The upper case letter versions of the above reverses their meaning
i.e. %A represents all non-letters and %D represents all non-digits
%Y represents the character Y if it is any non-alphanumeric character
This is the standard way to get a magic character to match itself
Any punctuation character (even a non magic one) preceded by a % represents itself
e.g. %% represents % percent and %+ represents + plus
[set] represents the class which is the union of all characters in the set
A range of characters is specified by separating first and last character of range with a – hyphen e.g. 1-5
All classes described above may also be used as components in the set
e.g. [%w~] (or [~%w]) represents all alphanumeric characters plus the ~ tilde
[^set] represents the complement of set, where set is interpreted as above
e.g. [^A-Z] represents any character except upper case letters

Many patterns are simply made up of a sequence of these classes.
For example, to match the string ### Abc (where # designates a digit), you would use the pattern %d%d%d Abc.

Note: classes %a to %x and %A to %X above aren’t strictly necessary, being just shorthand for certain [set] or [^set] classes.
e.g. %a is [A-Za-z] , %d is [0-9] , %l is [a-z] , %L is [^a-z] , %U is [^A-Z] , and so on.

Pattern Items

By adding modifiers, you can extend the meaning of those Character Classes represented by class below.

Item Meaning
class matches any single character in the class
class* matches 0 or more repetitions of class and will always match the longest possible chain
class+ matches 1 or more repetitions of class and will always match the longest possible chain
class- matches 0 or more repetitions of class and will always match the shortest possible chain
class? matches 0 or 1 occurrence of class

For example, %d+ will match a string of digits.

There are also two special pattern items.

Item Meaning
%N for N between 1 and 9 matches the Nth captured substring (see Captures below)
%bXY matches a substring starting with X and ending with Y and the substring also has the same number of X as Y
For example, %b() will match balanced nested pairs of ( ) parentheses

Pattern

A Pattern is a sequence of Pattern Items that can be anchored to the start &/or end of a string.

Anchor Meaning
^ when at the beginning of a pattern, forces the pattern to match the start of a string
$ when at the end of a pattern, forces the pattern to match the end of a string
When ^ or $ is elsewhere in a pattern (except in [^set]), it has no magic meaning and represents itself

For example, ^A%d+Z$ will match a string starting with A, ending with Z, with only digits in between.

Captures

A Pattern may have sub-patterns enclosed by ( ) parentheses, that are counted as matches are found. They can be accessed using the %N item defined above. The empty capture () will return the string position where a match is found.

For example, ^A(%d+)Z$ will capture just the digits from any matching string, and subsequently %1 represents those captured digits.

So the function string.gsub(S, "^A(%d+)Z$", "%1") will return just the digits from string S as long as it matches the pattern, otherwise it returns the original string S unaltered.

Examples

Here are a few basic examples of Lua Patterns that match Family Historian text.

Lua Pattern Matching Text
^Q[1-4] %d%d%d%d$ Quarter date e.g. Q2 1987 or Q4 1876
^[123]?%d %u%l+ %d%d%d%d%$ Simple date e.g. 21 March 1765 or 8 June 1678
^[1-3]?%d %u%l+ %d%d%d%d% %(%l%l%l%)$ Qualified date e.g. 19 September 1765 (est)
Note how magic ( ) parentheses need % prefix

The following Code Snippets use patterns, as does the Search and Replace plugin.

Unicode UTF-8 Encoding

Note: In ƒh6 and ƒh7 the native utf8 library is available (or can be implemented) to help in this area and can be supplemented with an additional utf8 library  to access extra functions; see Lua references and Library Modules for instructions how to include them. If you choose not to use these libraries, read on.

Lua and its pattern matching is designed to work with ANSI encoded strings and does not recognise the special needs of multi-byte UTF-8 encoding. Many of the UTF-8 bytes are treated as members of the character classes described above with highly disruptive effects. So UTF-8 compatible ASCII substitutes must be employed to avoid byte codes above \127.

Class UTF-8 Replacement Description
%a [A-Za-z] represents all unaccented letters
%c [%z\001-\031\127] or [\000-\031\127] represents all control characters
%d [0-9] represents all digits
%l [a-z] represents all unaccented lowercase letters
%p [!-/:-@%[\\%]^_`{|}~] for all punctuation i.e. [\033-\047\058-\064\091-\096\123-\126]
%s [\t-\r ] represents all white space characters i.e. [\009-\013\032]
%u [A-Z] represents all unaccented uppercase letters
%w [A-Za-z0-9] represents all unaccented alphanumeric characters
%x [0-9A-Fa-f] but %x is OK to represent all hexadecimal digits
%z or \0 OK to represent the character with code \000

The inverse classes using capital letters need similar treatment and can be replaced by the ASCII versions above but with a leading ^ e.g. %A becomes [^A-Za-z] to represent all except unaccented letters. Note that none of the above techniques take account of UTF-8 accented letters or symbols, which being multi-byte codes, are difficult to define as patterns.

(%W) is often used to detect non-alphanumerics to prefix with % to hide the magic pattern symbols, but can upset UTF-8 text. ([%^%$%(%)%%%.%[%]%*%+%-%?]) is a safe ASCII substitute.

Last update: 12 Apr 2023