* Regular Expressions and Lua Patterns

Questions regarding use of any Version of Family Historian. Please ensure you have set your Version of Family Historian in your Profile. If your question fits in one of these subject-specific sub-forums, please ask it there.
Post Reply
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Regular Expressions and Lua Patterns

Post by tatewise »

DavidNewton posted the following (but unfortunately got deleted) in response to AdrianBruce in Family Tree View (15646).

The most used source citations in my file are the GRO Indexes. The Where in Source generally contains five elements: year, quarter date, registration district, volume number and page number. Over the years I have put these elements in various orders according to my thinking at the time. I now want to change all of them into the order: Year, Quarter, District, Vol, Page, and the Lua patterns enable me to extract this data and reformat it.

I'll give a specific example (I have no idea if the Index is genuine) let wis be the Where In Source string
wis = '1857, Basford, Q3, v1D, p123'
To extract the individual pieces of data, the data captured is between ()
Year=string.match(wis,'(%d%d%d%d)') - pattern precisely 4 digits
Qdate=string.match(wis,'(Q[1,2,3,4])') - pattern Qx where x is 1,2,3 or 4.
Dist=string.match(wis,'%s*([%a%s]+)') - pattern the (first) longest string of letters and spaces removing any beginning spaces
Vol=string.match(wis,'(v[%w]+)') - match pattern v followed by the longest string of alphanumerics
Page=string.match(wis,'(p%d+)') - match pattern p followed by the longest string of digits

This is just an example to show simple patterns it is not intended to be a general method of reformatting GRO Index data but it does give an idea of the bulk editing that is possible using Lua.

David

Comments by Mike Tate:
The (Q[1,2,3,4]) is incorrect format and should be (Q[1234]) or (Q[1-4]) to avoid allowing comma in the set.
(v[%w]+) might capture say Tiverton in a District name instead of the Volume, and , (v[%w]+) is better.
(p%d+) might capture say Camp23 in a District name instead of the Page, and , (p%d+) is better.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Regular Expressions and Lua Patterns

Post by tatewise »

May I respectfully say that David has jumped the gun :D

What he describes is a practical example, but a few preliminary concepts need to be introduced first.
They are covered in plugins:understanding_lua_patterns|> Understanding Lua Patterns which is extracted from the Lua Reference Manual. The following tutorial, refined by feedback comments, could be added to that [kb]|[/kb] advice.

The purpose of Regular Expressions and Lua Patterns is to recognise a text string that conforms to certain characteristics, and optionally reorganise its components to yield a new text string.

Single Character Matching

The expression pattern is often mostly made up of regular literal text characters that must match the text string.
So if the expression pattern is "ABC123" it will only match the text string "ABC123".
But what if we want to match ABC followed by any three digits?
Then the expression pattern is "ABC%d%d%d" where %d matches any digit.
That does NOT match the text string "ABC%d%d%d" because some characters such as % have a special meaning, and are known as magic or meta characters. The full list of magic characters is % [ ] . ^ $ * + - ? ( ) and each will be explained later.

In the case of % the next character identifies what class of characters are matched.
e.g. %d matches digits 0-9 and %u matches all upper case letters A-Z.
See the [kb]|[/kb] reference above for the full set.
But what if we want to explicitly match the literal character % or one of the other magic characters?
Then we use %% or % followed by any other magic character.
e.g. "A%%B%+C%?" will only match the text string "A%B+C?"

But what about matching some other combination of characters such as just the digits 1 to 4 and underscore _?
Then the magic square brackets [ ] are used to enclose the required set of characters.
e.g. [1234_] matches any single character that is a 1 or 2 or 3 or 4 or _
A shortcut to defining a range of characters such as 1234 is to use the magic hyphen -
e.g. [1-4_] matches the same as above, and [0-9] is the same as %d and [A-Z] is the same as %u.

All the above can have their meaning inverted or complemented to match every character except what is specified.
e.g. %D matches non-digits, %U matches all but upper case letters, and [^1-4_] matches all but 1-4 and _

There is one special magic character the dot . that matches any single character.

Strictly speaking in Lua, all the above actually match one 8-bit byte, which is the same as the basic ANSI characters, but needs special consideration when handling multi-byte UTF-8 characters.

I will continue this tutorial later, and introduce matching repetitions of the same class of character, etc, but feedback so far is welcomed.

BTW: I have already made some improvements to plugins:understanding_lua_patterns|> Understanding Lua Patterns.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2458
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Regular Expressions and Lua Patterns

Post by Mark1834 »

This is timely, as I was looking at Lua for updating my media records. I have a large number of undated media records (images of source records) imported from FTM, and I want to add dates to aid reporting and sorting. My media titles are very structured, so I was thinking of two different scenarios:

1. For a fixed date event, check the first part of the media title and set the date appropriately. For example, if the title starts with "Census - 1881 - ", then set the date as "3 APR 1881".

2. For a variable date event, such as a baptism, the title is in the form "Event type - year - name(s) - place". I would therefore want to extract the four digit year (the second of the fields delimited by dashes) and set the date to this value with any leading and trailing spaces deleted.

It would be relatively straight forward to do this in the GEDCOM file using the VBA techniques discussed in earlier postings, but I ought to learn how to do it "the FH way" :) .

I'm not asking how to do it - I'm happy to dive in and pick that up - but is Lua the appropriate technique to use for this? Any traps for the unwary?

Thanks.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Regular Expressions and Lua Patterns

Post by tatewise »

Yes, a Plugin written in Lua is the way to go.
Beware that to set Date fields needs use of the special Date methods to convert Text to Date format.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2458
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Regular Expressions and Lua Patterns

Post by Mark1834 »

Thanks Mike, it was a few days before I was able to sit down and have a go at this, but it turned out to be quite a good first learning example!

I've now got a simple script that updates most of my media records perfectly, but fails where the title is more complex.

For a general media title, "Event - Year - Further Details", I want to extract the Event and Year sub-strings to use as variables in the script. Where Event is a simple one word descriptor, such as Baptism, Death, etc, the following works fine

Event, strYear = string.match(S, "(%a+) - %- (%d+) %- ")

The simple %a+ doesn't work for more complex events, so for example "Passenger List" returns just "List" and "Marriage (GRO)" returns nil for the value of Event. I can step through the string (analogous to the way I would do it in VBA), and this works correctly in all cases but is not very elegant:

i, j = string.find(S, " %- ")
Event = string.sub(S, 1, i - 1)
S = string.sub(S, j + 1)
i, j = string.find(S, " %- ")
strYear = string.sub(S, 1, i - 1)


No matter how I modify the first %a+ pattern, I can't get it to recognise letters, spaces and brackets correctly to give the desired result. Can it be done, or is it more a case of "well, I wouldn't actually start from here......"?

Thanks.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Regular Expressions and Lua Patterns

Post by tatewise »

That is most definitely feasible.
One way is to expand the set of characters matched by using the [set] structure.
e.g.
[%a%(%) ] will match any one letter %a, parenthesis %( & %), or space character.
[%w%p%s] will match any one alphanumeric %w, punctuation %p, or white space %s character.
[%w%p%s]+ will match a string of any such characters, so will include parentheses and spaces.

or even more simply, use the dot to match everything similar to your 'inelegant' code:
.+

It is also a good idea to anchor the start ^ &/or end $ of such patterns to the media title to ensure all leading text is included.

So the pattern becomes:

"^(.+) %- (%d+) %- "
or
"^(.+) %- (%d%d%d%d)" that more explicitly matches a 4 digit year and needs no %- terminator.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2458
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Regular Expressions and Lua Patterns

Post by Mark1834 »

Thanks Mike for the clear explanation. I'd been playing around with sets, but didn't get the syntax quite right! One criticism I would make of much of the patterns material I have read recently is that it gets too complicated too quickly. Lots of practice with simpler examples like this first!

My final code is as follows. There is enough in here for it to be a useful intro to Lua, but I'd be the first to admit it doesn't have nearly enough error trapping for a general script. It's fine for my purposes though where the records are well defined (mostly - running this has highlighted a few minor inconsistencies I hadn't noticed before :D).
Capture.PNG
Capture.PNG (41.92 KiB) Viewed 7381 times
Mark Draper
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Regular Expressions and Lua Patterns

Post by tatewise »

Mark, I have a few comments that you (and others) may find useful.

In terms of an introduction to Lua, did you see Tools >Plugins > How To Write Plugins > Sample Plugin Scripts?
Did you see plugins:getting_started|> Getting Started Writing Plugins that refers to above?

Why does your Lua Pattern "^(.+) - %- (%d+) %- " have that space hyphen in the middle?
Is it to absorb possible multiple spaces between Event and the space %hyphen space delimiters?
If so, then Lua Pattern "^(.+) +%- (%d+) %- " works just the same.

Those record loops are a bit neater using while p:IsNotNull() do

When faced with multiple elseif structures then look for table lookup alternatives.
They have the advantage of only calculating the fhNewDate values once instead for every record conversion.

Here you can have a lookup Table for each Event Year:
local Table = {
Census1841 = fhNewDate(1841, 6, 6);
Census1851 = fhNewDate(1851, 3,30);

... etc ... for other years
Census1911 = fhNewDate(1911, 4, 2);
["1939 Register"] = fhNewDate(1939, 9, 29);
}

( 1939 Register needs special format as it starts with digits. )

Then the conversion becomes:
local Event, Year = string.match( S, "^(.+) +%- (%d+) %- " )
and after checking both are not nil and creating date field:
local dtDate = fhNewDate(Year)
if Event == "1939 Register" then Year = "" end
dtDate = Table[Event .. Year] or dtDate

This last trick says if Table lookup fails i.e. returns nil, then use original dtDate value.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2458
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Regular Expressions and Lua Patterns

Post by Mark1834 »

Curious - simply a typo I hadn't spotted that didn't affect the end result on testing! I'll correct it before using on my live database though.

I like the idea of a table - I'm generally allergic to long lists of conditionals such as this and there had to be a better way, but tables and functions are my next lesson!

Between the free online first edition of Programming in Lua, the FH plug-in help file and the FHUG Knowledge Base there is certainly plenty to read. My main problem now is how to split my time between studying Lua, completing the migration from FTM to FH, and actually doing new research - particularly as the evenings start to get lighter... :D
Mark Draper
User avatar
tatewise
Megastar
Posts: 28341
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Regular Expressions and Lua Patterns

Post by tatewise »

The typo "matches 0 or more repetitions of class and will always match the shortest possible chain".
That is what the plugins:understanding_lua_patterns#pattern_items|> Pattern Items magic hyphen - does.

So it matches 0 or more space characters, which can match nothing, and thus has no effect when no multiple spaces.

Tables are EXTREMELY powerful, and have many extraordinary features, but are often overlooked.
In Programming in Lua (first edition) see Part II · Tables and Objects 11 – Data Structures.

Checkout all the plugins:getting_started#lua_language_references|> Lua Language References.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2458
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Regular Expressions and Lua Patterns

Post by Mark1834 »

Worth posting my latest code using Mike's excellent suggestions above. I've implemented the table slightly differently, with the year as the index and slightly more verbose assignments, but I find that easier to follow as a Lua novice. Interestingly, the alternative assignment trick to catch empty table elements was necessary here for non-UK census entries that otherwise gave an error.
Capture.PNG
Capture.PNG (39.21 KiB) Viewed 7328 times
Mark Draper
Post Reply