Regular Expressions and Lua Patterns
Posted: 31 Jan 2018 13:10
DavidNewton posted the following (but unfortunately got deleted) in response to AdrianBruce in Family Tree View (15646).
The most used source citations in my file are the GRO Indexes. The Where in Source generally contains five elements: year, quarter date, registration district, volume number and page number. Over the years I have put these elements in various orders according to my thinking at the time. I now want to change all of them into the order: Year, Quarter, District, Vol, Page, and the Lua patterns enable me to extract this data and reformat it.
I'll give a specific example (I have no idea if the Index is genuine) let wis be the Where In Source string
wis = '1857, Basford, Q3, v1D, p123'
To extract the individual pieces of data, the data captured is between ()
Year=string.match(wis,'(%d%d%d%d)') - pattern precisely 4 digits
Qdate=string.match(wis,'(Q[1,2,3,4])') - pattern Qx where x is 1,2,3 or 4.
Dist=string.match(wis,'%s*([%a%s]+)') - pattern the (first) longest string of letters and spaces removing any beginning spaces
Vol=string.match(wis,'(v[%w]+)') - match pattern v followed by the longest string of alphanumerics
Page=string.match(wis,'(p%d+)') - match pattern p followed by the longest string of digits
This is just an example to show simple patterns it is not intended to be a general method of reformatting GRO Index data but it does give an idea of the bulk editing that is possible using Lua.
David
Comments by Mike Tate:
The (Q[1,2,3,4]) is incorrect format and should be (Q[1234]) or (Q[1-4]) to avoid allowing comma in the set.
(v[%w]+) might capture say Tiverton in a District name instead of the Volume, and , (v[%w]+) is better.
(p%d+) might capture say Camp23 in a District name instead of the Page, and , (p%d+) is better.
The most used source citations in my file are the GRO Indexes. The Where in Source generally contains five elements: year, quarter date, registration district, volume number and page number. Over the years I have put these elements in various orders according to my thinking at the time. I now want to change all of them into the order: Year, Quarter, District, Vol, Page, and the Lua patterns enable me to extract this data and reformat it.
I'll give a specific example (I have no idea if the Index is genuine) let wis be the Where In Source string
wis = '1857, Basford, Q3, v1D, p123'
To extract the individual pieces of data, the data captured is between ()
Year=string.match(wis,'(%d%d%d%d)') - pattern precisely 4 digits
Qdate=string.match(wis,'(Q[1,2,3,4])') - pattern Qx where x is 1,2,3 or 4.
Dist=string.match(wis,'%s*([%a%s]+)') - pattern the (first) longest string of letters and spaces removing any beginning spaces
Vol=string.match(wis,'(v[%w]+)') - match pattern v followed by the longest string of alphanumerics
Page=string.match(wis,'(p%d+)') - match pattern p followed by the longest string of digits
This is just an example to show simple patterns it is not intended to be a general method of reformatting GRO Index data but it does give an idea of the bulk editing that is possible using Lua.
David
Comments by Mike Tate:
The (Q[1,2,3,4]) is incorrect format and should be (Q[1234]) or (Q[1-4]) to avoid allowing comma in the set.
(v[%w]+) might capture say Tiverton in a District name instead of the Volume, and , (v[%w]+) is better.
(p%d+) might capture say Camp23 in a District name instead of the Page, and , (p%d+) is better.