text wrap

Ron Melby · Post by **Ron Melby** » 04 Jun 2021 18:22

I am trying to conceive of an algorithm of efficient text wrapping.

say I have a string of text 2048 characters long and want to chop it into 80 char lengths.

function WRAP(str, plen)

--plen must be the maximum length to split lines at
local wrap = {}
slen = #tostring(str)

llen = math.floor(slen/plen +.5) -- (this often will be enough with one line left over) not even sure I need it.

here is the tricky part, that I need help with
txtstr = 1

while txtstr < slen do
txt = str:sub(txtstr, plen)
--[[characters to split on are ' ' = space, '.' = a period, ',' = comma, ':' = colon, '-'= hyphen , '?'= question maybe more]]

here is the tricky part, that I need help with:
from the reverse find the first occurrance of any of those characters ([' . ,:;-?']) -- capture any of these(not sure how to write that), get index

fnd = that index in the string.
table.insert(wrap, txt:sub(txtstr, fnd))
txtstr= txtstr - (fnd - plen) -- adjust next start to what you didnt take to insert.
end

Issues are I am unable to code it.
biggest issue is someone is already doing this very elegantly and efficiently, and I dont know how to find the code.

thanks, for any help on this.

Post by **tatewise** » 04 Jun 2021 21:49

I have done something like that in the Export Gedcom File plugin when deciding how much text will fit in one 250 character line and wrap on the last appropriate non-space character.
Let me dig out that code and adapt it to your requirements tomorrow.

Post by **tatewise** » 05 Jun 2021 11:04

You suggested characters to split on are ' ' = space, '.' = a period, ',' = comma, ':' = colon, '-'= hyphen , '?'= question
But consider some text such as:
"The FH v7.0.5 has 12,345 place records with Lat./Long. values such as 52.97894, -0.026577 ..."
Would you want that to wrap on '.' = a period, ',' = comma, '-'= hyphen ?
That might wrap in 7.0.5 or 12,000 or -0.0 which would not be sensible.

Now '.' = a period, ',' = comma, ':' = colon, '?'= question will usually be followed by a space, so they are not needed.
'-'= hyphen might need to wrap in long hyphenated words but that will need a complex conditional test.

So, the function to wrap on space characters only would as shown below.
The wrapped lines are more even in length if the line length (llen) has the number of lines (lnum) added.

Code: Select all

function WRAP(str, plen)
	local wrap = {}
	local slen = #tostring(str)
	local lnum = math.floor( slen/plen + 0.5 )	-- Number of lines
	local llen = math.floor( slen/lnum ) + lnum	-- Length of lines
	repeat
		local len = llen
		local txt = str:sub(1,len)		-- Next text line
		if #str > llen then
			len = txt:find("([^ ]-)$") - 1	-- Find trailing non-spaces
			txt = str:sub(1,len)
		end
		table.insert(wrap,txt)			-- Save wrap line
		str = str:sub(len+1)			-- Extract tail of text
	until #str == 0
	return wrap
end

Ron Melby · Post by **Ron Melby** » 05 Jun 2021 12:34

Yes, absolutely sensible, I was only considering note fields, not any wrap of alphameric 'text' -- space is the gravamen.
But it raises the question, should I 'gsub \t, space' ? and 'gsub \n, space ' or treat it as a break regardless of line length?

I guess my lengths will have to be checked against utf8len because I have Norwegian, and French relations.

-- get length of utf8 string
function UTF8len(str)
str = tostring(str) or ''
if str == '' then
return 0
end

local isFlag = fhIsConversionLossFlagSet()
str = fhConvertUTF8toANSI(str)
fhSetConversionLossFlag(isFlag)
return string.len(str)
end -- fn UTF8len

I havent even considered printing for rnotes since I have not arrived into the 90's yet and am still on 6.7.2
what haven't I thought of, or what am I not considering that will hurt me here? in other words, what social and political faux paus am I about to make?

thanks.

Mark1834 · Post by **Mark1834** » 05 Jun 2021 12:46

If I were coding that, I would do it a little differently. I’d just have a loop that located the last space in the first n + 1 characters of the input string (where n is the max line length), and split the line at that point.

It is significantly less code, and easier to follow. It might be slightly less efficient, but whether that is of any practical significance probably depends on whether it has to format 100 strings, or 10,000!

Neither version is right or wrong - just different relative weighting of clarity, efficiency, and technical “elegance”.

Post by **ColeValleyGirl** » 05 Jun 2021 12:50

And if I was coding it I'd use the compat53 and utf8 libraries and use utf8.len

Ron Melby · Post by **Ron Melby** » 05 Jun 2021 14:17

If I was going to use those, I would have to find a compat53.lua and a 'utf8 libraries' in either the store or in knowledge base), I found a kepler on github that required c++ compiling, and a bunch of junk with it--- (loadrequire was found, in knowledge base, but not as an lua, a copy and paste in knowledge base) it requires an ephemeral or hard to find compat53.lua

after some sloshing around here is the documentation for the 'utf8 libraries' from the link 'Documentation':

https://github.com/Stepets/utf8.lua/blo ... t/test.lua
utf8.len ----- hmmmmmmmm.

I will wait until the api documentation and code has reached its infancy.

You; Helen and Mike are way above my level and think in this stuff it is prima facie intuitive to you........I am but a mere nouveau dilettante in pc coding of any kind.

I am afraid the learning curve to get the length of a UTF8 with several libraries, while not even being able to determine if fh uses luaJIT is simply beyond my meager ken.

love to do it, but yaml and luarocks and sticking c++ code in my libs is a far piece down the road for me.

Post by **tatewise** » 05 Jun 2021 14:28

If there are possible Norwegian and French foreign characters then the compat53 and utf8 libraries are helpful.
See FHUG KB Lua References and Library Modules and https://github.com/Stepets/utf8.lua e.g. utf8.len().
If there are only a few foreign characters then I suspect ignoring UTF8 will do little harm.
Try my solution without the libraries.

Mark, a loop that located the last space in the first n + 1 characters would be as below but not sure it is any clearer:

Code: Select all

	local len = llen
	if #str > llen then
		while str:sub(len,len) ~= " " do	-- Find last space
			len = len - 1
			if len <= 0 then		-- Cater for no spaces
				len = llen
				break
			end 
		end
	end
	local txt = str:sub(1,len)
	table.insert(wrap,txt)			-- Save wrap line
	str = str:sub(len+1)			-- Extract tail of text

Ron, layout characters such as \t and \n should be considered.
It would be reasonable to convert \t to a single space.
It would be best to retain \n and wrap each line separately as shown below with a few refinements.

Code: Select all

function WRAP(str, plen)
	str = tostring(str)
	local wrap = {}
	local lines = {}
	str:gsub( "([^\n]+)", function(txt) lines[#lines+1] = txt end )	-- Split lines
	for _, str in ipairs (lines) do
		local slen = #str
		local lnum = math.ceil( slen/plen + 0.5 )		-- Number of lines
		local llen = math.ceil( slen/lnum + 0.5 ) + lnum	-- Length of lines
		repeat
			local len = llen
			local txt = str:sub(1,len)		-- Next text line
			if #str > llen then
				len = txt:find("([^ ]-)$") - 1	-- Find trailing non-spaces
				if len <= 0 then len = llen end
				txt = str:sub(1,len)
			end
			table.insert(wrap,txt)			-- Save wrap line
			str = str:sub(len+1)			-- Extract tail of text
		until #str == 0
	end
	return wrap
end

Post by **ColeValleyGirl** » 05 Jun 2021 14:42

If I was going to use those, I would have to find a compat53.lua and a 'utf8 libraries' in either the store or in knowledge base),

All you need do is loadrequire them -- I've done all the hard work with Luarocks etc. and Jane has hosted them. As documented at Lua References and Library Modules (and already pointed to by Mike):

Code: Select all

loadrequire("utf8")
loadrequire("compat53")
utf8 = require(".utf8"):init()

The loadrequire code is in the KB.

Ron Melby · Post by **Ron Melby** » 05 Jun 2021 14:56

I used loadrequire and it got compat53, and got utf8 with no utf8.len that I can find

there is an artifact copy of util in utf8.

is it one of those, I have to use it before I know whats in it? it grabs modules as I use them?

Post by **tatewise** » 05 Jun 2021 14:59

Have you added all three lines that Helen posted?
utf8 = require(".utf8"):init() is crucial. Without that there will be no utf8.len()
There should be a global table called utf8 and in that table will be len and char etc...

Post by **ColeValleyGirl** » 05 Jun 2021 15:08

Did you use the up-to-date version of loadrequire at Module Require With Load that handles subdirectories?

Ron Melby · Post by **Ron Melby** » 05 Jun 2021 15:18

yes, from the link in knowledgebase,copied it in

Mark1834 · Post by **Mark1834** » 05 Jun 2021 18:38

You're right Mike - the way you have implemented my comment is not particularly clear, but that is not what I would do!

Code: Select all

function wrap(S, LineLength)
	local T = {}
	while S:len() > LineLength do
		local j
		for i = 1, LineLength, 1 do
			if S:sub(i, i) == ' ' then j = i end
		end
		table.insert(T, S:sub(1, j - 1))
		S = S:sub(j + 1)
	end
	table.insert(T, S)
	return T
end

Not necessarily fully bomb-proofed against all strings that could upset it, but I'm just making the point that sometimes simple is perfectly adequate. It's not "better" than more sophisticated versions, but it is easier for just about any coder to understand, no matter what their expertise level. Too much coding here goes off at the deep end when it's not necessary - just because an experienced author knows how to make it complicated doesn't mean that they should!

As a test of how practical it is, I converted a copy of my main GEDCOM file to one long 2 MB string in Notepad++ by converting all the CR/LF combinations to spaces. A simple plugin using this code parsed it into about 33,000 80 character lines in about 10 seconds of processing. That's good enough for me!

Ron Melby · Post by **Ron Melby** » 06 Jun 2021 04:44

Mike,

Yes. and no.

: Untitled.png (116.43 KiB) Viewed 5909 times

Post by **tatewise** » 06 Jun 2021 09:26

That ...\Plugins\utf8 folder contents looks just like the FH v7.0 equivalent, so the utf8 library has been loaded and the compat53 library is listed too.

In Tools > Plugins use the New... option and enter the following statements into the New plugin window:

local utf8 = require('.utf8'):init()
local leng = utf8.len('abcd')
print(utf8,leng)

Use Go to run them and it should print table: 0435FE80 _ 4 lower left to show the utf8 table exists and utf8.len() is working.
Does that work or do you get an error message?

Post by **ColeValleyGirl** » 06 Jun 2021 09:59

Minor correction to Mike's suggested test code:

Code: Select all

require("utf8")
require("compat53")
local utf8 = require('.utf8'):init()
local leng = utf8.len('abcd')
print(utf8,leng)

require will work here because you've already download the two libraries.

Post by **tatewise** » 06 Jun 2021 10:31

Helen, I tested my suggested statements in FH v6.2.7 and they work fine without those require() statements.

For the purposes of testing the existence of utf8.len() they are sufficient.

Post by **ColeValleyGirl** » 06 Jun 2021 10:35

I wouldn't dare question that you'd tested it, Mike. However, for the benefit of anyone who finds this topic, I believe it's best to be consistent about the sequence needed to fully exploit the extended utf8 library.

Ron Melby · Post by **Ron Melby** » 06 Jun 2021 11:09

maybe I am missing something very fundamental here. but I am not.
where is the documentation of utf8.whatevers and how to use them?
where is the code? if its only binaries, and since I don't know good old Steve-o, we are at the
end of any discussion of using this scrap, because it looks flaky to me.
what table is it in?
what is the address of utf8.invisible divided by utf8.imadethisup and multiplied by utf8.blks?
apparently this library is a dessert topping and a floor wax, since you can name anything anything you need, and it does what you want, since no documentation exists in human readable format, as I have sent you the link I got for that documentation. And both of you are talking magic.

let's start simple, where is the function utf8.len and its args, and rtx values?
utf8.dll is in compat53. whats in it? how do I use it? the api is not documented in the link for documentation.
utf8.whatareyoutalkingabout?

Post by **tatewise** » 06 Jun 2021 11:18

Ron, the utf8 documentation is at https://github.com/Stepets/utf8.lua which displays the README guide.
Under Usage it says "It also provides all functions from Lua 5.3 UTF-8 module except utf8.len (s [, i [, j]])"
Click on that module link and it takes you to the Lua 5.3 UTF-8 documentation:
https://www.lua.org/manual/5.3/manual.html#6.5 that describes all the functions.

I don't understand why it says "except utf8.len (s [, i [, j]])" which seems to work the same in FH v6.2 and FH v7.0.
In fact, I cannot get the parameters i and j to work in either. They are ignored completely.

For compat53 the documentation is at https://github.com/keplerproject/lua-compat-5.3.
Under What's implemented it explains what it offers.

Post by **ColeValleyGirl** » 06 Jun 2021 11:36

Ron Melby wrote: ↑06 Jun 2021 11:09 maybe I am missing something very fundamental here. but I am not.

Yes, you are. You're not reading what you're directed to.

Follow the documentation links that Mike has repeatedly linked to (they're also in the KnowledgeBase).

The code for the stepets library is at https://github.com/Stepets/utf8.lua -- it's a pure lua library, so the code is available to you.

utf8.dll is in compat53. whats in it? how do I use it? the api is not documented in the link for documentation.

The documentation is linked to from the stepets library. utf8.dll is part of lua 5.3 backported to 5.1, so not flaky at all. It's a c library, so you can't inspect the code (or you can but you may not understand it), but then it's the same for the code that makes up lua but you seem happy to use it.

By now, I'd hope you had come to understand that Mike rarely leads you astray; I certainly shan't be doing so in future because I'm out-of-here -- I have less patience than Mike with people who don't listen to what they're told and then complain things don't work.

Ron Melby · Post by **Ron Melby** » 06 Jun 2021 11:50

mike, ok, well that is the long way around the barn.

since I am not writing to the gedcom much as of yet, I have the code here, and it looks like when fh 7 gets some more of its bugs out if I get the upgrade, then this sort of thing is wrapped in.

I think my UTF8len works sufficiently to write displayable text.

helen, this link was provided but no further instructions: https://github.com/Stepets/utf8.lua

Now, I want you to go down to the readme there, and find the link 'Documentation' and click on it.
Oh, foolish me to believe that link pointed to something akin to explication.

It is only after mike provided additional USEFUL information did I understand the situation. and that Documentation does not mean Documentation.

Post by **ColeValleyGirl** » 06 Jun 2021 12:01

Ron Melby wrote: ↑06 Jun 2021 11:50 helen, this link was provided but no further instructions: https://github.com/Stepets/utf8.lua

Now, I want you to go down to the readme there, and find the link 'Documentation' and click on it.

The readme is the documentation. That's what a readme file is. Which is why it's linked as Online Documentation from the KB article on libraries. https://en.wikipedia.org/wiki/README

Similarly the KB links to the readme file for compat53.

readme files may be outwith your previous experience, but you could at least read them!

Mark1834 · Post by **Mark1834** » 06 Jun 2021 12:20

Without wishing to take sides in this squabble, I am sympathetic to the view that useful documentation of some of the more advanced features of Lua is very thin on the ground. It’s a highly technical subject, and most of the stuff I’ve seen is written by experts for other experts. There’s very little tutorial type material around compared with say Python.

Fortunately, the overwhelming majority of FH users don’t have to worry about it. Only a small subset of users write plugins, and only a small subset of that subset would go beyond the standard libraries and FH API.

We’ve come a long way from the original question of how to parse a text string...

.

Family Historian User Group

* text wrap

text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap

Re: text wrap