searching files need to convert XCCS strings into UTF8 strings before searching

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

unicode and FILEPOS,about interlisp/medley

Comments (28)

nbriggs commented on August 21, 2024

Or more precisely, searching within a file needs to convert the XCCS search string into UTF8 iff the external representation of the file is UTF8.

from medley.

rmkaplan commented on August 21, 2024

I did this last summer, for internal XCCS to file UTF8/Unicode, in that particular sense this can be closed.

But this really should be generalized. What is needed is a general method of printing to a string-stream, with that stream created with the external format of the to-be-searched file, and then extracting the byte sequence as the actual search string. I don't remember: does Interlisp have a print-to-string that allows the string to grow, or does common lisp?

Otherwise, I suppose it could PRIN3 to a NODIRCORE IO stream with the file's external format, switch the external format to a byte format, and RSTRING.

from medley.

rmkaplan commented on August 21, 2024

One further issue: The case array.

The current implementation seems to apply the case array to the raw bytes of the string and the file and then test the results for a match. So, if the string contains byte X and the file contains byte Y and the case array maps both X and Y to Z, those bytes would match. That seems really wrong, in a multibyte world (even with our current runcoded XCCS files).

I think this splits into a fast case (no CASEARRAY, the current byte-based algorithm with the search string first transformed by the \OUTCHAR), and the slow case (when a case array is specified).

The slow case would first transform the given search string by apply the case array to its characters, but the result would still be an internal string (thin or fat). The search loop would match the ith (thin or fat) character of that string with the case-array transform of the \INCCODE of the file.

If the beginning and ending byte positions are given, then we have the even slower case where the \INCCODE has to manage the byte count, with the multiple value returns from the file's external format INCCODE function.

Maybe there is a way of doing FFILEPOS when there is a case array, but at least it will be correct if it just reverts to FILEPOS if the case array is non-NIL.

from medley.

masinter commented on August 21, 2024

the efficient way to implement case transformations and comparisons is by table lookup. so I don't understand.

from medley.

rmkaplan commented on August 21, 2024

The problem is whether you are looking up bytes or characters.

…

On May 8, 2021, at 12:16 PM, Larry Masinter ***@***.***> wrote: the efficient way to implement case transformations and comparisons is by table lookup. so I don't understand. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJPMF425AVXGCAIAGRDTMWEZXANCNFSM4P2SOT3Q>.

from medley.

masinter commented on August 21, 2024

the notion of "case" only applies to characters.

from medley.

rmkaplan commented on August 21, 2024

Right, and that’s what presumably the search string and the file both contain. But the file-characters are coded in different byte representation and with different mappings into the internal character-coding standard, whatever that may be. So in order to apply the casearray transformation to the characters as coded in the file, you have to first decode them into their internal (fat or thin) charcode numbers.

…

On May 8, 2021, at 3:59 PM, Larry Masinter ***@***.***> wrote: the notion of "case" only applies to characters. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJOBEVWJIYXUSVZDSW3TMW65DANCNFSM4P2SOT3Q>.

from medley.

masinter commented on August 21, 2024

@rmkaplan is this still an Issue?
what are some test cases? Filepos "xccs-string" file-in-repo which should match?

from medley.

rmkaplan commented on August 21, 2024

Yes, this is still screwed up, I was looking at it last night and this morning. (But then I got distracted by another glitch: (OPENSTREAM T ‘OUTPUT) produces a stream with the wrong EOL convention. The problem with FILEPOS is that the casearray and skip features are byte-oriented, not character oriented, and they introduce fuzzy matching on the file bytes. I don’t think they ever worked properly, even in the NS world. It’s now quite easy to convert the pattern string to the sequence of bytes that would represent it on the file in the given format, using the formatbytestream generic function. But there are alternative sequences of file bytes that should match depending on whether the casearray maps different file bytes (~=chars) to the same element of the pattern. This can be generalized easily and efficiently to arbitrary external formats when there is no case array, since it is easy now to produce the bytestring. When a case array is specified, then I think the correct thing is to get rid of the fancy byte optimizations and simply read characters through the generic \INCCODE, and assume that the case array is an extended map for 16 bit characters in the internal representation. A separate evolution would be to at least allow the pattern to be a simple regular expression, and convert its characters into finite-state machine with bytes according to the external format. An approximation to grep.

…

On Aug 2, 2021, at 3:48 PM, Larry Masinter ***@***.***> wrote: @rmkaplan <https://github.com/rmkaplan> is this still an Issue? what are some test cases? Filepos "xccs-string" file-in-repo which should match? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJLP7H5CAORFEYCPJQLT24OCFANCNFSM4P2SOT3Q>.

from medley.

rmkaplan commented on August 21, 2024

I thought I should say more about the current (= forever) issues with FILEPOS. It is described as behaving like STRPOS, except that it searches files instead of other strings. This is true only in 2 very special cases:
The search string is thin (character set 0) and the file is XCCS-run-length encoded
The search string is fat (at least one character outside of CS0) and the file is XCCS 2-byte encode.
Any other combination, or even a change of the file encoding in the middle, will not give correct results.

It would be more correct to say that FILEPOS is a byte-sequence searcher, where the bytes to be searched for happen to be presented in a string instead of a vector.

This is the incorrect behavior even when SKIP and CASEARRAY are not provided. Those also specify fuzzy matching of bytes and not characters. In fact, you get an invalid-arg error if you try to assign a mapping to a character not in CS0.

So, this is set up to be really fast in the only situation that existed before even XCCS and 16-bit characters were adopted.

The little extension that I put in for UTF-8 can be generalized to arbitrary external encodings, because I now have an interface to easily convert the sequence of characters into the sequence of bytes that represent any external encoding (modulo swapping modifiers and the like). This can be done in a preamble, before the search begins.

But that will still fail for XCCS files that switch midstream between runcoding and nonruncoding.

The better solution is to change the code to treat both the file and the pattern as character sequences, using the generic \INCCODE to read the file and reading the pattern essentially with NTHCHARCODE.

Or, if we know that the external coding is stable (like UTF-8 but unlike XCCS), and there is no SKIP or CASEARRAY, then we can still do the fast case of pre-converting the search string. (However: if we want to take account of alternative Unicode normalization sequences, we would probably end up with the transducer implementation I mention below.)

When provided (the slow case), SKIP would be a character also and the CASEARRAY would be generalized also to allow for 16-bit mappings.

In sum: I think we can still have the very fast case for stable external formats with no SKIP or CASEARRAY.

And take a performance hit in favor of correctness for XCCS files, or when SKIP or CASEARRAY are provided.

(We can deal with the XCCS, SKIP, and CASEARRAY, and perhaps also Unicode variations in normalization if we generalize in a different way, by converting the search pattern to a finite-state transducer that encodes all possible byte sequence matches.)

from medley.

rmkaplan commented on August 21, 2024

The current code is incorrect in another way. It returns the wrong byte position if the search pattern begins with a SKIP character.

I think it has a needless optimization of trying to prune the pattern of any skip-prefixes before it does the searching. How many times does a search pattern begin with skips--just to say that you don't want to match at the very beginning of the file.

from medley.

rmkaplan commented on August 21, 2024

As noted in the comments above, FILEPOS is currently only a byte-sequence searcher, highly optimized (and even then providing incorrect results if the search pattern begins with the skip byte). The search pattern is provided as a string, which suggests that it should be interpreted as characters, but the code doesn't do that. The string representation is really just a funky way of specifying a byte sequence--an array of bytes would be more honest.

Thus, first question: Do we want this utility to remain as a byte-searcher? If so, nothing to do.

If not--we want to define it as a character-string searcher--then I think we can still have a fast case if SKIP and CASEARRAY are NIL and the external format of the stream is marked as STABLE (a new external-format flag, NIL for XCCS because of (non)runcoding, T for UTF-8). The character-sequence is pre-converted to the sequence of bytes that would represent those characters on the file (using the generic \FORMATBYTESTREAM), and then we do the current byte-search.

In all other circumstances, we would do the slower strategy of pre-mapping the characters in the search string through the casearray to produce another array of characters (not bytes) to match, we read the file with the generic \INCCODE and determine the match after mapping that character-code also through the case array.

The CASEARRAY itself has to be upgraded to a (presumably sparse) 16 bit representation, so that one can search case-insensitively for, say, a Greek file. But that's a separate problem, independent of its use in FILEPOS (STRPOS as a similar issue--casearray is a 256-cell array).

Comments?

from medley.

masinter commented on August 21, 2024

string matching for unicode isn't just a matter of mapping bytes. There's also Unicode normalization... If you want to find "all strings that match WORDINTHAI you may well want to compare normalized pattern to normalized results.
When we did the Common Lisp standard for international characters, this wasn't well understood.

I think this is also a problem for Unicode file names,

Consider leaving Interlisp alone, making Common Lisp do what SBCL does, and writing more generic matching functions do what is 'right', rather than trying to be at the same time compatible and also do the 'right' thing (if you can).

from medley.

rmkaplan commented on August 21, 2024

What is "right" presumably involves the equivalent of a finite-state transducer in the middle of file searching, with importing or creating a whole bunch of code to interpret Unicode features and using that to construct an fst that does fuzzy matching for alternative (unnormalized as well as normalized, casemapped etc) byte representations.

A quick look at SBCL didn't talk about matching/searching, just that they associate all the various Unicode character properties with Lisp character objects. Is there more?

I worry that the best may be the enemy of the good in this situation. I believe FILEPOS is currently broken for anything except thin strings (8 bit codes) on runcoded XCCS files or fat strings on nonruncoded XCCS files. It can be generalized to other encodings without much trouble as long as the fuzziness is restricted to simple single-character-to-single-character mappings, which is all that CASEARRAY ever allowed.

Anything more is a big deal. Unless you are saying that we can (simply) down-load and compile code from the SBCL implementation.

from medley.

nbriggs commented on August 21, 2024

It's in the same category as grep -- as its documentation says

The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.

The world has changed but the code has not... so if there are some simple things that will make it more useful in some cases without making the simple cases that work horribly slow then I'd say go for it. Documenting the situations where it will NOT produce the expected results would be useful too.

from medley.

rmkaplan commented on August 21, 2024

On further thought, though, it seems that the correct search of any NS-format file will always be the (relatively) slower (character, not byte, matching) case. Unlike UTF-8, ISO8859 etc., the NS byte encoding is unstable (and in hindsight, over-optimized): a character-set shift can flip from runcoding to non-runcoding in the middle of the file, and that changes the byte sequence that represents a given character search-pattern. This is true whether or not a case-array is provided. The endgame solution is a generic interface that converts the search character-sequence to a finite-state machine that represents all possible format-dependent byte-instantiations (including case array, unicode-normalization for unicode files, and NS charset flipping). The search patterns in this interface are all acyclic and quite small (unlike the linguistic case) so probably this can be set up without requiring all of the fancy algorithms of the full-fledged regular calculus (union, epsilon-removal, intersection, determinization, composition, maybe even reversal for bidi), maybe just a simple decision tree of the byte-expansions. But I think I would go first for correct behavior, even if we take a performance hit for now in the NS case. This is not the only over-optimization we have decided is not worth it with modern machines.

…

On Sep 10, 2021, at 1:05 PM, Nick Briggs ***@***.***> wrote: It's in the same category as grep -- as its documentation say The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa. The world has changed but the code has not... so if there are some simple things that will make it more useful in some cases without making the simple cases that work horribly slow then I'd say go for it. Documenting the situations where it will NOT produce the expected results would be useful too. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJKNC4YEL3AYLQQGVUDUBJQIVANCNFSM4P2SOT3Q>.

from medley.

masinter commented on August 21, 2024

a character-set shift can flip from runcoding to non-runcoding in the middle of the file,

While that might be theoretically possible.... who would have created files that were XCCS but were otherwise "plain" text, suitable for some lisp program to FILEPOS in them?

All the XCCS everyone had was embedded in some otherwise binary data such as INTERPRESS and TEDIT.
So all of this searching "External Format" :XCCS seems like a lot of overhead for cases that I don't think really exist. especially not files that would switch in the middle

from medley.

rmkaplan commented on August 21, 2024

This has nothing to do with Interpress or Tedit, or even plain text files that were created in Lisp. They could have been created by any other NS-creation system. As mentioned before, we can assert that for NS (not Utf-8 etc) files, FILEPOS is just a fast byte-sequence searcher with correct results guaranteed only for thin strings and runcoded files, and make it cause an error for other NS situations. We can still make it run at byte-sequence speed for UTF-8 and other stable encodings, but perhaps only if the casearray is NIL. The slow-down for correct NS and for the typical file-sizes that we have may not be noticeable.

…

On Sep 11, 2021, at 1:06 PM, Larry Masinter ***@***.***> wrote: a character-set shift can flip from runcoding to non-runcoding in the middle of the file, While that might be theoretically possible.... who would have created files that were XCCS but were otherwise "plain" text, suitable for some lisp program to FILEPOS in them? All the XCCS everyone had was embedded in some otherwise binary data such as INTERPRESS and TEDIT. So all of this searching "External Format" :XCCS seems like a lot of overhead for cases that I don't think really exist. especially not files that would switch in the middle — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJM3B4YITGQCKCJHZETUBOZFHANCNFSM4P2SOT3Q>.

from medley.

masinter commented on August 21, 2024

This has nothing to do with Interpress or Tedit, or even plain text files that were created in Lisp. They could have been created by any other NS-creation system.

Which ones were those? Star / Viewpoint had it's own file format (people grumbled about it being a binary dump of internal data structures). Cedar? Smalltalk? For strings in programs were all defined in external structures to facilitate localization.
Before Unicode, most everyone had their own (IBM, Code pages, Apple). Japan had it's own NS-like ISO-2022-JP encoding.

There were no other NS creation systems. The documents were not widely available -- we had to dig them out.

from medley.

rmkaplan commented on August 21, 2024

Forget about Star/Viewpoint, bad example. Consider a Lisp source file that has a non character-set 0 character in it (e.g. Japanese or Greek function name, a mathematical symbol) and a search-string that includes such fat character(s). Current (fast, byte-level) FILEPOS won't find it if the file is NS-runcoded. If the file is not runcoded, current FILEPOS won't properly find ordinary ascii characters.

We could assert that FILEPOS for NS-file implementation only supports runcoded files, and therefore assumes that the encoding is stable and known from the get-go. Then I can extend it so that it runs at byte-speed for the non case-array situation for NS as well as for UTF-8 and ISO, by precomputing the unique file byte sequence (using \FORMATBYTESTREAM) that represents the search character sequence.

Or, deal correctly with all combinations of search-string and NS-file representations (including mid-file switching) by going through the generic function interface even for casearray=NIL searching of NS format files.

from medley.

rmkaplan commented on August 21, 2024

I have thought more about the FILEPOS and FFILEPOS issues, with arbitrary external formats. If the external format is (or is treated as being) stable, and the CASEARRAY and SKIP arguments are NIL, then these can run at current byte-level speed: I'm adding a generic external format function \FORMATBYTESTRING which will take an internal string (fat, thin) and produce the bytesequence that represents that string on the file. The code will run at current byte-optimized speeds, just searching for a different byte sequence. We don't have to decode the bytes into characters on the fly.

The pre-translation into a byte string doesn't work if we have to match against different byte sequences for the same character (an unstable format, like XCCS's runcoding versus 2bytes that can change in the middle of the file), or if different characters (which or without different byte sequences) are supposed to match the same character in the pattern.

This is what the CASEARRAY provides for: the byte-representations of A and a might both match character A. And if SKIP is given (say X), then an X in the search pattern would have to match a byte sequence that corresponds to any particular character. (Not to mention if the CASEARRAY says that the SKIP itself has alternatives: Skipping A might mean skipping A and a).

So if CASEARRAY or SKIP are specified, I think FILEPOS has to go upmarket. The file has to be read as a sequence of characters, not bytes, using the generic \INCCODE. That file code can be looked up in a list of precomputed SKIP character-codes, and also mapped through the CASEARRAY. And in those cases, I think FFILEPOS has to revert to FILEPOS.

This is all independent of whether the CASEARRAY only specifies mappings for ASCII characters, as it does now. That can be extended (hashtables or whatever) to do arbitrary mappings, and the searching logic wouldn't change.

from medley.

rmkaplan commented on August 21, 2024

I now have an implementation of FILEPOS/FFILEPOS that operates in terms of characters and not bytes, thus working with arbitrary external formats (and running at previous byte-speed in the common case of no skip and no case array).

However: I don't understand how the END parameter of FILEPOS is to be interpreted. I assumed that the START and END parameters delimited the range of bytes that a match must occur in. In particular, the search should fail if it requires matching a byte/character that is beyond the byte specified by END.

This is not the behavior of the original FILEPOS. Suppose the file FIVE consists of the characters "12345". (OLDFILEPOS "345" FIVE 0 5) returns 2, as you would expect. The file is left positioned at 2 so that the next READC will read 3. Surprising to me, you get the same result (2) for:
(OLDFILEPOS "345" FIVE 0 4)
(OLDFILEPOS "345" FIVE 0 3)
(OLDFILEPOS "345" FIVE 0 2)
It doesn't fail until END=1, the position before the first byte of the match.

So it seems that for the old FILEPOS, END is interpreted as "don't start a match" beyond END, as opposed to "don't finish a match" beyond END.

This is not what I expected, and I can't find any documentation of this, one way or the other.

Is the old behavior a bug or a feature? Maybe it's resolved in the virtual machine documentation--do we have that online anywhere? Otherwise: what should it be?

from medley.

masinter commented on August 21, 2024

published documents about Interlisp should be findable in our Zotero Interlisp bibliography collection
From Zotero you can find the original and the 1979 revision https://www.cs.utexas.edu/~moore/publications/interlisp-vm.pdf

from medley.

masinter commented on August 21, 2024

When questions about 'right' behavior are raised, especially when we don't have a lot of 'running' code, I think we should err on the side of backward compatibility with the current implementation.

I don't remember using Moore's document as an authority over current implementations.

from medley.

rmkaplan commented on August 21, 2024

Thanks for the doc, interesting. It partially describes the current implementation, in that END is in fact specified as a constraint on the start position (don’t allow a match that starts after END), contrary to what I have always intuited, based on analogies with COPYBYTES, SUBSTRING, etc. But I’ll make the new implementation be backward compatible. What the VM doesn’t describe accurately is the behavior when START or END are specified as negative numbers, or when END is leq START. It says cause an error, current behavior is to fold a negative END around the EOF, and to return NIL if START is negative. I think the only error that FILEPOS now produces (other than start and end not being numbers) is a FILENOTOPEN. It also doesn’t describe the casearray, but tthat’s just an extension.

…

On Jun 28, 2022, at 1:03 PM, Larry Masinter ***@***.***> wrote: When questions about 'right' behavior are raised, especially when we don't have a lot of 'running' code, I think we should err on the side of backward compatibility with the current implementation. I don't remember using Moore's document as an authority over current implementations. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJLG3ZEW6TNX2C2X6VLVRNLBJANCNFSM4P2SOT3Q>. You are receiving this because you were mentioned.

from medley.

rmkaplan commented on August 21, 2024

Finally (I hope), XCCS is still problematic because we typically use the run-coded representation of characters (so that in particular Ascii characters occupy only one byte). In that set up the byte sequence representing of a given character code on a file depends on the character set in play at that point in the file, and that depends on what character-set shifts (if any) have appeared earlier in the file. In essence, an XCCS runcoded file is not random access.

There is a shot at reasonable behavior if the search begins at the beginning of the file (either just after opening or with 0 specified as the START argument), then at least we will know the character set at each point and so can extract the proper character codes to compare with the search pattern. This would be slower code than the fast byte searching code that can be used with a stable byte-representation scheme like UTF-8 or the various ISO8859's.

However, if the search is to start at a position other than 0, then I think the best we can do is to assume that the stream's view of the character set at that start position is correct. FILEPOS (slow case) can be sure that the stream's character set information is accurate for the byte-position that it returns, so that future searches or manipulations can assume consistency between that byte position and the stream's charset.

If the stream's charset is not accurate at a given starting position, then the caller would have to all CHARSET to make it consistent...but only for XCCS files.

Is there a better strategy or a better contract with the caller?

(It would be good to begin the conversion of source files from XCCS to UTF-8/unicode. E.g. fix MAKEFILE so that it produces UTF-8 files by default, possibly forcing a MAKEFILE-NEW whenever it dumps a file that it has read in XCCS)

from medley.

rmkaplan commented on August 21, 2024

This is accomplished in prc #827.

The new code has a fast case that is about 20% faster than the original. The fast case is when no SKIP or CASEARRAY is specified, and it can then precompute the byte sequence that represents the characters of the internally code search pattern, and search for those bytes at high speed.

If a skip or case array is specified, then it runs at slow speed, since it has to decode each character's byte sequence at each position in the file. This is about 7 times slower than the previous byte-only search of the old code. But that code would return correct results only for 8-bit byte searches.

For an unstable encoding like run-coded XCCS (but not UTF-8, ISO8859), the results will be correct in the slow case provided that the stream's character set is correct at the starting byte position. In the fast case it will also be correct except in one particular situation: A non-ascii pattern will fail to match a substring in the file that is preceded by other characters in the same character set. Thus δξ it will match the pattern δξ where it appears on the file immediately after a non-Greek character (e.g. aδξ) but it will fail to match if it appears immediately after another Greek character (e.g αβδξ). That's because the fast case cannot know whether or not there will a character set shift immediately before a pattern occurrence.

As part of this PR, I reworked the EDITCALLERS algorithm to avoid its reliance on a case array, so it can run in the fast case (and perhaps miss out on some Greek or math symbols). It's about 40% faster than the original.

from medley.

masinter commented on August 21, 2024

It doesn't seem that there are cases that are still problematic in practice, and if there are, there is too much to read here. So marking this one as wontfix and closing.

from medley.

unicode and FILEPOS about medley HOT 28 CLOSED

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent