stenskjaer / samewords Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 647 KB

Automatically annotate potentially ambiguous words in critical text editions made with LaTeX and reledmac.

License: MIT License

Python 76.48% TeX 23.52%

samewords's People

Contributors

Stargazers

Watchers

Forkers

kevin-mattheus-moerman

samewords's Issues

Web client

The web client would also have a non GUI API.

This would:

Make demonstration and use a lot easier as it makes no requirements of the user system.
It could be added to an automatic pipeline of composable components like the idea is in the SCTA sytem.

In our typographical tradition it is the usual that the abbreviated part in \lemma is denoted by dashes, sometimes n-dashes, sometimes m-dashes, sometimes with spaces on their sides, sometimes without – but hardly ever \ldots.
Would it be possible to add this to the code so that

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{B}{\Afootnote{del.}}
\edtext{F}{\Afootnote{del.}}
\edtext{B C D E F}{\lemma{B--F}\Afootnote{b c d e f}}
\edtext{B}{\Afootnote{bb}}
\edtext{F}{\Afootnote{ff}}
\pend
\endnumbering


\end{document}

would be annotated to display the apparatus unambiguously?

If this is tricky, I can quite easily get by by redefining \ldots. In this case I'd suggest that you add to the readme that \ldots is required to get the \lemma annotated correctly ... it took me quite some time to realise what the problem was ;)

thin space \,

Thinspace \, in the \edtext makes the script abort with

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 11, in <module>
    load_entry_point('samewords', 'console_scripts', 'samewords')()
  File "~/samewords-issue-24/samewords/cli.py", line 99, in main
    output_content = samewords.core.process_document(filename)
  File "~/samewords-issue-24/samewords/core.py", line 26, in process_document
    chunk = ''.join([run_annotation(par) for par in chunk_pars(chunk)])
  File "~/samewords-issue-24/samewords/core.py", line 26, in <listcomp>
    chunk = ''.join([run_annotation(par) for par in chunk_pars(chunk)])
  File "~/samewords-issue-24/samewords/core.py", line 12, in run_annotation
    words = matcher.annotate()
  File "~/samewords-issue-24/samewords/matcher.py", line 28, in annotate
    edtext_end = entry['data'][1]
IndexError: list index out of range

e.g.

\documentclass[a5paper]{scrartcl}

\usepackage[series={A},noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
5\,000 letters or 
\edtext{5\,000}{%
	\Afootnote{6\,000}}
words?
\pend
\endnumbering

\end{document}

If \, is replaced with \thinspace{} it works. So it would be easy to circumvent for the user in case it is difficult to fix.

Still a problem minimal example

Thanks for the correction but adding the needed '\endnumbering' to edition.tex doesn't solve the problem. Both the web service and the script still return 'index out of range'.

Chris

Traceback (most recent call last):
File "/usr/local/bin/samewords", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 116, in main
print(samewords.core.process_document(filename, procedure))
File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 26, in process_document
return process_string(content, method=method)
File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 32, in process_string
chunked_content = chunk_doc(content)
File "/usr/local/lib/python3.6/site-packages/samewords/document.py", line 49, in chunk_doc
indices.append([indices[-1][-1], len(content)+1])
IndexError: list index out of range

markup in lemma not correct

Running

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{and two and %
\edtext{three}{%
	\Afootnote{tree}}
 and four and one and two and three %
\edtext{and}{%
	\Afootnote{or}}
 four}{
	\lemma{and–four}
	\Afootnote{del.}}
 and six.
\pend
\endnumbering

\end{document}

with 0.4.3 will not mark up "four" in the lemma:

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{\sameword[1]{and} two \sameword{and} %
\edtext{\sameword[2]{three}}{%
	\Afootnote{tree}}
 \sameword{and} four \sameword{and} one \sameword{and} two \sameword{and} \sameword{three} %
\edtext{\sameword[2]{and}}{%
	\Afootnote{or}}
 four}{
	\lemma{\sameword{and}–four}
	\Afootnote{del.}}
 \sameword{and} six.
\pend
\endnumbering

\end{document}

resulting in:

and¹–four] del.

while it should be

and¹–four²] del.

Tests strings with escaped latex expressions

Example: '{\\ \& \% \$ \# \_ \{ \} \~ \^}'

Double sameword annotation

This is lifted from #29.

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
TWO~dollars and
\edtext{TWO \edtext{cent}{%
	\Afootnote{dimes}}
and
even
%
\edtext{TWO}{%
	\Afootnote{4}}
 more}{%
	\lemma{TWO–more}%
	\Afootnote{del.}},
and
%
\edtext{TWO~%
\edtext{dollars}{%
	\Afootnote{cents}}
}{%
	\Afootnote{del.}}
some
more.
\pend
\endnumbering

\end{document}

The result gives some double \sameword{\sameword{ and will not compile with reledmac:

./orig1-SWtestSW.tex:30: Undefined control sequence.
<argument> ... @\this@absline @\the \section@numR 
                                                  @R
l.30 \endnumbering

Removing the double \samewords makes it run through with a correct result.

On the other hand

1 TWO⁴ dollars² ] del.

is a bit strange. I'd expect to find

1 TWO dollars² ] del.

in an edition.
(But I keep wondering whether one could construct cases where this would be ambiguous. Can you come up with one?)

Considering this, your markup seems to make sense and reledmac's handling of the markup should be different.
What do you think?

macros with empty argument/braces

Macros with empty arguments/braces without space immediately after seem to puzzle samewords. E.g.:

\beginnumbering
\pstart
{word\anymacro{}}
\pend
\endnumbering

{word\anymacro{}}a

gives

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 538, in _register_closing
    open_idx = self._stack_bracket[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/samewords/cli.py", line 116, in main
    print(samewords.core.process_document(filename, procedure))
  File "/usr/local/lib/python3.7/site-packages/samewords/core.py", line 26, in process_document
    return process_string(content, method=method)
  File "/usr/local/lib/python3.7/site-packages/samewords/core.py", line 38, in process_string
    for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.7/site-packages/samewords/core.py", line 38, in <listcomp>
    for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.7/site-packages/samewords/core.py", line 10, in run_annotation
    tokenization = Tokenizer(input_text)
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 390, in __init__
    self.wordlist = self._wordlist()
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 400, in _wordlist
    word, pos = self._tokenize(self.data, pos)
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 513, in _tokenize
    self._register_closing(word)
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 545, in _register_closing
    word.close_macro(0)
  File "/usr/local/lib/python3.7/site-packages/samewords/tokenize.py", line 191, in close_macro
    'The word "{}" does not have any open macros.'.format(self))
IndexError: The word "word" does not have any open macros.

But either

{word\anymacro }

{word\anymacro{} }

word\anymacro{}

work fine.

Compare search words as lower case?

Would it be better always (on subject to customizations) to compare words as lower case instances. The words in the critical apparatus may appear in lower case, thus creating ambiguity that would not be caught if Titlecased and lowercased words are compared.

On the other hand. By using the lemma words as the form of the search word, I guess the problem is not so much with transformation of words between maintext and lemma appearance, but on comparing lemma words with other instances in the text.

Example 1 (true case lemma entries):

An example of an (un)ambiguous case.
1 an ] om. P

Lemma: an². Would not match on searching context. But it would also not be ambiguous, as the appearance of the two is different. This could be distinguished from the alternative:

An example of an (un)ambiguous case.
1 An ] om. P

But: The first example could still confuse a reader who expects any lemma word to be lower case.

Example 2 (always lower case lemma entries):

An example of an ambiguous case.
1 an ] om. P

With the practice of always lower casing the apparatus lemma (a decision that samewords should be agnostic to), this would be ambiguous.

Both examples here may lead to confusion.

Idea 1: Lower case context words

If we lower case the context words before comparison:

Matches will occur when lemma words are always lower cased, regardless of whether the context word is lower or titlecased.
Matches will not occur when the lemma is not lower cased (except of course the line contains the same word in titlecase more than once). But it will also not be ambiguous as the lemma form obviously is titlecased.

Idea 2: Lower case both lemma and context before comparison

This way the annotation would be as explicit as possible. This might lead to some redundancy in annotation and disambiguation, but should not leave any room for doubt.

Unless of course:

An example of an (un)ambiguous case.
1 an¹ ] om. P

Could be interpreted to refer to the first lower case instance. This would cause confusion too, as there is only one such instance.

% and linebreaks again (vers. 0.2.7 branch "issue-19")

This is probably the word boundry issue of #12 again:

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word % some word commented out
word
w%
%something commented out
%something else commented out
o% some letter commented out
r% another word just to see what will happen
d 	
word wo% check whether "o" or "ø"
rd
\edtext{w%	
%
ord}{\Afootnote{statement}} %A
w% "W" or "w"?
ord
word
\pend
\endnumbering


\end{document}

gives

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word % some word commented out
word
w%
%something commented out
%something else commented out
o% some letter commented out
r% another word just to see what will happen
d 	
word wo% check whether "o" or "ø"
rd
\edtext{\sameword[1]{w%	
%
ord}}{\Afootnote{statement}} %A
\sameword{w% "W" or "w"?
ord}
word
\pend
\endnumbering


\end{document}

Also now spaces seem to make things work:

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
my phrase your statement my %
phrase
\edtext{my phrase}{\Afootnote{her sentence}} %A
my phrase my phrase your statement more text my phrase 
\edtext{my phrase}{\Afootnote{her sentence}} %A
your statement my phrase
\edtext{my 
phrase}{\Afootnote{her sentence}} %A
\pend
\endnumbering


\end{document}

is marked up correctly as:

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
\sameword{my phrase} your statement \sameword{my %
phrase}
\edtext{\sameword[1]{my phrase}}{\Afootnote{her sentence}} %A
\sameword{\sameword{my phrase}} my phrase your statement more text \sameword{\sameword{my phrase}} 
\edtext{\sameword[1]{my phrase}}{\Afootnote{her sentence}} %A
your statement \sameword{\sameword{my phrase}}
\edtext{\sameword[1]{my 
phrase}}{\Afootnote{her sentence}} %A
\pend
\endnumbering


\end{document}

(If the % inside words is unfeasible to implement, I'd think it is used rarely enough to exclude this possibility if you add this limitation to the read me.)

handling of brackets

Rather pondering than suggesting...

Processing

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
Some
\edtext{text}{%
	\Afootnote{words.}} 
and some more te[xt but partly un]readable.
This will cause \edtext{problems}{%
	\Afootnote{difficulties}} later.
\pend
\endnumbering 

\end{document}

will give

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
Some
\edtext{\sameword[1]{text}}{%
	\Afootnote{words.}} 
and some more \sameword{te[xt} but partly un]readable.
This will cause \edtext{problems}{%
	\Afootnote{difficulties}} later.
\pend
\endnumbering 

\end{document}

This is the expected behaviour in line with that "[]" are considered punctuation.

There are two problems with this -- depending on the expected behaviour for numbering:

reledmac will not process a single "[" in a \sameword-macro. The second run of the resulting code will break off. This should probably be changed in reledmac at one point.
In the meantime I could use \char"005B instead, but then "text" in the apparatus will not be numbered as reledmac considers text and te\char"005B{}xt to be two different words.
Some editors will prefer not to consider two words with critical brackets (e.g. "Juli[us]" and "Julius") to be the same word. Is there a way to remove characters from the list of punctuation-characters other than changing the settings.py file directly? Would this cause problems in other contexts?

Do you have any advice/suggestions on how to handle this?

Space in \index{} while matching

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{A}{\Afootnote{a}}\index{A, A}
\pend
\endnumbering


\end{document}

breaks off with

Traceback (most recent call last):
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 363, in content
    symbol = search_string[position]
IndexError: string index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 11, in <module>
    load_entry_point('samewords', 'console_scripts', 'samewords')()
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/cli.py", line 43, in main
    print(samewords.core.process_document(filename))
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/core.py", line 14, in process_document
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/core.py", line 14, in <listcomp>
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 781, in critical_note_match_replace_samewords
    result = sub_processing(TextSegment(text))
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 768, in sub_processing
    context.before, context.after, search_word)
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 546, in replace_in_proximity
    right = replace_in_context_list(context_after_list, search_word)
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 540, in replace_in_context_list
    chunk = replace_in_string(word, chunk)
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 652, in replace_in_string
    updated_replacement = make_replacements(search_word_listed, replacement_string_listed)
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 629, in make_replacements
    match_in_replace_list = check_list_match(search_list, replace_list, return_list=[])
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 603, in check_list_match
    clean(replace_list[0])):
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 476, in clean
    full_macro = macro.complete_macro()
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 409, in complete_macro
    + '{' + Brackets(self.input_string, start=position).content + '}'
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 325, in __init__
    self.content = self.content(search_string, start, macro)
  File "TeX-testing/sameword-test/samewords-issue-9/samewords/annotate.py", line 375, in content
    raise ValueError("Unbalanced brackets. The provided string terminated before all "
ValueError: Unbalanced brackets. The provided string terminated before all brackets were closed.

Removing the space from the \index-macro will solve the problem, as will changing the macro to something else, e.g. \indexx. Also changing the content to something that wouldn't match, eg. \index{B, B}, solves the problem.

commands within \edtext

Commands inside the \edtext{…} like

\documentclass{article}
\usepackage{polyglossia,fontspec,xunicode}
\setmainlanguage{latin}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart

\edtext{Hákon\emph{ar} konungs}{\Afootnote{k\emph{on}gſ \msside{33b} hakon\emph{ar} Sk}}, 

\pend
\endnumbering

\end{document}

make samewords abort with

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 31, in main
    print(samewords.core.process_document(filename))
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 14, in process_document
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 14, in <listcomp>
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 714, in critical_note_match_replace_samewords
    result = sub_processing(TextSegment(text))
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 698, in sub_processing
    if search_in_proximity(search_word, context.before, context.after):
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 455, in search_in_proximity
    if re.search(r'\b' + search_word + r'\b', maintext_words):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 401, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
sre_constants.error: bad escape \e at position 7

As a starting point I'd suggest that unknown commands could be compared as if they were text. If one word is e.g. emphasised and the other not, I suppose that this would enough differentiation when both occur in the apparatus, wouldn't it?

On the other hand also some other standard commands like\index give the same problem. In most real life cases typical candidates for the \sameword-mark will also have the same index markup. But it is easy to imagine cases when e.g. the same frequent name refers to two different persons, which will need different index-entries. If both occur in the apparatus, though, we'd need sameword-numbers anyway.

As it is very difficult to know what anybody will need in the next edition, I'd like to suggest that samewords will also compare commands as text (e.g. Hákon\emph{ar} and Hákonar will be treated as different words) but to allow for a negative list, so that I could ask for e.g. \index be ignored and Hákonar\index{Håkon I} and Hákonar\index{Håkon II} be tagged as \sameword{Hákonar}\index{Håkon I} and \sameword{Hákonar}\index{Håkon II}.
Would this be feasible?

Problems with Mac CR and Win CRLF linebreaks

When files are saved with Win CRLF or Mac CR linebreaks the output is garbled, containing only the beginning or end of the document.

From issue #6.

Clean function

Make it possible to remove all sameword annotation material from text.

Installation problem

Hi,
with Ubuntu 18.4

sudo -H pip install samewordspocryphes.wiki� (master)$ 
Collecting samewords
  Downloading https://files.pythonhosted.org/packages/14/53/81de9c2452896a09252e26dbb92b98db81258cccbe3fb5fe828fad22eab6/samewords-0.5.3.tar.gz (43kB)
    100% |████████████████████████████████| 51kB 276kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-MmD3VL/samewords/setup.py", line 4, in <module>
        from samewords import __version__
      File "samewords/__init__.py", line 11, in <module>
        import samewords.core
      File "samewords/core.py", line 9
        def run_annotation(input_text: str, method: str = 'annotate') -> str:
                                     ^
    SyntaxError: invalid syntax
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-MmD3VL/samewords/

Lemmas MUST have sameword annotation

This is a general bug covering ALL annotations.

Based on #23 and #13 it is clear that all \lemma{} macros need the \sameword{} annotation too.

missing some words

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word 
\edtext{and}{\Afootnote{C1–6.}}
word and
\pend
\endnumbering 

\end{document}

is annotated to

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word 
\edtext{\sameword[1]{and}}{\Afootnote{C1–6.}}
word \sameword{and}
\pend
\endnumbering 

\end{document}

instead of correct

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
\sameword{word} 
\edtext{\sameword[1]{and}}{\Afootnote{C1–6.}}
\sameword{word} \sameword{and}
\pend
\endnumbering 

\end{document}

Infinite recursion bug in proximity match

This was reported by @floriandk.

Somehow the 30-words-boundry for checking isn't in place any more:

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
test
thirtieth
twenty-ninth
twenty-eighth
twenty-seventh
twenty-sixth
twenty-fifth
twenty-fourth
twenty-third
twenty-second
twenty-first
twentieth
nineteenth
eighteenth
seventeenth
sixteenth
fifteenth
fourteenth
thirteenth
twelfth
eleventh
tenth
ninth
eighth
seventh
sixth
fifth
fourth
third
second
first
\edtext{test}{\Afootnote{check}}
first
second
third
fourth
fifth
sixth
seventh
eighth
ninth
tenth
eleventh
twelfth
thirteenth
fourteenth
fifteenth
sixteenth
seventeenth
eighteenth
nineteenth
twentieth
twenty-first
twenty-second
twenty-third
twenty-fourth
twenty-fifth
twenty-sixth
twenty-seventh
twenty-eighth
twenty-ninth
thirtieth
test
\pend
\endnumbering


\end{document}

will still mark-up "test". Putting some hundred words in front and after the \edtext-command without a match will make it break off:

RecursionError: maximum recursion depth exceeded in comparison

Only level numbering when relevant

The level numbering ("\sameword[1]{word}") is only necessary when there is a lemma for that edtext element.

We could therefore just add it when it is required, for more clean annotations.

Tokenize correctly in languages that don't demarcate with whitespace

Examples of such languages are Sanskrit and Arabic.

There an app note can contain only a fragment of a word...

Running on already disambiguated file must not change the file

Right now, running on processed file changes \sameword[1]{content} to \sameword[1,1]{content}. Other changes might also occur.

TeX's non-breaking space ~ breaking samewords

Hej igen -- it's me, your nemesis… ;)

I am now running samewords on my huge project, so I have dug up some more problems. I'd really appreciate if you could iron out the last few glitches.

First:

~ in \edtext isn't compiled:

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
2~dollars and
\edtext{2~\edtext{cent}{%
	\Afootnote{dimes}}
and
even
2
more}{%
	\lemma{2–more}%
	\Afootnote{del.}},
and
some
more.
\pend
\endnumbering

\end{document}

  File "/usr/local/bin/samewords", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 107, in main
    print(samewords.core.process_document(filename, procedure))
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 32, in process_document
    for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 32, in <listcomp>
    for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 13, in run_annotation
    words = matcher.annotate()
  File "/usr/local/lib/python3.6/site-packages/samewords/matcher.py", line 31, in annotate
    edtext_end = entry['data'][1] + 1
IndexError: list index out of range

With \edtext{2 \edtext{cent} it runs through, but doesn't catch all occurrences of "2", not surprisingly, as I assume that the whole string "2~dollars" is handled as one word and thus not matched with the word "2" of the edtext.

context_distance not read from json file (vers. 0.2.7 branch "issue-19")

Running with --config-file the value for context_distance is not considered.
Modifying settings.py works though, so this is very low priority for me.

(Other settings from the same json-file like exclude_macros are considered as expected.)

Update function

So when a user has changed his edition so that a reading is no longer matched by content phrases, the program should remove those sameword annotations again once run with a --update flag.

Problem with running samewords on Minmal example

Uploading the Minimal example edition.tex to the webs ervice produces the error 'Error: list index out of range'

The same error is produced by running samewords locally on edition.tex with this trace:

samewords edition.tex
Traceback (most recent call last):
File "/usr/local/bin/samewords", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 116, in main
print(samewords.core.process_document(filename, procedure))
File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 26, in process_document
return process_string(content, method=method)
File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 32, in process_string
chunked_content = chunk_doc(content)
File "/usr/local/lib/python3.6/site-packages/samewords/document.py", line 49, in chunk_doc
indices.append([indices[-1][-1], len(content)+1])
IndexError: list index out of range

lemma (still/again?) handled incorrectly

As I suspect that you aren't notified of comments on closed issues and I cannot reopen them either, I put this here as a new issue even though I've already written this in #13 :

Is the rewritten code you mention included in the vers. 0.2.7 branch "issue-19"? Testing any ellipsis marker, with or without calling the json file give me an incorrect result, e.g.

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{B}{\Afootnote{del.}}
\edtext{F}{\Afootnote{del.}}
\edtext{B C D E F}{\lemma{B--F}\Afootnote{b c d e f}}
\edtext{B}{\Afootnote{bb}}
\edtext{F}{\Afootnote{ff}}
\pend
\endnumbering


\end{document}

gives:

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{\sameword[1]{B}}{\Afootnote{del.}}
\edtext{\sameword[1]{F}}{\Afootnote{del.}}
\edtext{\sameword{B} C D E \sameword{F}}{\lemma{B--F}\Afootnote{b c d e f}}
\edtext{\sameword[1]{B}}{\Afootnote{bb}}
\edtext{\sameword[1]{F}}{\Afootnote{ff}}
\pend
\endnumbering


\end{document}

instead of

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{\sameword{B}}{\Afootnote{del.}}
\edtext{\sameword{F}}{\Afootnote{del.}}
\edtext{\sameword[1]{B} C D E \sameword[1]{F}}{\lemma{\sameword{B}--\sameword{F}}\Afootnote{b c d e f}}
\edtext{\sameword{B}}{\Afootnote{bb}}
\edtext{\sameword{F}}{\Afootnote{ff}}
\pend
\endnumbering


\end{document}

two occurences within the same edtext

I am not sure whether this actually is a problem or might even be handled like this on purpose. But let me put it here anyway, just in case:

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
two dollars and
\edtext{two \edtext{cent}{%
	\Afootnote{dimes}}
and even two more}{%
	\lemma{two–more}%
	\Afootnote{del.}},
and two more.
\pend
\endnumbering

\end{document}

is converted to

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\sameword{two} dollars and
\edtext{\sameword[1]{two} \edtext{cent}{%
	\Afootnote{dimes}}
and even two \sameword[1]{more}}{%
	\lemma{\sameword{two}–\sameword{more}}%
	\Afootnote{del.}},
and \sameword{two} \sameword{more}.
\pend
\endnumbering

\end{document}

This compiles to the correct endresult in reledmac but I am wondering about why "and even two " isn't tagged… Is that by design?

I haven't yet been able to construct an example where this exception results in a faulty end-result: When another \edtext{two}, e.g. in the next sentence, is added, also the earlier one will receive a \sameword{} command. Could there be other cases where this will not work?

False positives in edtext break compilation

causata est a sensibili \supplied{ut} ab agente est \sameword{in} sensu ut \sameword{in} subiecto,
et eodem modo \edtext{\sameword[1]{in} proposito: Intellectum \sameword[1]{in} actu et
\sameword[1]{in}tellectus dicuntur \sameword[1]{idem}}{\lemma{\sameword{in} \dots{} 
\sameword{idem}}\Bfootnote{\emph{om.} Aguin.}}, quia species intellecti \sameword{in}

The \sameword[1]{in}tellectus causes the compilation to break with the following error

! Missing number, treated as zero.
<to be read again> 
                   }
l.96

The false positive problem is still not fixed, it seems.

strange behaviour on simple files

While trying to annotate a medium size file, I encountered "Error: [Errno 2] No such file or directory" on the web-app. Minimizing the file I've come to

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word 
\edtext{and}{\Afootnote{C1–6.}}
word and
\pend
\endnumbering 

\end{document}

which should be totally unproblematic but still gives the error. Other minimal files made from scratch do work on the web service.

Several other things are weird with this

the problematic file is being annotated with samewords 0.5.0 which I still have installed locally though the result being not completely correct:

\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word 
\edtext{\sameword[1]{and}}{\Afootnote{C1–6.}}
word \sameword{and}
\pend
\endnumbering 

\end{document}

I can get the same (incomplete: I'll put this into a separate issue) result from the web service if I copy-paste the code to a new file.
If I go back to my initial code-snippet:

\beginnumbering
\pstart
 Þor%
\edtext{og}{%
	\Afootnote{÷~C\textsuperscript{1–6}.}} %
\pend
\endnumbering

it will also get processed on the webservice as copy-pasted but not as the original file.
But 0.5.0 will complain either way:

Traceback (most recent call last):
  File "somewhere/samewords-master/samewords/tokenize.py", line 522, in _register_closing
    open_idx = self._stack_bracket[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 11, in <module>
    load_entry_point('samewords', 'console_scripts', 'samewords')()
  File "somewhere/samewords-master/samewords/cli.py", line 116, in main
    print(samewords.core.process_document(filename, procedure))
  File "somewhere/samewords-master/samewords/core.py", line 32, in process_document
    for par in chunk_pars(chunk)])
  File "somewhere/samewords-master/samewords/core.py", line 32, in <listcomp>
    for par in chunk_pars(chunk)])
  File "somewhere/samewords-master/samewords/core.py", line 10, in run_annotation
    tokenization = Tokenizer(input_text)
  File "somewhere/samewords-master/samewords/tokenize.py", line 378, in __init__
    self.wordlist = self._wordlist()
  File "somewhere/samewords-master/samewords/tokenize.py", line 388, in _wordlist
    word, pos = self._tokenize(self.data, pos)
  File "somewhere/samewords-master/samewords/tokenize.py", line 497, in _tokenize
    self._register_closing(word)
  File "somewhere/samewords-master/samewords/tokenize.py", line 529, in _register_closing
    word.close_macro(0)
  File "somewhere/samewords-master/samewords/tokenize.py", line 179, in close_macro
    raise IndexError('The word does not have any open macros.')
IndexError: The word does not have any open macros.

This can be remedied for 0.5.0 by either adding a space behind "Þor" or removing the \textsuperscript command. The webservice will still reject the file.

So my guess is that you have take care of this error already in the up-to-date version but I include the description here because I hope it might give you a clue on the main problem:

Could you guide me to which kind of weird properties of my file can break samewords (including the web service) so I can avoid them? I attach the offending file here: VÓ-BSWtest.tex.zip

Do not require all apparatus notes to contain a \lemma{}

Create deployment pipeline for the package

When new versions of the package are created, the web service is not updated. This is mentioned in issue #40.

To address this, and a number of other inefficiencies, a deployment pipeline should be created to

run all tests before any deployment.
automatically prepare and publish new releases on pypi.
automatically update the webservice once new releases are created.

Error if docopt is not installed on installation

This is caused by docopt being imported in samewords.cli, which is imported into samewords.annotate. Another solution should be found for handling the command line arguments in annotate.

sensitive_context_match isn't false by default and has some problems

This paragraph won't get annotated without setting the sensitive_context_match parameter to false in a config file:

\pstart
A word with \edtext{a word}{\Afootnote{another D}} after it.
\pend

Moreover, once the paragraph is annotated, the resulting code is like this:

\pstart
\sameword{A} \sameword{word} with \edtext{\sameword[1]{a} \sameword[1]{word}}{\Afootnote{another D}} after it.
\pend
\pstart
\sameword{a} \sameword{word} with \edtext{\sameword[1]{a} \sameword[1]{word}}{\Afootnote{another D}} after it.
\pend

but, once compiled, the two critical notes are different:

6 a word2] another D
7 a2 word2 ] another D

And if you put the parameter multiword on true, the resulting code is this:

\pstart
\sameword{A word} with \edtext{\sameword[1]{a word}}{\Afootnote{another D}} after it.
\pend
\pstart
\sameword{a word} with \edtext{\sameword[1]{a word}}{\Afootnote{another D}} after it.
\pend

but, again, the two critical notes are different:

6 a word] another D
7 a word2] another D

Some Unicode blocks not supported

While trying some real life examples with branch issue-24 I noticed that some Unicode characters break compilation, i.e. it says "Starting conversion." and never comes to an end.

It is a bit curious which characters are affected. It seems to go by Unicode blocks and neither by frequency (e.g. typographical quotes and € don't work, Runes do) nor function.

\documentclass[a5paper]{scrartcl}

\usepackage{fontspec}
\setmainfont{Arial}

\usepackage[series={A},noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
Basic Latin: o 

Lat1: ô 

ExtA: ő

ExtB: ǫ

IPA: ɷ %all working

Spacin Mod:% ˚ %breaks

Comb Diacritics:%  oͦ %breaks

Greek: ώ %compiles

Cyrillic: ѻ %compiles

Cyr Supp: Ԛ %compiles

Runic: ᚮ  %compiles

IPA Ext: ᴏ ᴼ  %compiles

IPA Ext Suppl: ᶱ  %compiles

Comb Dia Suppl: % ᷕ %breaks 

Lat Ext A: ṓ %compiles

Greep Add: Ὧ %compiles

General Punct: %“⁐ %breaks!

Superscripts/Sunscripts: ₒ %compiles

Currency: %€₰ %breaks

[...]

Lat Ext C: ⱬ %compiles

[...]

Supplemental Punct: %⸀ %breaks

[...]

Lat Ext D: ꝏ %compiles

[...]

PUA: % %breaks
\pend
\endnumbering

\end{document}

error when nesting without lemma

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
et \edtext{hic \edtext{et}{\Afootnote{÷ A}} hoc}{\Afootnote{ille et illud B}} et cetera
\pend
\endnumbering

\end{document}

aborts with

  File "/usr/local/bin/samewords", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 31, in main
    print(samewords.core.process_document(filename))
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 14, in process_document
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 14, in <listcomp>
    updated_paragraphs = [annotate.critical_note_match_replace_samewords(par) for par in paragraphs]
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 714, in critical_note_match_replace_samewords
    result = sub_processing(TextSegment(text))
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 698, in sub_processing
    if search_in_proximity(search_word, context.before, context.after):
  File "/usr/local/lib/python3.6/site-packages/samewords/annotate.py", line 455, in search_in_proximity
    if re.search(r'\b' + search_word + r'\b', maintext_words):
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/sre_parse.py", line 401, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
sre_constants.error: bad escape \e at position 6

et \edtext{hic \edtext{et}{\Afootnote{÷ A}} hoc}{\lemma{test}\Afootnote{ille et illud B}} et cetera

works, this has obviously something to do with the first \edtext not having a \lemma even though I am using vers 0.1.3 installed via pip.

handbook

Write an short presentation/an handbook to be published on http://geekographie.maieul.net

overlapping structures with xxref

Overlapping structures with \xxref confuse the script:

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{and two \edtext{}{\xxref{and-and-start}{and-and-end}\lemma{and–and}\Afootnote{overlapping}}\edlabel{and-and-start}and %
\edtext{three}{%
	\Afootnote{tree}}
 and four and one and two and three %
\edtext{and}{%
	\Afootnote{or}}
 four}{
	\lemma{and–four}
	\Afootnote{del.}}
 and\edlabel{and-and-end} six.
\pend
\endnumbering

\end{document}

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{\sameword{and} two \edtext{\sameword[2]{}}{\xxref{and-and-start}{and-and-end}\lemma{\sameword{and}–\sameword{and}}\Afootnote{overlapping}}\edlabel{and-and-start}and %
\edtext{\sameword[2]{three}}{%
	\Afootnote{tree}}
 \sameword{and} four \sameword{and} one \sameword{and} two \sameword{and} \sameword{three} %
\edtext{\sameword[2]{and}}{%
	\Afootnote{or}}
 four}{
	\lemma{and–four}
	\Afootnote{del.}}
 and\edlabel{and-and-end} six.
\pend
\endnumbering

\end{document}

which compiles to

1 and–four] del.
1 and¹–and] overlapping
1 three¹ ] tree
1 and⁶ ] or

Or (which I understand to be alternative correct usage of \xxref)

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{and two \edtext{and}{\xxref{and-and-start}{and-and-end}\lemma{and–and}\Afootnote{overlapping}}\edlabel{and-and-start} %
\edtext{three}{%
	\Afootnote{tree}}
 and four and one and two and three %
\edtext{and}{%
	\Afootnote{or}}
 four}{
	\lemma{and–four}
	\Afootnote{del.}}
 and\edlabel{and-and-end} six.
\pend
\endnumbering

\end{document}

\documentclass{scrartcl}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
One %
\edtext{\sameword{and} two \edtext{\sameword[2]{and}}{\xxref{and-and-start}{and-and-end}\lemma{\sameword{and}–\sameword{and}}\Afootnote{overlapping}}\edlabel{and-and-start} %
\edtext{\sameword[2]{three}}{%
	\Afootnote{tree}}
 \sameword{and} four \sameword{and} one \sameword{and} two \sameword{and} \sameword{three} %
\edtext{\sameword[2]{and}}{%
	\Afootnote{or}}
 four}{
	\lemma{and–four}
	\Afootnote{del.}}
 and\edlabel{and-and-end} six.
\pend
\endnumbering

\end{document}

1 and–four] del.
1 and²–and] overlapping
1 three¹ ] tree
1 and⁷ ] or

As far as I understand reledmac's handling of \sameword it isn't possible to mark up the overlapping structure to be numbered automatically (is it?) but at least the applying of regular \sameword-tags shouldn't be broken.

Config option to indicate languages not matching the `\w` character class

Currently the assumption is that text consists of material matching the \w.

If a user has an edition outside that class, it will work significantly slower. But we cant just turn faster matching (currently basically \w+) up to match all possible code blocks, because that would make matching the few exceptional cases (\\{} and punctuation) much more demanding.

So it might be a good idea to make it possible to configure it to use one or more of the other languages. The full list of not included material is

['Adlam',
 'Aegean Numbers',
 'Ahom',
 'Alchemical Symbols',
 'Anatolian Hieroglyphs',
 'Ancient Greek Musical Notation',
 'Ancient Greek Numbers',
 'Ancient Symbols',
 'Arabic Extended-A',
 'Arabic Mathematical Alphabetic Symbols',
 'Arabic Presentation Forms-A',
 'Arabic Presentation Forms-B',
 'Armenian',
 'Arrows',
 'Avestan',
 'Balinese',
 'Bamum',
 'Bamum Supplement',
 'Basic Latin',
 'Bassa Vah',
 'Batak',
 'Bengali',
 'Bhaiksuki',
 'Block Elements',
 'Bopomofo',
 'Bopomofo Extended',
 'Box Drawing',
 'Brahmi',
 'Braille Patterns',
 'Buginese',
 'Buhid',
 'Byzantine Musical Symbols',
 'CJK Compatibility',
 'CJK Compatibility Forms',
 'CJK Compatibility Ideographs',
 'CJK Radicals Supplement',
 'CJK Strokes',
 'CJK Symbols and Punctuation',
 'CJK Unified Ideographs',
 'CJK Unified Ideographs Extension A',
 'Carian',
 'Caucasian Albanian',
 'Chakma',
 'Cham',
 'Cherokee',
 'Combining Diacritical Marks',
 'Combining Diacritical Marks Extended',
 'Combining Diacritical Marks Supplement',
 'Combining Diacritical Marks for Symbols',
 'Combining Half Marks',
 'Common Indic Number Forms',
 'Control Pictures',
 'Coptic',
 'Coptic Epact Numbers',
 'Counting Rod Numerals',
 'Cuneiform',
 'Cuneiform Numbers and Punctuation',
 'Currency Symbols',
 'Cyrillic Extended-A',
 'Cyrillic Extended-B',
 'Cyrillic Extended-C',
 'Devanagari Extended',
 'Dingbats',
 'Domino Tiles',
 'Duployan',
 'Early Dynastic Cuneiform',
 'Egyptian Hieroglyphs',
 'Elbasan',
 'Emoticons',
 'Enclosed Alphanumeric Supplement',
 'Enclosed CJK Letters and Months',
 'Enclosed Ideographic Supplement',
 'Ethiopic',
 'Ethiopic Extended',
 'Ethiopic Extended-A',
 'Ethiopic Supplement',
 'General Punctuation',
 'Geometric Shapes',
 'Geometric Shapes Extended',
 'Georgian Supplement',
 'Glagolitic',
 'Glagolitic Supplement',
 'Gothic',
 'Grantha',
 'Greek Extended',
 'Gujarati',
 'Gurmukhi',
 'Halfwidth and Fullwidth Forms',
 'Hangul Compatibility Jamo',
 'Hangul Jamo Extended-A',
 'Hangul Jamo Extended-B',
 'Hangul Syllables',
 'Hanunoo',
 'Hebrew',
 'High Private Use Surrogates',
 'High Surrogates',
 'Ideographic Description Characters',
 'Ideographic Symbols and Punctuation',
 'Javanese',
 'Kaithi',
 'Kana Extended-A',
 'Kana Supplement',
 'Kanbun',
 'Kangxi Radicals',
 'Kannada',
 'Kayah Li',
 'Kharoshthi',
 'Khmer',
 'Khmer Symbols',
 'Khojki',
 'Khudawadi',
 'Lao',
 'Latin Extended-E',
 'Letterlike Symbols',
 'Linear A',
 'Linear B Ideograms',
 'Linear B Syllabary',
 'Lisu',
 'Low Surrogates',
 'Lycian',
 'Lydian',
 'Mahajani',
 'Mahjong Tiles',
 'Mandaic',
 'Manichaean',
 'Marchen',
 'Masaram Gondi',
 'Mathematical Operators',
 'Meetei Mayek',
 'Meetei Mayek Extensions',
 'Mende Kikakui',
 'Miscellaneous Mathematical Symbols-A',
 'Miscellaneous Mathematical Symbols-B',
 'Miscellaneous Symbols',
 'Miscellaneous Symbols and Arrows',
 'Miscellaneous Symbols and Pictographs',
 'Miscellaneous Technical',
 'Modi',
 'Mongolian',
 'Mongolian Supplement',
 'Mro',
 'Multani',
 'Musical Symbols',
 'Myanmar',
 'Myanmar Extended-B',
 'NKo',
 'New Tai Lue',
 'Newa',
 'Number Forms',
 'Nushu',
 'Ogham',
 'Ol Chiki',
 'Old Italic',
 'Old Permic',
 'Old Persian',
 'Old South Arabian',
 'Old Turkic',
 'Optical Character Recognition',
 'Oriya',
 'Ornamental Dingbats',
 'Osage',
 'Osmanya',
 'Pau Cin Hau',
 'Phags-pa',
 'Phaistos Disc',
 'Phoenician',
 'Playing Cards',
 'Private Use Area',
 'Rejang',
 'Rumi Numeral Symbols',
 'Runic',
 'Samaritan',
 'Saurashtra',
 'Sharada',
 'Shorthand Format Controls',
 'Siddham',
 'Sinhala',
 'Sinhala Archaic Numbers',
 'Small Form Variants',
 'Sora Sompeng',
 'Soyombo',
 'Spacing Modifier Letters',
 'Specials',
 'Sundanese Supplement',
 'Superscripts and Subscripts',
 'Supplemental Arrows-A',
 'Supplemental Arrows-B',
 'Supplemental Arrows-C',
 'Supplemental Mathematical Operators',
 'Supplemental Punctuation',
 'Sutton SignWriting',
 'Syloti Nagri',
 'Syriac Supplement',
 'Tagalog',
 'Tagbanwa',
 'Tai Le',
 'Tai Tham',
 'Tai Viet',
 'Tai Xuan Jing Symbols',
 'Takri',
 'Tamil',
 'Tangut',
 'Tangut Components',
 'Telugu',
 'Thaana',
 'Thai',
 'Tibetan',
 'Tifinagh',
 'Tirhuta',
 'Transport and Map Symbols',
 'Ugaritic',
 'Unified Canadian Aboriginal Syllabics Extended',
 'Vai',
 'Variation Selectors',
 'Vedic Extensions',
 'Vertical Forms',
 'Yi Radicals',
 'Yi Syllables',
 'Yijing Hexagram Symbols',
 'Zanabazar Square']

This idea came from #25.

2.3 version install broken?

I am really excited to see that you have come back to the development of samewords!

First thing I tried was to install the released version through the easy installation. The pip3 install samewords runs through w/o errors and samewords --help reacts correctly.

But pytest gives:

====================================================================== ERRORS =======================================================================
________________________________________________________ ERROR collecting test/test_core.py _________________________________________________________
test/test_core.py:6: in <module>
    class TestMainProcessing:
test/test_core.py:9: in TestMainProcessing
    './samewords/test/assets/da-49-l1q1-processed.tex')
document.py:21: in doc_content
    with open(filename, mode='r', encoding='utf-8') as f:
E   FileNotFoundError: [Errno 2] No such file or directory: './samewords/test/assets/da-49-l1q1-processed.tex'
______________________________________________________ ERROR collecting test/test_document.py _______________________________________________________
test/test_document.py:9: in <module>
    multi_begins = doc_content('./samewords/test/assets/multi_begins.tex')
document.py:21: in doc_content
    with open(filename, mode='r', encoding='utf-8') as f:
E   FileNotFoundError: [Errno 2] No such file or directory: './samewords/test/assets/multi_begins.tex'
_________________________________________________ ERROR collecting test/test_document_processing.py _________________________________________________
test/test_document_processing.py:10: in <module>
    class TestParagraphHandling:
test/test_document_processing.py:11: in TestParagraphHandling
    document = doc_content('./samewords/test/assets/da-49-l1q1.tex')
document.py:21: in doc_content
    with open(filename, mode='r', encoding='utf-8') as f:
E   FileNotFoundError: [Errno 2] No such file or directory: './samewords/test/assets/da-49-l1q1.tex'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 3 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Trying to run it on a testfile (eg from #15) nonetheless gives:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/samewords/tokenize.py", line 408, in _tokenize
    open_idx = self._stack_bracket[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/samewords", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/samewords/cli.py", line 79, in main
    print(samewords.core.process_document(filename))
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 26, in process_document
    chunk = ''.join([run_annotation(par) for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 26, in <listcomp>
    chunk = ''.join([run_annotation(par) for par in chunk_pars(chunk)])
  File "/usr/local/lib/python3.6/site-packages/samewords/core.py", line 10, in run_annotation
    tokenization = Tokenizer(input_text)
  File "/usr/local/lib/python3.6/site-packages/samewords/tokenize.py", line 283, in __init__
    self.wordlist = self._wordlist()
  File "/usr/local/lib/python3.6/site-packages/samewords/tokenize.py", line 293, in _wordlist
    word, pos = self._tokenize(self.data, pos)
  File "/usr/local/lib/python3.6/site-packages/samewords/tokenize.py", line 415, in _tokenize
    word.close_macro(0)
  File "/usr/local/lib/python3.6/site-packages/samewords/tokenize.py", line 166, in close_macro
    raise IndexError('The word does not have any open macros.')
IndexError: The word does not have any open macros.

If I download 2.0 or 2.3 zips here from GitHub, the pytest is passed (!) but the errors for a real file are the same as with the pip-version with small variance (file-paths and linenumbers).

I am using MacOS 10.12.6 with homebrew Python 3.6.4.

Make the proximate word distance configurable

It currently searches for word matches in 30 words on each side of the pivot word. This should be adjustable.

Linebreaks and tabs

On MacOS 10.12 linebreaks are not treated correctly.

A UTF8-file saved with Unix LF produces a LaTeX-file that will compile but \sameword is applied incorrectly at the linebreak:

Leo aut ursus aut oryx aut ricinus aut equus aut
lupus \edtext{aut}{\Afootnote{et}\Bfootnote{monotone\ldots}} canis aut felix aut asinus \edtext{aut}{\Bfootnote{et}} burricus.

becomes

Leo \sameword{aut} ursus \sameword{aut} oryx \sameword{aut} ricinus \sameword{aut} equus \sameword{aut
lupus} \edtext{\sameword[1]{aut}}{\Afootnote{et}\Bfootnote{monotone\ldots}} canis \sameword{aut} felix \sameword{aut} asinus \edtext{\sameword{aut}}{\Bfootnote{et}} burricus.

A similar problem occurs with tabs (shown here with <tab/>) in the source file:

Leo  aut ursus aut  oryx<tab/>aut ricinus aut<tab/>equus <tab/>aut lupus aut<tab/> canis aut felix aut asinus \edtext{aut}{\Bfootnote{et}} burricus.

becomes

Leo  \sameword{aut} ursus \sameword{aut}  \sameword{oryx<tab/>aut} ricinus \sameword{aut<tab/>equus} \sameword{<tab/>aut} lupus \sameword{aut<tab/>} canis \sameword{aut} felix \sameword{aut} asinus \edtext{\sameword[1]{aut}}{\Bfootnote{et}} burricus.

As LaTeX normalizes tabs and linebreaks to whitespace, I suppose samewords could, too, couldn't it?

I am aware that I could do some preprocessing. But I use linebreaks and tabs heavily to keep the reledmac-code human-readable, especially when I have to deal with many variants. So a built-in solution would be really appreciated.

For completeness sake: When saved with Win CRLF or Mac CR linebreaks the same problem occurs at the linebreak, but also the preamble is garbled to only:

 \documentclass[\pstart

I don't care much for which linebreak format I have to use, but it would be really good if one of them would work.

path to config-file added to output

Using a config-file the filename and the path (if any) specified after --config-file is always added twice in the beginning of the resulting file:

test-config.json
test-config.json
\documentclass{article}
\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}
...

This seems to be independent of the contents of the config-file.

Everything else is just fine and after removing the two lines the result compiles as expected.

samewords 0.5.3 with Python 3.7.5 (homebrew) on MacOS 10.12.6

successive identical words only marked once (vers. 0.2.7 branch "issue-19")

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
word
word
word
word
word
word
word
word
word
\edtext{word}{\Afootnote{statement}} %A
word
word
word
word
word
word
\edtext{word}{\Afootnote{statement}} %A
word
word
\pend
\endnumbering


\end{document}

results in

\documentclass[draft]{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
\sameword{word}
word
a
b
c
\edtext{\sameword[1]{word}}{\Afootnote{statement}} %A
\sameword{word}
word
\sameword{word}
word
\sameword{word}
word
\edtext{\sameword[1]{word}}{\Afootnote{statement}} %A
\sameword{word}
word
\pend
\endnumbering


\end{document}

I think I remember that this wasn't a problem earlier but I'm not completely sure.

handling %

\documentclass{article}

\usepackage[series={A},nofamiliar,noeledsec,noledgroup]{reledmac}

\begin{document}

\beginnumbering
\pstart
\edtext{A}{\Afootnote{a}}%A
\pend
\endnumbering


\end{document}

with or without a lb before the % results in \sameword{%A} which obviously doesn't compile in LaTeX.