Giter Site home page Giter Site logo

Tokenizer text recovery problem about somajo HOT 1 OPEN

tsproisl avatar tsproisl commented on July 28, 2024
Tokenizer text recovery problem

from somajo.

Comments (1)

tsproisl avatar tsproisl commented on July 28, 2024

It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like control characters (see also issue #17).

How to best proceed from here depends on what you want to achieve. Do you want to be able to perfectly detokenize any text or do you want to address the particular tokenization error in your example, i.e. that colon and paren are erroneously merged into a single token? The former would require a lot more work than the latter.

Detokenization, alternative 1: SoMaJo could try to keep all the information that is necessary to reconstruct the original input. This might be feasible for whitespace. However, being able to do the same thing for some of the nasty characters that SoMaJo removes (control characters, soft hyphen, zero-width space, etc.) would require deeper changes.

Detokenization, alternative 2: You could solve the problem externally. The detokenize function from issue #17 almost solves the problem. It should be easy to capture the remaining differences between the detokenized text and the original input with some string alignment algorithm and to add the additional information to the tokens.

Addressing the tokenization error: Emoticons that contain an erroneous space should be quite rare. If you do not need to recognize them (for example because regular sequences of colon, space and paren are much more frequent in your data), you could try to deacticate that feature of the tokenizer. Unfortunately, there is no API for doing that, but a small hack can do the trick: You can set the regular expression that recognizes emoticons with a space to something that never matches, e.g. r"$^" (end of string followed by beginning of string). Here is how you could do that:

import somajo
import regex as re

tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
tokenizer._tokenizer.space_emoticon = re.compile(r"$^")
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

And here is the output:

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:  -->  None
(  -->  None
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

from somajo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.