Comments (1)
It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like control characters (see also issue #17).
How to best proceed from here depends on what you want to achieve. Do you want to be able to perfectly detokenize any text or do you want to address the particular tokenization error in your example, i.e. that colon and paren are erroneously merged into a single token? The former would require a lot more work than the latter.
Detokenization, alternative 1: SoMaJo could try to keep all the information that is necessary to reconstruct the original input. This might be feasible for whitespace. However, being able to do the same thing for some of the nasty characters that SoMaJo removes (control characters, soft hyphen, zero-width space, etc.) would require deeper changes.
Detokenization, alternative 2: You could solve the problem externally. The detokenize
function from issue #17 almost solves the problem. It should be easy to capture the remaining differences between the detokenized text and the original input with some string alignment algorithm and to add the additional information to the tokens.
Addressing the tokenization error: Emoticons that contain an erroneous space should be quite rare. If you do not need to recognize them (for example because regular sequences of colon, space and paren are much more frequent in your data), you could try to deacticate that feature of the tokenizer. Unfortunately, there is no API for doing that, but a small hack can do the trick: You can set the regular expression that recognizes emoticons with a space to something that never matches, e.g. r"$^"
(end of string followed by beginning of string). Here is how you could do that:
import somajo
import regex as re
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
tokenizer._tokenizer.space_emoticon = re.compile(r"$^")
paragraph = ["Angebotener Hersteller/Typ: (vom Bieter einzutragen) Im \
Einheitspreis sind alle erforderlichen \
Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
for token in sent:
print(token, " --> ", token.original_spelling)
And here is the output:
Angebotener --> None
Hersteller --> None
/ --> None
Typ --> None
: --> None
( --> None
vom --> None
Bieter --> None
einzutragen --> None
) --> None
Im --> None
Einheitspreis --> None
sind --> None
alle --> None
erforderlichen --> None
Schutzmaßnahmen --> None
bei --> None
Errichtung --> None
des --> None
Brandschutzes --> None
einzukalkulieren --> None
. --> None
from somajo.
Related Issues (20)
- Failing unit test in 2.0.2 HOT 4
- Quotation Marks HOT 1
- Phonenumber and sad emoje HOT 1
- Tokenizer outputs single characters per line HOT 4
- Thread safety HOT 6
- Segmentation of sentences in lowercase HOT 2
- Apostrophe die Vokale ersetzen HOT 5
- How do I split sentences but not words? HOT 2
- False Positives with URLS HOT 2
- tokenize with continue multiple punctions HOT 1
- Domain adaptation to law texts HOT 4
- publish on conda-forge HOT 2
- Document all possible values for `token_class`. HOT 6
- Dates at the end of sentences HOT 1
- Markdown link splitting bug. HOT 6
- Issue with Markdown style links. HOT 3
- Other issue with Markdown style links. HOT 1
- Other MD issue. HOT 2
- German text to sentence segmentation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from somajo.