Giter Site home page Giter Site logo

tsproisl / somajo Goto Github PK

View Code? Open in Web Editor NEW
136.0 9.0 20.0 1.37 MB

A tokenizer and sentence splitter for German and English web and social media texts.

License: GNU General Public License v3.0

Python 98.50% Shell 1.50%
tokenizer sentence-splitter german english social-media

somajo's Introduction

SoMaJo

PyPI Build

Introduction

echo 'Wow, superTool!;)' | somajo-tokenizer -c -
Wow
,
super
Tool
!
;)

SoMaJo is a rule-based tokenizer and sentence splitter that implements tokenization guidelines for German and English. It has a strong focus on web and social media texts (it was originally created as the winning submission to the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media) and is particularly well-suited to perform tokenization on all kinds of written discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues. Of course it also works on more formal texts.

Version 1 of the tokenizer is described in greater detail in Proisl and Uhrig (2016).

For part-of-speech tagging (in particular of German web and social media texts), we recommend SoMeWeTa:

somajo-tokenizer --split_sentences <file> | somewe-tagger --tag <model> -

Features

  • Rule-based tokenization and sentence-splitting:
    • EmpiriST 2015 tokenization guidelines for German
    • “New” Penn Treebank conventions for English (described, for example, in the guidelines for ETTB 2.0 (Mott et al., 2009) and CLEAR (Warner et al., 2012))
    • Optionally split camel-cased tokens
    • Optionally output token class information for each token, i.e. if it is a number, an emoticon, an abbreviation, etc.
    • Optionally output additional information for each token, e.g. if it was followed by whitespace or if it contained internal whitespace
    • Optionally split the tokenized text into sentences
    • Optionally determine the character offsets of the tokens in the input, allowing for stand-off tokenization
  • Text preprocessing/cleaning:
  • XML support:
    • Transparent processing of XML: Tokenize the textual content of an XML file while preserving the XML structure
    • Optionally delimit sentence boundaries by XML tags
    • Optionally prune tags, i.e. subtrees, from the XML before tokenization (for example to remove <script> and <style> tags from HTML input)
    • Optionally strip all tags from the output, effectively turning the XML into plain text
  • Parallelization: Optionally run multiple worker processes to speed up tokenization

Installation

SoMaJo can be easily installed using pip (pip3 in some distributions):

pip install -U SoMaJo

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/SoMaJo.git

In the new directory, run the following command:

pip install -U .

Usage

Using the somajo-tokenizer executable

You can use the tokenizer as a standalone program from the command line. General usage information is available via the -h option:

somajo-tokenizer -h
usage: somajo-tokenizer [-h] [-l {en_PTB,de_CMC}]
                        [-s {single_newlines,empty_lines}] [-x] [--tag TAG]
                        [--prune PRUNE] [--strip-tags] [-c]
                        [--split_sentences] [--sentence_tag SENTENCE_TAG] [-t]
                        [-e] [--parallel N] [-v]
                        FILE

A tokenizer and sentence splitter for German and English texts. Currently, two
tokenization guidelines are implemented: The EmpiriST guidelines for German
web and social media texts (de_CMC) and the "new" Penn Treebank conventions
for English texts (en_PTB).

positional arguments:
  FILE                  The input file (UTF-8-encoded) or "-" to read from
                        STDIN.

options:
  -h, --help            show this help message and exit
  -l {en_PTB,de_CMC}, --language {en_PTB,de_CMC}
                        Choose a language. Currently supported are German
                        EmpiriST-style tokenization (de_CMC) and English Penn-
                        Treebank-style tokenization(en_PTB). (Default: de_CMC)
  -s {single_newlines,empty_lines}, --paragraph_separator {single_newlines,empty_lines}
                        How are paragraphs separated in the input text? Will
                        be ignored if option -x/--xml is used. (Default:
                        empty_lines)
  -x, --xml             The input is an XML file. You can specify tags that
                        always constitute a sentence break (e.g. HTML p tags)
                        via the --tag option.
  --tag TAG             Start and end tags of this type constitute sentence
                        breaks, i.e. they do not occur in the middle of a
                        sentence. Can be used multiple times to specify
                        multiple tags, e.g. --tag p --tag br. Implies option
                        -x/--xml. (Default: --tag title --tag h1 --tag h2
                        --tag h3 --tag h4 --tag h5 --tag h6 --tag p --tag br
                        --tag hr --tag div --tag ol --tag ul --tag dl --tag
                        table)
  --prune PRUNE         Tags of this type will be removed from the input
                        before tokenization. Can be used multiple times to
                        specify multiple tags, e.g. --tag script --tag style.
                        Implies option -x/--xml. By default, no tags are
                        pruned.
  --strip-tags          Suppresses output of XML tags. Implies option
                        -x/--xml.
  -c, --split_camel_case
                        Split items in written in camelCase (excluding
                        established names and terms).
  --split_sentences, --split-sentences
                        Also split the input into sentences.
  --sentence_tag SENTENCE_TAG, --sentence-tag SENTENCE_TAG
                        Tag name for sentence boundaries (e.g. --sentence_tag
                        s). If this option is specified, sentences will be
                        delimited by XML tags (e.g. <s>…</s>) instead of empty
                        lines. This option implies --split_sentences
  -t, --token_classes   Output the token classes (number, XML tag,
                        abbreviation, etc.) in addition to the tokens.
  -e, --extra_info      Output additional information for each token:
                        SpaceAfter=No if the token was not followed by a space
                        and OriginalSpelling="…" if the token contained
                        whitespace.
  --character-offsets   Output character offsets in the input for each token.
  --parallel N          Run N worker processes (up to the number of CPUs) to
                        speed up tokenization.
  -v, --version         Output version information and exit.

Here are some common use cases:

  • To tokenize a text file according to the guidelines of the EmpiriST 2015 shared task:

    somajo-tokenizer -c <file>
    
    Show example
    echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -c -
    der
    beste
    Betreuer
    ?
    ->
    Prof
    Smith
    !
    :)
    
  • If you do not want to split camel-cased tokens, simply drop the -c option:

    somajo-tokenizer <file>
    
    Show example
    echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -
    der
    beste
    Betreuer
    ?
    ->
    ProfSmith
    !
    :)
    
  • Your input delimits paragraphs by single newlines instead of empty lines? Tell the tokenizer via the -s/--paragraph_separator option:

    somajo-tokenizer --paragraph_separator single_newlines <file>
    
  • In addition to tokenizing the input, SoMaJo can also split it into sentences:

    somajo-tokenizer --split-sentences <file>
    
    Show example
    echo 'Palim, Palim! Ich hätte gerne eine Flasche Pommes Frites.' | somajo-tokenizer --split-sentences -
    Palim
    ,
    Palim
    !
    
    Ich
    hätte
    gerne
    eine
    Flasche
    Pommes
    Frites
    .
    
    
  • To tokenize English text according to the “new” Penn Treebank conventions, explicitly specify the tokenization guideline using the -l/--language option:

    somajo-tokenizer -l en_PTB <file>
    
    Show example
    echo 'Dont you wanna come?' | somajo-tokenizer -l en_PTB -
    Do
    nt
    you
    wan
    na
    come
    ?
    
  • SoMaJo can also process XML files. Use the -x/--xml option to tell the tokenizer that your input is an XML file:

    somajo-tokenizer --xml <xml-file>
    
    Show example
    echo '<html><head><title>Weihnachten</title></head><body><p>Fr&#x00fc;her war mehr Lametta!</p></body></html>' | somajo-tokenizer --xml -
    <html>
    <head>
    <title>
    Weihnachten
    </title>
    </head>
    <body>
    <p>
    Früher
    war
    mehr
    Lametta
    !
    </p>
    </body>
    </html>
    
  • For XML input, you can use (multiple instances of) the --tag option to specify XML tags that are always sentence breaks, i.e. that can never occur in the middle of a sentence. See the help message for the default list of tags.

    somajo-tokenizer --xml --split_sentences --tag h1 --tag p --tag div <xml-file>
    
  • Via option -t/--token_classes, SoMaJo can output token class information for each token, i.e. if it is a number, an emoticon, an abbreviation, etc. Via option -e/--extra_info, additional information is available, e.g. if a token was followed by whitespace or if it contained internal whitespace.

    Show example
    echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -c -e -t -
    der      regular
    beste    regular
    Betreuer regular    SpaceAfter=No
    ?        symbol
    ->       symbol     SpaceAfter=No, OriginalSpelling="- >"
    Prof     regular    SpaceAfter=No
    Smith    regular    SpaceAfter=No
    !        symbol
    :)       emoticon   OriginalSpelling=": )"
    
  • To speed up tokenization, you can specify the number of worker processes used via the --parallel option:

    somajo-tokenizer --parallel <number> <file>
    

Using the module

Take a look at the API documentation.

You can incorporate SoMaJo into your own Python projects. All you need to do is importing somajo, creating a SoMaJo object and calling one of its tokenizer functions: tokenize_text, tokenize_text_file, tokenize_xml or tokenize_xml_file. These functions return a generator that yields tokenized chunks of text. By default, these chunks of text are sentences. If you set split_sentences=False, then the chunks of text are either paragraphs or chunks of XML. Every tokenized chunk of text is a list of Token objects.

Here is an example for tokenizing and sentence splitting two paragraphs:

from somajo import SoMaJo

tokenizer = SoMaJo("de_CMC", split_camel_case=True)

# note that paragraphs are allowed to contain newlines
paragraphs = ["der beste Betreuer?\n-- ProfSmith! : )",
              "Was machst du morgen Abend?! Lust auf Film?;-)"]

sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")
    print()

And here is an example for tokenizing and sentence splitting a whole file. The option paragraph_separator="single_newlines" states that paragraphs are delimited by newlines instead of empty lines:

sentences = tokenizer.tokenize_text_file("Beispieldatei.txt", paragraph_separator="single_newlines")
for sentence in sentences:
    for token in sentence:
        print(token.text)
    print()

For processing XML data, use the tokenize_xml or tokenize_xml_file methods:

eos_tags = ["title", "h1", "p"]

# you can read from an open file object
sentences = tokenizer.tokenize_xml_file(file_object, eos_tags)
# or you can specify a file name
sentences = tokenizer.tokenize_xml_file("Beispieldatei.xml", eos_tags)
# or you can pass a string with XML data
sentences = tokenizer.tokenize_xml(xml_string, eos_tags)

for sentence in sentences:
    for token in sentence:
        print(token.text)
    print()

Evaluation

SoMaJo was the system with the highest average F₁ score in the EmpiriST 2015 shared task. The performance of the current version on the two test sets is summarized in the following table (Training and test sets are available from the official website):

Corpus Precision Recall F₁
CMC 99.71 99.56 99.64
Web 99.91 99.92 99.91

Tokenizing English text

SoMaJo can also tokenize English text. In general, we follow the “new” Penn Treebank conventions described, for example, in the guidelines for ETTB 2.0 (Mott et al., 2009) and CLEAR (Warner et al., 2012).

For tokenizing English text on the command line, specify the language via the -l or --language option:

somajo-tokenizer -l en_PTB <file>

From Python, you can pass language="en_PTB" to the SoMaJo constructor, e.g.:

paragraphs = ["That aint bad!:D"]
tokenizer = SoMaJo(language="en_PTB")
sentences = tokenizer.tokenize_text(paragraphs)

Performance of the English tokenizer:

Corpus Precision Recall F₁
English Web Treebank 99.66 99.64 99.65

Development

Here are some brief notes to help you get started:

  • Preferably create a dedicated virtual environment.

  • Make sure you have pip ≥ 21.3.

  • Install the project in editable mode:

    pip install -U -e .
  • Install the development dependencies:

    pip install -r requirements_dev.txt
  • To run the tests:

    python3 -m unittest discover
  • To build the documentation:

    cd doc
    make markdown

    Note that the created markdown is not perfect and needs some manual postprocessing.

  • To build the distribution files:

    python3 -m build

References

If you use SoMaJo for academic research, please consider citing the following paper:

  • Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, edited by Paul Cook, Stefan Evert, Roland Schäfer, and Egon Stemle, 57–62. Berlin: Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2607.

    @InProceedings{Proisl_Uhrig_EmpiriST:2016,
      author    = {Proisl, Thomas and Uhrig, Peter},
      title     = {{SoMaJo}: {S}tate-of-the-art tokenization for {G}erman web and social media texts},
      year      = {2016},
      booktitle = {Proceedings of the 10th {W}eb as {C}orpus Workshop ({WAC-X}) and the {EmpiriST} Shared Task},
      editor    = {Cook, Paul and Evert, Stefan and Schäfer, Roland and Stemle, Egon},
      address   = {Berlin},
      publisher = {Association for Computational Linguistics},
      pages     = {57--62},
      doi       = {10.18653/v1/W16-2607},
      url       = {https://aclanthology.org/W16-2607},
    }

somajo's People

Contributors

adbar avatar andreasblombach avatar fros1y avatar horsmann avatar max-otto avatar peteruhrig avatar richstone avatar tsproisl avatar ygorg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

somajo's Issues

Quotation Marks

Hey,

totally enjoying your tool. Works almost perfect. Thanks for all the work!

Just one thing. Maybe you already know these edge cases, but for me there is a problem with quotation marks:

Input: 
[0] Der Entwurf biete aber auch Chancen, wenn man ihn entsprechend verändere: "Das betrifft zum Beispiel die häusliche 1:1-Versorgung und das Modell der persönlichen Assistenz." Dann könne sich die Situation in der ausserklinischen Intensivpflege positiv verändern.
Output: 
[0] Der Entwurf biete aber auch Chancen, wenn man ihn entsprechend verändere: "Das betrifft zum Beispiel die häusliche 1:1-Versorgung und das Modell der persönlichen Assistenz.
[1] " Dann könne sich die Situation in der ausserklinischen Intensivpflege positiv verändern.

Sentence [1] starting with " is clearly wrong. Should be the ending of [0].

Failing unit test in 2.0.2

First of all, thanks for the great tokenizer! I just downloaded SoMaJo 2.0.2, but the unit tests seem to fail:

======================================================================
FAIL: test_xml_boundaries_01 (somajo.test.test_sentence_splitter.TestXMLBoundaries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/somajo/test/test_sentence_splitter.py", line 125, in test_xml_boundaries_01
    self._equal_xml("<foo>Foo bar. Foo bar.</foo>", ["<foo> <s> Foo bar . </s>", "<s> Foo bar . </s> </foo>"])
  File "/build/source/somajo/test/test_sentence_splitter.py", line 27, in _equal_xml
    self.assertEqual(sentences, tokenized_sentences)
AssertionError: Lists differ: ['<foo> Foo bar .', 'Foo bar . </foo>'] != ['<foo> <s> Foo bar . </s>', '<s> Foo bar . </s> </foo>']

First differing element 0:
'<foo> Foo bar .'
'<foo> <s> Foo bar . </s>'

- ['<foo> Foo bar .', 'Foo bar . </foo>']
+ ['<foo> <s> Foo bar . </s>', '<s> Foo bar . </s> </foo>']
?         ++++         +++++    ++++         +++++


======================================================================
FAIL: test_xml_boundaries_02 (somajo.test.test_sentence_splitter.TestXMLBoundaries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/somajo/test/test_sentence_splitter.py", line 128, in test_xml_boundaries_02
    self._equal_xml("<foo><i>Foo bar</i>. Foo <i>bar.</i></foo>", ["<foo> <s> <i> Foo bar </i> . </s>", "<s> Foo <i> bar . </i> </s> </foo>"])
  File "/build/source/somajo/test/test_sentence_splitter.py", line 27, in _equal_xml
    self.assertEqual(sentences, tokenized_sentences)
AssertionError: Lists differ: ['<foo> <i> Foo bar </i> .', 'Foo <i> bar . </i> </foo>'] != ['<foo> <s> <i> Foo bar </i> . </s>', '<s> Foo <i> bar . </i> </s> </foo>']

First differing element 0:
'<foo> <i> Foo bar </i> .'
'<foo> <s> <i> Foo bar </i> . </s>'

- ['<foo> <i> Foo bar </i> .', 'Foo <i> bar . </i> </foo>']
+ ['<foo> <s> <i> Foo bar </i> . </s>', '<s> Foo <i> bar . </i> </s> </foo>']
?         ++++                  +++++    ++++                  +++++


======================================================================
FAIL: test_xml_boundaries_03 (somajo.test.test_sentence_splitter.TestXMLBoundaries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/somajo/test/test_sentence_splitter.py", line 131, in test_xml_boundaries_03
    self._equal_xml("<foo><i>Foo bar.</i> Foo bar.</foo>", ["<foo> <i> <s> Foo bar . </s> </i>", "<s> Foo bar . </s> </foo>"])
  File "/build/source/somajo/test/test_sentence_splitter.py", line 27, in _equal_xml
    self.assertEqual(sentences, tokenized_sentences)
AssertionError: Lists differ: ['<foo> <i> Foo bar . </i>', 'Foo bar . </foo>'] != ['<foo> <i> <s> Foo bar . </s> </i>', '<s> Foo bar . </s> </foo>']

First differing element 0:
'<foo> <i> Foo bar . </i>'
'<foo> <i> <s> Foo bar . </s> </i>'

- ['<foo> <i> Foo bar . </i>', 'Foo bar . </foo>']
+ ['<foo> <i> <s> Foo bar . </s> </i>', '<s> Foo bar . </s> </foo>']
?             ++++         +++++         ++++         +++++


======================================================================
FAIL: test_xml_boundaries_04 (somajo.test.test_sentence_splitter.TestXMLBoundaries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/source/somajo/test/test_sentence_splitter.py", line 134, in test_xml_boundaries_04
    self._equal_xml("<foo>Foo <i>bar. Foo</i> bar.</foo>", ["<foo> <s> Foo <i> bar . </i> </s>", "<s> <i> Foo </i> bar . </s> </foo>"])
  File "/build/source/somajo/test/test_sentence_splitter.py", line 27, in _equal_xml
    self.assertEqual(sentences, tokenized_sentences)
AssertionError: Lists differ: ['<foo> Foo <i> bar .', 'Foo </i> bar . </foo>'] != ['<foo> <s> Foo <i> bar . </i> </s>', '<s> <i> Foo </i> bar . </s> </foo>']

First differing element 0:
'<foo> Foo <i> bar .'
'<foo> <s> Foo <i> bar . </i> </s>'

- ['<foo> Foo <i> bar .', 'Foo </i> bar . </foo>']
+ ['<foo> <s> Foo <i> bar . </i> </s>', '<s> <i> Foo </i> bar . </s> </foo>']

----------------------------------------------------------------------
Ran 407 tests in 30.426s

FAILED (failures=4)

Any ideas?

How to get specific classification about word (eg. verb,noun,abbr.,preposition)

hi, I want to get specific classification about the word, however, I get 'regular ' almost all of word during running

tokenizer = Tokenizer(split_camel_case=True, token_classes=True, extra_info=False)
paragraph= "da können wir mit unserem osterwetter eigentlich ganz zufrieden sein ."
tokens = tokenizer.tokenize(paragraph)
print(tokens)

[('da', 'regular'), ('können', 'regular'), ('wir', 'regular'), ('mit', 'regular'), ('unserem', 'regular'), ('osterwetter', 'regular'), ('eigentlich', 'regular'), ('ganz', 'regular'), ('zufrieden', 'regular'), ('sein', 'regular'), ('.', 'symbol')]

Markdown link splitting bug.

Hi. I have this text: This is a Markdown link: [https://one_link.com](https://other_link.com).

And split it with SoMaJo:

from somajo import SoMaJo

tokenizer = SoMaJo("de_CMC")

paragraphs = ["This is a Markdown link: [https://one_link.com](https://other_link.com)."]

sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
    for token in sentence:
        print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))
    print()

Result is:

This	regular	
is	regular	
a	regular	
Markdown	regular	
link	regular	SpaceAfter=No
:	symbol	
[	symbol	SpaceAfter=No
https://one_link.com](https://other_link.com).	URL	

IMHO this shows a bug with the split of the MD link.

The "." should not be part of the link.
The brackets also not be part of a link.
And it is not one link but two...

What do you think?

crashes machine when used with multiprocessing

I have tried using the tokenizer and sentence splitter with multiprocessing and mp.Pool, which crashes my PC. I have tried using fewer cores but get the same outcome. This does not seem to be a memory leak, because RAM usage remains consistent.

Is there any obvious reason SoMaJo would fail when used with multiprocessing?

How to add own exceptions to the tokenizer?

Hello,

My corpus has some specific vocabulary that I want to handle with the tokenizer.

For example, I want the string E.ON from my corpus to be handled as a single token.

I have SoMaJo installed via pip and I tried to add this exception in my project's
MY_PROJECT_PATH/venv/lib/python3.7/site-packages/somajo/single_token_abbreviations_de.txt file like this:

...
# Lines starting with “#” are treated as comments and will be ignored.

# EV and charging stations specific tokens
E.ON

Forsch.frage
IT.NRW
...

Unfortunately, this does not work like this.

What would be the best way to handle this?

Do you think it would be a useful functionality to provide an additional option like somajo-tokenizer add --single-token-abbr-de-exceptions <exception-list.txt>

Issue with Markdown style links.

Links in this format: "MD link <https://heise.de> example." have an issue.

Code:

text = "MD link <https://heise.de> example."
sentences = somajo.tokenize_text([text])
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")

Returns:

MD	regular	
link	regular	
<	symbol	SpaceAfter=No
https://heise.de/>	URL	
example	regular	SpaceAfter=No
.	symbol

Should return something like this:

MD	regular	
link	regular	
<	symbol	SpaceAfter=No
https://heise.de/	URL
>	symbol
example	regular	SpaceAfter=No
.	symbol

Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing

tokenize with continue multiple punctions

BE kem pertama dalam Bahasa Melayu. 350 Pax Pemimpin daripada Malaysia, Singapura, Brunei Dan Indonesia!!! Marilah kita membawa gelombang #BEInternational ke Pasaran Melayu!!!🔥🔥🔥🔥🔥

with the text above, i got a result:
[['BE', 'kem', 'pertama', 'dalam', 'Bahasa', 'Melayu', '.'], ['350', 'Pax', 'Pemimpin', 'daripada', 'Malaysia', ',', 'Singapura', ',', 'Brunei', 'Dan', 'Indonesia', '!'], ['!'], ['!'], ['Marilah', 'kita', 'membawa', 'gelombang', '#', 'BEInternational', 'ke', 'Pasaran', 'Melayu', '!'], ['!'], ['!'], ['🔥', '🔥', '🔥', '🔥', '🔥']]

I want get the "!!!" as total .
Thanks!

Tokenizer outputs single characters per line

Thank you for proving this tool!

I might be doing something wrong but the following code produces an output with every character on its own line. So, basically, each identified token is just one character.

When calling somajo-tokenizer from the command line, it works as expected.

from somajo import SoMaJo
tokenizer = SoMaJo("de_CMC")
sentences = tokenizer.tokenize_text('Das ist ein Test.')
for sentence in sentences:
    for token in sentence:
        print(token.text)

Other issue with Markdown style links.

Links in this format: "*[Neubau](https://www.some-link.com)*" have an issue.

Code:

text = "*[Neubau](https://www.some-link.com)*"
sentences = somajo.tokenize_text([text])
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")

Returns:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com)*	URL	

Should return something like this:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com	URL
)       symbol SpaceAfter=No
*	symbol SpaceAfter=No

Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing

Sentence Splitter does not work

=== CODE ===
sentence_splitter = SentenceSplitter()
sentences = sentence_splitter.split(text)

=== RESULT ===

text = ' Der Text ist unter der Lizenz „Creative Commons Attribution/Share Alike“ verfügbar; Informationen zu den Urhebern und zum Lizenzstatus eingebundener Mediendateien (etwa Bilder oder Videos) können im Regelfall durch Anklicken dieser abgerufen werden. Möglicherweise unterliegen die Inhalte jeweils zusätzlichen Bedingungen. Durch die Nutzung dieser Website erklären Sie sich mit den Nutzungsbedingungen und der Datenschutzrichtlinie einverstanden. Wikipedia® ist eine eingetragene Marke der Wikimedia Foundation Inc.'

sentences = [' Der Text ist unter der Lizenz „Creative Commons Attribution/Share Alike“ verfügbar; Informationen zu den Urhebern und zum Lizenzstatus eingebundener Mediendateien (etwa Bilder oder Videos) können im Regelfall durch Anklicken dieser abgerufen werden. Möglicherweise unterliegen die Inhalte jeweils zusätzlichen Bedingungen. Durch die Nutzung dieser Website erklären Sie sich mit den Nutzungsbedingungen und der Datenschutzrichtlinie einverstanden. Wikipedia® ist eine eingetragene Marke der Wikimedia Foundation Inc.']

Segmentation of sentences in lowercase

Can it be that whenever a sentence is starting with a lowercase letter, it won't be recognized as a new sentence? For example, while "Das war es. Es funktioniert nicht." works fine, "Das war es. es funktioniert nicht." is returned as one sentence. Is that a bug or intended behavior?

False Positives with URLS

I just wanted to make you aware that frameworks such as 'VB.NET' or 'ASP.Net' are considered URLs after tokenization and are thus not splitted (which is probably good). This is also the case for some abbriviations such as 'L/S/R' and SAP Versions such as R/3. Unfortunately this can't be prevented by adding them to 'single_token_abbreviations_de.txt' since they are checked after URLs. (R/3 is even included in 'single_token_abbreviations_de.txt').

publish on conda-forge

Hi, looks like you created a great package! I have built a tool that utilizes SoMaJo, amongst others, and would like to make it available on conda-forge in addition to pypi.
Since SoMaJo is not yet available on conda-forge, or other conda channels, I wanted to ask if you would like to add it, otherwise I will create a recipe and do so (the release will not be affiliated with a specific user).
Until then I can not add my package to conda-forge.

Tokenizer text recovery problem

I am trying to recover the text but it is not possible since the token.original_spelling for a token : ( does not contain the original number of spaces.

Here is a motivating example:

import somajo
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

This prints

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:(  -->  : (
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

It would be great if this could somehow be resolved. Thanks!

Thread safety

Is SoMaJo thread safe? Can I, e.g., use it with joblib to parallelize operation?

Phonenumber and sad emoje

Hey,

I found a small little bug: The regex:
self.space_emoticon = re.compile(r'([:;])[ ]+([()])'
can hit on some German telephone number formats such as:
'Tel: ( 0049)', 'Tel: (+49)
In my code I simply fixed this by a negative lookahead, but I can't really test if this breaks something somewhere else. So mine right now is:
self.space_emoticon = re.compile(r'([:;])[ ]+([()])(?! *[\+0])')

Btw. Thank you for this amazing tool. I use it really often. I really like that if there's something I don't understand, I can just climb down to the regex. Some more options would be nice though, but I guess one day I'll have to send a pull request ;)

SRX for sentence_splitter

Hi Thomas, I am hoping to use SoMaJo's sentence_splitter in rust, and I am wondering if it would be possible to formulate it in terms of SRX rules? I would happily contribute to making that happen, but I wanted to check with you regarding feasibility before going forward.

Dates at the end of sentences

  • full stops at the end of dates are not correctly split from the date
  • steps to recreate the issue using somajo-2.2.4:
from somajo import SoMaJo
tokenizer = SoMaJo("de_CMC", split_camel_case=True)
paragraphs = ["Am Ende dieses Satzes steht 12.03.2023."]
sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
     for token in sentence:
         print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))

yields

Am	regular	
Ende	regular	
dieses	regular	
Satzes	regular	
steht	regular	
12.03.2023.	number	

Apostrophe die Vokale ersetzen

Hi,

Ich arbeite an (historischer) Lyrik, und da werden Apostrophe gerne verwendet um Vokale zu ersetzen (um eine Silbe zu sparen).

Beispiele:

wär'
hingegeb'nen 
ward's
Bring's
Von Mutter-Lieb', von Schwester-Treu',
Und alle diese sel'gen Träume,

Ich würde gerne verhindern, dass somajo mir die trennt.
Gibt es da eine Option in python?

tokenizer = SoMaJo("de_CMC", split_camel_case=True)

Vielen Dank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.