tsproisl / someweta Goto Github PK

View Code? Open in Web Editor NEW

22.0 4.0 3.0 254 KB

A part-of-speech tagger with support for domain adaptation and external resources.

License: GNU General Public License v3.0

Python 71.38% Shell 28.62%

part-of-speech-tagger german english french social-media

someweta's Introduction

SoMeWeTa

Introduction
Installation
Usage
Model files
References

Introduction

SoMeWeTa (short for Social Media and Web Tagger) is a part-of-speech tagger that supports domain adaptation and that can incorporate external sources of information such as Brown clusters and lexica. It is based on the averaged structured perceptron and uses beam search and an early update strategy. It is possible to train and evaluate the tagger on partially annotated data.

SoMeWeTa achieves state-of-the-art results on the German web and social media texts from the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. Therefore, SoMeWeTa is particularly well-suited to tag all kinds of written German discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

The system is described in greater detail in Proisl (2018).

For tokenization and sentence splitting on these kinds of text, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts:

somajo-tokenizer --split_sentences <file> | somewe-tagger --tag <model> -

In addition to the German web and social media model, we also provide models trained on German, English and French newspaper texts, as well as models for Bhojpuri and spoken Italian. For all languages, SoMeWeTa achieves highly competitive results close to the current state of the art.

Installation

SoMeWeTa can be easily installed using pip:

pip3 install SoMeWeTa

Alternatively, you can download and decompress the latest release or clone the git repository:

git clone https://github.com/tsproisl/SoMeWeTa.git

In the new directory, run the following command:

python3 setup.py install

Optional dependency

If your Python version has insertion ordered dictionaries (for CPython this means version 3.6 and later, for any other Python implementation this means 3.7 and later), you can drastically reduce the amount of memory needed for loading a tagger model by installing the ijson library:

pip3 install ijson

Usage

You can use the tagger as a standalone program from the command line. General usage information is available via the -h option:

somewe-tagger -h

Tagging a text

SoMeWeTa requires that the input texts are tokenized and split into sentences. Tokenization and sentence splitting have to be consistent with the corpora the tagger model has been trained on. For German texts, we recommend SoMaJo, a tokenizer and sentence splitter with state-of-the-art performance on German web and social media texts. The expected input format is one token per line with an empty line after each sentence.

To tag a file, run the following command:

somewe-tagger --tag <model> <file>

If your machine has multiple cores, you can use the --parallel option to speed up tagging. To tag a file using four cores, use this command:

somewe-tagger --parallel 4 --tag <model> <file>

Using the option -x or --xml, it is possible to tag an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --tag <model> <file>

When called with the --progress option, SoMeWeTa displays tagging progress, average and current tagging speed and remaining time.

Training the tagger

The expected input format for training the tagger is one token-pos pair per line, where token and pos are seperated by a tab character, and an empty line after each sentence. To train a model, run the following command:

somewe-tagger --train <model> <file>

SoMeWeTa supports domain adaptation. First train a model on the background corpus, then use this model as prior when training the in-domain model:

somewe-tagger --train <model> --prior <background_model> <file>

SoMeWeTa can make use of additional sources of information. You can use the --brown option to provide a file with Brown clusters (the paths file produced by wcluster) and the --lexicon option to provide a lexicon with additional token-level information. The lexicon should consist of lines with tab-separated token-value pairs, e.g.:

welcome	ADJ
welcome	INTJ
welcome	NOUN
welcome	VERB
work	NOUN
work	VERB

It is also possible to train the tagger on partially annotated data. To do this, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --train <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to train the tagger on an XML file. It is assumed that each XML tag is on a separate line:

somewe-tagger --xml --train <model> <file>

Evaluating a model

To evaluate a model, you need an annotated input file in the same format as for training. Then you can run the following command:

somewe-tagger --evaluate <model> <file>

You can also evaluate a model on partially annotated data. Simply assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --evaluate <model> --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to evaluate a model on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --evaluate <model> <file>

Performing cross-validation

You can also perform a 10-fold cross-validation on a training corpus:

somewe-tagger --crossvalidate <file>

To perform a cross-validation on partially annotated data, assign a pseudo-tag to each unannotated token and tell SoMeWeTa to ignore this pseudo-tag:

somewe-tagger --crossvalidate --ignore-tag <pseudo-tag> <file>

Using the option -x or --xml, it is possible to perform a cross-validation on an XML file. The tagger assumes that each XML tag is on a separate line:

somewe-tagger --xml --crossvalidate <file>

Using the module

To incorporate the tagger into your own Python project, you have to import someweta.ASPTagger, create an ASPTagger object, load a pretrained model and call the tag_sentence method:

from someweta import ASPTagger

model = "german_web_social_media_2018-12-21.model"
sentences = [["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."],
             ["Zeitfliegen", "mögen", "einen", "Pfeil", "."]]

asptagger = ASPTagger()
asptagger.load(model)

for sentence in sentences:
    tagged_sentence = asptagger.tag_sentence(sentence)
    print("\n".join(["\t".join(t) for t in tagged_sentence]), "\n", sep="")

Here is an example for using SoMaJo and SoMeWeTa in combination, performing tokenization, sentence splitting and part-of-speech tagging:

import somajo
import someweta

filename = "test.txt"
model = "german_web_social_media_2018-12-21.model"

asptagger = someweta.ASPTagger()
asptagger.load(model)

# See https://github.com/tsproisl/SoMaJo#using-the-module
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=False)
sentences = tokenizer.tokenize_text_file(filename, paragraph_separator="empty_lines")
for sentence in sentences:
    tokens = [token.text for token in sentence]
    tagged_sentence = asptagger.tag_sentence(tokens)
    print("\n".join("\t".join(t) for t in tagged_sentence), "\n", sep="")

Model files

Model	tagset	est. accuracy
German newspaper	STTS (TIGER)	98.02%
German web and social media	STTS_IBK	92.18%
English newspaper	Penn	97.25%
French newspaper	FTB-29	97.71%
Spoken Italian	UD (KIPoS)	91.79%
Bhojpuri	BIS-33	92.58%

German newspaper texts

This model has been trained on the entire TIGER corpus and uses Brown clusters (extracted from DECOW16AX, GeRedE and a collection of German tweets) and coarse wordclasses extracted from Morphy as additional information.

To estimate the accuracy of this model, we performed a 10-fold cross-validation on the TIGER corpus with the same settings, resulting in a 95% confidence interval of 98.02% ±0.12.

Download model (111 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

German web and social media texts

This model uses a variant of the above model as prior and is trained on the entire EmpiriST 2.0 corpus, i.e. both the training and the test data, as well as a little bit of additional training data (cf. the data directory of this repository). It uses the same additional sources of information as the prior model.

A variant of this model that only uses the training part of the EmpiriST corpus achieves a mean accuracy of 92.18% on the two test sets:

Corpus	all words	known words	unknown words
CMC	90.39 ±0.30	92.42 ±0.29	77.57 ±1.40
Web	93.96 ±0.16	95.56 ±0.17	83.40 ±0.69

Download model (112 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

English newspaper texts

This model has been trained on all sections of the Wall Street Journal part of the Penn Treebank and uses Brown clusters extracted from ENCOW14 and part-of-speech data extracted from the English DELA dictionary as additional information.

A variant of this model that was trained only on sections 0–18 of the Wall Street Journal achieves the following results on the usual development and test sets:

Data set	all words	known words	unknown words
dev (19–21)	97.15 ±0.02	97.41 ±0.03	89.59 ±0.28
test (22–24)	97.25 ±0.02	97.42 ±0.03	91.05 ±0.29

Download model (38 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

French newspaper texts

This model has been trained on the French Treebank and uses Brown clusters extracted from FRCOW16 and part-of-speech data extracted from the French DELA dictionary as additional information.

The French Treebank is annotated with two different tagsets: A coarse-grained tagset consisting of 15 tags and a more fine-grained tagset consisting of 29 tags. The model has been trained on the more fine-grained tagset. However, we provide a mapping to the smaller tagset (data/mapping_french_29_to_15.json) that can be used to annotate a text with both tagsets:

somewe-tagger --tag <model> --mapping <mapping> <file>

To estimate the accuracy of the model, we performed a 10-fold cross-validation on the French Treebank using the same settings:

tagset	accuracy
29 tags	97.71 ±0.13
15 tags (mapped)	98.22 ±0.11

Download model (28 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

Spoken Italian

This model has been pretrained on the union of all Italian corpora in the Universal Dependencies project and then been adapted to spoken Italian using annotated data from the KIParla corpus. The model uses coarse-grained wordclass information from Morph-it! and Brown clusters extracted from a collection of Italian corpora (OpenSubtitles, Reddit posts, PAISÀ, Wikimedia dumps, OSCAR). The input text must be tokenized according to the UD tokenization guidelines. In particular, the model expects that contracted forms like parlarmi (parlar + mi) or della (di + la) are split into their constituents. A detailed description and analysis of the model is available in Proisl and Lapesa (2020).

A variant of this model that only uses the training part of the KIParla corpus achieves a mean accuracy of 91.79% on the two test sets:

Corpus	all words	known words	unknown words
formal	92.67	93.39	67.92
informal	90.90	91.41	75.00

Download model (43 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

Bhojpuri

This model has been trained on ca. 105,000 tokens of annotated Bhojpuri text provided by the organizers of the NSURL shared task for Bhojpuri. Additionally, the model uses Brown clusters extracted from text collections of related languages (Hindi and Bihari Wikimedia dumps and a Magahi corpus). The model uses a fine-grained variant of the Bureau of Indian Standards (BIS) annotation scheme with 33 tags. A more detailed description of the model can be found in Proisl et al. (2019).

A variant of this model that only uses the training part of the dataset achieves an accuracy of 92.58% on the test set:

All words	known words	unknown words
92.58	94.57	75.09

Download model (3,7 MB) – Note that the model is provided for research purposes only. For further information, please refer to the licenses of the individual resources that were used in the creation of the model.

References

If you use SoMeWeTa for academic research, please consider citing the following paper:

Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–670. Miyazaki: European Language Resources Association (ELRA). PDF.

@InProceedings{Proisl_LREC:2018,
  author    = {Proisl, Thomas},
  title     = {{SoMeWeTa}: {A} Part-of-Speech Tagger for {G}erman Social Media and Web Texts},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)},
  year      = {2018},
  address   = {Miyazaki},
  publisher = {European Language Resources Association {ELRA}},
  pages     = {665--670},
  url       = {http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf},
}

If you use the model for spoken Italian, please consider citing also the following paper:

Proisl, Thomas, and Gabriella Lapesa. 2020. “KLUMSy@KIPoS: Experiments on Part-of-Speech Tagging of Spoken Italian.” In Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). CEUR-WS.org. PDF.

@InProceedings{Proisl_Lapesa_EVALITA:2020,
  author    = {Proisl, Thomas and Lapesa, Gabriella},
  title     = {{KLUMSy@KIPoS}: Experiments on Part-of-Speech Tagging of Spoken {I}talian},
  booktitle = {Proceedings of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for {I}talian ({EVALITA} 2020)},
  year      = {2020},
  editor    = {Basile, Valerio and Croce, Danilo and Di Maro, Maria and Passaro, Lucia C.},
  address   = {Online},
  publisher = {CEUR-WS.org},
  url       = {http://ceur-ws.org/Vol-2765/paper140.pdf}
}

If you use the Bhojpuri model, please consider citing also the following paper:

Proisl, Thomas, Peter Uhrig, Philipp Heinrich, Andreas Blombach, Sefora Mammarella, Natalie Dykes, and Besim Kabashi. 2019. “The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri Without Even Knowing the Alphabet.” In Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019), 73–79. Trento: Association for Computational Linguistics. PDF.

@InProceedings{Proisl_et_al_NSURL:2019,
  author    = {Proisl, Thomas and Uhrig, Peter and Heinrich, Philipp and Blombach, Andreas and Mammarella, Sefora and Dykes, Natalie and Kabashi, Besim},
  title     = {{T}he\_{I}lliterati: Part-of-Speech Tagging for {M}agahi and {B}hojpuri without Even Knowing the Alphabet},
  booktitle = {Proceedings of the First International Workshop on {NLP} Solutions for Under Resourced Languages ({NSURL} 2019)},
  year      = {2019},
  pages     = {73--79},
  address   = {Trento},
  publisher = {Association for Computational Linguistics},
  url       = {https://www.aclweb.org/anthology/2019.nsurl-1.11}
}

someweta's People

Contributors

Stargazers

Watchers

Forkers

richstone ausgerechnet ianroberts

someweta's Issues

inaccurate action word recognition

SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German lach (Beißwenger, Bartz, Storrer und Westpfahl, 2015).
I tested the accuracy of AKW-tagging with a small sample of tokens. As you can see from the attached results, the accuracy is about 33 %.

You can reproduce the wrong tagging with the following minimal working example containing 10 sample sentences:

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#ToDo: update path to language model
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from authentic German CMC-data
sentences = ["Also das schlägt ja wohl dem Fass den Boden aus! :haeh:",
             "das mehr oder weniger gute Dlc gabs noch gratis dazu.",
            "Aus der Liste: definitiv Brink, obwohls für kurze Zeit Spaß gemacht "
            "hat, aber im Nachhinein hab ichs doch sehr bereut.",
            "*schluchz, heul*",
            "endlich, und dann noch als standalone-addon *freu*",
            "Und immer schön mit den Holländer zocken, da gabs die besten Preise.",
            "Ich freu mich riesig und weiß was ich im Wintersemester "
            "jeden Tag machen werde!!",
            "alles oben in der liste gabs unter bf2 auch schon in einer form.",
            "Mit dem Account werden weitere Features im Online-Modus des FM11 "
            "freigeschaltet, bswp mehr Statistiken, mehr Aktionskarten, mögliche "
            "Fantasy-Liga, yadda, yadda."]

akws = list()
for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    for word in tagged_sentence:
        #append to list akws if tagged with PoS-Tag 'AKW'
        try:
            akw = (word.index('AKW'))
            akws.append(word)
        except:
            continue
print("tagged as AKW:", akws)

The output list akws contains two right action words ('heul' and 'freu'). 'Haeh' is an emoticon, 'gabs' and 'obwohls' are in fact contractions. 'bswp' is used as abbreviation for German 'beispielsweise'.

Is this serious enough to be considered as an issue or have i implemented something wrong? As far as I see, this error is not part of the error table 4 in Proisl (2018, p. 668).

Cited sources:

Beißwenger, Michael / Bartz, Thomas / Storrer, Angelika und Westpfahl, Swantje (2015). Tagset and guidelines for the PoS tagging of language data from genres of computer- mediated communication / social media., 19.
Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

"fromstring deprecated" (warnings in Jupyter notebook)

I'm trying to use SoMeWeTa in a notebook and get this error over and over:

self.weights = {f: np.fromstring(base64.b85decode(w), np.float64) for f, w in zip(features, weights)} /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/someweta/tagger.py:225: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead

The model (German social media) does not even load, I can't run this minimal example:
asptagger = ASPTagger(beam_size=5, iterations=10) asptagger.load(pos_model) print("got model!")

much less actually tag something:
asptagger.tag_sentence(["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."])

This used to work before (i.e. last year) so I'm not sure what changed.

Lemmatizer

As far as I see, there is no lemmatizer built in in SoMeWeTa. It would be more comfortable to have a third column in the output that contains the lemma.
My output looks like this:

Geschlagene ADJA
3-Monate NN
war VAFIN

And what I mention is something like this:

Geschlagene ADJA schlagen
3-Monate NN 3-Monate
war VAFIN sein

This feature is supported e.g. by Texblob (see Words Inflection and Lemmatization) and Stanford CoreNLP. The consideration of the PoS-tags from SoMeWeTa in common lemmatisers (see this overview of 7 lemmatisers) can in my eyes only be done with a mapping of the PoS-tags which is time-consuming, non-pythonic and probably leads to a loss of information because of missing corresponding tags in the destination tagset.

Normalisierung der Tokens vor dem Taggen

Insbesondere in Daten aus sozialen Medien findet man häufig Wörter und ganze Sätze, die allein über Unicode-Zeichen in einer anderen Schriftart oder einem anderen Schriftstil dargestellt werden: 𝖋𝖗𝖊𝖎𝖍𝖊𝖎𝖙, 𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘, 𝘔𝘢𝘴𝘬𝘦𝘯𝘱𝘧𝘭𝘪𝘤𝘩𝘵 u.ä.

SoMeWeTa taggt diese Tokens i.d.R. nicht korrekt, was sich vmtl. mit NKFC-Normalisierung ändern ließe:

import unicodedata
unicodedata.normalize("NFKC", "𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘")
Out[2]: 'Impfausweis'

Da Kompatibilitätsäquivalenz leider nicht bloß solche Fälle betrifft, sollte das wahrscheinlich optional sein.

Jupyter Notebook: Future Warning possible nested set

I use SoMeWeTa in Jupyter Notebook 5.7.4 with Python 3.7.1. I Installed SoMeWeTa in Jupyter Notebook using

import sys
!{sys.executable} -m pip install -U SoMeWeTa

When I try to run the following test code I found under Using the Module

from someweta import ASPTagger

model = "german_web_social_media_2018-12-21.model"
sentences = [["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."],
             ["Zeitfliegen", "mögen", "einen", "Pfeil", "."]]

# future versions will have sensible default values
asptagger = ASPTagger(beam_size=5, iterations=10)
asptagger.load(model)

The output contains multiple errors that look like this:

/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 2
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)
/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 34
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)
/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 66
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)

Actually, everything seems to work correctly: I tested the following code:

for sentence in sentences:
    tagged_sentence = asptagger.tag_sentence(sentence)
    print("\n".join(["\t".join(t) for t in tagged_sentence]), "\n", sep="")

which gave the following correct output:

Ein ART
Satz NN
ist VAFIN
eine ART
Liste NN
von APPR
Tokens NN
. $.

Zeitfliegen NN
mögen VMFIN
einen ART
Pfeil NN
. $.

It might be useful for other users to fix this (maybe with adding an explicit installation guide for Jupyter Notebook)

Question regarding multiprocessing

Hi,

I see that it is possible to pass the "parallel" argument for the command line interface to use multiprocessing. Is the same possible via the Python API? I do not see a "parallel" argument in the ASPTagger constructor and do not see exactly how to do this via the cli.py to get the output.

Thank you in advance for the answer.

Model loading is very memory hungry

Taking the spoken Italian model as an example, the process of loading the model into memory (ASPTagger.load) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.

The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must

load the entire JSON array-of-arrays into memory
duplicate the vocabulary list to turn it into a set
zip together the parallel lists of feature names and weights, and for each entry base-85 decode the weight and add the pair to a dict
then throw away the original lists of vocabulary, features and weights

If the feature name/weight pairs were instead serialized together (either as a {"feature":"base85-weight",...} object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.

Mistagging of homographic, sentence-initial verbs

As mentioned by Horbach et al. (2015, p. 44), sentence-initial verbs are frequently in CMC-data and are often mistagged as nouns by standard tools. I checked the behavior of SoMeWeTa with the german_web_social_media_2018-12-21.model and noted that it does a real good job in recognizing these kinds of verbs.
The example provided in Horbach et al. (2015, p. 44) is in fact a tricky one:

Blicke da auch nicht so richtig durch und habe Probleme damit

Blicke is homographic to the German plural of 'der Blick' but is meant as first person singular of the German verb 'blicken' in the example.
In this case, also SoMeWeTa mistags it as a noun. This seems to be true for some of the rare cases of homographic sentence-initial verbs. (For being precise: They have to be homographic with a token of another part-of-speech subcategory.)

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#To Do: update path to your model here
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from introspectiv generated German sentences
sentences = ["Blicke da auch nicht durch.",
             "Check ich auch nicht.",
             "Schau mir das morgen an.",
             "Trank kurz den Tee fertig."]

for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    print(tagged_sentence)

If you run the above code the output will be:

incorrect for: [('Blicke', 'NN'), ('da', 'ADV'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('durch', 'PTKVZ'), ('.', '$.')]
correct for: [('Check', 'VVFIN'), ('ich', 'PPER'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('.', '$.')]
correct for: [('Schau', 'VVIMP'), ('mir', 'PPER'), ('das', 'ART'), ('morgen', 'NN'), ('an', 'PTKVZ'), ('.', '$.')]
correct for: [('Trank', 'VVFIN'), ('kurz', 'ADJD'), ('den', 'ART'), ('Tee', 'NN'), ('fertig', 'ADJD'), ('.', '$.')]

The homographs of my examples are the following nouns: 'der Check', 'die Schau' and 'der Trank'
As you can see from the example above only the example sentence of Horbach et al. seems to be affected. All other test sentences have been tagged correctly. I have not yet discovered a system for the failure. As this is not part of the documented errors of SoMeWeTa (Proisl, 2018, p. 667) I considered it as an issue.

Sources:

Horbach, Andrea / Thater, Stefan / Steffen, Diana / Fischer, Peter M. / Witt, Andreas und Pinkal, Manfred (2015). Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum, 15(1), 41–47. https://doi.org/10.1007/s13222-014-0172-z
Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

Sentence Option

When using SoMeWeTa in a pipeline with SoMaJo, a respective 'sentence-tag' option would come in handy. Right now,

less some-file.xml.gz | somajo-tokenizer --xml --split_sentences --sentence_tag s --tag p --parallel 4 - | somewe-tagger --xml --parallel 4 --tag german_web_social_media_2020-05-28.model -

does not work since SoMeWeTa expects newlines as sentence breaks.

This can be fixed on the fly with something like

less some-file.xml.gz | somajo-tokenizer --xml --split_sentences --sentence_tag s --tag p --parallel 4 - | sed 's|</s>|</s>\n|' - | somewe-tagger --xml --parallel 4 --tag german_web_social_media_2020-05-28.model -

but this is rather hacky …

Please enhance this amazing tool!

Multiprocessing is broken (at least on MacOS arm64)

--parallel slows down processing extremely, e.g. --parallel 8 from 10k tokens/s to about 1.5k tokens/s. It isn't clear whether this is a problem with the multiprocessing functionality itself or with something the SoMeWeTa engine does; but there appears to be massive synchronisation overhead.

MacOS 12.5.1 arm64 (M1)
Anaconda Python v3.9.12 with current SoMeWeTa from PyPI

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.