Giter Site home page Giter Site logo

tsproisl / someweta Goto Github PK

View Code? Open in Web Editor NEW
22.0 22.0 3.0 254 KB

A part-of-speech tagger with support for domain adaptation and external resources.

License: GNU General Public License v3.0

Python 71.38% Shell 28.62%
english french german part-of-speech-tagger social-media

someweta's People

Contributors

ausgerechnet avatar ianroberts avatar richstone avatar tsproisl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

someweta's Issues

Jupyter Notebook: Future Warning possible nested set

I use SoMeWeTa in Jupyter Notebook 5.7.4 with Python 3.7.1. I Installed SoMeWeTa in Jupyter Notebook using

import sys
!{sys.executable} -m pip install -U SoMeWeTa

When I try to run the following test code I found under Using the Module

from someweta import ASPTagger

model = "german_web_social_media_2018-12-21.model"
sentences = [["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."],
             ["Zeitfliegen", "mögen", "einen", "Pfeil", "."]]

# future versions will have sensible default values
asptagger = ASPTagger(beam_size=5, iterations=10)
asptagger.load(model)

The output contains multiple errors that look like this:

/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 2
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)
/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 34
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)
/anaconda3/lib/python3.7/site-packages/someweta/tagger.py:30: FutureWarning: Possible nested set at position 66
self.email = re.compile(r"^[[:alnum:].%+-]+(?:@| [?at]? )[[:alnum:].-]+(?:.| [?dot]? )[[:alpha:]]{2,}$", re.IGNORECASE)

Actually, everything seems to work correctly: I tested the following code:

for sentence in sentences:
    tagged_sentence = asptagger.tag_sentence(sentence)
    print("\n".join(["\t".join(t) for t in tagged_sentence]), "\n", sep="")

which gave the following correct output:

Ein ART
Satz NN
ist VAFIN
eine ART
Liste NN
von APPR
Tokens NN
. $.

Zeitfliegen NN
mögen VMFIN
einen ART
Pfeil NN
. $.

It might be useful for other users to fix this (maybe with adding an explicit installation guide for Jupyter Notebook)

Multiprocessing is broken (at least on MacOS arm64)

--parallel slows down processing extremely, e.g. --parallel 8 from 10k tokens/s to about 1.5k tokens/s. It isn't clear whether this is a problem with the multiprocessing functionality itself or with something the SoMeWeTa engine does; but there appears to be massive synchronisation overhead.

MacOS 12.5.1 arm64 (M1)
Anaconda Python v3.9.12 with current SoMeWeTa from PyPI

inaccurate action word recognition

SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German lach (Beißwenger, Bartz, Storrer und Westpfahl, 2015).
I tested the accuracy of AKW-tagging with a small sample of tokens. As you can see from the attached results, the accuracy is about 33 %.

You can reproduce the wrong tagging with the following minimal working example containing 10 sample sentences:

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#ToDo: update path to language model
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from authentic German CMC-data
sentences = ["Also das schlägt ja wohl dem Fass den Boden aus! :haeh:",
             "das mehr oder weniger gute Dlc gabs noch gratis dazu.",
            "Aus der Liste: definitiv Brink, obwohls für kurze Zeit Spaß gemacht "
            "hat, aber im Nachhinein hab ichs doch sehr bereut.",
            "*schluchz, heul*",
            "endlich, und dann noch als standalone-addon *freu*",
            "Und immer schön mit den Holländer zocken, da gabs die besten Preise.",
            "Ich freu mich riesig und weiß was ich im Wintersemester "
            "jeden Tag machen werde!!",
            "alles oben in der liste gabs unter bf2 auch schon in einer form.",
            "Mit dem Account werden weitere Features im Online-Modus des FM11 "
            "freigeschaltet, bswp mehr Statistiken, mehr Aktionskarten, mögliche "
            "Fantasy-Liga, yadda, yadda."]

akws = list()
for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    for word in tagged_sentence:
        #append to list akws if tagged with PoS-Tag 'AKW'
        try:
            akw = (word.index('AKW'))
            akws.append(word)
        except:
            continue
print("tagged as AKW:", akws)

The output list akws contains two right action words ('heul' and 'freu'). 'Haeh' is an emoticon, 'gabs' and 'obwohls' are in fact contractions. 'bswp' is used as abbreviation for German 'beispielsweise'.

Is this serious enough to be considered as an issue or have i implemented something wrong? As far as I see, this error is not part of the error table 4 in Proisl (2018, p. 668).

Cited sources:

  • Beißwenger, Michael / Bartz, Thomas / Storrer, Angelika und Westpfahl, Swantje (2015). Tagset and guidelines for the PoS tagging of language data from genres of computer- mediated communication / social media., 19.
  • Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

Lemmatizer

As far as I see, there is no lemmatizer built in in SoMeWeTa. It would be more comfortable to have a third column in the output that contains the lemma.
My output looks like this:

Geschlagene ADJA
3-Monate NN
war VAFIN

And what I mention is something like this:

Geschlagene ADJA schlagen
3-Monate NN 3-Monate
war VAFIN sein

This feature is supported e.g. by Texblob (see Words Inflection and Lemmatization) and Stanford CoreNLP. The consideration of the PoS-tags from SoMeWeTa in common lemmatisers (see this overview of 7 lemmatisers) can in my eyes only be done with a mapping of the PoS-tags which is time-consuming, non-pythonic and probably leads to a loss of information because of missing corresponding tags in the destination tagset.

Model loading is very memory hungry

Taking the spoken Italian model as an example, the process of loading the model into memory (ASPTagger.load) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.

The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must

  • load the entire JSON array-of-arrays into memory
  • duplicate the vocabulary list to turn it into a set
  • zip together the parallel lists of feature names and weights, and for each entry base-85 decode the weight and add the pair to a dict
  • then throw away the original lists of vocabulary, features and weights

If the feature name/weight pairs were instead serialized together (either as a {"feature":"base85-weight",...} object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.

Question regarding multiprocessing

Hi,

I see that it is possible to pass the "parallel" argument for the command line interface to use multiprocessing. Is the same possible via the Python API? I do not see a "parallel" argument in the ASPTagger constructor and do not see exactly how to do this via the cli.py to get the output.

Thank you in advance for the answer.

Normalisierung der Tokens vor dem Taggen

Insbesondere in Daten aus sozialen Medien findet man häufig Wörter und ganze Sätze, die allein über Unicode-Zeichen in einer anderen Schriftart oder einem anderen Schriftstil dargestellt werden: 𝖋𝖗𝖊𝖎𝖍𝖊𝖎𝖙, 𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘, 𝘔𝘢𝘴𝘬𝘦𝘯𝘱𝘧𝘭𝘪𝘤𝘩𝘵 u.ä.

SoMeWeTa taggt diese Tokens i.d.R. nicht korrekt, was sich vmtl. mit NKFC-Normalisierung ändern ließe:

import unicodedata
unicodedata.normalize("NFKC", "𝕴𝖒𝖕𝖋𝖆𝖚𝖘𝖜𝖊𝖎𝖘")
Out[2]: 'Impfausweis'

Da Kompatibilitätsäquivalenz leider nicht bloß solche Fälle betrifft, sollte das wahrscheinlich optional sein.

Mistagging of homographic, sentence-initial verbs

As mentioned by Horbach et al. (2015, p. 44), sentence-initial verbs are frequently in CMC-data and are often mistagged as nouns by standard tools. I checked the behavior of SoMeWeTa with the german_web_social_media_2018-12-21.model and noted that it does a real good job in recognizing these kinds of verbs.
The example provided in Horbach et al. (2015, p. 44) is in fact a tricky one:

Blicke da auch nicht so richtig durch und habe Probleme damit

Blicke is homographic to the German plural of 'der Blick' but is meant as first person singular of the German verb 'blicken' in the example.
In this case, also SoMeWeTa mistags it as a noun. This seems to be true for some of the rare cases of homographic sentence-initial verbs. (For being precise: They have to be homographic with a token of another part-of-speech subcategory.)

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#To Do: update path to your model here
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from introspectiv generated German sentences
sentences = ["Blicke da auch nicht durch.",
             "Check ich auch nicht.",
             "Schau mir das morgen an.",
             "Trank kurz den Tee fertig."]

for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    print(tagged_sentence)

If you run the above code the output will be:

  • incorrect for: [('Blicke', 'NN'), ('da', 'ADV'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('durch', 'PTKVZ'), ('.', '$.')]
  • correct for: [('Check', 'VVFIN'), ('ich', 'PPER'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('.', '$.')]
  • correct for: [('Schau', 'VVIMP'), ('mir', 'PPER'), ('das', 'ART'), ('morgen', 'NN'), ('an', 'PTKVZ'), ('.', '$.')]
  • correct for: [('Trank', 'VVFIN'), ('kurz', 'ADJD'), ('den', 'ART'), ('Tee', 'NN'), ('fertig', 'ADJD'), ('.', '$.')]

The homographs of my examples are the following nouns: 'der Check', 'die Schau' and 'der Trank'
As you can see from the example above only the example sentence of Horbach et al. seems to be affected. All other test sentences have been tagged correctly. I have not yet discovered a system for the failure. As this is not part of the documented errors of SoMeWeTa (Proisl, 2018, p. 667) I considered it as an issue.

Sources:

  • Horbach, Andrea / Thater, Stefan / Steffen, Diana / Fischer, Peter M. / Witt, Andreas und Pinkal, Manfred (2015). Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum, 15(1), 41–47. https://doi.org/10.1007/s13222-014-0172-z
  • Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

"fromstring deprecated" (warnings in Jupyter notebook)

I'm trying to use SoMeWeTa in a notebook and get this error over and over:

self.weights = {f: np.fromstring(base64.b85decode(w), np.float64) for f, w in zip(features, weights)} /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/someweta/tagger.py:225: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead

The model (German social media) does not even load, I can't run this minimal example:
asptagger = ASPTagger(beam_size=5, iterations=10) asptagger.load(pos_model) print("got model!")

much less actually tag something:
asptagger.tag_sentence(["Ein", "Satz", "ist", "eine", "Liste", "von", "Tokens", "."])

This used to work before (i.e. last year) so I'm not sure what changed.

Sentence Option

When using SoMeWeTa in a pipeline with SoMaJo, a respective 'sentence-tag' option would come in handy. Right now,

less some-file.xml.gz | somajo-tokenizer --xml --split_sentences --sentence_tag s --tag p --parallel 4 - | somewe-tagger --xml --parallel 4 --tag german_web_social_media_2020-05-28.model -

does not work since SoMeWeTa expects newlines as sentence breaks.

This can be fixed on the fly with something like

less some-file.xml.gz | somajo-tokenizer --xml --split_sentences --sentence_tag s --tag p --parallel 4 - | sed 's|</s>|</s>\n|' - | somewe-tagger --xml --parallel 4 --tag german_web_social_media_2020-05-28.model -

but this is rather hacky …

Please enhance this amazing tool!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.