Giter Site home page Giter Site logo

gambolputty / german-nouns Goto Github PK

View Code? Open in Web Editor NEW
143.0 143.0 18.0 21.46 MB

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%
corpus german-language german-nouns nouns parser wiktionary

german-nouns's People

Contributors

gambolputty avatar salaheddineghamri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

german-nouns's Issues

same thing but for wikipedia archives

hello i need to search words with some precise syllables for my game but wiktionary is kind of empty compared to wikipedia, is there any way to make this but for wikipedia archives?

Automatically handle compound words

I have some code roughly like the following to automatically handle compound words not support explicitly.

nouns = Nouns()
result = nouns[word]
if len(result) == 0:
    if "-" in word:
        words = word.split("-")
        word = words[-1]
        words = "-".join(words[0:-1]) + "-"
        lower = False
    else:
        words = nouns.parse_compound(word)
        if len(words) < 2:
            return []

        word = words[-1]
        words = "".join(words[0:-1])
        lower = True

    result = nouns[word]
    if len(result) == 0:
        return []

    for i in range(len(result)):
        lemma = result[i]["lemma"].lower() if lower else result[i]["lemma"]
        result[i]["lemma"] = words + lemma
        for flexion in result[i]["flexion"]:
            flexion_expanded = (
                result[i]["flexion"][flexion].lower()
                if lower
                else result[i]["flexion"][flexion]
            )

            result[i]["flexion"][flexion] = words + flexion_expanded

return result

Would you welcome a PR adding this natively? Potentially via some config variable.

add non exact match strategy

in our testing we found too many words missing than we can realistically add to Wiktionary (see also #8). so we now implemented the following strategy which at least allows us to detect the genus. would this be interesting to add to your package?

primary_german_genus_endings = {
    "n": [
        "chen",
        "ett",
        "eau",
        "lein",
        "icht",
        "il",
        "ium",
        "it",
        "ma",
        "ment",
        "tel",
        "tum",
        "um",
    ],
    "f": [
        "in",
        "a",
        "ade",
        "age",
        "anz",
        "elle",
        "ette",
        "ere",
        "enz",
        "ei",
        "ine",
        "isse",
        "itis",
        "ive",
        "ie",
        "heit",
        "keit",
        "ik",
        "sion",
        "se",
        "sis",
        "tät",
        "ung",
        "ur",
        "schaft",
    ],
    "m": [
        "ant",
        "ast",
        "ich",
        "ist",
        "ig",
        "ling",
        "or",
        "us",
        "ismus",
        "är",
        "eur",
        "iker",
        "ps",
    ],
}

secondary_german_genus_endings = {
    # 3 out of four words ending with -nis and -sal are neuter nouns
    "n": [
        "nis", "sal",
    ],
    # There are exceptions such as Postillion, which is masculine while the oberwhelming majority of -ion words in German is feminine.
    "f": [
        "ion",
    ],
    # More than half of words ending with -er, -en, -el are masculine
    "m": [
        "er", "en", "el",
    ],
}

def determine_genus_from_ending(word, german_genus_endings):
    for genus in german_genus_endings:
        for ending in german_genus_endings[genus]:
            if word.endswith(ending):
                return {"genus": genus}

    return None


def german_noun_lookup(word):
    result = german_nouns[word]
    if not len(result):
        return None

    result = result[0]

    if "genus" in result:
        return result

    if "genus 1" in result:
        result["genus"] = result["genus 1"]

        return result

    if word[-5:].lower() == "leute":
        result["genus"] = "f"

        return result

    genus_result = determine_genus_from_ending(word, primary_german_genus_endings)
    if genus_result == None or "genus" not in genus_result:
        genus_result = determine_genus_from_ending(word, secondary_german_genus_endings)
        if genus_result == None or "genus" not in genus_result:
            return None

    result["genus"] = genus_result["genus"]

    return result


def german_noun_analysis(word, genus_only=False):
    result = german_noun_lookup(word)
    if result != None:
        return result

    if genus_only:
        result = determine_genus_from_ending(word, primary_german_genus_endings)

        if result != None:
            return result

    # skip the first 2 letters
    i = 2

    # skip the last 2 letters
    while i < len(word) - 2:
        partial_word = word[i:]

        # avoid cases like 'Ende' at the end of 'Arbeitgebende'
        if partial_word == "ende":
            break

        result = german_noun_lookup(partial_word.capitalize())
        if result == None:
            i += 1
            continue

        result["Lemma"] = word
        if not genus_only:
            word_prefix = word[0:i]
            for flexion in result["flexion"]:
                result["flexion"][flexion] = (
                    word_prefix + result["flexion"][flexion].lower()
                )

        return result

    if genus_only:
        result = determine_genus_from_ending(word, primary_german_genus_endings)

    return result

UnicodeDecodeError on Windows with Python 3.9

I was testing the german-nouns module today with the example from the Readme.md and ran into the following error:
Traceback (most recent call last): File "c:\Users\...\Documents\...\....py", line 3, in <module> nouns = Nouns() File "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\site-packages\german_nouns\lookup\__init__.py", line 23, in __init__ data = list(csv.reader(open(CSV_FILE_PATH))) File "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 4372: character maps to <undefined>

After editing "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\site-packages\german_nouns\lookup\__init__.py" in line 23 and changing from data = list(csv.reader(open(CSV_FILE_PATH))) to data = list(csv.reader(open(CSV_FILE_PATH, encoding='utf-8'))) it worked fine.

I'm using Python 3.9 on a Windows 10 x64 machine.

why not use sqlite?

Looking at

def create_index(self) -> None:
print('Creating index once')
# create index
col_index_to_skip = {1, 2, 3, 4, 5, 6} # everything before "pos" and after the last "genus" column
for row_idx, row in enumerate(self.data):
for col_idx, word in enumerate(row):
if col_idx in col_index_to_skip:
continue
if not word:
continue
word_low = word.lower()
if row_idx not in self.index[word_low]:
# liste, weil Reihenfolge der Indexes erhalten bleiben muss
self.index[word_low].append(row_idx)
# save index file
output = ''
for k, v in self.index.items():
indexes = '\t'.join(str(x) for x in v)
output += f'{k}\t{indexes}\n'
with open(INDEX_FILE_PATH, 'w', encoding='utf-8') as f:
f.write(output)
def load_index(self) -> None:
with open(INDEX_FILE_PATH, encoding='utf8') as f:
lines = [l.strip() for l in f.readlines()]
for line in lines:
split = line.split('\t')
word_low = split[0]
self.index[word_low] = [int(x) for x in split[1:] if x]

this logic screams for just writing import sqlite3 (comes with the python interpreter!), creating a table in memory (or on disk), maybe even with a column index, and writing some simple sql queries.

memory use

I am running into issues with memory use as I integrated your solution into an API.

Now I am pondering ways to reduce memory use and increase scalability. One idea I have is to move all these mappings into Redis. In theory this way I could load all of this into memory only once regardless of how many workers I have. As long as Redis manages to hold up and not introduce too much latency, this could in theory be a very memory efficient solution.

I need to implement this first but the question is if you would accept a PR that would optionally use Redis?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.