gambolputty / german-nouns Goto Github PK

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%

corpus german-language german-nouns nouns parser wiktionary

german-nouns's People

Contributors

Stargazers

Watchers

Forkers

inventivejon fabridamicelli wchoston sroertgen odnodn nicolaischmid keller-kirill kannandreams aaabramov musicologyman aberja alirezabakhtiari emphasize salaheddineghamri felzim pkepka maintobias p-robot

german-nouns's Issues

UnicodeDecodeError

I have just installed the package and copied the example code.
Why am I getting this error?!

I have found this in StackOverflow:
https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character

Thank you for your good work

same thing but for wikipedia archives

hello i need to search words with some precise syllables for my game but wiktionary is kind of empty compared to wikipedia, is there any way to make this but for wikipedia archives?

add Weibliche/Männliche Wortformen to the output

Automatically handle compound words

I have some code roughly like the following to automatically handle compound words not support explicitly.

nouns = Nouns()
result = nouns[word]
if len(result) == 0:
    if "-" in word:
        words = word.split("-")
        word = words[-1]
        words = "-".join(words[0:-1]) + "-"
        lower = False
    else:
        words = nouns.parse_compound(word)
        if len(words) < 2:
            return []

        word = words[-1]
        words = "".join(words[0:-1])
        lower = True

    result = nouns[word]
    if len(result) == 0:
        return []

    for i in range(len(result)):
        lemma = result[i]["lemma"].lower() if lower else result[i]["lemma"]
        result[i]["lemma"] = words + lemma
        for flexion in result[i]["flexion"]:
            flexion_expanded = (
                result[i]["flexion"][flexion].lower()
                if lower
                else result[i]["flexion"][flexion]
            )

            result[i]["flexion"][flexion] = words + flexion_expanded

return result

Would you welcome a PR adding this natively? Potentially via some config variable.

missing words

I stumbled over some missing words and can imagine more need to be added like Schwachpunkt and Erschwernis.

I guess such missing words need to be added to Wiktionary as per https://github.com/gambolputty/german-nouns#2-compile-the-list-of-nouns-from-a-wiktionary-xml-file ?

in our testing we found too many words missing than we can realistically add to Wiktionary (see also #8). so we now implemented the following strategy which at least allows us to detect the genus. would this be interesting to add to your package?

primary_german_genus_endings = {
    "n": [
        "chen",
        "ett",
        "eau",
        "lein",
        "icht",
        "il",
        "ium",
        "it",
        "ma",
        "ment",
        "tel",
        "tum",
        "um",
    ],
    "f": [
        "in",
        "a",
        "ade",
        "age",
        "anz",
        "elle",
        "ette",
        "ere",
        "enz",
        "ei",
        "ine",
        "isse",
        "itis",
        "ive",
        "ie",
        "heit",
        "keit",
        "ik",
        "sion",
        "se",
        "sis",
        "tät",
        "ung",
        "ur",
        "schaft",
    ],
    "m": [
        "ant",
        "ast",
        "ich",
        "ist",
        "ig",
        "ling",
        "or",
        "us",
        "ismus",
        "är",
        "eur",
        "iker",
        "ps",
    ],
}

secondary_german_genus_endings = {
    # 3 out of four words ending with -nis and -sal are neuter nouns
    "n": [
        "nis", "sal",
    ],
    # There are exceptions such as Postillion, which is masculine while the oberwhelming majority of -ion words in German is feminine.
    "f": [
        "ion",
    ],
    # More than half of words ending with -er, -en, -el are masculine
    "m": [
        "er", "en", "el",
    ],
}

def determine_genus_from_ending(word, german_genus_endings):
    for genus in german_genus_endings:
        for ending in german_genus_endings[genus]:
            if word.endswith(ending):
                return {"genus": genus}

    return None


def german_noun_lookup(word):
    result = german_nouns[word]
    if not len(result):
        return None

    result = result[0]

    if "genus" in result:
        return result

    if "genus 1" in result:
        result["genus"] = result["genus 1"]

        return result

    if word[-5:].lower() == "leute":
        result["genus"] = "f"

        return result

    genus_result = determine_genus_from_ending(word, primary_german_genus_endings)
    if genus_result == None or "genus" not in genus_result:
        genus_result = determine_genus_from_ending(word, secondary_german_genus_endings)
        if genus_result == None or "genus" not in genus_result:
            return None

    result["genus"] = genus_result["genus"]

    return result


def german_noun_analysis(word, genus_only=False):
    result = german_noun_lookup(word)
    if result != None:
        return result

    if genus_only:
        result = determine_genus_from_ending(word, primary_german_genus_endings)

        if result != None:
            return result

    # skip the first 2 letters
    i = 2

    # skip the last 2 letters
    while i < len(word) - 2:
        partial_word = word[i:]

        # avoid cases like 'Ende' at the end of 'Arbeitgebende'
        if partial_word == "ende":
            break

        result = german_noun_lookup(partial_word.capitalize())
        if result == None:
            i += 1
            continue

        result["Lemma"] = word
        if not genus_only:
            word_prefix = word[0:i]
            for flexion in result["flexion"]:
                result["flexion"][flexion] = (
                    word_prefix + result["flexion"][flexion].lower()
                )

        return result

    if genus_only:
        result = determine_genus_from_ending(word, primary_german_genus_endings)

    return result

UnicodeDecodeError on Windows with Python 3.9

I was testing the german-nouns module today with the example from the Readme.md and ran into the following error:
Traceback (most recent call last): File "c:\Users\...\Documents\...\....py", line 3, in <module> nouns = Nouns() File "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\site-packages\german_nouns\lookup\__init__.py", line 23, in __init__ data = list(csv.reader(open(CSV_FILE_PATH))) File "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 4372: character maps to <undefined>

After editing "C:\Users\...\AppData\Local\Programs\Python\Python39\lib\site-packages\german_nouns\lookup\__init__.py" in line 23 and changing from data = list(csv.reader(open(CSV_FILE_PATH))) to data = list(csv.reader(open(CSV_FILE_PATH, encoding='utf-8'))) it worked fine.

I'm using Python 3.9 on a Windows 10 x64 machine.

why not use sqlite?

Looking at

german-nouns/german_nouns/lookup/__init__.py

Lines 35 to 68 in 1d076f5

    
           def create_index(self) -> None: 
        
               print('Creating index once') 
        
               # create index 
        
               col_index_to_skip = {1, 2, 3, 4, 5, 6}  # everything before "pos" and after the last "genus" column 
        
               for row_idx, row in enumerate(self.data): 
        
                   for col_idx, word in enumerate(row): 
        
                       if col_idx in col_index_to_skip: 
        
                           continue 
        
                       if not word: 
        
                           continue 
        
                       word_low = word.lower() 
        
                       if row_idx not in self.index[word_low]: 
        
                           # liste, weil Reihenfolge der Indexes erhalten bleiben muss 
        
                           self.index[word_low].append(row_idx) 
        
               # save index file 
        
               output = '' 
        
               for k, v in self.index.items(): 
        
                   indexes = '\t'.join(str(x) for x in v) 
        
                   output += f'{k}\t{indexes}\n' 
        
               with open(INDEX_FILE_PATH, 'w', encoding='utf-8') as f: 
        
                   f.write(output) 
        
           def load_index(self) -> None: 
        
               with open(INDEX_FILE_PATH, encoding='utf8') as f: 
        
                   lines = [l.strip() for l in f.readlines()] 
        
                   for line in lines: 
        
                       split = line.split('\t') 
        
                       word_low = split[0] 
        
                       self.index[word_low] = [int(x) for x in split[1:] if x]

this logic screams for just writing import sqlite3 (comes with the python interpreter!), creating a table in memory (or on disk), maybe even with a column index, and writing some simple sql queries.

memory use

I am running into issues with memory use as I integrated your solution into an API.

Now I am pondering ways to reduce memory use and increase scalability. One idea I have is to move all these mappings into Redis. In theory this way I could load all of this into memory only once regardless of how many workers I have. As long as Redis manages to hold up and not introduce too much latency, this could in theory be a very memory efficient solution.

I need to implement this first but the question is if you would accept a PR that would optionally use Redis?

gambolputty / german-nouns Goto Github PK

german-nouns's People

Contributors

Stargazers

Watchers

Forkers

german-nouns's Issues

UnicodeDecodeError

same thing but for wikipedia archives

add Weibliche/Männliche Wortformen to the output

Automatically handle compound words

missing words

add non exact match strategy

UnicodeDecodeError on Windows with Python 3.9

why not use sqlite?

memory use

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def create_index(self) -> None:
	print('Creating index once')

	# create index
	col_index_to_skip = {1, 2, 3, 4, 5, 6} # everything before "pos" and after the last "genus" column
	for row_idx, row in enumerate(self.data):
	for col_idx, word in enumerate(row):
	if col_idx in col_index_to_skip:
	continue
	if not word:
	continue

	word_low = word.lower()
	if row_idx not in self.index[word_low]:
	# liste, weil Reihenfolge der Indexes erhalten bleiben muss
	self.index[word_low].append(row_idx)

	# save index file
	output = ''
	for k, v in self.index.items():
	indexes = '\t'.join(str(x) for x in v)
	output += f'{k}\t{indexes}\n'

	with open(INDEX_FILE_PATH, 'w', encoding='utf-8') as f:
	f.write(output)

	def load_index(self) -> None:
	with open(INDEX_FILE_PATH, encoding='utf8') as f:
	lines = [l.strip() for l in f.readlines()]

	for line in lines:
	split = line.split('\t')
	word_low = split[0]
	self.index[word_low] = [int(x) for x in split[1:] if x]