stscoundrel / scandinavian-dictionary-crosslinker Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 4.44 MB

Finds shared entries in dictionary sitemaps, allowing crosslinking

Home Page: https://www.npmjs.com/package/scandinavian-dictionary-crosslinker

License: MIT License

Go 42.96% Rust 7.30% JavaScript 2.82% TypeScript 36.52% Nim 10.41%

crosslinks golang old-icelandic old-norse old-norwegian old-swedish rust typescript

scandinavian-dictionary-crosslinker's Introduction

Scandinavian dictionary crosslinker

Finds shared entries in sitemaps of linguistically related dictionaries. Builds a mapping of relations that allows individual dictionaries to crosslink to related sources. Usually having same word in multiple dictionaries of different languages would not be too helpful, but in case of scandinavian languages from 8th to 16th century all of the languages are closely enough related to be useful as crossreference.

Parses sources from following dictionary projects:

The parser finds over 1 000 entries that are present in all four dictionaries. There are also over 20 000 entries that appear in at least two different dictionaries, making them worth a crosslink.

Install

yarn add scandinavian-dictionary-crosslinker

Download sitemaps.

Run cargo run in downloader folder. Downloads latest XML sitemaps to resources folder.

Generate crosslinks

Run go run *.go in crosslinks folder. Generates crosslinks json to resources folder.

Minify outout

Run nimble build and ./minifier in minifier folder. Generates minified & gzipped json outputs.

Update data to NPM module.

Run go run main.go in root folder to update json & readme to NPM module.

scandinavian-dictionary-crosslinker's People

Contributors

Stargazers

Watchers

scandinavian-dictionary-crosslinker's Issues

Manual links: various "brodhir" words

Wests bróðir and easts brodhir seem to have common combination words other than their base forms. Pairs like:

fóstbróðir -> fosterbrodhir
leikbróðir-> lekbrodhir
sambróðir -> sambrodhir
svaribróðir -> stalbrodhir

Add to manual links

Output JSON - consider gzipping

With added crosslinks, the output file has become huge again. JSON is not exactly the most effective format for this, so consider:

Gzipping the whole thing. Should change it from 3+MB to 0.5MB
Reading the gzipped file back to JSON in the npm module. Probably introduces some overhead, but may still be more efficient.

Manual links: various "daughter" words

There are west -> east pairs like:

fóstrdóttir -> fosterdottir
stjópdóttir -> stiupdottir
guðdóttir -> gudhdottir

Add to manual links, not much of a pattern here.

Manual links: various "mother" words

Some west -> east pairs to add to manual links:

móðir -> modhir
fóstrmóðir -> fostermodhir
stjópmóðir -> stiupmodhir
amma -> stormodhir

Unrelated grandfather word, pair to amma/stormodhir:
afi -> storfadhir

NPM module public interface: add per language getters

Currently only public function returns the full dataset. Add versions that do this by language, ie:

Client calls getOldIcelandicCrosslinks()
The library filters out links that point to old-icelandic -> no point in self referencing
Under the hood they probably call a common private function which deals with DictionarySource enum. Offer convenience function per enum option.

If we keep the current full dataset method available too, doesn't even need a major version bump

Minified resource: one-line it

Current minification process left out one-lining the json. It seems that would minify it from current 5MB to 2.6MB.

Odd oversight, just minify it.

Manual links: various "drengskapr" words

While the word is the same, Cleasby & Vigfusson seems to be left out due to dash in the word. Swedish one also has different vowels. Also add related common-enough variants that appear in both sides of west & east division.

drengskapr -> dreng-skapr
drengskapr -> drängskaper
drengiligr -> drängeliker
drengr -> dränger

Crosslinks: parse dash containing slugs also without dash as alternative

Example: Blá-tönn appears in Old Norse dictionary with a dash, but in Icelandic and Norwegian dictionaries without dash. Automatic crosslink detection fails to see that.

Try something like:

Gather links that contain dash
Feed them to crosslinks
See if it produces reasonable results.

That should produce reasonable crosslinks for words like Blá-tönn and dreng-skapr.

Add Node.js module for listing alternatives by slug & language

Should produce both the autogenerated list + possible manual additions and overrides.

Automatic list should contain simple cases where slug totally matches. Manual overrides can serve popular words that differ a lot (dvergr vs dvärger), or minor adjustements for popular SEO cases (hval vs hvalr)

Crosslink detection: consider taking simple sound changes into account

Simple example: from Old Norse word "hvalr", there would later be descendant "hval", simply by dropping the ending r. This would be common pattern that may be easy to recognize, allowing links from west norse to east norse words (=west generally kept -r longer)

See if this kind of feature:

Produces respectable amount of crosslinks without too many false positives. Should there be too few, we may be better off just using manual overrides
Has decent enough performance. Naive way could easily end up doing quite a bit of comparisons, as there are some 120K+ entries in the dictionaries.

Dashless detection: crossinject links

It seems that the logic for dashless detection only works one way. It is probably only matter of crosslinking dashful versions to dashless versions instead of just adding the dashless version to matrix.

Example:

Manual crosslinks: fix dreng-skapr

Currently only produces match for Old Swedish. See what is wrong.

Add Old Danish dictionary as a source

Add Old Danish dictionary when released.

To sitemap downloader
To crosslink mappings

Minify crosslinks JSON

The script produces more crosslinks than expected, which is a positive issue to have. However, the produced json file is large, so it would do the planet some good to make it smaller. Currently around 7 MB (EDIT: now almost 9 MB with added content), which is a lot for additional meta info.

Possible avenues:

Just basic JSON minify -> one-lining it should save some space when theres 20K+ entries
Minify keys -> repeating "url" and "source" keys 20k+ times does take space. Could just be "a" and "b", moving the key parsing to TypeScript code.
Consider dropping urls from master data. They tend to be long and combined with the "source" key, they do not really add any extra information, just convenience. Another possible structure would be to just ship slug & sources list, which are predefined strings. The "truth" of urls could still be hosted in the NPM module, so individual dictionary websites don't have to keep up-to-date info about that.

Minifying keys does produce overhead to Node.js side, so do consider if it gives as much benefits here as it did in Old Swedish dictionary. There we had many more keys and almost all of them were longer than the ones we have here. First and last avenues may be the most beneficial ones without adding too much burden to processing time.

Add reader & parser for sitemap content

Will probably want to transform the sitemaps to simple intermediate format for easier iteration & comparison. Should probably consider if the core dictionary libraries are needed for name comparisons, or are we just going by slugs. That could result in false positives, as many scandic letters would be slugified to similar outcomes.

Will probably want to skip all numbers in slugs -> with multiple words the chance of matching them to correct sibling drops dramatically.

Generating simple JSON file may be the most efficient output, as consumer part will anyway be Node.js

Add manual crosslinks

There should be manual crosslinks, ie. mapping of non-clear connections like dvergr -> dvärg or more simple ones like hvalr -> hval.

Consider what is the best place for these: should they be baked into the original crosslinks parsed in Go, or just added after-the-fact in the NPM module. As the slugs differ, links may need to be "duplicated", like:

dvergr => [links, including dvärg]
dvärg => [links, including dvergr]

To keep the structure searchable by key/slug.

Add downloader for sitemaps

Script or simple module for getting latest sitemaps from predefined list.