Giter Site home page Giter Site logo

stscoundrel / scandinavian-dictionary-crosslinker Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 4.44 MB

Finds shared entries in dictionary sitemaps, allowing crosslinking

Home Page: https://www.npmjs.com/package/scandinavian-dictionary-crosslinker

License: MIT License

Go 42.96% Rust 7.30% JavaScript 2.82% TypeScript 36.52% Nim 10.41%
crosslinks golang old-icelandic old-norse old-norwegian old-swedish rust typescript

scandinavian-dictionary-crosslinker's Introduction

Scandinavian dictionary crosslinker

Finds shared entries in sitemaps of linguistically related dictionaries. Builds a mapping of relations that allows individual dictionaries to crosslink to related sources. Usually having same word in multiple dictionaries of different languages would not be too helpful, but in case of scandinavian languages from 8th to 16th century all of the languages are closely enough related to be useful as crossreference.

Parses sources from following dictionary projects:

The parser finds over 1 000 entries that are present in all four dictionaries. There are also over 20 000 entries that appear in at least two different dictionaries, making them worth a crosslink.

Install

yarn add scandinavian-dictionary-crosslinker

Download sitemaps.

Run cargo run in downloader folder. Downloads latest XML sitemaps to resources folder.

Generate crosslinks

Run go run *.go in crosslinks folder. Generates crosslinks json to resources folder.

Minify outout

Run nimble build and ./minifier in minifier folder. Generates minified & gzipped json outputs.

Update data to NPM module.

Run go run main.go in root folder to update json & readme to NPM module.

scandinavian-dictionary-crosslinker's People

Contributors

dependabot[bot] avatar stscoundrel avatar

Stargazers

 avatar

Watchers

 avatar  avatar

scandinavian-dictionary-crosslinker's Issues

Manual links: various "brodhir" words

Wests bróðir and easts brodhir seem to have common combination words other than their base forms. Pairs like:

fóstbróðir -> fosterbrodhir
leikbróðir-> lekbrodhir
sambróðir -> sambrodhir
svaribróðir -> stalbrodhir

Add to manual links

Output JSON - consider gzipping

With added crosslinks, the output file has become huge again. JSON is not exactly the most effective format for this, so consider:

  • Gzipping the whole thing. Should change it from 3+MB to 0.5MB
  • Reading the gzipped file back to JSON in the npm module. Probably introduces some overhead, but may still be more efficient.

Manual links: various "daughter" words

There are west -> east pairs like:

fóstrdóttir -> fosterdottir
stjópdóttir -> stiupdottir
guðdóttir -> gudhdottir

Add to manual links, not much of a pattern here.

Manual links: various "mother" words

Some west -> east pairs to add to manual links:

móðir -> modhir
fóstrmóðir -> fostermodhir
stjópmóðir -> stiupmodhir
amma -> stormodhir

Unrelated grandfather word, pair to amma/stormodhir:
afi -> storfadhir

NPM module public interface: add per language getters

Currently only public function returns the full dataset. Add versions that do this by language, ie:

  • Client calls getOldIcelandicCrosslinks()
  • The library filters out links that point to old-icelandic -> no point in self referencing
  • Under the hood they probably call a common private function which deals with DictionarySource enum. Offer convenience function per enum option.

If we keep the current full dataset method available too, doesn't even need a major version bump

Minified resource: one-line it

Current minification process left out one-lining the json. It seems that would minify it from current 5MB to 2.6MB.

Odd oversight, just minify it.

Manual links: various "drengskapr" words

While the word is the same, Cleasby & Vigfusson seems to be left out due to dash in the word. Swedish one also has different vowels. Also add related common-enough variants that appear in both sides of west & east division.

drengskapr -> dreng-skapr
drengskapr -> drängskaper
drengiligr -> drängeliker
drengr -> dränger

Crosslinks: parse dash containing slugs also without dash as alternative

Example: Blá-tönn appears in Old Norse dictionary with a dash, but in Icelandic and Norwegian dictionaries without dash. Automatic crosslink detection fails to see that.

Try something like:

  • Gather links that contain dash
  • Feed them to crosslinks
  • See if it produces reasonable results.

That should produce reasonable crosslinks for words like Blá-tönn and dreng-skapr.

Add Node.js module for listing alternatives by slug & language

Should produce both the autogenerated list + possible manual additions and overrides.

Automatic list should contain simple cases where slug totally matches. Manual overrides can serve popular words that differ a lot (dvergr vs dvärger), or minor adjustements for popular SEO cases (hval vs hvalr)

Crosslink detection: consider taking simple sound changes into account

Simple example: from Old Norse word "hvalr", there would later be descendant "hval", simply by dropping the ending r. This would be common pattern that may be easy to recognize, allowing links from west norse to east norse words (=west generally kept -r longer)

See if this kind of feature:

  • Produces respectable amount of crosslinks without too many false positives. Should there be too few, we may be better off just using manual overrides
  • Has decent enough performance. Naive way could easily end up doing quite a bit of comparisons, as there are some 120K+ entries in the dictionaries.

Minify crosslinks JSON

The script produces more crosslinks than expected, which is a positive issue to have. However, the produced json file is large, so it would do the planet some good to make it smaller. Currently around 7 MB (EDIT: now almost 9 MB with added content), which is a lot for additional meta info.

Possible avenues:

  • Just basic JSON minify -> one-lining it should save some space when theres 20K+ entries
  • Minify keys -> repeating "url" and "source" keys 20k+ times does take space. Could just be "a" and "b", moving the key parsing to TypeScript code.
  • Consider dropping urls from master data. They tend to be long and combined with the "source" key, they do not really add any extra information, just convenience. Another possible structure would be to just ship slug & sources list, which are predefined strings. The "truth" of urls could still be hosted in the NPM module, so individual dictionary websites don't have to keep up-to-date info about that.

Minifying keys does produce overhead to Node.js side, so do consider if it gives as much benefits here as it did in Old Swedish dictionary. There we had many more keys and almost all of them were longer than the ones we have here. First and last avenues may be the most beneficial ones without adding too much burden to processing time.

Add reader & parser for sitemap content

Will probably want to transform the sitemaps to simple intermediate format for easier iteration & comparison. Should probably consider if the core dictionary libraries are needed for name comparisons, or are we just going by slugs. That could result in false positives, as many scandic letters would be slugified to similar outcomes.

Will probably want to skip all numbers in slugs -> with multiple words the chance of matching them to correct sibling drops dramatically.

Generating simple JSON file may be the most efficient output, as consumer part will anyway be Node.js

Add manual crosslinks

There should be manual crosslinks, ie. mapping of non-clear connections like dvergr -> dvärg or more simple ones like hvalr -> hval.

Consider what is the best place for these: should they be baked into the original crosslinks parsed in Go, or just added after-the-fact in the NPM module. As the slugs differ, links may need to be "duplicated", like:

dvergr => [links, including dvärg]
dvärg => [links, including dvergr]

To keep the structure searchable by key/slug.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.