Giter Site home page Giter Site logo

domain_wordlists's Introduction

What this?

This started because of feedback from Overleaf users that the spell-checkers (straight from Ubuntu) were unsatisfactory, in that many domain-specific terms were missing. Users feel that LaTeX is widely used by mathematicians and scientists, and it's odd that the spell-checkers didn't respect that.

And to everyone's surprise although there has been many word lists (not even wordnets or thesauri, mind you!) for various languages, not to mention an abundance of research on keyword extraction, there doesn't seem to be any easy-to-use word lists for specific domains. Glossaries exist on the Web, but they need to be scraped.

And scrape them I will, if only because I'm getting tired of "Overleaf doesn't think my research field (neuroscience) is a real word" feedback. (Disclaimer: I was amused that the spell-checker didn't recognise many terms in my thesis!)

However each glossary list Web is formatted differently (as if that should be a surprise after I've been trying to process machine-readable dictionaries...), including Wikipedia glossary lists, may even be formatted differently. If we're lucky most of them will be formatted as dt-dd definition lists; but some may be formatted as tables or just marked with <strong> for the terms. I had hoped that Wikidata gives some respite, but it appears that the tuples have not been set up for most of these glossaries (see e.g. https://www.wikidata.org/wiki/Q5571762). So for now each script is rather ad-hoc! And perhaps badly coded because my coding-fu is rusty!

Also I have not figured out the copyright issues for these things, especially if they're not from Wikipedia/Wikidata.

Domains

I started with neuroscience, because for some reason that field crops up a lot in Overleaf user feedback re the spell-checkers.

Translations

Theoretically speaking this is possible by leveraging Wikipedia/Wikidata queries, as I'd done a few times in the past. But I'll probably do that later. Much later.

Proper integration into hunspell?

Probably with https://github.com/blatinier/pyhunspell? I have no idea how to do this yet (or whether I'll do it).

Acknowledgments

MeSH® data for the medicine wordlist comes, without warranties, from NLM®. It is downloaded on the fly, trying to get the most recent version by year. NLM® is in no way connected to this project apart from being the source of the used data. MeSH® General Terms and Conditions

domain_wordlists's People

Contributors

crispycrafter avatar liantze avatar lionawurscht avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.