Giter Site home page Giter Site logo

bobotig / ebook-reader-dict Goto Github PK

View Code? Open in Web Editor NEW
372.0 8.0 20.0 43.52 MB

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

Home Page: http://www.tiger-222.fr/?d=2020/04/17/22/14/21-un-dictionnaire-alternatif-et-complet-pour-votre-liseuse

License: MIT License

Python 95.09% Shell 0.01% HTML 4.90%
dictionary dict python kobo ebook-reader french francais wiktionary english catalan

ebook-reader-dict's Introduction

eBook Reader Dictionaries

Update dictionaries Update local-specific data Word of the day

Users

Development

Setup a virtual environment:

$ python3.12 -m venv venv

# For Linux and Mac users
$ . venv/bin/activate

# For Windows users
$ . venv/Scripts/activate

Install, or update, dependencies:

$ python -m pip install -U pip
$ python -m pip install -r requirements-tests.txt

Run tests:

$ python -m pytest --doctest-modules wikidict tests

Run linters, and quality checkers, before submitting a pull-request:

$ ./check.sh

Contributors ✨

All Contributors

Thanks go to these wonderful people (emoji key):

Mickaël Schoentgen
Mickaël Schoentgen

🐛 💻 📖 📆
Nicolas Froment
Nicolas Froment

🐛 💻 📖 📆
Attilio
Attilio

💻
chopinesque
chopinesque

💻
Saeed Rasooli
Saeed Rasooli

🚇
Matthias C. Hormann
Matthias C. Hormann

💻
tjader
tjader

💻
Victor
Victor

💻
John Koll
John Koll

🌍
Marta Malberti
Marta Malberti

🌍
Arsenii Chaplinskii
Arsenii Chaplinskii

🌍
Ander Romero
Ander Romero

🌍
Dongchen Yue | 岳东辰
Dongchen Yue | 岳东辰

🌍
Johan Larsson
Johan Larsson

💻
kyxap
kyxap

📖

This project follows the all-contributors specification. Contributions of any kind welcome!

ebook-reader-dict's People

Contributors

allcontributors[bot] avatar and4po avatar atti84it avatar bobotig avatar chopinesque avatar dependabot-preview[bot] avatar dependabot[bot] avatar frafra avatar g1r0 avatar ilius avatar jolars avatar kyxap avatar lasconic avatar martamalb avatar moonbase59 avatar sourcery-ai[bot] avatar tjaderxyz avatar victornove avatar yue-dongchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ebook-reader-dict's Issues

Correctly format italic and strong

Wikicode:

''this is italic''
'''this is strong'''

Output:

this is italic
this is strong

Expected:

<i>this is italic</i>
<b>this is strong</b>

Better handle the "term" template

Output:

guère \ɡɛʁ\

    1. ...
    2. ...
    3. (Term) Presque personne excepté ; presque rien excepté.
    4. ...
    5. (Term) Presque. — [Note] Placé après jamais ou plus ; placé devant nulle part, personne ou rien.

Wikicode:

# {{term|ne … guère que}} Presque personne excepté ; presque rien excepté.
# {{term|Avec un mot négatif}} Presque. — {{note}} ''Placé après ''[[jamais]]'' ou ''[[plus]]'' ; placé devant ''[[nulle part]]'', ''[[personne]]'' ou ''[[rien]].

Expected:

guère \ɡɛʁ\

    1. ...
    2. ...
    3. (Ne … guère que) Presque personne excepté ; presque rien excepté.
    4. ...
    5. (Avec un mot négatif) Presque. — [Note] Placé après jamais ou plus ; placé devant nulle part, personne ou rien.

Add a way to handle templates output

Example 1

Wikicode:

Sorte de [[large]] ceinture de cuir, [[en vogue]] à la [[fin]] du XX{{e}} [[siècle]].

Current output:

Sorte de large ceinture de cuir, en vogue à la fin du XX[E] siècle.

Expected output:

Sorte de large ceinture de cuir, en vogue à la fin du X<sup>ème</sup> siècle.

Example 2

Wikicode:

{{par ext}} ou {{figuré|fr}} Avec un [[sentiment]], une [[expression]] ou un [[ton]] de douleur.

Current output:

(Par Ext) ou (Figuré) Avec un sentiment, une expression ou un ton de douleur.

Expected output:

(Par extension) ou (Figuré) Avec un sentiment, une expression ou un ton de douleur.

Check if getting rid of xmltodict speed-up the process

I initially used xmltodict to ease the parsing, but in the end this is not really usefull as the parsing is really really simple and only 3 checks have to be done.

The process is taking too much time:

$ WIKI_DUMP=20200420 WIKI_LOCALE=fr python -m scripts --fetch-only                
>>> WIKI_DUMP is set to 20200420, regenerating dictionaries ...
>>> Processing data/fr/pages-20200420.xml ...
>>> Saved 1,669,770 words into data/fr/data.json
>>> Retrieval done!
Time: 0h:25m:50s

[FR] "capacité thermique" not well handled

https://fr.wiktionary.org/wiki/capacit%C3%A9_thermique

{{physique|fr}} {{thermodynamique|fr}} [[énergie|Énergie]] qu’il faut apporter à un corps pour augmenter sa [[température]] d’une unité de température, exprimée en [[joule par kelvin#fr|joule par kelvin]] ([[J·K-1|'''J·K{{e|-1}}''']]).

Output:

(Physique) (Thermodynamique) Énergie qu’il faut apporter à un corps pour augmenter sa température d’une unité de température, exprimée en joule par kelvin ().

Expected:

(Physique) (Thermodynamique) Énergie qu’il faut apporter à un corps pour augmenter sa température d’une unité de température, exprimée en joule par kelvin (J·K<sup>-1</sup).

[SV] Release translation needed

It is currently translated to english: https://github.com/BoboTiG/ebook-reader-dict/releases/tag/sv

The sentences are here:

Words count: {words_count}
Wiktionary dump: {dump_date}
:arrow_right: Download: [dicthtml-{locale}.zip]({url})
---
Installation:
1. Copy the `dicthtml-{locale}.zip` file into the `.kobo/dict/` folder of the reader.
2. Restart the reader.
---
Caracteristics :
- Only definitions are included: there are no quote nor ethymology.
- Proper nouns are not included.
- Conjuged verbs are not included.
<sub>Updated on {creation_date}</sub>

cf @Emileeson :)

[FR] "marlotte" not well handled

Wiktionary page: https://fr.wiktionary.org/wiki/marlotte

Wikicode:

# [[grand|Grand]] [[manteau]] de [[femme]], [[long]] mais plus court que la jupe, complètement [[ouvert]] sur le [[devant]], à [[manche]]s [[bouffant]]es, [[tuyauté]] [[derrière]] de [[pli]]s [[symétrique]]s, qui fut [[à la mode]] en [[France]] sous {{w|François_Ier_de_France|François I<small><sup>er</sup></small>}}.

Output:

Grand manteau de femme, long mais plus court que la jupe, complètement ouvert sur le devant, à manches bouffantes, tuyauté derrière de plis symétriques, qui fut à la mode en France sous François I.

Expected:

Grand manteau de femme, long mais plus court que la jupe, complètement ouvert sur le devant, à manches bouffantes, tuyauté derrière de plis symétriques, qui fut à la mode en France sous François I<small><sup>er</sup></small>.

Remove more words from the dictionary

There are several words that are not taken into account by the Kobo but still present in the dictionary. Source: https://github.com/pettarin/penelope/blob/fce6dcfd899d3755ae3a5a3867d7d436105ada56/penelope/prefix_kobo.py#L31-L34

    def is_allowed(character):
        # all non-ascii (x > 127) are ok
        # all ASCII lowercase letters (97 <= x <= 122) are ok
        # everything else is not ok

More deeply, there are HTML files generated with unicode characters and they will be near to zero be used.

This would allow to generate smaller files (even if this is currently not an problem) and speed-up the update process.

Handle the <math> HTML tag

Wiktionary page: https://fr.wiktionary.org/wiki/octonion

Wikicode:

<math>x=x_0+x_1{\rm i}+x_2{\rm j}+x_3{\rm k}+x_4{\rm l}+x_5{\rm il}+x_6{\rm jl}+x_7{\rm kl}</math>

Output:

x=x_0+x_1{\rm i}+x_2{\rm j}+x_3{\rm k}+x_4{\rm l}+x_5{\rm il}+x_6{\rm jl}+x_7{\rm kl}

Expected:

Not sure if we can and how to display it.

[CA] Release translation needed

It is currently translated to english: https://github.com/BoboTiG/ebook-reader-dict/releases/tag/ca

The sentences are here:

Words count: {words_count}
Wiktionary dump: {dump_date}
:arrow_right: Download: [dicthtml-{locale}.zip]({url})
---
Installation:
1. Copy the `dicthtml-{locale}.zip` file into the `.kobo/dict/` folder of the reader.
2. Restart the reader.
---
Caracteristics :
- Only definitions are included: there are no quote nor ethymology.
- Proper nouns are not included.
- Conjuged verbs are not included.
<sub>Updated on {creation_date}</sub>

[SV] Uttal sublists are not well handled

Wiktionary page: https://sv.wiktionary.org/wiki/sand

Wikicode:

*{{uttal|sv|enkel=sand|ipa=sand}}
#[[sten]] som blivit till små korn, antingen genom väder och vind eller på konstgjord väg
#:''Jag har aldrig sett så mycket '''sand''', sa turisten på besök i Saharaöknen.''
#{{tagg|geologi}} [[jordart]] med kornstorlek mellan 0,06 och 2 mm

Output:

sand \sand\ (.) 

  1. <b>uttal:</b> /sand/

Expected:

sand \sand\ (.) 

  1. <b>uttal:</b> /sand/
        a. sten som blivit till små korn, antingen genom väder och vind eller på konstgjord väg
        b. (geologi) jordart med kornstorlek mellan 0,06 och 2 mm

[FR] Filter out the "région" template

Output:

guère \ɡɛʁ\

    1. Presque pas ; presque rien. — [Note] Il ne s’emploie qu’avec la particule ne.
    2. De peu ; d'un rien.
    3. (Term) Presque personne excepté ; presque rien excepté.
    4. [Région] Tout au plus.
    5. (Term) Presque. — [Note] Placé après jamais ou plus ; placé devant nulle part, personne ou rien.

Wikicode:

# {{région}} [[tout au plus|Tout au plus]].

Find a simple way to handle nested lists

Wiktionary page: https://fr.wiktionary.org/wiki/languide

Output:

languide \lɑ̃.ɡid\ (mf.) 

  1. En parlant d’une personne :
  2. En parlant d’une chose :

Expected:

languide \lɑ̃.ɡid\ (mf.) 

  1. En parlant d’une personne :
      a. (Littéraire) Qui est dans un état habituel de langueur, qui dépérit, qui se trouve dans un état d’abattement, de grande faiblesse physique et psychologique.
      b. Qui exprime un sentiment de langueur.
  3. En parlant d’une chose :
      a. Qui manque de force, d’énergie.
      b. Qui évoque, engendre la langueur.

I am not sure right now if it is worth trying to handle that subtle case. There are not many words having that issue.

Filter out the "pronl" template

Output:

singulariser \sɛ̃.ɡy.la.ʁi.ze\

    1. ...
    2. (Pronl) (Souvent Péjoratif) Se distinguer, se faire remarquer par quelque singularité, par des opinions, des actions, des manières singulières.

Wikicode:

# {{pronl|fr}} {{souvent péjoratif|fr}} Se [[distinguer]], se faire [[remarquer]] par quelque [[singularité]], par des [[opinion]]s, des [[action]]s, des manières [[singulières]].

Distributing the workload

A good improvement has been done with #30. But it should be possible to do better.

An idea would be to distribute the workload between several threads or processors. The fact that the overall process is slow is that the script only uses 1 thread, at 100% CPU.

The interesting code to improve is here: scripts/get.py, look for the def process() function.

[FR] Word "ski" not well handled

Wiktionary page: https://fr.wiktionary.org/wiki/ski

Wikicode:

''Pluriel de'' '''ski'''. {{note}} Certains auteurs conservent le pluriel norvégien ''ski'' du {{étyl|no|fr|mot=ski}}.

Output:

Pluriel de ski. Note : Certains auteurs conservent le pluriel norvégien ski du.

Expected:

Pluriel de ski. Note : Certains auteurs conservent le pluriel norvégien ski du norvégien ski.

[FR] Pluriels d'un mot --> Définition du mot au singulier

Wiktionary page: https://fr.wiktionary.org/wiki/schmilblicks

Wikicode:

# ''Pluriel de ''[[schmilblick]].

Output:

1. Pluriel de schmilblick.

Expected:

schmilblick \ʃmil.blik\ m.
1. (À l’origine) Appareil invraisemblable ne servant à rien du tout.
2. (Après 1969) Chose, objet à deviner par des questions auxquelles on ne répond que par oui ou par non.
3. (Par extension) Quelque chose de difficile à décrire ou à cerner, un machin.
4. (Populaire) Sujet de discussion (dans l'optique de faire avancer celle-ci).

Bonjour,

L'idée serait d'afficher pour les mots aux pluriels directement la définition au singulier du mot.

Je pense qu'il y a moyen via le fichier "words".
Pour l'exemple, la ligne à écrire dans le fichier "words" serait :

schmilblick, blicks

(Je ne suis pas certain du comportement de ce fichier niveau fins des mots... à tester.)

Bien cordialement
NicoR

PS : serait-il possible d'avoir le fichier "words" en clair (avant marisa-build) dans ce Git ?

Change the dictionary source

I did not find out how to change the dictionary name:
kobo-libra-dicthtml-fr-name

Le Robert Micro © 2013 Dictionnaires Le Robert

What I tried so far:

  • Searching the word Robert or obert in the whole Kobo folder: no results.
  • Enabling logs for all categories for several hours + reboots: no more results.
  • Looking into the KoboReader.sqlite file: no results.
  • Checked that the name is in the dicthtml-$LOCALE.zip as a special header (or in the "comment" header of the ZIP file): no results.

If someone has more ideas to try, I would lve to give a try.

Note: there is the encrypted file BookReader.sqlite that I could not check as it is password protected.

Smaller Elements and less work done in useless tags

A good improvement has been done with #30. But it should be possible to do better.

Parsing one "page" of the XML will result in a lot of unused data. In fact, we only need items 0 (the "title") and 3 (the "revision").

  • The item 0 is the word itself, stocked into the title tag.
  • The item 3 is a subelement containing a lot of unusued data too. In that subelement, we only need the subitem 0 (the "id", aka the current revision number) and items 6, 7, or 8. For items 6, 7 or 8, this is rquired because the text tag (aka the wikicode) maybe be at one of those indexes, it is not always at the same level.

So we should find a way to reduce the memory footprint but, more importantly, the Element handling, but using a custom class. The less tags traited, the better.

The interesting code to improve is here: scripts/get.py, look for the def xml_iter_parse() function.
If such an improvement is found and implemented, there will laso the def xml_parse_element() function to clean-up.

Better handle auto-updates

$ python -m scripts
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.2/x64/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.2/x64/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/runner/work/ebook-reader-dict/ebook-reader-dict/scripts/__main__.py", line 49, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/home/runner/work/ebook-reader-dict/ebook-reader-dict/scripts/__main__.py", line 43, in main
    if get.main() == 1:
  File "/home/runner/work/ebook-reader-dict/ebook-reader-dict/scripts/get.py", line 273, in main
    file = fetch_pages(snapshot)
  File "/home/runner/work/ebook-reader-dict/ebook-reader-dict/scripts/get.py", line 71, in fetch_pages
    req.raise_for_status()
  File "/opt/hostedtoolcache/Python/3.8.2/x64/lib/python3.8/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://dumps.wikimedia.org/frwiktionary/20200420/frwiktionary-20200420-pages-meta-current.xml.bz2
>>> Fetching pages-20200420.xml.bz2 

Here, the data was unavailable because the dump is not finished on the Wiktionary side.

Move to OOP approach

The code is using a lot of global envars. And most of them are depending of the desired locale, which is also a global variable.

Moving to a OOP approach would ease the code reading and testing accross all supported locales.

Update the release commit

It would be interesting to automate the release commit in the auto-updates.yml workflow.
Doig that manually for the WIKI_LOCALE="fr" is simple:

git tag -f -a "fr" -m "fr"
git push -f --tags

But we cannot really do that from the GitHub Action right now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.