Giter Site home page Giter Site logo

pyphen's Introduction

Pyphen is a pure Python module to hyphenate text using existing Hunspell hyphenation dictionaries.

This module is a fork of python-hyphenator, written by Wilbert Berendsen.

Many dictionaries are included in pyphen, they come from the LibreOffice git repository and are distributed under GPL, LGPL and/or MPL. Dictionaries are not modified in this repository. See the dictionaries and LibreOffice's repository for more details.

https://git.libreoffice.org/dictionaries

Pyphen has been created and developed by Kozea (https://kozea.fr). Professional support, maintenance and community management is provided by CourtBouillon (https://www.courtbouillon.org).

Copyrights are retained by their contributors, no copyright assignment is required to contribute to Pyphen. Unless explicitly stated otherwise, any contribution intentionally submitted for inclusion is licensed under GPL 2.0+/LGPL 2.1+/MPL 1.1, without any additional terms or conditions. For full authorship information, see the version control history.

pyphen's People

Contributors

blkserene avatar grewn0uille avatar lize avatar mbr avatar robinwhittleton avatar xezpeleta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyphen's Issues

Type annotations?

Hi there, it seems like the use of type annotations is becoming ever more popular in the community. Is there any reason Pyphen doesn't have them? I'd be happy to submit a PR 🙂

Support for more languages

Hi, just want to say thanks for the work so far, this is a really handy tool :)

In the README of this repo, there is a link to LibreOffice dictionaries. If I follow that link, I see dictionaries for a few languages that are not supported by Pyphen so far. Examples include Arabic, Turkish, etc.

I was wondering if there is any reason why those languages are not supported? I see here that you're pulling from a different repo to add dictionaries, so maybe you have some other criteria for including languages that I'm not seeing.

It would be great if we could support the languages from the link on the README page. I know someone was asking for Arabic already.

Thanks and keep up the good work :)

length 4

Dear

I had followed you advice in using the pyphen.
I do have one more question.
I have question regarding for word with 4 letters. for example, "өвөл"
if i would use the word exactly as written it will give the result as "өвөл"
if i put space before it just like " өвөл" the result would be "ө-вөл", which is correct.

How shall get the desired result?

Thank you in advance.

Sincerely

Batnyam

I get the following error after running pip install PyHyphen~=4.0.3

Collecting PyHyphen~=4.0.3
Using cached PyHyphen-4.0.3.tar.gz (40 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: wheel>=0.36.0 in ./.venv/lib/python3.10/site-packages (from PyHyphen~=4.0.3) (0.42.0)
Requirement already satisfied: setuptools>=52.0 in ./.venv/lib/python3.10/site-packages (from PyHyphen~=4.0.3) (69.1.0)
Requirement already satisfied: appdirs>=1.4.0 in ./.venv/lib/python3.10/site-packages (from PyHyphen~=4.0.3) (1.4.4)
Requirement already satisfied: requests>=2.25 in ./.venv/lib/python3.10/site-packages (from PyHyphen~=4.0.3) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in ./.venv/lib/python3.10/site-packages (from requests>=2.25->PyHyphen~=4.0.3) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.10/site-packages (from requests>=2.25->PyHyphen~=4.0.3) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.10/site-packages (from requests>=2.25->PyHyphen~=4.0.3) (2.2.0)
Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.10/site-packages (from requests>=2.25->PyHyphen~=4.0.3) (2024.2.2)
Building wheels for collected packages: PyHyphen
Building wheel for PyHyphen (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [31 lines of output]
No name configuration, performing automatic discovery
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-310
creating build/lib.linux-x86_64-cpython-310/hyphen
copying src/hyphen/hyphenator.py -> build/lib.linux-x86_64-cpython-310/hyphen
copying src/hyphen/dictools.py -> build/lib.linux-x86_64-cpython-310/hyphen
copying src/hyphen/init.py -> build/lib.linux-x86_64-cpython-310/hyphen
copying src/hyphen/textwrap2.py -> build/lib.linux-x86_64-cpython-310/hyphen
running egg_info
writing src/PyHyphen.egg-info/PKG-INFO
writing dependency_links to src/PyHyphen.egg-info/dependency_links.txt
writing requirements to src/PyHyphen.egg-info/requires.txt
writing top-level names to src/PyHyphen.egg-info/top_level.txt
reading manifest file 'src/PyHyphen.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE.txt'
writing manifest file 'src/PyHyphen.egg-info/SOURCES.txt'
running build_ext
building 'hyphen.hnj' extension
creating build/temp.linux-x86_64-cpython-310
creating build/temp.linux-x86_64-cpython-310/lib
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -Ilib -I/home/brumbrum/UltraSinger/.venv/include -I/usr/include/python3.10 -c lib/hnjalloc.c -o build/temp.linux-x86_64-cpython-310/lib/hnjalloc.o
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -Ilib -I/home/brumbrum/UltraSinger/.venv/include -I/usr/include/python3.10 -c lib/hnjmodule.c -o build/temp.linux-x86_64-cpython-310/lib/hnjmodule.o
lib/hnjmodule.c:3:10: fatal error: Python.h: No such file or directory
3 | #include "Python.h"
| ^~~~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for PyHyphen
Running setup.py clean for PyHyphen
Failed to build PyHyphen
ERROR: Could not build wheels for PyHyphen, which is required to install pyproject.toml-based projects

Surrounding Punctuation Affecting Hyphenation

A TLA which isn't normally hyphenated can become hyphenated if it's surrounded by punctuation, such as parentheses or quote marks:

>>> import pyphen
>>> dic = pyphen.Pyphen(lang='en_GB')
>>> dic.inserted('LST')
u'LST'
>>> dic.inserted('(LST)')
u'(L-ST)'
>>> dic.inserted('TLA')
u'TLA'
>>> dic.inserted('(TLA)')
u'(T-LA)'
>>> dic.inserted('"TLA"')
u'"T-LA"'

Please ignore surrounding punctuation for the purpose of determining hyphenation points, so that terms which aren't normally hyphenated don't look wrong simply because they've been put in brackets.

Thanks.

Hyphenation error on german word "einen"

"einen" should return "ei-nen". I have also tried to manually add it to the dictionary by inserting a line
"ei1nen"
after the pattern for
"ei1ne" Line 11739
But the pattern is not used.

Here is the test code:

import pyphen
dic = pyphen(filename="my.dic")
dic.inserted("einen")

P.S.: "eine" is correctly inserted and results in "ei-ne".

User-specified hyphenation dictionaries

It looks as though Pyphen has full support for arbitrary user-specified hyphenation dictionaries via its filename parameter, but Weasyprint isn't presently making use of that capability. It would be nice to have a commandline switch to specify this option, like with stylesheets, in case people want to do something a bit unusual. At present the only way to tweak hyphenation policy at this level is to maually tweak the contents of the Pyphen package, which is a bit dicey.

Split method?

I was surprised there's no method to split input text into list of tokens. I ended up writing following function:

def splitWord(word, splits) :
        if len(splits) == 0:
          return [word]
        else:
          firstidx = 0
          lastidx = 0
          syls = []
          while splits:
            firstidx = lastidx
            lastidx = splits.pop(0)
            syls.append(word[firstidx:lastidx])
          syls.append(word[lastidx:])
          return syls

With minor adjustments, it could be added to library (get splitpoints with Pyphen.positions for input, then perform split on them).

Licensing

For now, this package is triple-licensed under GPL-2.0-or-later AND LGPL-2.1-or-later AND MPL-1.1. Especially the GPL part makes it hard to use this package in commercial solutions.

I therefore wanted to evaluate whether a separate package for each license type is feasible in some way, but this would require a compatible license for https://github.com/Kozea/Pyphen/blob/master/pyphen/__init__.py as well. For now, I would have to assume that it is GPL-2.0-or-later AND LGPL-2.1-or-later AND MPL-1.1 as well, which would not really allow for such a split?

Example: If I just need the DE and EN locale, I could use a stripped package under the terms of LGPL-2.1-or-later for DE and under the terms of a BSD-style license for EN, if the accompanying Python code uses a compatible license.

Exception when word not found in dict?

Hi, is it possible to raise an exception when query word is not in current dictionary?
Now it is very possible for pyphen to return 1 for nonsense input, which is quite confusing.

ResourceWarning: unclosed file

Since 1795d65#diff-c99bbcfc0e7112a9ae7a56a9af9b33a1b503dafc3bbdaf6f2a753bd62fd2a6e4R122, Pyphen opens a dictionary without properly closing it at

encoding = path.open('rb').readline().decode()
.

This results in a resource warning:

/Users/grzesiek/Code/anki/anki-word-hyphenator/deps/Pyphen/pyphen/__init__.py:118: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/grzesiek/Code/anki/anki-word-hyphenator/deps/Pyphen/pyphen/dictionaries/hyph_en_US.dic'>
  encoding = path.open('rb').readline().decode()

Hyphenaton error on german word "Fortschritt"

import pyphen
dic = pyphen.Pyphen(lang='de_DE')
dic.inserted('Fortschritt')

results in: 'Fort-s-chritt'
The correct answer would be: 'Fort-schritt'

Although Libreoffice uses the same dictionary the result seems to be correct there.

UnicodeDecodeErrors with non-ascii hyphen chars

chr(173) is also known as &shy; and used as marks in HTML for hyphenation.

import pyphen

dic = pyphen.Pyphen(lang='en_US')
dic.inserted('crocodile', hyphen=chr(173))

… results in UnicodeDecodeError: 'ascii' codec can't decode….

Several issues about the updating script

Hi, I've found several issues that could be fixed or improved about the script used to automatically update all hyphenation dictionaries.

  1. The hyphen in the file name of the Serbian dictionary hyph_sr-Latn.dic would not be replaced by an underscore (the rename command does not work).
  2. READMEs of dictionaries are added in this commit recently, so the script should also update them accordingly.
  3. The repo of LibreOffice's dictionaries cloned locally should be removed automatically after updating (otherwise they need to be removed by hand before commiting the change).

pytest-3.8: error: unrecognized arguments: --isort --flake8 --numprocesses=auto

pytest fails

  • pytest-3.8 --ignore=_build.python38 --ignore=_build.python39 --ignore=_build.python310 -v
    [ 32s] ERROR: usage: pytest-3.8 [options] [file_or_dir] [file_or_dir] [...]
    [ 32s] pytest-3.8: error: unrecognized arguments: --isort --flake8 --numprocesses=auto
    [ 32s] inifile: /home/abuild/rpmbuild/BUILD/Pyphen-0.13.0/pyproject.toml

(German) hyphenation derailed by punctuation characters

I found this strange behavior:

> dic = pyphen.Pyphen(lang='de')

> dic.inserted('begreifbar')
'be-greif-bar'

> dic.inserted('begreifbar.')
'be-greif-ba-r.'

> dic.inserted('begreifbar«.')
'be-greif-ba-r«.'

The first hyphenation is correct. The second and third have trailing punctuation characters (« is a common closing-quote in German printing), which leads to an additional incorrect hyphenation point being inserted.

I tried to use the local hunspell dictionary instead (/usr/share/hyphen/hyph_de_DE.dic), with the same result.

In this case, I could fix it by removing punctuation characters myself, but I'd still consider it to be a bug, possibly related to #24 and #26.

License question

I realized that this library is in tri-license and it's used in weazyprint library.

Does it mean I cannot use weazyprint in private repo for commercial application?

Any comment is appreciated.

Sorting items obtained from importlib.resources.files() might not always be supported

Version: pyphen 0.14.0
Imported from: weasyprint==52.5

Other versions:

  • Nuitka==1.7.9
  • Nuitka-Python's sys.version: '3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]'

Assuming that Pyphen is trying to support pre-built, pre-compiled, and/or shrink-wrapped Python-based applications (think PyInstaller, py2exe , or Nuitka), it tries to locate resource- or data-files using Python's stdlib resources.files functionality. This is of course a Good Thing™ in itself. However, Pyphen assumes that objects returned from its .iterdir() method can be compared using the < operator, since it calls sorted() on the .iterdir() result, but the Traversible ABC does not guarantee that objects returend by .iterdir() can be compared using the < operator.

When importing Pyphen, in a Nuitka-built program, the following error message is shown:

  File "[...]\standalone\build\manage.dist\pyphen\__init__.py", line 37, in <module pyphen>
TypeError: '<' not supported between instances of 'nuitka_resource_reader_files' and 'nuitka_resource_reader_files'

The following patch will solve this problem:

--- a/pyphen/__init__.py
+++ b/pyphen/__init__.py
@@ -30,11 +30,11 @@
 
 try:
     dictionaries = resources.files('pyphen.dictionaries')
 except (AttributeError, TypeError):
     # AttributeError with Python 3.7 and 3.8, TypeError with Python 3.9
     dictionaries = Path(__file__).parent / 'dictionaries'
 
-for path in sorted(dictionaries.iterdir()):
+for path in sorted(dictionaries.iterdir(), key=Path):
     if path.suffix == '.dic':
         name = path.name[5:-4]
         LANGUAGES[name] = path

Please note that I also have tried wrapping the resources.files() result into importlib.resources.as_file, but that didn't help. I don't know whether that should be attributed to Python or Nuitka. Probably the last one, since there's also another issue with Nuitka that it is not able to open the resulting path regardless.

As a final note: is this sorting of hyphenation file names needed anyways?

Polish syllables not works. Since wrong positions() use to create sullables - data is valid but not result.

See your code examples it looks like bug.
Syllables not works - this word is very popular "Mary" in English.

import pyphen
p = inserted('Maryśce')
# CVCVCCV - Vowel, Consonant

p.inserted('Maryśce', ' ')
# 'Ma-ry-ś-ce-' ? Syllables not works. Why cut on end?

list(p.iterate('Maryśce'))
# [('Maryśce', ''), ('Maryś', 'ce'), ('Mary', 'śce'), ('Ma', 'ryśce')]
# [('Maryście', '')? Cut on end?
# Rest is good. Two alternative cuts both are good first can be better Ma-ry-śce or Ma-ryś-ce.

# ś is consonant in Polish so invalid. Must be one vowel in syllable.
p.positions('Maryśce')
# [2, 4, 5, 7]
# This generate error since 4,5 = 'ś' and it is not valid syllable. 7 is no idea why?
# 4 or 5 is alternatives better to choose 4 and skip 5.

Valid is just Ma-ry-śce or Ma-ryś-ce.

Fuzzy language matching

en and en_US should probably use the same dictionary, but there are many more subtle cases. The "proper" way to do this is described in BCP47, although Pyphen can maybe get away with something simpler that works with the set of distributed dictionaries.

Babel is having a similar issue, maybe Pyphen can do what they do when they figure it out :) python-babel/babel#39

Hungarian hyphenation is faulty in case of vowel-consonant-vowel-* words

Hello,
using latest pyphen (0.14.0), there seems to be an issue with the hyphenation of Hungarian words starting as vowel-consonant-vowel-*. E.g.: "alak" should be hyphenated as "a-lak" (currently not hyphenated by pyphen), or "alaktalan" as "a-lak-ta-lan" (incorrectly hyphenated as "alak-ta-lan" by pyphen).

I saw you suggested here to check with https://www.ushuaia.pl/hyphen/?ln=en (selecting language: Hungarian). The hyphenation of these type of words are also faulty there. Also checked these words in LibreOffice (7.3.7.2), it has the same issue.

Notes:

  • have not checked previous releases, just started to try pyphen
  • could not find any other type of words incorrectly hyphenated, e.g. "láda", "labda", "drága" are all hyphenated correctly ("lá-da", "lab-da", "drá-ga") by these tools.

What should be used for cross-checking instead of the above is https://helyesiras.mta.hu/helyesiras/default/hyph# .
Note: MTA (mta.hu) is the National Academy of Science in Hungary.
The hyphenations I checked here were all correct, including the above words, too.

Thanks for looking into this!

Avoid use of __file__

Like Kozea/tinycss2#21 , however it will be a bit more difficult in this case, and needing to get a reader from importlib.resources and use contents() instead of os.listdir()

'uni' considered one syllable

>>> dic = pyphen.Pyphen(lang='en_US')
>>> dic.inserted("universal")
'uni-ver-sal'
>>> dic.inserted("university")
'uni-ver-si-ty'
>>> dic.inserted("universe")
'uni-verse'
>>> dic.inserted("uni")
'uni'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)

Pyphen is unable to handle ascii characters outside the first 128 (Related issue: #2). Even in English text, there are occasionally characters in the extended ascii or unicode sets that will appear such as below:
Example:

p = Pyphen(lang='en_US')
p.inserted('She spent ¥130')

results in: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)

Can you please provide error handling for these cases as opposed to the program breaking?

not working for norwegian?

Would be nice if one would get a more explicit error when trying to instantiate for norwegian (i.e. 'no'):

>>> dic = pyphen.Pyphen(lang='no')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/blabla/.venvs/blabla/lib/python3.8/site-packages/pyphen/__init__.py", line 218, in __init__
    filename = LANGUAGES[language_fallback(lang)]
KeyError: None

Something along the lines of: use nn or nb instead.

error on lang='de'

import pyphen

if 'de' in pyphen.LANGUAGES:
    dic = pyphen.Pyphen(lang='de')

… results in IndexError: list index out of range.

It works with fr, though.

Why jonathan can not be hyphenated?

See the following example. jonathan is not hyphenated. Does anybody know what is wrong?

$ ./main1.py 
jonathan
$ cat ./main1.py 
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import pyphen
dic = pyphen.Pyphen(lang='en')
print dic.inserted('jonathan')

hyphen symbol at the beginning

Dear All

I am using pyphen in Mongolian Language.
I do have issue that the hyphen symbol is appearing at the beginning. Here is the result:

На|ран |Лув|сан ах о|лон |со|нин |хэв|лэ|нэ. |На|ран |сай|хан |ном |со|нин ав|лаа. Э|нэ |со|нин Ү|нэн |со|нин. |Хэр|лэн өөр о|лон |ном |со|нин ун|шив. |На|сан |ха|вар, |на|мар о|лон |со|нин ав|сан. Э|нэ |со|нин |хо|вор. |Ха|вар |мал өс|лөө. |Мо|лом |мал |мал|лав. |Ло|сол |ма|лаа ус|лав. |Ма|рал үх|рээ |ху|раа|лаа. Ө|вөө ү|хэр |ма|лаа |ху|раа|лаа. Э|мээ |мах |сүү а|вав. Ах |мал |мал|лав

As i am new for python, it might be that i had missed something.

Awaiting for your kind support

install dictionaries in /usr/share/pyphen

Pyphen installs its dictionaries in /usr/lib64/python2.7/site-packages/pyphen/dictionaries/(or similar). The dictionary are more like shared objects and should go to /usr/share/…. There's many reasons for that (FHS, perhaps different filesystems, duplicate-finding), making the life easier for maintainers is one of them. :-)

You can achieve that by adding arg data_files to your setup.py. Please see Thot::setup.py for an example and code you can copy.

As data files are installed under sys.prefix (which often is /usr under GNU/Linux) the second arg of find_data_files starts with share/…. That is how you access them, too.

And: Thanks for making Pyphen!

Git symlinks are not working on Windows

The fact that some hyphenation dictionaries are actually sym-linked to other dictionaries breaks Pyphen on Windows when referring to those symlinks. For more detailed discussion about Git symlinks and Windows see http://stackoverflow.com/questions/5917249/git-symlinks-in-windows.

Repro sequence on a Win32 machine:

>>> import pyphen
>>> dic = pyphen.Pyphen(lang='en')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyphen.py", line 250, in __init__
    hdcache[filename] = HyphDict(filename)
  File "pyphen.py", line 181, in __init__
    self.maxlen = max(len(key) for key in self.patterns)
ValueError: max() arg is an empty sequence

P.S. It would also be nice if there would be more informative error message instead of the cryptic ValueError above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.