rspeer / python-ftfy Goto Github PK

Fixes mojibake and other glitches in Unicode text, after the fact.

License: Other

Python 62.72% Jupyter Notebook 37.28%

python-ftfy's Introduction

ftfy: fixes text for you

>>> from ftfy import fix_encoding
>>> print(fix_encoding("(à¸‡'âŒ£')à¸‡"))
(ง'⌣')ง

The full documentation of ftfy is available at ftfy.readthedocs.org. The documentation covers a lot more than this README, so here are some links into it:

Testimonials

“My life is livable again!” — @planarrowspace
“A handy piece of magic” — @simonw
“Saved me a large amount of frustrating dev work” — @iancal
“ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
“I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
“9.2/10” — pylint

What it does

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

>>> import ftfy
>>> ftfy.fix_text('âœ” No problems')
'✔ No problems'

Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

>>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:

>>> ftfy.fix_text("l’humanitÃ©")
"l'humanité"

ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:

>>> ftfy.fix_text('Ã\xa0 perturber la rÃ©flexion')
'à perturber la réflexion'
>>> ftfy.fix_text('Ã perturber la rÃ©flexion')
'à perturber la réflexion'

ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:

>>> # by the HTML 5 standard, only 'P&Eacute;REZ' is acceptable
>>> ftfy.fix_text('P&EACUTE;REZ')
'PÉREZ'

These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.

The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.

>>> ftfy.fix_text('IL Y MARQUÉ…')
'IL Y MARQUÉ…'

Installing

ftfy is a Python 3 package that can be installed using pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

Local development

ftfy is developed using poetry. Its setup.py is vestigial and is not the recommended way to install it.

Install Poetry, check out this repository, and run poetry install to install ftfy for local development, such as experimenting with the heuristic or running tests.

Who maintains ftfy?

I'm Robyn Speer, also known as Elia Robyn Lake. You can find me on GitHub or Cohost.

Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.

ftfy has a citable record on Zenodo. A citation of ftfy may look like this:

Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

@misc{speer-2019-ftfy,
  author       = {Robyn Speer},
  title        = {ftfy},
  note         = {Version 5.5},
  year         = 2019,
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.2591652},
  url          = {https://doi.org/10.5281/zenodo.2591652}
}

Important license clarifications

If you do not follow ftfy's license, you do not have a license to ftfy.

This sounds obvious and tautological, but there are people who think open source licenses mean that they can just do what they want, especially in the field of generative AI. It's a permissive license but you still have to follow it. The Apache license is the only thing that gives you permission to use and copy ftfy; otherwise, all rights are reserved.

If you use or distribute ftfy, you must follow the terms of the Apache license, including that you must attribute the author of ftfy (Robyn Speer) correctly.

You may not make a derived work of ftfy that obscures its authorship, such as by putting its code in an AI training dataset, including the code in AI training at runtime, or using a generative AI that copies code from such a dataset.

At my discretion, I may notify you of a license violation, and give you a chance to either remedy it or delete all copies of ftfy in your possession.

python-ftfy's People

Contributors

Stargazers

Watchers

Forkers

simongreenhill craigcitro mainka twophoned martin-lx airhorns jaytoday saturnisbig kazuar bussiere michft bag-of-projects afthill prodigeni thesnews jaredly ahurriyetoglu joranbeasley echel0n neilbryant aminorex devsinghsachan math4youbyusgroupillinois mittald niksonx wakermahmud vanms1989 fburkitt camr0n nickinthebox jeisc ptighe yuhuating loretoparisi shannonyu sandy4321 giserh rlugojr yetanothertimes nkhuyu saikswaroop libardo1 ranjeet-floyd rock999 bishoptylaor jsean662 rmax-contrib leezqcst nulledexceptions mpvyard dattatele marketing1by1 aishwaryapradhan chetanmehra priestd09 jalajthanaki dgreyling rcyrus punizione mehdimashayekhi michael-k hhy5277 sahwar refacktor abdessalam-aadel wxrui dafeima feconroses sethips lukeshuo kyuhwas motiteux devblissit dimibit robmsmt coldpressedlinkjuice databill86 johns1342 unoqualsiasi anonhas priyansh2 princessinconsuela hkrone74 stjordanis algofreak e7dal eai7 commanderasdasd volkansenturk2012 dophist barseghyanartur nageshlop mjhawkins owlwang nth-attempt techthiyanes georgi-petkov personx000 jairoandrescastaneda usamai000

python-ftfy's Issues

Command-line tool clears the file if the output file path is identical to the input file path

Usage example:
ftfy -e windows-1251 -o weirdfile.txt weirdfile.txt

Result: An empty file.

CLI tests fail on Windows

This was found while building conda packages on Windows (py35, py36). Full log here: https://ci.appveyor.com/project/conda-forge/ftfy-feedstock/build/1.0.4/job/d6yr2io539ouweqf

======================================================================
ERROR: test_cli.test_stdin
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\bld\ftfy_1494727628610\_t_env\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\bld\ftfy_1494727628610\test_tmp\tests\test_cli.py", line 53, in test_stdin
    output = get_command_output(['ftfy'], stdin=infile)
  File "C:\bld\ftfy_1494727628610\test_tmp\tests\test_cli.py", line 23, in get_command_output
    return subprocess.check_output(args, stdin=stdin, stderr=subprocess.STDOUT, timeout=5).decode('utf-8')
  File "C:\bld\ftfy_1494727628610\_t_env\lib\subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "C:\bld\ftfy_1494727628610\_t_env\lib\subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ftfy']' returned non-zero exit status 1.
======================================================================

Linux/OSX tests passes OK.

ftfy-4.4.2: Nix build from PyPi fails on Linux and OS X due to non-ASCII character '\xc5'

Please see GNU/Linux Build and OS X Build on Nix Travis CI.

The culprit file is tests/test-entities.py.

The source tar ball is fetched from the PyPi repository.

Support for broken Cyrillic text

It would be great if ftfy could fix cases like this:

>>> s = u'ÑÅÊÐÅÒ - Áåñïå÷íûé Åçäîê - 0:00'
>>> print s.encode('latin1').decode('Windows-1251')
СЕКРЕТ - Беспечный Ездок - 0:00

but it doesn't:

>>> print ftfy.fix_text_segment(s)
ÑÅÊÐÅÒ - Áåñïå÷íûé Åçäîê - 0:00

Source of mojibake: http://ru2.101.ru:8000/status.xsl

Feature: fix Korean mojibake

It would be great if ftfy could fix cases like this:

>>> s = u'¼Ò¸®¿¤ - »ç¶ûÇÏ´Â ÀÚ¿©'
>>> print s.encode('latin1').decode('euc_kr')
소리엘 - 사랑하는 자여

but it doesn't:

>>> print ftfy.fix_text_segment(s)
1⁄4Ò ̧®¿¤ - »ç¶ûÇÏ ́Â ÀÚ¿©

Source: http://media.yohan.net/7.html

Fix Failure

Input: Àëüÿíñ ðóññêèõ è ÈÀÏË âìåñòå áóäóò ó÷àñòâîâàòü â âûáîðàõ â Åâðîïàðëàìåíò
Expected Output: Альянс русских и ИАПЛ вместе будут участвовать в выборах в Европарламент

Result of requests incorrectly guessing the encoding of text as ISO-8859-1 instead of Windows-1251.

Import failure in Python 2.7

Unfortunately, Python 2.7 doesn't seem to like UNSAFE_3_3_RE (https://github.com/LuminosoInsight/python-ftfy/blob/master/ftfy/fixes.py#L396):

$ python2.7
Python 2.7.5 (default, Sep  2 2013, 20:52:43)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.compile(u'[\U00100000-\U0010ffff]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kcarnold/.virtualenvs/py27/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/Users/kcarnold/.virtualenvs/py27/lib/python2.7/re.py", line 242, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Support for broken Greek text

It would be great if ftfy could fix cases like this:

>>> s = u'ÖÉËÁ ÌÅ - ÂÏÓÊÏÐÏÕËÏÓ - ×ÉÙÔÇÓ'
>>> print s.encode('latin-1').decode('iso-8859-7')
ΦΙΛΑ ΜΕ - ΒΟΣΚΟΠΟΥΛΟΣ - ΧΙΩΤΗΣ

but it doesn't:

>>> print ftfy.fix_text_encoding(s)
ÖÉËÁ ÌÅ - ÂÏÓÊÏÐÏÕËÏÓ - ×ÉÙÔÇÓ

False positives on all-caps text truncated by an ellipsis

There are about 30 accented capital letters that, when followed by the ellipsis character …, look like a different character encoded in UTF-8 and decoded as Windows-1252.

In Twitter testing, these are a large source of extant false positives (about 0.3 per megatweet).

One option is to add a penalty for turning an ellipsis into something else, but that would break the test saying that any Unicode character encoded as UTF-8 and decoded as Windows-1252 can be fixed on its own. Maybe that's okay.

The most ambiguous case is that the Swedish and Finnish letter Ä plus an ellipsis could be the Polish letter ą.

Feature: detect mixups between two single-byte encodings

There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.

Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".

This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.

This mixup is apparently common enough that it would be worth fixing as another special case.

Feature: mixed encodings in a single line / custom encoding boundaries

I'm looking into using ftfy to help upstream data sources clean up their data. a common format is CSV files. These data sources often manage to mix up encoding in different columns in the same row (maybe they're from different databases or web forms).

ftfy will be even more useful for me if it were possible to fix mixed encodings on the same line, or alternatively to add an argument to control what is considered a segment.

Obviously I could just make some code to use ftfy functions in CSV fields myself to get the desired results, but ftfy is a well documented easy to use tool so I'd really like offload the analysis to the bad data providers themselves.

If it's agreed this could be a feature, I might be able to spend some time on it.

Handle Cocoa/Core Foundation Unicode in Python

This is an admittedly edge-case, but that fact makes it all the tougher to debug and fix. I was recently using subprocess and mdls to fiddle with file metadata on OS X. To my surprise, I got some odd Unicode and Python just wasn't handling it well. I went to StackOverflow, and is is frequently the case got some great advice. You can find the topic here. I will repeat the basics of the issue here tho for simplicity.

The issue regards reading Unicode text from the shell into Python. I have a test document with the following metadata attribute:

kMDItemAuthors = (
    "To\U0304ny\U0308 Sta\U030ark"
)

I see this when I run mdls -name kMDItemAuthors path/to/the/file. I am attempting to get this data into usable form within a Python script. However, I cannot get the Unicode represented text into actual Unicode in Python.

As the comments detail, the issue comes from different ways to represent Unicode in Python and Cocoa/Core Foundation. I have a function that decodes input text into clean, normalized Python Unicode. I have altered it following this issue to this:

def decode(text, encoding='utf-8', normalization='NFC'):
    """Return ``text`` as normalised unicode.

    :param text: string
    :type text: encoded or Unicode string. If ``text`` is already a
        Unicode string, it will only be normalised.
    :param encoding: The text encoding to use to decode ``text`` to
        Unicode.
    :type encoding: ``unicode`` or ``None``
    :param normalization: The nomalisation form to apply to ``text``.
    :type normalization: ``unicode`` or ``None``
    :returns: decoded and normalised ``unicode``

    """
    # convert string to Unicode
    if isinstance(text, basestring):
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
    # decode Cocoa/CoreFoundation Unicode to Python Unicode
    if '\\U' in text:
        text = text.replace('\\U', '\\u').decode('unicode-escape')
    return unicodedata.normalize(normalization, text)

ftfy.fix_test() doesn't handle this odd Unicode, so adding the check that I added to my decode() function somewhere in your code is needed. Not certain where or how, but this seemed like a perfect little addition to ftfy.

Sentence which couldn't correctly covert

Hello,

I have a sentence which couldn't convert to utf-8 encoding correctly. Here it is:

yes itâ€™s true; for us, it doesnâ€™t get any better. the negative about the property, as others have mentioned, is that there is just â€œsomethingâ€� missing.

After applying ftfy.fix_text(), it shows:

Yes itâ€TMs true; For us, it doesnâ€TMt get any better. The negative about the property, as others have mentioned, is that there is just â€œsomethingâ€� missing.

There is a slightly change, but still not recognized.
Any comments on this?

CLI broken with Python 3: tries to decode STDIN as UTF-8 even when different encoding is specified

With ftfy installed as a Python 3 package on my Ubuntu 14.04 machine, this happens:

$ echo Æèçíü íà ìåñòå | iconv -f utf8 -t latin1 | ftfy -e sloppy-windows-1251
ftfy error:
This input couldn't be decoded as 'sloppy-windows-1251'. We got the following error:

    'utf-8' codec can't decode byte 0xc6 in position 0: invalid continuation byte

ftfy works best when its input is in a known encoding. You can use `ftfy -g`
to guess, if you're desperate. Otherwise, give the encoding name with the
`-e` option, such as `ftfy -e latin-1`.

With ftfy installed as a Python 2 package, it works as expected:

$ echo Æèçíü íà ìåñòå | iconv -f utf8 -t latin1 | ftfy -e sloppy-windows-1251
Жизнь на месте

BTW, awesome package!

Fail in fix

Text:

text = "a€¢ Strong telecom background in BSS area a€“ Preferably Order Management"
ftfy.fix_text(text)

When i am passing text to fix text it is returning

UnicodeError: Hey wait, this isn't Unicode.

Then i am converting text to unicode and passing to fix_text function

utext = unicode(text, encoding, errors=errors)

Output:

u'a\u20ac\xa2 Strong telecom background in BSS area a\u20ac\u201c Preferably Order Management'

In [10]: ftfy.fix_text(utext)
Out[10]: u'a\u20ac\xa2    Strong telecom background in BSS area a\u20ac" Preferably Order Management'

I deal with this kind of data alot.
How to solve this issue ?

Possibility to support Chinese codecs?

Based on this Stack Overflow question I looked into support for Chinese character encodings.

The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:

Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€

which can be decoded using GB* encodings to varying degrees of success:

>>> print text.encode('windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('windows-1252').decode('big5', 'replace')
癡瞽�嘔阬Ｔdcx�瓣繡鬚疆��嘔氐�嘔刈鄞珍把�腕氐倦疇�Ｔ�

Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled fix_one_step_and_explain().

Is supporting these codecs feasible?

Endless loop fixing this text

import requests
import ftfy
ftfy.fix_text(requests.get("http://www.vb.is/frettir/84846/").text)

Never comes back, KeyboardInterrupt gives us this position

textbreak = text.find(' ', pos, pos + MAXLEN)

I'll try to narrow it down to the offending text

PyPy: import failure

Hi there,

I get the following error when trying to import ftfy into a PyPy console:

Python 2.7.8 (10f1b29a2bd21f837090286174a9ca030b8680b2, Feb 05 2015, 17:51:14)
[PyPy 2.5.0 with GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>> import ftfy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/__init__.py", line 15, in <module>
    from ftfy import fixes
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/fixes.py", line 8, in <module>
    from ftfy.chardata import (possible_encoding,
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/chardata.py", line 45, in <module>
    ENCODING_REGEXES = _build_regexes()
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/chardata.py", line 36, in _build_regexes
    charlist = latin1table.encode('latin-1').decode(encoding)
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/bad_codecs/__init__.py", line 72, in search_function
    from ftfy.bad_codecs.sloppy import CODECS
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/bad_codecs/sloppy.py", line 156, in <module>
    CODECS[_new_name] = make_sloppy_codec(_encoding)
  File "/usr/local/Cellar/pypy/2.5.0/libexec/site-packages/ftfy/bad_codecs/sloppy.py", line 97, in make_sloppy_codec
    decoded_chars = all_bytes.decode(encoding, errors='replace')
  File "/usr/local/Cellar/pypy/2.5.0/libexec/lib-python/2.7/encodings/cp1250.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
TypeError: expected string, got bytearray object

Any ideas?

pip install -U ftfy on Windows XP 32-bit fails

Installing ftfy on Windows 32-bit under the Python Anaconda (2.1) distribution:

C:>pip install -U ftfy
Collecting ftfy
Using cached ftfy-3.4.0.tar.gz
C:\Anaconda\lib\distutils\dist.py:267: UserWarning: Unknown distribution opt
ion: 'entry_points'
warnings.warn(msg)
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: invalid command 'egg_info'
Complete output from command python setup.py egg_info:
C:\Anaconda\lib\distutils\dist.py:267: UserWarning: Unknown distribution opt
ion: 'entry_points'

  warnings.warn(msg)

usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]

   or: -c --help [cmd1 cmd2 ...]

   or: -c --help-commands

   or: -c cmd --help

error: invalid command 'egg_info'
----------------------------------------

←[31m Command "python setup.py egg_info" failed with error code 1 in c:\docum
e~~1\dinesh\locals~~1\temp\pip-build-tng72e\ftfy←[0m

documentation: But gb18030 is a UTF.

We also can’t handle the non-UTF encodings used for Chinese, Japanese, and Korean, such as shift-jis and gb18030. See issue #34 for why this is so hard.

But GB18030 is a UTF! It handles every code point but surrogates, which is good enough to meet this definition:

A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence.

I guess there will be some rewording needed. Perhaps just replace gb18030 with gbk.

PS: members of the GB family may be partially identifiable given the following features:

Most characters found in common GB strings fall inside the GB2312-1980 (EUC-CN) range. EUC G1 bytes don't overlap with ASCII byte range.
GB 18030's 4-byte sequences are highly visible to human in some otherwise GBK soup, as GBK does not make use of 0x30-0x39 for its 2-byte encoding.

On Python 2, the "fixed" text could contain surrogates

On Python 2, it's valid to decode a surrogate codepoint from UTF-8. On Python 3 (and in the Unicode standard), it's not.

If we see a string that coincidentally contains what looks like mojibake of a surrogate, but doesn't succeed in ftfy's built-in CESU-8 decoder, then ftfy on Python 2 will turn it into that surrogate, while Python 3 will see that something is wrong and leave the text alone.

This actually happened given this text (note the last three characters):

``toda produzida pronta pra assa aí´´

Even after fixing this discrepancy, this shows that there are some possible false positives when acute accents are used as quotation marks.

Unicode apostrophe not being converted to ASCII equivalent

I was trying to tidy up a google news corpus and ran into characters that I thought would be 'normalized' but weren't.

In particular, I thought ftfy would convert ʼ (U+02BC or 700) to its ASCII apostrophe equivalent like it does with other single quote/apostrophe like characters. However, I noticed that ftfy only converts in one Unicode range SINGLE_QUOTE_RE = re.compile('[\u2018-\u201b]') and misses many instances. Is there a reason for this?

FYI I wrote a little program to find what other self proclaimed apostrophe characters are out there and as you can see many aren't converted or are converted to U+02BC:

python demo_unidecode_ftfy.py | grep -i "apostrophe" | cut -d" " -f 1,2,3,4,8-

dec=39 unicode_chr=' ftfy_chr(s)=' ftfy_dec=Same unicode_name=APOSTROPHE
dec=329 unicode_chr=ŉ ftfy_chr(s)=ʼn ftfy_decs=700, 100 unicode_name=LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
dec=700 unicode_chr=ʼ ftfy_chr(s)=ʼ ftfy_dec=Same unicode_name=MODIFIER LETTER APOSTROPHE
dec=750 unicode_chr=ˮ ftfy_chr(s)=ˮ ftfy_dec=Same unicode_name=MODIFIER LETTER DOUBLE APOSTROPHE
dec=1370 unicode_chr=՚ ftfy_chr(s)=՚ ftfy_dec=Same unicode_name=ARMENIAN APOSTROPHE
dec=65287 unicode_chr=＇ ftfy_chr(s)=' ftfy_dec=39 unicode_name=FULLWIDTH APOSTROPHE

Along the same token here are the ones that ftfy converts:

python demo_unidecode_ftfy.py | grep "dec=39 " | cut -d" " -f 1,2,3,4,8-

dec=39 unicode_chr=' ftfy_chr(s)=' ftfy_dec=Same unicode_name=APOSTROPHE
dec=8216 unicode_chr=‘ ftfy_chr(s)=' ftfy_dec=39 unicode_name=LEFT SINGLE QUOTATION MARK
dec=8217 unicode_chr=’ ftfy_chr(s)=' ftfy_dec=39 unicode_name=RIGHT SINGLE QUOTATION MARK
dec=8218 unicode_chr=‚ ftfy_chr(s)=' ftfy_dec=39 unicode_name=SINGLE LOW-9 QUOTATION MARK
dec=8219 unicode_chr=‛ ftfy_chr(s)=' ftfy_dec=39 unicode_name=SINGLE HIGH-REVERSED-9 QUOTATION MARK
dec=65287 unicode_chr=＇ ftfy_chr(s)=' ftfy_dec=39 unicode_name=FULLWIDTH APOSTROPHE

#!/usr/bin/env python

from unidecode import unidecode
from unicodedata import name
import ftfy

for i in range(33, 65535):
    if i > 0xeffff:
        continue  # Characters in Private Use Area and above are ignored
    if 0xd800 <= i <= 0xdfff:
        continue
    u = chr(i)
    f = ftfy.fix_text(u, normalization='NFKC')
    a = unidecode(u)
    if a != '[?]' and len(u) != 0 and len(a) != 0 and len(f) != 0:
        new_char = ''
        if u != f:
            for c in list(f):
                new_char += "{}, ".format(ord(c))
            new_char = new_char[:-2]
        else:
            new_char = 'Same'
        try:
          print("dec={} unicode_chr={} ftfy_chr(s)={} ftfy_dec={} ascii_chr={} "
                "uni_len={} ascii_len={} unicode_name={}".format(i, u, f, new_char, a, len(u), len(a), name(u)))
        except ValueError:
          pass

Broken link in README

The careers page moved to http://www.luminoso.com/career.html , apparently.

Support cp437 and MacRoman

In the README file, I have wondered aloud whether anyone needs to handle text explosions that involve the cp437 or MacRoman encodings, given that they haven't been seriously used in over a decade. I have found examples of both.

The "20 Newsgroups" data set from 1993, widely used in machine learning, is in an ugly mish-mash of encodings. As many computer users in 1993 were using DOS, the cp437 encoding (used in American MS-DOS computers) is much more common than Latin-1 (mostly used on Unix at the time). There are occasional instances of MacRoman in there as well, especially in the comp.sys.mac.hardware group.

Although not everyone needs to read newsgroups from 1993, it would be good if machine learning tools could show the right output.

But are there ever more complicated mixups than mistaking one bytewise encoding for another? Yes. I just found this tweet in the wild. This text was encoded as UTF-8, decoded as if it were MacRoman, and posted to Twitter (which itself speaks UTF-8):

Le Schtroumpf Docteur conseille g√¢teaux et baies schtroumpfantes pour un r√©gime √©quilibr√©.

How the smurf did that even happen? But anyway, I can't fully claim to fix text encoding mistakes until I can detect what happened there and fix it.

Mac Classic line breaks get eaten

This technically works as documented, but it's probably not what you want. You probably want them to get turned into UNIX line breaks. I do, at least.

Weird Asian characters not fixed

I have a couple of filenames that are encoded and decoded on various operating systems without any respect to Unicode, and what I got is something like this:

U+9DED  鷭      [Lo] CJK UNIFIED IDEOGRAPH-9DED

that should be

U+00F8  ø       [Ll] LATIN SMALL LETTER O WITH STROKE
U+0073  s       [Ll] LATIN SMALL LETTER S

U+6DEA  淪      [Lo] CJK UNIFIED IDEOGRAPH-6DEA

↓

????????????????????????????? (maybe a base 10 Arabic digit)
U+005F  _       [Pc] LOW LINE

U+7FB9  羹      [Lo] CJK UNIFIED IDEOGRAPH-7FB9

↓

U+00FC  ü       [Ll] LATIN SMALL LETTER U WITH DIAERESIS

U+E15D  \ue15d  [Co] <unknown>

↓

U+00FC  ü       [Ll] LATIN SMALL LETTER U WITH DIAERESIS

Any ideas would be appreciated.

I'm especially interested in the ??? character. :)

Add processing of mere filenames

It is desirable to have a tool to fix mojibake filenames, using for example a --filenameonly flag, which can sometimes come in whole collections (made by archivers or otherwise coming from inferior operating systems) and are at least an aesthetical chagrin, therefore this should be implemented in addition to #79. As for the “output” argument, it would again be desirable to have a special value for it called maybe “tofilename” or another argument for renaming the files instead of expressing the conversion result in stdout.

Custom html entity defs

I ran into some text that uses non-standard html entities...

&Ccaron; &Sacute; &Scedil; &cacute; &ccaron; &nacute; &ncaron; &scedil; &slig; &zcaron;

...which are apparently supported in some browsers (all of these but &slig; resolve to a real character in Chrome 51):

Č Ś Ş ć č ń ň ş &slig; ž

Currently I'm using the following hack, which works, but, well, it's a hack:

from ftfy import fix_text
from ftfy.fixes import htmlentitydefs
nonstandard_entities = [
    ('ccaron', u'č'), ('Ccaron', u'Č'), ('cacute', u'ć'), ('Sacute', u'Ś'),
    ('Scedil', u'Ş'), ('nacute', u'ń'), ('ncaron', u'ň'), ('scedil', u'ş'),
    ('slig', u'ß'), ('zcaron', u'ž'),
]
for name, unicode_value in nonstandard_entities:
    htmlentitydefs.name2codepoint[name] = ord(unicode_value)
# and then later I use fix_text() as usual

I'd prefer to be able to pass in my own htmlentitydefs to fix_text(...), or perhaps just the name2codepoint mapping, since that's all that ftfy.fixes.unescape_html(text) uses.

What do you think? Too much argument bloat?

version 3 is way slow

The code in the "version3" branch currently runs about 8x slower than the released version. The code paths were supposed to be simpler, but a lot of it is new code that's not optimized.

3.1 branch crashes on stuff that looks like it might be modified UTF-8 but isn't

An example (in Windows-1251 decoded as Windows-1252): "èíâåíòàðèçàöèÿ ìóçåéíûõ ýêñïîíàòîâ áëàíê êíèãà ãèííåñà ñòîèìîñòü ðûñàêà".

The text is probably Russian spam, but the important part is that it contains 0xed bytes and crashes the newer branch of ftfy.

Add an encoding detector, to determine the first step in reading bytes

Right now ftfy is unable to process the 20 Newsgroups dataset, which is written in a combination of Latin-1, MacRoman, and cp437. Basically, that data starts out as bytes in an unknown encoding, and ftfy snarks at you if you give it bytes that haven't been decoded.

ftfy would be a more useful command-line tool, and a great way to clean up old text files, if it could heuristically determine which encoding to use to turn bytes into Unicode.

This is currently the purported job of the chardet package, but chardet's heuristics only work correctly on multi-byte encodings. ftfy eventually needs to come with a mini-chardet that can correctly guess single-byte encodings.

This is tricky, and it probably won't land in 3.0.

Add (recursive) processing of files in directories to the command-line tool

It would require fixing of #78 too, and a special value called maybe “selfsamefile” for the “--output” argument or another argument for having the selfsame files as output rather than the standard output.
We could differentiate whether files in subfolders are affected by an additional “--recursive” argument.

Actually, processing of whole folders is the crucial feature one desires from the command-line tool in ftfy, because one is hardly slower just opening and resaving files with medit or an other encoding aware editor for the purpose of fixing files ony by one.

Reasonable French gets decoded as unreasonable Armenian

Twitter testing has revealed a case where ftfy v3 fails at its mission: it decodes a reasonable French word with an unreasonable Armenian character.

If the text L’épisode (including a curly apostrophe) is interpreted as MacRoman, it becomes the UTF-8 encoding of LՎpisode. Վ is a capital letter, so the current heuristics do not penalize the result.

Some possible remedies:

Add a heuristic that penalizes mixing of Latin and non-Latin letters (which v2 has, but it would cost computation time to add v2's heuristic back in)
Add a penalty to using an encoding besides Latin-1 or Windows-1252 as the first step

2.6 compatibility

ftfy/ftfy/chardata.py in _build_charmaps()
62 charlist = [unichr(codept) for codept in sorted(charmap.keys())
63 if codept >= 0x80]
---> 64 regex = '^[\x00-\x7f{}]*$'.format(''.join(charlist))
65 charmaps[encoding] = charmap
66 encoding_regexes[encoding] = re.compile(regex)

ValueError: zero length field name in format

Fixable as:
regex = '^[\x00-\x7f{0}]*$'.format(''.join(charlist))

false positives in Ukrainian

Very particular Ukrainian words, such as ВІКІ (which is a translation of "WIKI" in all caps, and is entirely different from the ASCII letters "BIKI"), can be misinterpreted as Windows-1251 ~ UTF-8 mojibake. 'ВІКІ', for example, becomes '²ʲ'.

Windows-1251 should get at least the cost that MacRoman and cp437 have, especially because real Windows-1251 mojibake will tend to appear in large quantities. (They're writing in Cyrillic characters, none of which are ASCII.)

Wrongly-decoded sequences of more than 2^16 characters may not be fixed

If a string contains 2^16 characters without any white space, and some of these characters are decoded incorrectly, ftfy may fail to fix them. The characters will only be fixed if the boundary between the 65535th and 65536th character is also a boundary between correctly-decoded characters.

This is a case where ftfy does not work as its docstrings promise. Although I can hope that no user of ftfy is relying on its correctness in such a case, I have to assume that they are.

Some emojis are considered "weird"

Hello,
I cannot post the example here, but I can post the "explain unicode":
U+FE0F ️ [Mn] VARIATION SELECTOR-16
U+2764 ❤ [So] HEAVY BLACK HEART
I have a post with a sequence of these, and it receives a high weirdness score (as high as the number of hearts as above).

Vietnamese text doesn't fix correctly.

I use the fix_text method to this the text below, but sounds like it doesn't work.

Input: u'N\xe1"\x91i l\xe1\xba\xa1i tu\xe1\xba\xa7n tra, ch\xc3\xadnh quy\xe1"\x81n Trump x\xc3\xb3a tan nghi ng\xe1"\x9d \xe1"\x9f Bi\xe1"\x83n \xc4\x90\xc3\xb4ng '

Ouput: Ná"�i láº¡i tuáº§n tra, chÃnh quyá"�n Trump xÃ³a tan nghi ngá"� á"� Biá"�n Ä�Ã´ng

Expected output: Nối lại tuần tra, chính quyền Trump xóa tan nghi ngờ ở Biển Đông

import requests
from readability import Document
from ftfy import fix_text

def extract(url):
response = requests.get(url)
doc = Document(response.text)
return doc.title()

t = extract('http://vnexpress.net/tin-tuc/the-gioi/phan-tich/noi-lai-tuan-tra-chinh-quyen-trump-xoa-tan-nghi-ngo-o-bien-dong-3590033.html')
print(t)
u'N\xe1"\x91i l\xe1\xba\xa1i tu\xe1\xba\xa7n tra, ch\xc3\xadnh quy\xe1"\x81n Trump x\xc3\xb3a tan nghi ng\xe1"\x9d \xe1"\x9f Bi\xe1"\x83n \xc4\x90\xc3\xb4ng - VnExpress'

print(fix_text(t))
u'N\xe1"\x91i l\xe1\xba\xa1i tu\xe1\xba\xa7n tra, ch\xc3\xadnh quy\xe1"\x81n Trump x\xc3\xb3a tan nghi ng\xe1"\x9d \xe1"\x9f Bi\xe1"\x83n \xc4\x90\xc3\xb4ng - VnExpress'

Curly quotes get transformed into "normal" quotes

The following code

import ftfy
ftfy.fix_text('„')

gives '"' as output. Is this correct or should it keep „ instead? The same goes for “.

Feature: fix UTF-8 mixups with Windows-1250 or ISO-8859-2

We can currently fix mixups between UTF-8 and Windows-1252, or Windows-1251, but not Windows-1250 (the codepage used by some Eastern European versions of Windows).

If we handle this, we should also handle the similar but not quite compatible encoding of ISO-8859-2.

These are the most common single-byte encodings that aren't already handled in ftfy, based on https://w3techs.com/technologies/history_overview/character_encoding. Adding these encodings would probably be the most effective way to improve ftfy's coverage. (Issue #18 would improve it much more, as would dealing with CJK encodings, but those are much more difficult problems.)

how to install is missing in docs

I did not found any hints how to install ftfy. It took my a bit to realize you provide it via pip.

I would mention it in the README

Thanks for your work!

Explain the `restore_byte_a0` function a bit better

Why is it needed? How does a space get encoded to \xa0? (Would fit into the documentation, I guess.)

Cannot decode CESU-8 on Python 3

CESU-8 is a problematic encoding. Nobody ever intends to use CESU-8, but it's still used.

It looks like UTF-8, but first you surrogate-encode the astral characters like in UTF-16, then you encode those surrogates in UTF-8. It's the encoding you get if you implement a naive UTF-8 algorithm on a system that natively uses UTF-16.

Python 3 will raise an error if you try to handle CESU-8 as if it's UTF-8. This causes one of the test tweets in the "version3" branch to fail on Python 3, because it's represented in Windows-1252 on top of CESU-8. We need a way to properly decode this kind of text on Python 3, so that the module behaves identically on Python 2 and Python 3.

Feature request: Fix \u escaped unicode sequences

Feature request for something akin to your html entities fix.

I have a text provider who is supposed to be providing me with unicode strings. However, occasionally, I get strings with \u strings in them. For instance, a string might be l'h\u00f4tel (python equivalent u"l'h\\u00f4tel").

What I'd like is for fix_text to notice the \u<unicode char> and replace it with <unicode char>.

Hope that's sufficient detail to understand the issue, if not, let me know.

ftfy doesn't fix 16-bit surrogate codepoints, making stdout sad

One kind of Unicode brokenness that ftfy doesn't fix is the presence of 16-bit surrogate codepoints, the codepoints from U+D800 to U+DFFF, whether or not they're correctly paired.

Strings with these characters tend to cause encoding errors. This is particularly inconvenient in situations where it's not possible to set errors='replace' on the encoding, and one of these situations is writing to stdout on Python 3.

Vietnamese is not weird!

Hello,
I've been getting high weirdness scores for texts that seem benign, like:

Mấy thằng vệ sỉ này muốn xơi nó củng khó, thịt không ăn hoài không hết!
Google Translate tells me that this is Vietnamese.
It has weirdness 14.

bad character range

Updated ftfy via pip

import ftfy now fails

Python 2.7.2 (default, Oct 11 2012, 20:14:37) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ftfy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/atli/work/sling/env/lib/python2.7/site-packages/ftfy/__init__.py", line 10, in <module>
    from ftfy import fixes
  File "/Users/atli/work/sling/env/lib/python2.7/site-packages/ftfy/fixes.py", line 396, in <module>
    UNSAFE_3_3_RE = re.compile('[\U00100000-\U0010ffff]')
  File "/Users/atli/work/sling/env/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/Users/atli/work/sling/env/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Behavior change

Hello,

In last version, some examples are not working anymore

print(fix_text('\001\033[36;44mIm blue, da ba dee da doo\033[0m', normalization='NFKC'))

gives I'm blue, da ba dee da ba doo and not I'm blue, da ba dee da ba doo...


Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
In[2]: from ftfy import fix_text
In[3]: print(fix_text('\001\033[36;44mI&#x92;m blue, da ba dee da ba doo&#133;\033[0m', normalization='NFKC'))
I'm blue, da ba dee da ba doo

In[4]: import ftfy
In[5]: ftfy.__version__
Out[5]: '5.0'

Feature: Replace unicode specials block characters?

Great tool, very useful! Its working brilliantly for most text I throw at it; I've just got the occasional documents that appear to have some rogue sequences of unicode "special blocks" characters in them.

e.g.

>ftfy.explain_unicode(txt):
U+000A  \n      [Cc] <unknown>
U+FFFC         [So] OBJECT REPLACEMENT CHARACTER
U+FFFC         [So] OBJECT REPLACEMENT CHARACTER
U+FFFC         [So] OBJECT REPLACEMENT CHARACTER
U+0042  B       [Lu] LATIN CAPITAL LETTER B
U+0072  r       [Ll] LATIN SMALL LETTER R
U+0065  e       [Ll] LATIN SMALL LETTER E
U+0072  r       [Ll] LATIN SMALL LETTER R
U+0065  e       [Ll] LATIN SMALL LETTER E
U+0074  t       [Ll] LATIN SMALL LETTER T
U+006F  o       [Ll] LATIN SMALL LETTER O
U+006E  n       [Ll] LATIN SMALL LETTER N
U+0020          [Zs] SPACE
U+0047  G       [Lu] LATIN CAPITAL LETTER G
U+0072  r       [Ll] LATIN SMALL LETTER R
U+0065  e       [Ll] LATIN SMALL LETTER E
U+0065  e       [Ll] LATIN SMALL LETTER E
U+006E  n       [Ll] LATIN SMALL LETTER N

For my case, output would look best if the Object Replacement Characters could be converted to spaces. Same for other similar characters in the same unicode block.

Note: I'm not sure of the origin of this text, it looks like it has been copied&pasted from elsewhere, but suffice to say the characters are certainly not intended for the output. It would be nice if ftfy had an option to remove them directly (apologies if I've missed it).

Fails to fix if the transformation are preceded by "…"

I noticed that nothing is transformed after … character:

>>> ftfy.fix_text('said… â\x80\x9cOverall an')
'said… â\x80\x9cOverall an'
>>> ftfy.fix_text('said â\x80\x9cOverall an')
'said "Overall an'

rspeer / python-ftfy Goto Github PK

python-ftfy's Introduction

ftfy: fixes text for you

Testimonials

What it does

Installing

Local development

Who maintains ftfy?

Citing ftfy

Important license clarifications

python-ftfy's People

Contributors

Stargazers

Watchers

Forkers

python-ftfy's Issues

Recommend Projects

Recommend Topics

Recommend Org