Giter Site home page Giter Site logo

ciumabok / django-wagtail-turkish-i Goto Github PK

View Code? Open in Web Editor NEW

This project forked from originell/django-wagtail-turkish-i

0.0 0.0 0.0 63 KB

A demonstration to showcase an issue in django's slugification with combining diacritics

License: MIT License

Python 55.63% CSS 9.75% HTML 32.38% Dockerfile 2.24%

django-wagtail-turkish-i's Introduction

Demo Project for "İstanbul" issue

This is a demonstration to showcase the slugification problem of "İstanbul".

As originally discussed in a thread in Wagtail's support channel.

Setup

This repo contains 2 projects, both using official procedures:

  1. was created using wagtail's Getting Started (mysite_wagtail)
  2. was created using django's django-admin.py startproject and ./manage.py startapp (mysite_django).
$ pipenv install
# Cd into either mysite_django or mysite_wagtail
$ cd mysite_django
$ ./manage.py test

In both projects, the tests fail. Why they do and what the expected behaviour would be, is explained in detail below.

The issue

When wanting to create a new page with the title Hello İstanbul, the slugification with allow_unicode=True fails. Further analysis is below the Steps to reproduce.

Wagtail: Steps to reproduce in Webinterface

$ ./manage.py migrate
# Create a user to access the admin interface.
$ ./manage.py createsuperuser
# And off we go!
$ ./manage.py runserver
  1. Log into the admin with your superuser: http://localhost:8000/admin/.
  2. Open up Home in the explorer
  3. Click Add Child Page
  4. Set Hello İstanbul as title
  5. Press Save Draft
  6. The error message "The page could not be created due to validation errors" appears and the Promote tab has a badge with a 1 in it.
  7. Click on the Promote tab
  8. You will see an error message: Enter a valid 'slug' consisting of Unicode letters, numbers, underscores, or hyphens..

Expected behaviour

The lowercased version of the uppercase İ (latin capital letter i with dot above) should work as a slug.

Analysis

First we analyze the letter, then we try to reason why this fails with wagtail/django.

Letter

What kind of character is this?

In [1]: import unicodedata
In [2]: unicodedata.name('İ')
Out[2]: 'LATIN CAPITAL LETTER I WITH DOT ABOVE'

Also, we can see that this letter only consists of a single unicode character.

In [3]: [unicodedata.name(character) for character in 'İ']
Out[3]: ['LATIN CAPITAL LETTER I WITH DOT ABOVE']

Now, we let Python create the lowercase version

In [5]: li = 'İ'.lower()
In [6]: li
Out[6]: 'i̇'

Looks good. Let's see what it is called.

In [7]: unicodedata.name(li)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-70daade6eaaf> in <module>
----> 1 unicodedata.name(li)

TypeError: name() argument 1 must be a unicode character, not str

Uh oh. Wait,.. what? So. That means this string consists of two unicode characters.

In [8]: [unicodedata.name(character) for character in li]
Out[8]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

Indeed it does! So it is a "regular" small "i" with the combining diacritic "dot above".

Django/Wagtail

So what is going on in Django? As far as I can tell, Wagtail does not have anything to do with this, as it simply calls django's slugify(). Furthermore it is possible to re-create the exact same error when using only Django, as is shown in mysite_django.

Now, let's go down the rabbit hole.

  1. Wagtail calls slugify("Hello İstanbul")
  2. Django slugifies the title
  3. Django's SlugField validates the slug
  4. If validation succeeds, Django commits it to the database

1. wagtail calls slugify()

We are going to skip this. Looking very unsuspicious.

In mysite_django, we clone this behaviour in the mysite_django/demo/models.py.

2. django slugifies the title

We start by taking a look at django's slugify.

def slugify(value, allow_unicode=False):
    """
    Convert to ASCII if 'allow_unicode' is False. Convert spaces to hyphens.
    Remove characters that aren't alphanumerics, underscores, or hyphens.
    Convert to lowercase. Also strip leading and trailing whitespace.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value).strip().lower()
    return re.sub(r'[-\s]+', '-', value)

First our string ("Hello İstanbul") is normalized. So it turns each unicode character into a defined "normalized" form (python's docs on this explain it well).

We do allow_unicode, so Django uses the normalized C form:

# https://github.com/django/django/blob/master/django/utils/text.py#L400
# value is always "Hello İstanbul"
if allow_unicode:
    value = unicodedata.normalize('NFKC', value)

Doing this, does not change anything for us. As it should be, if I understand Python's docs on this correctly.

In [20]: unicodedata.normalize('NFKC', "Hello İstanbul")
Out[20]: 'Hello İstanbul'

Now, the next step in django's slugification is this line:

# https://github.com/django/django/blob/master/django/utils/text.py#L404
value = re.sub(r'[^\w\s-]', '', value).strip().lower()

Splitting it up into three steps:

  1. Remove anything that is not a space, a (unicode) character or a hypen
  2. Strip leading and trailing whitespace
  3. Lowercase the entire string

Again, our value would be Hello İstanbul. First, we do what the regex does:

In [21]: re.sub(r'[^\w\s-]', '', "Hello İstanbul")
Out[21]: 'Hello İstanbul'

That is alright, because every character in our string is either a unicode character or a space:

In [22]: [(character, unicodedata.name(character), re.match('[\w\s]', character)) for character in "Hello İstanbul"]
Out[22]:
[('H', 'LATIN CAPITAL LETTER H', <re.Match object; span=(0, 1), match='H'>),
 ('e', 'LATIN SMALL LETTER E', <re.Match object; span=(0, 1), match='e'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('o', 'LATIN SMALL LETTER O', <re.Match object; span=(0, 1), match='o'>),
 (' ', 'SPACE', <re.Match object; span=(0, 1), match=' '>),
 ('İ',
  'LATIN CAPITAL LETTER I WITH DOT ABOVE',
  <re.Match object; span=(0, 1), match='İ'>),
 ('s', 'LATIN SMALL LETTER S', <re.Match object; span=(0, 1), match='s'>),
 ('t', 'LATIN SMALL LETTER T', <re.Match object; span=(0, 1), match='t'>),
 ('a', 'LATIN SMALL LETTER A', <re.Match object; span=(0, 1), match='a'>),
 ('n', 'LATIN SMALL LETTER N', <re.Match object; span=(0, 1), match='n'>),
 ('b', 'LATIN SMALL LETTER B', <re.Match object; span=(0, 1), match='b'>),
 ('u', 'LATIN SMALL LETTER U', <re.Match object; span=(0, 1), match='u'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>)]

14 characters, 14 matches. All right!

Then, Django calls strip, to free the string of extraneous whitespace:

In [24]: re.sub(r'[^\w\s-]', '', "Hello İstanbul").strip()
Out[24]: 'Hello İstanbul'

As expected, everything stays the same.

Now the last step in this very line is to lowercase the string:

In [26]: re.sub(r'[^\w\s-]', '', "Hello İstanbul").strip().lower()
Out[26]: 'hello i̇stanbul'

After that, the method replaces all spaces with a dash:

In [34]: re.sub(r'[-\s]+', '-', 'hello i̇stanbul')
Out[34]: 'hello-i̇stanbul'

And that's that.

3. Django's SlugField validates the value

This is where things "fail". The SlugField is given the result of the slugification: 'hello-i̇stanbul'.

Now the SlugField does it's validation.

If we take a look at the SlugField validator being used (with allow_unicode = True), we end up with the slug_unicode_re regular expression, as defined in django/core/validators.py.

# allow for unicode words (\w), dashes (-). \Z is for marking the end of the string.
slug_unicode_re = _lazy_re_compile(r'^[-\w]+\Z')

From here, things go awry. That regular expression does not match the output generated by slugify("Hello İstanbul").

In [30]: type(slug_unicode_re.match('hello-i̇stanbul'))
Out[30]: NoneType

Why does it fail? Let's repeat the small analysis we did before and remember that the lowercase version of İ consists of two characters:

# Just a quick reminder:
In [10]: [unicodedata.name(character) for character in unicodedata.normalize('NFKC', 'İ'.lower())]
Out[10]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

# Now the gory details:
In [36]: [(character, unicodedata.name(character), slug_unicode_re.match(character)) for character in 'hello-i̇stanbul']
Out[36]:
[('h', 'LATIN SMALL LETTER H', <re.Match object; span=(0, 1), match='h'>),
 ('e', 'LATIN SMALL LETTER E', <re.Match object; span=(0, 1), match='e'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('o', 'LATIN SMALL LETTER O', <re.Match object; span=(0, 1), match='o'>),
 ('-', 'HYPHEN-MINUS', <re.Match object; span=(0, 1), match='-'>),
 ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),

 # EUREKA!!
 ('̇', 'COMBINING DOT ABOVE', None),

 ('s', 'LATIN SMALL LETTER S', <re.Match object; span=(0, 1), match='s'>),
 ('t', 'LATIN SMALL LETTER T', <re.Match object; span=(0, 1), match='t'>),
 ('a', 'LATIN SMALL LETTER A', <re.Match object; span=(0, 1), match='a'>),
 ('n', 'LATIN SMALL LETTER N', <re.Match object; span=(0, 1), match='n'>),
 ('b', 'LATIN SMALL LETTER B', <re.Match object; span=(0, 1), match='b'>),
 ('u', 'LATIN SMALL LETTER U', <re.Match object; span=(0, 1), match='u'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>)]

The combining diacritic does not validate. Which is 100% correct. It should not! However, that is exactly why the slugification process "fails" in this scenario, because slugify() can be made to produce output that does not validate against slug_unicode_re.

Summary & Proposed Solution

Summing up: the culprit is the order in which slugify() executes the lowercasing. It does so, after cleaning unwanted characters away. However, as demonstrated, lowercasing can trigger the creation of unwanted characters. "Unwanted" meaning a character that can not be validated by the SlugField.

There are two ways to solve this:

  1. Extend the slug_unicode_re to allow for combining diacritics. Or…
  2. …change when the lowercasing happens.

Personally I am leaning towards option 2.

So this line in django/utils/text.py's slugify:

value = re.sub(r'[^\w\s-]', '', value).strip().lower()

becomes

value = re.sub(r'[^\w\s-]', '', value.lower()).strip()

Repeating the proposed solution with our demonstration:

In [37]: re.sub(r'[^\w\s-]', '', 'Hello İstanbul'.lower()).strip()
# Diacritic is now removed, hence we have the lower case latin i
Out[37]: 'hello istanbul'

In [38]: re.sub(r'[-\s]+', '-', 'hello istanbul')
Out[38]: 'hello-istanbul'

In [39]: slug_unicode_re.match('hello-istanbul')
Out[39]: <re.Match object; span=(0, 14), match='hello-istanbul'>

An even deeper dive in (Is Python doing this right?)

Thanks to Matt Westcott's suggestion that it could also be a bug in Python itself, I took a quick dive into the unicode plane myself.

Does this really mean that Python is doing something weird here by adding the Combining dot above?

In [8]: [unicodedata.name(character) for character in 'İ'.lower()]
Out[8]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

I'm in no way a unicode pro, so my assumption might be very naive and wrong.

django-wagtail-turkish-i's People

Contributors

originell avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.