Demo Project for "İstanbul" issue

This is a demonstration to showcase the slugification problem of "İstanbul".

As originally discussed in a thread in Wagtail's support channel.

Setup

This repo contains 2 projects, both using official procedures:

was created using wagtail's Getting Started (mysite_wagtail)
was created using django's django-admin.py startproject and ./manage.py startapp (mysite_django).

$ pipenv install
# Cd into either mysite_django or mysite_wagtail
$ cd mysite_django
$ ./manage.py test

In both projects, the tests fail. Why they do and what the expected behaviour would be, is explained in detail below.

The issue

When wanting to create a new page with the title Hello İstanbul, the slugification with allow_unicode=True fails. Further analysis is below the Steps to reproduce.

Wagtail: Steps to reproduce in Webinterface

$ ./manage.py migrate
# Create a user to access the admin interface.
$ ./manage.py createsuperuser
# And off we go!
$ ./manage.py runserver

Log into the admin with your superuser: http://localhost:8000/admin/.
Open up Home in the explorer
Click Add Child Page
Set Hello İstanbul as title
Press Save Draft
The error message "The page could not be created due to validation errors" appears and the Promote tab has a badge with a 1 in it.
Click on the Promote tab
You will see an error message: Enter a valid 'slug' consisting of Unicode letters, numbers, underscores, or hyphens..

Expected behaviour

The lowercased version of the uppercase İ (latin capital letter i with dot above) should work as a slug.

Analysis

First we analyze the letter, then we try to reason why this fails with wagtail/django.

Letter

What kind of character is this?

In [1]: import unicodedata
In [2]: unicodedata.name('İ')
Out[2]: 'LATIN CAPITAL LETTER I WITH DOT ABOVE'

Also, we can see that this letter only consists of a single unicode character.

In [3]: [unicodedata.name(character) for character in 'İ']
Out[3]: ['LATIN CAPITAL LETTER I WITH DOT ABOVE']

Now, we let Python create the lowercase version

In [5]: li = 'İ'.lower()
In [6]: li
Out[6]: 'i̇'

Looks good. Let's see what it is called.

In [7]: unicodedata.name(li)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-70daade6eaaf> in <module>
----> 1 unicodedata.name(li)

TypeError: name() argument 1 must be a unicode character, not str

Uh oh. Wait,.. what? So. That means this string consists of two unicode characters.

In [8]: [unicodedata.name(character) for character in li]
Out[8]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

Indeed it does! So it is a "regular" small "i" with the combining diacritic "dot above".

Django/Wagtail

So what is going on in Django? As far as I can tell, Wagtail does not have anything to do with this, as it simply calls django's slugify(). Furthermore it is possible to re-create the exact same error when using only Django, as is shown in mysite_django.

Now, let's go down the rabbit hole.

Wagtail calls slugify("Hello İstanbul")
Django slugifies the title
Django's SlugField validates the slug
If validation succeeds, Django commits it to the database

1. wagtail calls slugify()

We are going to skip this. Looking very unsuspicious.

In mysite_django, we clone this behaviour in the mysite_django/demo/models.py.

2. django slugifies the title

We start by taking a look at django's slugify.

def slugify(value, allow_unicode=False):
    """
    Convert to ASCII if 'allow_unicode' is False. Convert spaces to hyphens.
    Remove characters that aren't alphanumerics, underscores, or hyphens.
    Convert to lowercase. Also strip leading and trailing whitespace.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value).strip().lower()
    return re.sub(r'[-\s]+', '-', value)

First our string ("Hello İstanbul") is normalized. So it turns each unicode character into a defined "normalized" form (python's docs on this explain it well).

We do allow_unicode, so Django uses the normalized C form:

# https://github.com/django/django/blob/master/django/utils/text.py#L400
# value is always "Hello İstanbul"
if allow_unicode:
    value = unicodedata.normalize('NFKC', value)

Doing this, does not change anything for us. As it should be, if I understand Python's docs on this correctly.

In [20]: unicodedata.normalize('NFKC', "Hello İstanbul")
Out[20]: 'Hello İstanbul'

Now, the next step in django's slugification is this line:

# https://github.com/django/django/blob/master/django/utils/text.py#L404
value = re.sub(r'[^\w\s-]', '', value).strip().lower()

Splitting it up into three steps:

Remove anything that is not a space, a (unicode) character or a hypen
Strip leading and trailing whitespace
Lowercase the entire string

Again, our value would be Hello İstanbul. First, we do what the regex does:

In [21]: re.sub(r'[^\w\s-]', '', "Hello İstanbul")
Out[21]: 'Hello İstanbul'

That is alright, because every character in our string is either a unicode character or a space:

In [22]: [(character, unicodedata.name(character), re.match('[\w\s]', character)) for character in "Hello İstanbul"]
Out[22]:
[('H', 'LATIN CAPITAL LETTER H', <re.Match object; span=(0, 1), match='H'>),
 ('e', 'LATIN SMALL LETTER E', <re.Match object; span=(0, 1), match='e'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('o', 'LATIN SMALL LETTER O', <re.Match object; span=(0, 1), match='o'>),
 (' ', 'SPACE', <re.Match object; span=(0, 1), match=' '>),
 ('İ',
  'LATIN CAPITAL LETTER I WITH DOT ABOVE',
  <re.Match object; span=(0, 1), match='İ'>),
 ('s', 'LATIN SMALL LETTER S', <re.Match object; span=(0, 1), match='s'>),
 ('t', 'LATIN SMALL LETTER T', <re.Match object; span=(0, 1), match='t'>),
 ('a', 'LATIN SMALL LETTER A', <re.Match object; span=(0, 1), match='a'>),
 ('n', 'LATIN SMALL LETTER N', <re.Match object; span=(0, 1), match='n'>),
 ('b', 'LATIN SMALL LETTER B', <re.Match object; span=(0, 1), match='b'>),
 ('u', 'LATIN SMALL LETTER U', <re.Match object; span=(0, 1), match='u'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>)]

14 characters, 14 matches. All right!

Then, Django calls strip, to free the string of extraneous whitespace:

In [24]: re.sub(r'[^\w\s-]', '', "Hello İstanbul").strip()
Out[24]: 'Hello İstanbul'

As expected, everything stays the same.

Now the last step in this very line is to lowercase the string:

In [26]: re.sub(r'[^\w\s-]', '', "Hello İstanbul").strip().lower()
Out[26]: 'hello i̇stanbul'

After that, the method replaces all spaces with a dash:

In [34]: re.sub(r'[-\s]+', '-', 'hello i̇stanbul')
Out[34]: 'hello-i̇stanbul'

And that's that.

3. Django's SlugField validates the value

This is where things "fail". The SlugField is given the result of the slugification: 'hello-i̇stanbul'.

Now the SlugField does it's validation.

If we take a look at the SlugField validator being used (with allow_unicode = True), we end up with the slug_unicode_re regular expression, as defined in django/core/validators.py.

# allow for unicode words (\w), dashes (-). \Z is for marking the end of the string.
slug_unicode_re = _lazy_re_compile(r'^[-\w]+\Z')

From here, things go awry. That regular expression does not match the output generated by slugify("Hello İstanbul").

In [30]: type(slug_unicode_re.match('hello-i̇stanbul'))
Out[30]: NoneType

Why does it fail? Let's repeat the small analysis we did before and remember that the lowercase version of İ consists of two characters:

# Just a quick reminder:
In [10]: [unicodedata.name(character) for character in unicodedata.normalize('NFKC', 'İ'.lower())]
Out[10]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

# Now the gory details:
In [36]: [(character, unicodedata.name(character), slug_unicode_re.match(character)) for character in 'hello-i̇stanbul']
Out[36]:
[('h', 'LATIN SMALL LETTER H', <re.Match object; span=(0, 1), match='h'>),
 ('e', 'LATIN SMALL LETTER E', <re.Match object; span=(0, 1), match='e'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>),
 ('o', 'LATIN SMALL LETTER O', <re.Match object; span=(0, 1), match='o'>),
 ('-', 'HYPHEN-MINUS', <re.Match object; span=(0, 1), match='-'>),
 ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),

 # EUREKA!!
 ('̇', 'COMBINING DOT ABOVE', None),

 ('s', 'LATIN SMALL LETTER S', <re.Match object; span=(0, 1), match='s'>),
 ('t', 'LATIN SMALL LETTER T', <re.Match object; span=(0, 1), match='t'>),
 ('a', 'LATIN SMALL LETTER A', <re.Match object; span=(0, 1), match='a'>),
 ('n', 'LATIN SMALL LETTER N', <re.Match object; span=(0, 1), match='n'>),
 ('b', 'LATIN SMALL LETTER B', <re.Match object; span=(0, 1), match='b'>),
 ('u', 'LATIN SMALL LETTER U', <re.Match object; span=(0, 1), match='u'>),
 ('l', 'LATIN SMALL LETTER L', <re.Match object; span=(0, 1), match='l'>)]

The combining diacritic does not validate. Which is 100% correct. It should not! However, that is exactly why the slugification process "fails" in this scenario, because slugify() can be made to produce output that does not validate against slug_unicode_re.

Summary & Proposed Solution

Summing up: the culprit is the order in which slugify() executes the lowercasing. It does so, after cleaning unwanted characters away. However, as demonstrated, lowercasing can trigger the creation of unwanted characters. "Unwanted" meaning a character that can not be validated by the SlugField.

There are two ways to solve this:

Extend the slug_unicode_re to allow for combining diacritics. Or…
…change when the lowercasing happens.

Personally I am leaning towards option 2.

So this line in django/utils/text.py's slugify:

value = re.sub(r'[^\w\s-]', '', value).strip().lower()

becomes

value = re.sub(r'[^\w\s-]', '', value.lower()).strip()

Repeating the proposed solution with our demonstration:

In [37]: re.sub(r'[^\w\s-]', '', 'Hello İstanbul'.lower()).strip()
# Diacritic is now removed, hence we have the lower case latin i
Out[37]: 'hello istanbul'

In [38]: re.sub(r'[-\s]+', '-', 'hello istanbul')
Out[38]: 'hello-istanbul'

In [39]: slug_unicode_re.match('hello-istanbul')
Out[39]: <re.Match object; span=(0, 14), match='hello-istanbul'>

An even deeper dive in (Is Python doing this right?)

Thanks to Matt Westcott's suggestion that it could also be a bug in Python itself, I took a quick dive into the unicode plane myself.

The İ is the Latin Capital Letter I with Dot Above. It's codepoint is U+0130. According to the chart for the Latin Extended-A set, it's lowercase version is U+0069.
U+0069 lives in the C0 Controls and Basic Latin set. Lo and behold: it is the Latin small letter I. So a latin lowercase i.

Does this really mean that Python is doing something weird here by adding the Combining dot above?

In [8]: [unicodedata.name(character) for character in 'İ'.lower()]
Out[8]: ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

I'm in no way a unicode pro, so my assumption might be very naive and wrong.

ciumabok / django-wagtail-turkish-i Goto Github PK

django-wagtail-turkish-i's Introduction

Demo Project for "İstanbul" issue

Setup

The issue

Wagtail: Steps to reproduce in Webinterface

Expected behaviour

Analysis

Letter

Django/Wagtail

1. wagtail calls slugify()

2. django slugifies the title

3. Django's SlugField validates the value

Summary & Proposed Solution

An even deeper dive in (Is Python doing this right?)

django-wagtail-turkish-i's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent