scrapy / cssselect Goto Github PK

View Code? Open in Web Editor NEW

287.0 23.0 60.0 438 KB

CSS Selectors for Python

Home Page: https://cssselect.readthedocs.io/

License: Other

Python 100.00%

css python selectors hacktoberfest

cssselect's People

Contributors

Stargazers

Watchers

Forkers

varialus scoder singletoned dangra bukzor kmike imclab coinpayee yodalee ian-huu novocaine douglas-larocca alanland nikolas kolosochek tayiorbeii waytai graingert caidongyun alexef appscluster harrygg ii0 letser br0ken- arthurdarcet hmozju parth-vader ladin157 jazzzchan nycto-hackerone malloxpb kanav-raina whybin hhy5277 gallaecio sortafreel orenmazor akshita27 julius383 msgpo mindaugasvaitkus2 rwaycachedlibs kolanich-libs imnitishng annbgn elacuesta eric-seekas pcorpet 0therguys frankfanslc python-repository-hub jackwiy fancyweb evgenrud arpitjain799 joneuhauser anteverse sysfce2 emarondan

cssselect's Issues

There are four different syntaxes for namespaces in element types and attribute names: ns|E, *|E, |E and just E. All four have a different meaning: http://www.w3.org/TR/css3-selectors/#typenmsp

cssselect currently parses *|E and E the same, and fails to parse |E.

Encoding Error using utf8

In code # coding: utf8 alias should be replaced with the actual encoding # -*- encoding: utf-8 -*-

Negation selector does not accept any selector as argument.

The following are valid CSS3 selectors which are rejected by cssselect:

:not(.foo, .bar)
:not(foo > bar)
:not(foo bar)
:not(:not(a))
:not(<any other selector>)

From the looks of it, disallowing nested selectors was an explicit choice, but the parser doesn't seem to like binary operators such as < and , in the negation either, and just raises a syntax error.

cssselect 0.7: Test failures

All tests were passing in cssselect 0.6.1.
Some tests fail in cssselect 0.7.

$ PYTHONPATH="." python2.7 cssselect/tests.py -v
test_parse_errors (__main__.TestCssselect) ... ok
test_parser (__main__.TestCssselect) ... ok
test_pseudo_elements (__main__.TestCssselect) ... FAIL
test_quoting (__main__.TestCssselect) ... ok
test_select (__main__.TestCssselect) ... ok
test_select_shakespeare (__main__.TestCssselect) ... ok
test_series (__main__.TestCssselect) ... ok
test_specificity (__main__.TestCssselect) ... ok
test_tokenizer (__main__.TestCssselect) ... FAIL
test_translation (__main__.TestCssselect) ... ERROR
test_unicode (__main__.TestCssselect) ... ok
test_unicode_escapes (__main__.TestCssselect) ... ok

======================================================================
ERROR: test_translation (__main__.TestCssselect)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "cssselect/tests.py", line 378, in test_translation
    assert xpath(r'di\a0 v') == (
  File "cssselect/tests.py", line 297, in xpath
    return str(GenericTranslator().css_to_xpath(css, prefix=''))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 14: ordinal not in range(128)

======================================================================
FAIL: test_pseudo_elements (__main__.TestCssselect)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "cssselect/tests.py", line 175, in test_pseudo_elements
    assert parse_one('::before') == ('Element[*]', 'before')
  File "cssselect/tests.py", line 161, in parse_one
    result = parse_pseudo(css)
  File "cssselect/tests.py", line 155, in parse_pseudo
    assert pseudo is None or type(pseudo) is _unicode
AssertionError

======================================================================
FAIL: test_tokenizer (__main__.TestCssselect)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "cssselect/tests.py", line 63, in test_tokenizer
    "<EOF at 42>",
AssertionError

----------------------------------------------------------------------
Ran 12 tests in 0.155s

FAILED (failures=2, errors=1)

Doesn’t work on python 3.4

Hi, I’ve installed cssselect with pip and pip3 (on ubuntu) and I can’t make it works with python 3.4

$ python 
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from cssselect import GenericTranslator, SelectorError
>>>

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cssselect import GenericTranslator, SelectorError
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'GenericTranslator'
>>>

[feature-request] `:not()` to support generic selectors (not only "simple" ones)

The document (version 0.9.1) says:

:not() accepts a sequence of simple selectors, not just single simple selector. For example, :not(a.important[rel]) is allowed, even though the negation contains 3 simple selectors.

May I ask what is a simple selector? Can :not() support something like :not(a>b)?

>>> import cssselect
>>> cssselect.parse('a:not(p>a)')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/cssselect/parser.py", line 355, in parse
    return list(parse_selector_group(stream))
  File "/usr/lib/python3.4/site-packages/cssselect/parser.py", line 370, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "/usr/lib/python3.4/site-packages/cssselect/parser.py", line 378, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)
  File "/usr/lib/python3.4/site-packages/cssselect/parser.py", line 471, in parse_simple_selector
    raise SelectorSyntaxError("Expected ')', got %s" % (next,))
cssselect.parser.SelectorSyntaxError: Expected ')', got <DELIM '>' at 7>

cssselect can't work on firefox

Firefox is unlike other xpath implementations in that name() returns an upper-cased string. cssselect's translation of the nth-child selector (for example) uses "name() = 'foo'" which will never possibly match, due to the above oddity. One workaround is to set HTMLTranslator.lower_case_element_names to False, and write selectors like 'LI:nth-child(2)', which will result in a working xpath for firefox, but won't work on any other xpath implementation.

I see two possible solutions:

Call lower-case() wherever name() is called.
Factor out the use of name() entirely, replacing [name() = 'foo'] with [self::foo].

Demonstration of the problem and solution here: (the contrast between chrome and firefox is stark, ie is like chrome)
http://fiddle.jshell.net/J7VrG/10/show/light/

Tokenizer corner cases

Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:

The current tokenizer for Symbol uses something like the '\w' regex,
while a CSS IDENT token can contain any non-ASCII character (including
U+00A0 no-break space, for example), can have backslash-escapes but can
not start with a digit.
Unicode white space (like U+00A0) counts as white space (either
ignored or a descendant combinator) but should not (related to 1)
2n+1 or similar strings (arguments to :nth-child()) are tokenized as
Symbol objects, and are then accepted by the parser as element types,
class names, IDs, etc.

I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
problem ...

Selector.repr doesn't work as intended

a4b12ae#diff-adc3ae8f2cf8b1931771d84ca7af6275R82 - this does nothing because pseudo_element variable is overwritten in the following if-else statements

Non-ASCII pseudo-classes

Translating a selector with a non-ASCII pseudo-class causes UnicodeEncodeError on Python 2.x. This is because we are calling getattr() with a name based on the pseudo-class’. No such pseudo-class exists, but they should raise ExpressionError instead.

Drop Python 2.4 support

What do you think about dropping Python 2.4 support? It is true that Python 2.4 can still be used in some setups (like old Red Hat machines), but

Travis doesn't run Python 2.4 tests;
tox also can't run Python 2.4 tests

so there is no easy way to make sure cssselect works under Python 2.4, and this makes contributing to cssselect harder.

Exception on selectors without namespace

Problem

The CSS3 spec allows the namespace field to be left empty, which indicates
an element with no namespace attached. However, cssselect cannot handle
those selectors right now.

For example, suppose we have the following line:

GenericTranslator().css_to_xpath('|foo')

This causes the parser to raise an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cssselect/cssselect/xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "/cssselect/cssselect/parser.py", line 354, in parse
    return list(parse_selector_group(stream))
  File "/cssselect/cssselect/parser.py", line 367, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "/cssselect/cssselect/parser.py", line 375, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)
  File "/cssselect/cssselect/parser.py", line 475, in parse_simple_selector
    "Expected selector, got %s" % (peek,))
cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '|' at 0>

Expected behaviour

cssselect should be able to handle a selector like |foo.

Note

Here is the related part from Selector Level 3 $6.1.1:

ns|E
    elements with name E in namespace ns 
*|E
    elements with name E in any namespace, including those without a namespace 
|E
    elements with name E without a namespace 
E
    if no default namespace has been declared for selectors, this is equivalent to *|E.
    Otherwise it is equivalent to ns|E where ns is the default namespace.

*:first-of-type and friends are not implemented yet

From the docs:

*:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type. All of these work when you specify an element type, but not with *

Release cssselect with pseudo-elements improvements

This is a remainder asking to release 0.9 (or whatever version you prefer) with recent pseudo-elements improvements.

I didn't wanted to bother you with this release until work on Scrapy CSS selectors are ready to merge, but the unavailability on pypi is inconvenient right now as it makes travis-ci fail for scrapy/scrapy#426 pull request.

thanks!

[bug] parse() fails if :scope present in second element of selector list

As of v1.1.0, cssselect.parse() seems to have problems parsing the ":scope" psuedo-class if the input string is a selector list, and ":scope" occurs in any clause besides the first one.

Ones that successfully parse as expected:

parse(":scope > th")
parse(":scope > th, td")
parse(":scope > th, table > td")

However, all of the following unexpectedly (at least to me) throw SelectorSyntaxError('Got immediate child pseudo-element ":scope" not at the start of a selector'):

parse("th, :scope > td")
parse("table > th, :scope > td")
parse(":scope > th, :scope > td")

(I'd submit a PR for this, but looking at the location of the error, I'm not familiar enough with the internals to suggest what the right thing to do is!).

element>element selector does not work relative to an element

in version 1.0.3 i get an exception when using cssselect on an element to select it's direct children
element > element (see https://www.w3schools.com/cssref/sel_element_gt.asp)

>>> from lxml import html
>>> html.fromstring('<html><body><div class="parent"><div class="child"><div class="child"></div></div></div></body></html>')
<Element html at 0x7feadf137d08>
>>> tree=html.fromstring('<html><body><div class="parent"><div class="child"><div class="child"></div></div></div></body></html>')
>>> tree.cssselect('div.parent')
[<Element div at 0x7feadf137e10>]
>>> tree.cssselect('div.parent')[0].cssselect('> .child')
*** SelectorSyntaxError: Expected selector, got <DELIM '>' at 0>

in version 0.9.1 the following worked w/o raising an exception, however it leads to an unexpected result since the second div.child is no direct child of div.parent

>>> tree=html.fromstring('<html><body><div class="parent"><div class="child"><div class="child"></div></div></div></body></html>')
# works but should return only one element
>>> tree.cssselect('div.parent')[0].cssselect('> .child')
[<Element div at 0x7fa6e973def0>, <Element div at 0x7fa6e973dfb0>]

> only works when parent selector is given in the selector

>>> tree.cssselect('div.parent > .child')
[<Element div at 0x7fa6e973de90>]

A) is it a regression, that element.cssselect('> .child') raises an exception on recent versions?

B) is there a way to select a direct child given the parent element?

Unable to parse selector with escaped characters.

To select an element with class "width-3:4" one must escape the ':' as per http://www.w3.org/International/questions/qa-escapes

However, this raises an error:

>>> GenericTranslator().css_to_xpath('.width-3\3a 4')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/data/devel/cssselect/cssselect/xpath.py", line 165, in css_to_xpath
    selectors = parse(css)
  File "/data/devel/cssselect/cssselect/parser.py", line 313, in parse
    return list(parse_selector_group(stream))
  File "/data/devel/cssselect/cssselect/parser.py", line 328, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "/data/devel/cssselect/cssselect/parser.py", line 336, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)
  File "/data/devel/cssselect/cssselect/parser.py", line 446, in parse_simple_selector
    "Expected selector, got %s" % (peek,))
SelectorSyntaxError: Expected selector, got <DELIM '' at 8>

A nasty bug lies within the library

Have a crawler based on this library cssselect (1.0.0) my customer was nagging about not getting all the news he wanted for quite sometimes.
Had a closer look and found something peculiar.

If you want to reproduce the error simply go to this url :
http://www.bbc.com/news/uk-england-40465494
and try this selector :
.story-body__inner > p

the purpose of mentioned selector is to get news body and when it will get to following line :

Mr Paget-Brown resigned following sustained criticism of the council and an aborted meeting of its cabinet on Thursday, from which leaders had tried to ban members of the public and press.

it only get the

Mr Paget-Brown

You can take a look at this picture which I got from crawler
http://imgur.com/a/4UcMb

I have tested my selector using firepath in firefox

:contains("text") NOT :contains(text)

Took me a while to figure this out, but in jquery the quotes are not required. Not sure if this is a bug or a feature :)

CSS :lang() and XPath lang()

XPath 1.0 actually has a lang() function! http://www.w3.org/TR/xpath/#function-lang
It would probably be a more efficient way to implement the :lang() selector, but would only work for XML, not HTML.

Test against Python 3.7

We should run tests against Python 3.7 and declare its support in setup.py.

Whitespace in series

The grammar for series (like 2n+1, as accepted by :nth-child() and friends) is

nth
  : S* [ ['-'|'+']? INTEGER? {N} [ S* ['-'|'+'] S* INTEGER ]? |
         ['-'|'+']? INTEGER | {O}{D}{D} | {E}{V}{E}{N} ] S*
  ;

Currently, any whitespace will be rejected by the parser, as it expects a single token followed by )

Invalid exception in parser.tokenize

Here: https://github.com/SimonSapin/cssselect/blob/master/cssselect/parser.py#L687 next_pos is undefined so NameError will be raised instead of SelectorSyntaxError.

CSS selector finds nothing with invalid HTML

Since this example has invalid HTML, feel free to ignore this issue.

Anyway, here it is (simplified from http://www.weheart.co.uk/2013/02/18/alley-oop-design-exhibition/):

import cssselect
import lxml.html

d = lxml.html.document_fromstring('''
<!DOCTYPE html>
<html/>
<body></body>
''')

t = cssselect.HTMLTranslator()

print d.xpath(t.css_to_xpath('body'))
print d.xpath(t.css_to_xpath('body', prefix = '//'))

Just a bit unexpected that the first XPath query doesn't find anything.

Move docs to readthedocs

Procedure to build and upload docs to https://pythonhosted.org/cssselect/ used to be:

(pip install sphinx)
python setup.py build_sphinx
python setup.py upload_sphinx

But now you get:

$ python setup.py upload_sphinx
running upload_sphinx
Submitting documentation to https://upload.pypi.org/legacy/
Upload failed (410): Uploading documentation is no longer supported, we recommend using https://readthedocs.org/.

For 1.0.0, I had to manually create the zip file from the docs _build/html folder and upload it with PyPI's web interface.

Drop Python 3.1 support

What do you think about dropping Python 3.1 support?

I doubt anybody uses Python 3.1 in practice;
Travis can't run Python 3.1 tests;
tox also doesn't support Python 3.1 and can't run cssselect tests under Python 3.1.

cssselect didn't handle value in integer

As mention in gawel/pyquery#6
cssselect failed to convert 'option[value=1]' properly.
I test and modify the parser.py:536, it only accept 'INDENT' and 'STRING'
I add 'NUMBER on it and it seems not a big problem on it.

selector has problem when getting to paraghpes nested with href

According to this issue : #77

I still think the selector is not capable of handling proper selector in some situation.

reason to open another issue : the last one closed just because of not reading the whole issue thoroughly!

str and repr shouldn't return unicode in Python 2.x

Hi,

Just noticed that __str__ and __repr__ methods return unicode here: https://github.com/SimonSapin/cssselect/blob/master/cssselect/xpath.py#L41.

This is incorrect in Python 2.x because unfortunately these methods must return bytestrings in Python 2.x. Returning unicode will cause many kinds of problems with non-ascii text.

I haven't checked other __str__ and __repr__ methods in cssselect.

Web Scraping Youtube Playlist Information

Im trying to extract information from Youtube but when I try to parse an element:

URL = "https://www.youtube.com/user/Urbanroosters/playlists"
with HTMLSession() as session:
request = session.get(URL)

body = request.html.find('div id="items" class="style-scope ytd-grid-renderer"><ytd-grid-playlist-renderer class="style-scope ytd-grid-renderer" lockup=""')

the following error is displayed:

Traceback (most recent call last):

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 7, in
body = request.html.find('div id="items" class="style-scope ytd-grid-renderer"><ytd-grid-playlist-renderer class="style-scope ytd-grid-renderer" lockup=""')

File "/Users/JT/opt/anaconda3/lib/python3.7/site-packages/requests_html.py", line 212, in find
for found in self.pq(selector)

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/pyquery/pyquery.py", line 300, in call
result = self._copy(*args, parent=self, **kwargs)

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/pyquery/pyquery.py", line 286, in _copy
return self.class(*args, **kwargs)

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/pyquery/pyquery.py", line 271, in init
xpath = self._css_to_xpath(selector)

File "/Users/JGBT/opt/anaconda3/lib/python3.7/site-packages/pyquery/pyquery.py", line 282, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/cssselect/xpath.py", line 192, in css_to_xpath
for selector in parse(css))

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/cssselect/parser.py", line 415, in parse
return list(parse_selector_group(stream))

File "/Users/JGBT/opt/anaconda3/lib/python3.7/site-packages/cssselect/parser.py", line 428, in parse_selector_group
yield Selector(*parse_selector(stream))

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/cssselect/parser.py", line 454, in parse_selector
next_selector, pseudo_element = parse_simple_selector(stream)

File "/Users/JGTB/opt/anaconda3/lib/python3.7/site-packages/cssselect/parser.py", line 545, in parse_simple_selector
"Expected selector, got %s" % (peek,))

File "", line unknown
SelectorSyntaxError: Expected selector, got <DELIM '=' at 6>

HTML :enable and :disabled are not quite conformant

These should match :enabled, but currently do not:

li elements that are children of menu elements, and that have a child element that defines a command, if the first such element's Disabled State facet is false (not disabled)

(Similarly for :disabled with Disabled State facet is true (disabled))

Form elements should be considered disabled

... if its disabled attribute is set, or if it is a descendant of a fieldset element whose disabled attribute is set and is not a descendant of that fieldset element's first legend element child, if any.

The last part was skipped, so the current implementation is:

... if its disabled attribute is set, or if it is a descendant of a fieldset element whose disabled attribute is set

Google fonts @import url() raises SelectorSyntaxError in parse_simple_selector

Got this error: Expected selector, got <NUMBER '700' at 0>

I think it was caused by this line in my css:
@import url('https://fonts.googleapis.com/css?family=Lato:400,700,400italic|Signika:400,700');

I'm using premailer which uses your library.

The PyPI tarball doesn't include tests

I guess MANIFEST.in needs to be updated, like https://github.com/scrapy/parsel/blob/master/MANIFEST.in

Incorrect use of XPath name() function

The use of the name() function for matching tags breaks with documents that have a default namespace or multiple namespace prefixes mapping to the same namespace.

For example,

The CSS selector

h|p + h|p

becomes

descendant-or-self::h:p/following-sibling::*[name() = 'h:p' and (position() = 1)]

When this query is run on a XHTML document it will produce no matches, because the name() function returns "p". Similarly if it is run on a document that defines the XHTML namespace with a prefix other than h it will fail.

A possible solution is to have the css_to_xpath function take a namespaces argument that contains a mapping of prefixes to URIs and then use local-name() and namespace-uri() instead of name(). The argument can default to None, in which case it can use the present behavior, for backward compatibility.

See http://lenzconsulting.com/namespaces-in-xslt/#perils_of_the_name_function for more details on the problems caused by using the name() function.

Support for relational pseudo-class :has()

CSS Selectors Level 4 (still in draft) introduce the :has() pseudo-class:

The relational pseudo-class, :has(), is a functional pseudo-class taking a relative selector list as an argument. It represents an element if any of the relative selectors, when absolutized and evaluated with the element as the :scope elements, would match at least one element.

For example, the following selector matches only <a> elements that contain an <img> child:
a:has(> img)
The following selector matches a <dt> element immediately followed by another <dt> element:
dt:has(+ dt)

Although no browser seems to be supporting this yet, it looks here to stay (I may be wrong).

It would be interesting to support this to get a bit more flexibility on predicates (e.g. testing children elements).

Project maintenance

I’m not really interested in cssselect anymore. I think the approach of "translating" selectors to XPath is fundamentally flawed (see #12 for example). I’ve started cssselect2 which implements Selectors "for real", but it’s blocked on a deciding what kind of tree it works on.

I’ve also kind of moved on from Python; I mostly work with Rust nowadays.

Still, some people seem to be interested in cssselect. @redapple, @Dobz, @kmike, @bukzor, @sjp, @kovidgoyal, or anyone, would you be interested in maintaining it? I can give push access to this repository and to PyPI.

Some files aren't recorded into plist with --record option

The FreeBSD port is failing:

===> Checking for items in STAGEDIR missing from pkg-plist
Error: Orphaned: %%PYTHON_SITELIBDIR%%/cssselect/__init__.pyc
Error: Orphaned: %%PYTHON_SITELIBDIR%%/cssselect/parser.pyc
Error: Orphaned: %%PYTHON_SITELIBDIR%%/cssselect/xpath.pyc
===> Checking for items in pkg-plist which are not in STAGEDIR

These files aren't recorded by --record.

Version 0.9.1

Missing v0.9 tag

A note about v0.9 tag missing in public github repo:

[attr~='']

http://www.w3.org/TR/selectors/#attribute-selectors

[att~=val]
Represents an element with the att attribute whose value is a whitespace-separated list of words, one of which is exactly "val". If "val" contains whitespace, it will never represent anything (since the words are separated by spaces). Also if "val" is the empty string, it will never represent anything.

The empty-string or whitespace-only cases are not implemented. Similar issues for other attribute operators.

:empty is true on white-space only

An element containing only whitespace text is not empty and should not match :empty, but it does in cssselect.

backslash in attribute

Support :nth-child(An+B of S)

The current CCS 4 draft has :nth-child(An+B [of S]? ), extending :nth-child(An+B)

The :nth-child(An+B [of S]? ) pseudo-class notation represents the An+Bth element that matches the selector list S among its inclusive siblings.
The CSS Syntax Module [CSS3SYN] defines the An+B notation. If S is omitted, it defaults to *.

By passing a selector argument, we can select the Nth element that matches that selector. For example, the following selector matches the first three “important” list items, denoted by the .important class:
:nth-child(-n+3 of li.important)

Example in docs wrong

Was just fiddling with the cssselect 0.2 package from PyPI. Noticed that the docs
claim a result that doesn't appear to be current/correct. In docs:

>>> from cssselect import css_to_xpath
>>> exrpession = css_to_xpath('div.content')
>>> exrpession
"descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' content ')]"

What I got:

u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]"

(Differences: The u for unicode string and @class and expression.)

ID selector syntax

The spec’ed syntax for ID selectors is # followed by an identifier, not any hash token. See the "type" flag on hash tokens in css-syntax.

:nth-child incorrect with + or ~

As noted in #4, the current implementation of :nth-child and related selectors is incorrect when used after a + or ~ combinator: the selector e ~ f:nth-child(3) is translated to XPath e/following-sibling::*[name() = 'f' and (position() = 3)] which is wrong: it finds the 3rd element after e, not the third child of its parent.

Test case:

diff --git a/cssselect/tests.py b/cssselect/tests.py
index 796537b..d1dc9fa 100755
--- a/cssselect/tests.py
+++ b/cssselect/tests.py
@@ -516,7 +516,8 @@ class TestCssselect(unittest.TestCase):
         assert pcss(':lang("EN")', '*:lang(en-US)', html_only=True) == [
             'second-li', 'li-div']
         assert pcss(':lang("e")', html_only=True) == []
-        assert pcss('li:nth-child(3)') == ['third-li']
+        assert pcss('li:nth-child(3)',
+                    '#first-li ~ :nth-child(3)') == ['third-li']
         assert pcss('li:nth-child(10)') == []
         assert pcss('li:nth-child(2n)', 'li:nth-child(even)',
                     'li:nth-child(2n+0)') == [

Support for "matches-any" pseudo-class :matches()

CSS Selectors Level 4 (draft) proposes the :matches() pseudo-class (apparently inspired by :any() already supported in some browsers)

Example from https://www.sitepoint.com/css-selectors-level-4-the-path-to-css4/

div:matches(.active, .visible, #main)

There can be a need for or relations on XPath predicates and :matches() could allow that.

import lxml.html

html_fragment = lxml.html.fromstring("""
<div>
    <p>First</p>
    <p>Second</p>
    <p>Second Last</p>
    <p>Last</p>
</div>
""")

for element in html_fragment.cssselect("div > p:nth-last-child(1)"):
    print(element.text_content())

print()

for element in html_fragment.cssselect("div > p:nth-last-child(2n)"):
    print(element.text_content())

Output:

Second Last

Second
Last

Expected output:

Last

First
Second Last