Giter Site home page Giter Site logo

philss / floki Goto Github PK

View Code? Open in Web Editor NEW
2.0K 25.0 151.0 1.56 MB

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Home Page: https://hex.pm/packages/floki

License: MIT License

Elixir 98.62% Erlang 1.34% Shell 0.01% HTML 0.03%
elixir html-parser css-selector floki html5ever fast-html myhtml erlang css-selectors hacktoberfest

floki's Introduction

Actions Status Floki version Hex Docs Hex.pm License Last Updated

Floki logo

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Check the documentation ๐Ÿ“™.

Usage

Take this HTML as an example:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <span class="headline">Enables search using CSS selectors</span>
    <a href="https://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
  <a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>

Here are some queries that you can perform (with return examples):

{:ok, document} = Floki.parse_document(html)

Floki.find(document, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]

document
|> Floki.find("p.headline")
|> Floki.raw_html
# => <p class="headline">Floki</p>

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

Installation

Add Floki to your mix.exs:

defp deps do
  [
    {:floki, "~> 0.36.0"}
  ]
end

After that, run mix deps.get.

If you are running on Livebook or a script, you can install with Mix.install/2:

Mix.install([
  {:floki, "~> 0.36.0"}
])

You can check the changelog for changes.

Dependencies

Floki needs the :leex module in order to compile. Normally this module is installed with Erlang in a complete installation.

If you get this "module :leex is not available" error message, you need to install the erlang-dev and erlang-parsetools packages in order get the :leex module. The packages names may be different depending on your OS.

Alternative HTML parsers

By default Floki uses a patched version of mochiweb_html for parsing fragments due to its ease of installation (it's written in Erlang and has no outside dependencies).

However one might want to use an alternative parser due to the following concerns:

  • Performance - It can be up to 20 times slower than the alternatives on big HTML documents.
  • Correctness - in some cases mochiweb_html will produce different results from what is specified in HTML5 specification. For example, a correct parser would parse <title> <b> bold </b> text </title> as {"title", [], [" <b> bold </b> text "]} since content inside <title> is to be treated as plaintext. Albeit mochiweb_html would parse it as {"title", [], [{"b", [], [" bold "]}, " text "]}.

Floki supports the following alternative parsers:

  • fast_html - A wrapper for lexbor. A pure C HTML parser.
  • html5ever - A wrapper for html5ever written in Rust, developed as a part of the Servo project.

fast_html is generally faster, according to the benchmarks conducted by its developers.

You can perform a benchmark by running the following:

$ sh benchs/extract.sh
$ mix run benchs/parse_document.exs

Extracting the files is needed only once.

Using html5ever as the HTML parser

This dependency is written with a NIF using Rustler, but you don't need to install anything to compile it thanks to RustlerPrecompiled.

defp deps do
  [
    {:floki, "~> 0.36.0"},
    {:html5ever, "~> 0.15.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use html5ever:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.Html5ever

Notice that you can pass the HTML parser as an option in parse_document/2 and parse_fragment/2.

Using fast_html as the HTML parser

A C compiler, GNU\Make and CMake need to be installed on the system in order to compile lexbor.

First, add fast_html to your dependencies:

defp deps do
  [
    {:floki, "~> 0.36.0"},
    {:fast_html, "~> 2.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use fast_html:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.FastHtml

More about Floki API

To parse a HTML document, try:

html = """
  <html>
  <body>
    <div class="example"></div>
  </body>
  </html>
"""

{:ok, document} = Floki.parse_document(html)
# => {:ok, [{"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}]}

To find elements with the class example, try:

Floki.find(document, ".example")
# => [{"div", [{"class", "example"}], []}]

To convert your node tree back to raw HTML (spaces are ignored):

document
|> Floki.find(".example")
|> Floki.raw_html
# =>  <div class="example"></div>

To fetch some attribute from elements, try:

Floki.attribute(document, ".example", "class")
# => ["example"]

You can get attributes from elements that you already have:

document
|> Floki.find(".example")
|> Floki.attribute("class")
# => ["example"]

If you want to get the text from an element, try:

document
|> Floki.find(".headline")
|> Floki.text

# => "Floki"

Supported selectors

Here you find all the CSS selectors supported in the current version:

Pattern Description
* any element
E an element of type E
E[foo] an E element with a "foo" attribute
E[foo="bar"] an E element whose "foo" attribute value is exactly equal to "bar"
E[foo~="bar"] an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"
E[foo^="bar"] an E element whose "foo" attribute value begins exactly with the string "bar"
E[foo$="bar"] an E element whose "foo" attribute value ends exactly with the string "bar"
E[foo*="bar"] an E element whose "foo" attribute value contains the substring "bar"
E[foo|="en"] an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"
E:nth-child(n) an E element, the n-th child of its parent
E:nth-last-child(n) an E element, the n-th child of its parent, counting from bottom to up
E:first-child an E element, first child of its parent
E:last-child an E element, last child of its parent
E:nth-of-type(n) an E element, the n-th child of its type among its siblings
E:nth-last-of-type(n) an E element, the n-th child of its type among its siblings, counting from bottom to up
E:first-of-type an E element, first child of its type among its siblings
E:last-of-type an E element, last child of its type among its siblings
E:checked An E element (checkbox, radio, or option) that is checked
E:disabled An E element (button, input, select, textarea, or option) that is disabled
E.warning an E element whose class is "warning"
E#myid an E element with ID equal to "myid" (for ids containing periods, use #my\\.id or [id="my.id"])
E:not(s) an E element that does not match simple selector s
:root the root node or nodes (in case of fragments) of the document. Most of the times this is the html tag
E F an F element descendant of an E element
E > F an F element child of an E element
E + F an F element immediately preceded by an E element
E ~ F an F element preceded by an E element

There are also some selectors based on non-standard specifications. They are:

Pattern Description
E:fl-contains('foo') an E element that contains "foo" inside a text node
E:fl-icontains('foo') an E element that contains "foo" inside a text node (case insensitive)

Suppressing log messages

Floki may log debug messages related to problems in the parsing of selectors, or parsing of the HTML tree. It also may log some "info" messages related to deprecated APIs. If you want to suppress these log messages, please consider setting the :compile_time_purge_matching option for :logger in your compile time configuration.

See https://hexdocs.pm/logger/Logger.html#module-compile-configuration for details.

Special thanks

License

Copyright (c) 2014 Philip Sampaio Silva

Floki is under MIT license. Check the LICENSE file for more details.

floki's People

Contributors

buhman avatar carlosfrodrigues avatar danhuynhdev avatar davydog187 avatar dependabot-preview[bot] avatar dependabot[bot] avatar donaldducky avatar fcapovilla avatar francois2metz avatar glaucocustodio avatar gmile avatar grych avatar jjcarstens avatar josevalim avatar kianmeng avatar lowks avatar mmmries avatar nirev avatar philss avatar richmorin avatar samhamilton avatar steffende avatar sweetmnm avatar vict0rynox avatar vikeri avatar viniciusmuller avatar vittoriabitton avatar wlsf avatar wojtekmach avatar ypconstante avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

floki's Issues

How to filter out comments?

Hi,

I'd like to filter out comments, how do I do it? I tried

Floki.filter_out(doc, :comment)

But that doesn't work.

Improve search by id

Today the search by id is using a generic lookup algo that searches inside all the nodes.
The search by id can be optimized by stopping the lookup in the first match of element.

DOM tree to HTML

Hi guys, if I do Floki.find(html, "#wrapper") for instance I very likely will get a lot of sub DOM elements, can't I convert it to HTML? I just saw Floki.text, but I dont want just text.

Something like:
Floki.find(html, "#wrapper") |> Floki.raw_html
Result:
<div id="wrapper">... children DOM here ...</div>

A deep version of Floki.text/1?

First off, I'm a big fan of this library - I really like the approach it's taking.

I'm using it to simplify my assertions in Phoenix Controller tests, and I find that a version of Floki.text/1 that returned text deeply would be super useful for me, letting me write tests that are agnostic to styling choices in the template. In the example in this method's documentation, I'm proposing something that would work like so:

iex> Floki.text("<div><span>something else</span>hello world</div>", deep: true)
"something elsehello world"

Thoughts?

FunctionClauseError

I received an error when trying to parse the html of this page. Let me know if you need anything else.

http://theguysociety.com/success/kunal-desai-no-bull-on-how-to-take-over-the-stock-market/?utm_content=buffer4be54&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

** (FunctionClauseError) no function clause matching in Floki.Finder.traverse/4
lib/floki/finder.ex:33: Floki.Finder.traverse({:pi, "php echo date("Y"); "}, [" | ", {"a", [{"href", "http://crossatlanticebooks.com"}, {"target", "_blank"}], ["Cross Atlantic E-Books"]}, " , a division therein "], %Floki.Selector{attributes: [%Floki.AttributeSelector{attribute: "property", match_type: :equal, value: "og:image"}], classes: [], combinator: nil, id: nil, type: "meta"}, [])
lib/floki/finder.ex:40: Floki.Finder.traverse/4
lib/floki/finder.ex:40: Floki.Finder.traverse/4
lib/floki/finder.ex:44: Floki.Finder.traverse/4
lib/floki/finder.ex:40: Floki.Finder.traverse/4
lib/floki/finder.ex:23: Floki.Finder.find/2
(hue2) lib/hue2/tweet_info_2.ex:263: Hue2.TweetInfo2.get_source_data/1
(elixir) lib/enum.ex:1043: anonymous fn/3 in Enum.map/2
(elixir) lib/enum.ex:1387: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir) lib/enum.ex:1043: Enum.map/2
(hue2) lib/hue2/tweet_info_2.ex:98: Hue2.TweetInfo2.store/0

Compilation error: "attribute 'dialyzer' after function definitions"

I get this trying to mix deps.compile floki on my OS X machine:

/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:47: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:118: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:194: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:247: attribute 'dialyzer' after function definitions
==> floki
could not compile dependency floki, mix compile failed. You can recompile this dependency with `mix deps.compile floki` or update it with `mix deps.update floki`

Problem when parse a part of html tree without root node

Example:

<h2>Show user</h2>
<ul>
<li><strong>Name:</strong>Jim</li>
<li><strong>Age:</strong>25</li>
</ul>
<a href="/users">Back</a>

Floki.parse/1 retruns only first node {"h2", [], ["Show user"]}
It is, of course, :mochiweb_html.parse/1. But can we make it better on Floki side?

Floki.attribute shouldn't return a list

Is there a reason why Floki.attribute returns a list?

Right now I find myself constantly needing to run hd or List.first when retrieving attribute's value:

~s(<meta property="og:url" name="twitter:url" content="https://example.com">)
|> Floki.parse()
|> Floki.find(~s(meta[property="og:url"]))
|> Floki.attribute("content")
|> hd()

Attribute name is always uniqe, so it should be possible to have hd/1 in Floki.attribute itself?

According to the HTML5 CR, 8.1.2.3 Attributes:

There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.

mochiweb_html breaks building a release when other dependancies include mochiweb

Hi @philss - thanks so much for Floki it's awesome...

I'm in a bit of a pickle with mochiweb_html conflicting with deps that use the mochiweb package.

I understand your reasoning only including the mochiweb_html parser rather than all of the other stuff with Floki, however when trying to build a release it causes the following error:

==> Failed to build release:

Duplicated modules: 
    mochiweb_html specified in mochiweb and mochiweb_html
    mochiweb_charref specified in mochiweb and mochiweb_html
    mochiutf8 specified in mochiweb and mochiweb_html
    mochinum specified in mochiweb and mochiweb_html

Given that releases are borked and I'd have to fork your library (as would anyone building a release that includes mochiweb in the deps) would you consider this optimisation less desirable and go over to using full mochiweb?

Thanks again!!!

Plus in find not work as expected

For this code:

<span>1</span>&nbsp;<a>2</a>

I run:

Floki.find(html, "span + a")

Expected results:

[{"a", [], ["2"]}]

Actual results:

[]

Note:
With this html:

<span>1</span> <a>2</a>

and this:

<span>1</span><a>2</a>

find method works as expected.

So problem is that Floki interprets &nbsp; (text node) as normal node. So next element (for Floki) after span element is text node instead of a element.

Note2:
As a workaround I tried:

Floki.find(html, "span + * + a") # any element between span and link

but I got:

** (MatchError) no match of right hand side value: "ย "
                lib/floki/finder.ex:137: Floki.Finder.traverse_sibling/3
                lib/floki/finder.ex:69: Floki.Finder.traverse/4
                lib/floki/finder.ex:73: Floki.Finder.traverse/4
                lib/floki/finder.ex:47: Floki.Finder.find_selectors/2

Cannot parse soap response

I've tried to parse a minimal soap response

iex(3)> a = """
...(3)> <?xml version="1.0" encoding="utf-8"?>
...(3)> <soap:Envelope>
...(3)>     <soap:Body>
...(3)>         <SomeResponse>foo</SomeResponse>
...(3)>     </soap:Body>
...(3)> </soap:Envelope>
...(3)> """
"<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope>\n    <soap:Body>\n        <SomeResponse>foo</SomeResponse>\n    </soap:Body>\n</soap:Envelope>\n"
iex(4)> Floki.find(a, "SomeResponse")
[]

and floki seems to not be able to parse anything

Create a built in HTML parser

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

  • support HTML5;
  • support HTML snippets;
  • be able to parse large files, like 15MB;
  • easy to traverse;
  • be a bit tolerant with errors, like missing closing tags.

Cannot find og: tags with Floki.find/2

html = """<html>
    <head>
      <title>Tag title - test</title>
      <meta property='og:image' content='http://i.imgur.com/qQNJVAY.png'>
      <meta property='twitter:image' content='http://i.imgur.com/O40KSbh.png'>
    </head>

  </html>
  """

Floki.find(html, "meta[property=og:image]")
[warn] Unknown token {:unknown, 1, ':'}. Ignoring.
[]

Hi. Basically I can't seem to use Floki.find/2 to get any properties with : in them (opengraph, twitter etc.) This bug seems to have popped up after v0.9.0. The above code works on 0.9.X, but breaks starting on 0.10.X

Thanks!

Add support for :not selector (and other pseudo-class selectors)

In my case, I'm attempting to use the pseudo-class :notselector as "div:not(.invisible)#error"

pry(1)> Floki.find(html, "div:not(.invisible)#error")
iex(15)> [warn] Unknown token '('. Ignoring.
[warn] Unknown token ')'. Ignoring.
[warn] Pseudo-class "not" is not implemented. Ignoring.
[]

I would loooovvee to have support for :not

Can't select sibling options

Test case:

iex> html = "<select><option value=\"foo\">Foo</option><option value=\"bar\">Bar</option><option value=\"barz\">Barz</option></select>"
iex> Floki.parse(html) |> Floki.find("select option + option")
** (ArgumentError) argument error
    :erlang.hd([])
    lib/floki/finder.ex:90: Floki.Finder.traverse_sibling/4
    lib/floki/finder.ex:60: Floki.Finder.traverse/4
    lib/floki/finder.ex:40: Floki.Finder.traverse/4
    lib/floki/finder.ex:56: Floki.Finder.traverse/4
    lib/floki/finder.ex:44: Floki.Finder.traverse/4
    lib/floki/finder.ex:23: Floki.Finder.find/2

Floki version: 0.6.1

Find with ":" token

I use Floki to parse XML. When I need to find "media:thumbnail" it returns me empty list.

Example:
iex(2)> Floki.find(response.body, "item") |> List.first
{"item", [],
[{"title", [], ["Test"]},
{"description", [], ["test"]},
{"media:thumbnail",
[{"url", "test-url"}], []},]}
iex(3)> Floki.find(Floki.find(response.body, "item") |> List.first, "media:thumbnail")
[]

Maybe I use Floki wrong? Thank you for help)

Floki.find returning unmatching nodes

I tried to update a project to the new 0.12.0 release, but ran into a bug. Here is a minimal reproduction of the error:

    html = """
      <html>
      <body>
      <div id="messageBox" class="legacyErrors"><div class="messageBox error"><h2 class="accessAid">Error Message</h2><p>There has been an error in your account.</p></div></div>
      <div id="main" class="legacyErrors"><p>Separate Error Message</p></div>
      </body>
      </html>
      """
    html |> Floki.parse |> Floki.find(".messageBox p") |> Floki.text
    # => "There has been an error in your account.Separate Error Message"

The .messageBox p selector should only return the first p node, but it actually returns both of the p nodes and the Floki.text call ends up returning the text from both of those nodes.

Floki removes blank text nodes without option to avoid this

Actual results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
[{"span", [], ["5"]}, {"span", [], ["="]}, {"span", [], ["5"]}]

Expected results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
# or
Floki.parse("<span>5</span> <span>=</span> <span>5</span>", remove_text_nodes_with_whitespaces: false)

[{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Note: this is really important in some cases. For example, please try parsing html generated from github markup (code samples).

Errors compiling floki 0.4

Created a new Elixir project (mix new floki_test), added {:floki, "~> 0.4"} to mix.exs, successfully ran mix deps.get, but when I run mix test on this blank project, Floki fails to compile and the project outputs:

โžœ mix test
==> mochiweb (compile)
Compiled src/reloader.erl
*** SNIP ***
Compiled src/mochifmt.erl
Compiled src/mochiweb_charref.erl
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:47: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:118: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:194: attribute 'dialyzer' after function definitions
/usr/local/Cellar/erlang/18.0.2/lib/erlang/lib/parsetools-2.1/include/leexinc.hrl:247: attribute 'dialyzer' after function definitions
==> floki
could not compile dependency floki, mix compile failed. You can recompile this dependency with `mix deps.compile floki` or update it with `mix deps.update floki`
==> floki_test
** (Mix) Encountered compilation errors.

I'm using mix/elixir 1.0.5. Above steps succeed with floki 0.3.3.

Add search for descendant elements

I want to be able to search for elements that are descendant of others without have to use
pipes.

Ideal interface:

Floki.find(html_tree, ".class-a .class-b")

It should be able to work with html tags and ids.

find/2 gets stuck with non-HTML input

# These call never return
Floki.find("", "a")
Floki.find("foobar", "a")
Floki.find("foobar<", "a")

# find/2 wants at least one tag
Floki.find("foobar<a>", "a")  # => [{"a", [], []}]

# To be precise, it's ok if input matches /<[a-zA-Z0-9]/
Floki.find("foobar<a", "a")   # => []

Remove embedded mochiweb code

This is low priority.

The idea is to remove mochiweb code from this codebase.

Solutions are:

  • use a hex package of mochiweb (current not available);
  • create a new HTML parser in Elixir and use as hex dependency.

Unexpected "/floki" tag

Via #37 it looks like handling unclosed tags is a goal for built-in HTML parser. But I was surprised that a tag name floki was added to the parsed html tree here. Is this expected behavior?

iex> "<div>hello <h1>world</h1><script>alert('wat');</div>" |> Floki.parse
{"div", [],
 ["hello ", {"h1", [], ["world"]},
  {"script", [], ["alert('wat');</div></floki>"]}]}

module :leex is not available

I get this error with floki 0.6.1
=> floki
could not compile dependency :floki, "mix compile" failed. You can recompile this dependency with "mix deps.compile floki", update it with "mix deps.update floki" or clean it with "mix deps.clean floki"
** (UndefinedFunctionError) undefined function: :leex.file/2 (module :leex is not available)
:leex.file('src/floki_selector_lexer.xrl', [scannerfile: 'src/floki_selector_lexer.erl', report: true])
(mix) lib/mix/compilers/erlang.ex:84: anonymous fn/3 in Mix.Compilers.Erlang.compile/3
(elixir) lib/enum.ex:1385: Enum."-reduce/3-lists^foldl/2-0-"/3
(mix) lib/mix/compilers/erlang.ex:83: Mix.Compilers.Erlang.compile/3
(elixir) lib/enum.ex:1043: anonymous fn/3 in Enum.map/2
(elixir) lib/enum.ex:1385: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir) lib/enum.ex:1043: Enum.map/2
(mix) lib/mix/tasks/compile.all.ex:19: anonymous fn/1 in Mix.Tasks.Compile.All.run/1

Attributes' names are forced lowercase

Floki forces attributes' names lowercase. Failing test case:

defmodule FlokiTest do
  use ExUnit.Case

  test "preserves case of attributes names" do
    xml = ~s(<root ID="1"/>)
    [{"root", [{attrName, "1"}], []}] = Floki.find(xml, "root")
    assert(attrName == "ID")
  end    
end

Add an optional separator to .text()

consider this case:
Floki.parse("<ul><li>text1</li><li>text2</li><ul>") |> Floki.text
returns
text1text2

If we give the consumer the possibility to add a separator
Floki.parse("<ul><li>text1</li><li>text2</li><ul>") |> Floki.text (separator: " ")
returns
text1 text2

When 2 tags with only a space between them, space is removed

Example html:
<p>I want <a href="http://www.google.com">link to</a> <strong>article</strong></p>

When run through floki:
[{"p", [],
["I want ", {"a", [{"href", "http://www.google.com"}], ["link to"]},
{"strong", [], ["article"]}]}]}]

Maybe this is the intended result... which is perfectly fine. I was hoping to use floki for a weird use-case... basically run HTML through floki, parse through the floki response, removing tags/attributes that aren't allowed, and then re-create html. Essentially using floki to help me cleanse user inputted html to remove tags that aren't allowed and/or malicious javascript attributes (xss attacks).

Because of how I wanted to use it, the space between </a> and <strong> is actually important, otherwise "link to" and "article" will be touching each other.

Feel free to close the issue if it was intended.

Add a more robust representation of selectors

The current implementation does not allow searching for elements using a multi selector, like a.some-class or .some-class.another-class. This is because it is coupled to the idea of one search per selector at time. This means that I can't search a "tag" and a "class" in the same element at the same time.

We need to provide a basic infrastructure to be able to search using groups of selectors.
This could bring a huge flexibility to Floki, since it's would be possible to truly simulates the "jQuery" selector (actually simulates Sizzle, the jQuery query selector engine).

Examples of queries to support:

  • Floki.find("a.foo")
  • Floki.find(".foo.bar")
  • Floki.find(".foo[data-js=bar]")
  • Floki.find("a.foo[data='baz.html']")
  • Floki.find("a b.foo")

The following examples can be implemented as well, but since depends on a more robust representation of the HTML tree, it should be implemented in the future:

  • Floki.find("a + b") (matches b followed by a)
  • Floki.find("a > b") (matches b that is right in the first level of children of a)
  • Floki.find("a:first-child") (matchesawhena` is the first child of its parent)

This issue is related to #18

Using `filter_out` with attribute selector like "div[class]"

Suppose I want to filter out all divs that have a particular attribute. I can't see how to do that with filter_out.

Consider this HTML. It has one div with a class and one without:

html = "<body> <div class=\"has\">one</div>  <div>two</div> </body>"

I want to find all the divs without a class. I expected this to work, but it does not:

iex(44)> Floki.filter_out(html, "div[class]")
{"body", [], [{"div", [{"class", "has"}], ["one"]}, {"div", [], ["two"]}]}

More precisely, I want to work with a list produced by find, so a closer approximation to my real code would be this:

iex(40)> Floki.find(html, "div") |> Floki.filter_out("div[class]")
[{"div", [{"class", "has"}], ["one"]}, {"div", [], ["two"]}]

Note that filtering in the div with class works (using find):

iex(42)> Floki.find(html, "div") |> Floki.find("div[class]")
[{"div", [{"class", "has"}], ["one"]}]

So does filtering out all the divs, regardless of their attributes:

iex(45)> Floki.find(html, "div") |> Floki.filter_out("div")
[]

Am I doing something wrong?

How to find tag with namespace?

iex(1)> "<ns:tag>xxx</ns:tag>" |> Floki.find("ns:tag")
Unknown token ':'. Ignoring.
[]

The jQuery can do as follows:
$("ns\\:tag")

Refactor Finder to enable lookup for parent nodes and update tree

This refactor is needed in order to implement some features like search using pseudo-classes and update tree.

  • Represent parsed HTML as a tree using a Map
  • Refactor finders to work with this HTML tree
  • Add docs for APIs that may change
  • Remove old code

There is a prove of concept here, with all tests passing.

This refactor should not break APIs (as the name suggest). The only change that may occur is the Elixir version support.

Finding article Tag

Hi,

Floki.find(html, "article") returns nothing despite the fact the html does contain an article tag. No other selectors find that tag too.

Thanks;

How to get the text node?

Let's say I have the following code

  <div id="msg"><span>sometext</span> Hello world</div>

How do I get the textnode 'Hello world'?

html
 |> Floki.find("#msg")
 |> Floki.WHAT_METHOD_CAN_I_CALL_TO_GET_"Hello world"?

Add support for html5ever as an optional HTML parser

We made html5ever and Floki work together through ex_html5ever thanks to @hansihe and the Rustler team! ๐Ÿ‘

Using ex_html5ever can be very useful those who need a more accurate parsing of the HTML. It solves the issues #50 and #75, for example.

Since ex_html5ever relies on a Rust NIF, it requires the user to have Rust installed.

Also I ran some benchmarking that shows the ex_html5ever version is faster from the average to big HTML files. โšก๏ธ

UPDATE: the html5ever Elixir NIF was named to "html5ever" on Hex.pm, and the repository is now https://github.com/hansihe/html5ever_elixir

Commented out html do not show up in find function

Floki is not finding elements in comments.

I have this src I fetched with Hound and I passed it to Floki.find(src, "span"). Floki only returns the span with the class='visible' like this [{"span", [{"class", "visible"}], [" hello "]}]. The commented out span is totally neglected.

Source:

<div class='parent'>
  <div class='child'>
    <span class='visible'> hello </span>
    <!-- <span class='commented'> hi </span> -->
  </div>
</div>

Is there a way to fetch commented out code? Or there's a temporary hack I can use?

You can reproduce this with:

Floki.find("<div class='parent'><div class='child'><span class='visible'> hello </span><!-- <span class='commented'> hi </span> --></div></div>", "span")

deep_text fails when handling comments

When I try to extract text from elements containing a comment-tag, the deep_text-extraction breaks. Instead of ignoring the comment, it throws a FunctionClauseError.

Here is a minimal example:

defmodule FlokiCase do
  def fail_deep_text_with_comment do
    """
    <a>foo</a>
    <!--bar-->
    <b>baz</b>
    """ |> Floki.parse |> Floki.text |> IO.puts
  end
end

I expected to have it returned "foobaz", but instead got this error output.
I hope this helps.

Supporting transforms for Floki nodes!

Hi! Thanks so much for Floki!

I have a need to transform Floki nodes. For example, to turn all relative links into absolute links:

html = "<p><a href='/foo'>click!</a></p>"
Floki.transform(html, "a", fn({"a", [href: x], xs}) -> {"a", URI.merge("https://base.com", x), xs} end)
# => [{"p", [], [{"a", [{"href", "https://base.com/foo"}], ["click!"]}]}]

Instead of requiring the implementation of a recursive search, do you think there is space in the Floki API for a transform function like above? I'd be up for cutting a PR myself. Just wondering if you have any thoughts.

Making Floki support XML files parsing.

I'm getting this error when parsing a XML file.

** (MatchError) no match of right hand side value: {"version", "1.0"}

Removing this line:

<?xml version="1.0" encoding="UTF-8"?>

From:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title></title>
    <link></link>
    <description></description>
    <language></language>
    <copyright></copyright>
    <lastBuildDate></lastBuildDate>
    <docs></docs>
    <image>
      <url></url>
      <title></title>
      <link></link>
      <width></width>
      <height></height>
      <description>
      </description>
    </image>
    <item>
      <title></title>
      <link></link>
      <description></description>
      <category></category>
      <pubDate></pubDate>
    </item>
    <item>
      <title></title>
      <link></link>
      <description></description>
      <category></category>
      <pubDate></pubDate>
    </item>
  </channel>
</rss>

It seems to work just fine.

Any thought's on supporting this kind of file?

Cannot search with multiple selectors

With @html markup from test/floki_test.exs:

<html>
<head>
<title>Test</title>
</head>
<body>
  <div class="content">
    <a href="http://google.com" class="js-google js-cool">Google</a>
    <a href="http://elixir-lang.org" class="js-elixir js-cool">Elixir lang</a>
    <a href="http://java.com" class="js-java">Java</a>
  </div>
</body>
</html>

I am trying to search for an element with multiple classes (on the same element) with the selector .js-cool.js-elixir but Floki is not finding them. However, if I add a space between the classes, as if the second was a descendent of the first, .js-cool .js-elixir, it does return what I'm looking for but should not (as js-elixir is not descendent of js-cool).

Test demonstrating the error:

test "find elements with given multiple classes" do
  class_selector = ".js-cool.js-elixir"

  assert Floki.find(@html, class_selector) == [
    {"a", [
        {"href", "http://elixir-lang.org"},
        {"class", "js-elixir js-cool"}],
      ["Elixir lang"]}
  ]
end

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.