lazytiger / gumbo-query Goto Github PK
View Code? Open in Web Editor NEWc++ library to provide jQuery style api for gumbo library
License: MIT License
c++ library to provide jQuery style api for gumbo library
License: MIT License
i m trying to parse html in utf8 format, while it contains russian symbols. As a result i get symbols ?????? instead of normal symbols.
Machine Linux (GCC 7.1) and Windows (GCC 6.3)
Sample Code
std::string page("<div><p>1</p><p>2</p><p>3</p><p>4</p><p>5</p><p>6</p></div>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("p:nth-child(odd)");
CNode node = c.nodeAt(0);
std::cout << c.nodeNum() << std::endl;
for (int i = 0; i < c.nodeNum(); i++)
{
CNode node = c.nodeAt(i);
std::cout << " " + node.text() << std::endl;
}
Expected Output
3
1
3
5
Original output
2
3
5
问题场景:
给定这样一个page:
std::string page = "<!DOCTYPE html><html lang=\"en\"><head><meta charset=\"UTF-8\"><title>Title</title></head><body><img src=\"file:///d:/test.png\" /><img src=\"file:///d:/test2.png\" /><img src=\"http://blablabla.png\" /><div><div><img src=\"http://asdfasfdsaf.com/asdf.png\" alt=\"alternative text\"/><img src=\"file:///d:/test3.png\" /></div><div1><div2><div3><img id=\"imgId2\" src=\"http://www.taobao.com\" /><pre><img src=\"file:///foo\" id=\"bar\" id=\"notexists\"/></pre></div3></div2></div1></div></body></html>";
在其中查找所有tag为img的node
auto nodes = doc.find("IMG");
然后取出这些node的 startPos 和 endPos ,
发现它们 startPos 是指向标签末尾的,endPos指向 标签开始,是不是这两个函数的实现搞反了?
I'm assuming this wasn't intended, but would it be possible to create a way to set the text contents of a CNode? I'm in a situation where I need to update parts of a DOM on the fly, and I need such a feature.
If it were to be implemented, I'd imagine overloading .text() for a CNode to accept an std::string would work well and be similar to the JQuery function .html().
like a[src*="runoob"] is not supported, maybe updated?
Would it be possible to have .text() include the HTML inside a node as well as the text content?
For example:
std::string te = "<div><span>1</span>2</div>";
CDocument cdo;
cdo.parse(te);
// cdo.find("div").nodeAt(0).text() should be "<span>1</span>2" not "12"
Since this library is based heavily on the cascadia GO library, it's recommended to include the original Author's copyright notice in your license file. Example:
Copyright (c) 2011 Andy Balholm. All rights reserved.
Copyright (c) 2015 baimashi.com.
Presently, your license doesn't reflect that your work is derived from another copyrighted work. Not trying to be nitpicky, just a suggestion. :)
References:
https://github.com/andybalholm/cascadia/blob/master/LICENSE
http://programmers.stackexchange.com/a/22261/22018
After some digging around I found a way to trim the unformatted strings (containing '\r', '\v', '\f', '\n', '\t', ' ') this library returns when parsing HTML files. For example a file with multiple spaces etc can be very annoying when you for example try to train a ML algortigh that gets data from libcurl. So this function 'reduce' will tranform the string:
You can modify the text in the box to the left any way you like, and ss
then click the "Show Page" button below the box to display the
result here. Go ahead and do this as often and as long as you like.
To something like this:
You can modify the text in the box to the left any way you like, and ss then click the "Show Page" button below the box to display the result here. Go ahead and do this as often and as long as you like.
The code:
std::string trim(
const std::string& str,
const std::string& whitespace = " \t \n \r \v \f"
){
const auto strBegin = str.find_first_not_of(whitespace);
if (strBegin == std::string::npos)
return ""; // no content
const auto strEnd = str.find_last_not_of(whitespace);
const auto strRange = strEnd - strBegin + 1;
return str.substr(strBegin, strRange);
}
std::string reduce(
const std::string& str,
const std::string& fill = " ",
const std::string& whitespace = " \t \n \r \v \f")
{
// trim first
auto result = trim(str, whitespace);
// replace sub ranges
auto beginSpace = result.find_first_of(whitespace);
while (beginSpace != std::string::npos)
{
const auto endSpace = result.find_first_not_of(whitespace, beginSpace);
const auto range = endSpace - beginSpace;
result.replace(beginSpace, range, fill);
const auto newStart = beginSpace + fill.length();
beginSpace = result.find_first_of(whitespace, newStart);
}
return result;
}
I go this from a reddit post, but it did not have an author.
I built *.deb with
cmake ..
cmake — build .
sudo checkinstall
then my application tries to compile with error
Document.h:19:10: fatal error: gumbo.h: No such file or directory
#include <gumbo.h>
And it's true. There is no gumbo.h in /usr/local/include/gq
Document.h
Node.h
Object.h
Parser.h
QueryUtil.h
Selection.h
Selector.h
How to resolve?
I hope gumbo-query will have more usage examples, so i prepared some examples of examples :)
https://gist.github.com/derofim/517b60c637dc2d8e0f680610ffd8722f
I still dont know how to use properly things like nextSibling().
-- The CXX compiler identification is GNU 4.8.4
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at cmake/LibFindMacros.cmake:259 (message):
REQUIRED PACKAGE NOT FOUND
We only found some files of Gumbo, not all of them. Perhaps your
installation is incomplete or maybe we just didn't look in the right place?
This package is REQUIRED and you need to install it or adjust CMake
configuration in order to continue building gumbo_query.
Relevant CMake configuration variables:
Gumbo_INCLUDE_DIR=/usr/local/include
Gumbo_LIBRARY=<not found>
Gumbo_static_LIBRARY=/usr/local/lib/libgumbo.a
You may use CMake GUI, cmake -D or ccmake to modify the values. Delete
CMakeCache.txt to discard all values and force full re-detection if
necessary.
Call Stack (most recent call first):
cmake/FindGumbo.cmake:39 (libfind_process)
CMakeLists.txt:20 (find_package)
-- Configuring incomplete, errors occurred!
Hi,
Sigil the opensource epub editor has decided to adopt Google's gumbo parser to help with html5 used in epub3. I found your project and have to do something similar to use it with Qt. Do you have a leicnese for your code yet so I know whether or not I could use in inside our GPL3 project?
Thanks,
Kevin
May you add to cmake/FindGumbo.cmake libgumbo.so or libgumbo.so.1 ? It will allow a successful build on linux.
Patch example:
--- cmake/FindGumbo.cmake.orig 2015-08-06 20:37:13.000000000 -0300
+++ cmake/FindGumbo.cmake 2015-08-08 05:46:41.021517785 -0300
@@ -23,7 +23,7 @@
find_library(Gumbo_LIBRARY
It seems like there's no unicode support, because CDocument .parse only accepts std::string, which doesn't seem unicode friendly (at least under Windows)
I am searching a "script" node in facebook html source, the node is like
<.s.c.r.i.p.t.>require("TimeSlice").guard(function() ... ///< The dots in "script" is for showing this line normally in issue page.
So I use this selector to find this node
CSelection c = doc.find("script:contains(require("TimeSlice"))");
But, the app crashed with error "terminate called after throwing an instance of 'std::string'", GDB says it crash in doc.find function.
If I use CSelection c = doc.find("script:contains(require)"), it works well. But these nodes are not what I want. So, I think gumbo-query's "contains" filter does not support '(' in it.
您好,出于某种目的,我需要new一个CSelection的对象,就像这样:CSelection* sel = new CSelection(selection),调用的拷贝构造函数。然而在程序的最后我有delete这个指针,但是用valgrind检测的时候还是检测出了内存泄露。释放CSelection对象的时候是直接delete还是调用release()函数呢?还是说这是个潜在的内存泄露问题?英文不好所以就用中文了,望解答,谢谢!
If I have a span with the id "that's", I can't seem to use .find() to get to it:
.find("span[id_='that's']") doesn't work (error)
.find("span[id_='that's']") doesn't work (error)
An unhandled exception of type 'System.StackOverflowException' occurred in Parser.cpp
std::string ret = message + " at:";
for (unsigned int i = ds.size() - 1; i >= 0; i--)
{
ret.push_back(ds[i]);
}
Hi,
I want to get all nodes with a class containing (not matching exactly) a specific value.
I used 'find', but it returns only the nodes with class matching exactly the value.
For example, if I have a:
<div class="the-doc the-row'">
Hello world
</div>
This do not work:
CSelection c = doc.find("div[class=' the-row']"); for (int i = 0; i < c.nodeNum(); i++) { qDebug() << c.nodeAt(i).text().c_str(); }
this works:
CSelection c = doc.find("div[class='the-doc the-row']"); for (int i = 0; i < c.nodeNum(); i++) { qDebug() << c.nodeAt(i).text().c_str(); }
Thanks.
Just tried 3 different package installations of gumbo-parser
(Archlinux)
pacman -Sy gumbo-parser
yay -Sy gumbo-parser
yay -Sy gumbo-git
However, none provides a static library to find.. not even gumbo-git
which built and installed the package from source. So the cmake configuration process always fails. Ideally, gumbo-query
should accept a package install that has a default configuration and not been customized to produce a static library.
CMake Error at extern/gumbo-query/cmake/LibFindMacros.cmake:259 (message):
REQUIRED PACKAGE NOT FOUND
We only found some files of Gumbo, not all of them. Perhaps your
installation is incomplete or maybe we just didn't look in the right place?
This package is REQUIRED and you need to install it or adjust CMake
configuration in order to continue building nxm.
Relevant CMake configuration variables:
Gumbo_INCLUDE_DIR=/usr/include
Gumbo_LIBRARY=/usr/lib/libgumbo.so
Gumbo_static_LIBRARY=<not found>
You may use CMake GUI, cmake -D or ccmake to modify the values. Delete
CMakeCache.txt to discard all values and force full re-detection if
necessary.
CObject can throw inside destructor, but destructors should never throw. This can cause problems with stack unwinding.
In readme.md, "brew install gumbo-query" does not work so I am using brew tap instead. There are many problem in the gumbo-query.rb as indicated here Homebrew/legacy-homebrew#50276
class GumboQuery < Formula
homepage "https://github.com/lazytiger/gumbo-query"
url "https://github.com/lazytiger/gumbo-query", :using => :git
depends_on "cmake" => :build
def install
cd "build" do
system "cmake", "..", "-DCMAKE_INSTALL_PREFIX=#{prefix}"
system "make"
system "make", "install"
end
end
end
Hi,
do you plan to create a conan package?
Thanks,
Dario
The logic of skipping nodes should be the opposite
keithyipkw@0efee4b
void test_parser() {
std::string page("<h1><a>wrong link</a><a class=\"special\"\\>some link</a></h1>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("h1 a.special");
printf("Node: %s\n", c.nodeAt(0).text().c_str());
}
I've checked that each iteration of test_parser
adds more and more allocated memory. When I was trying to identify where memory leaks I've tried valgrind:
==89424== 98,304 bytes in 1,024 blocks are definitely lost in loss record 77 of 77
==89424== at 0x66BB: malloc (in /usr/local/Cellar/valgrind/3.10.1/lib/valgrind/vgpreload_memcheck-amd64-darwin.so)
==89424== by 0x9A28D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==89424== by 0x1000061EB: CParser::parseClassSelector() (in ./parser)
==89424== by 0x100004CFC: CParser::parseSimpleSelectorSequence() (in ./parser)
==89424== by 0x100003C9C: CParser::parseSelector() (in ./parser)
==89424== by 0x100003664: CParser::parseSelectorGroup() (in ./parser)
==89424== by 0x1000035E3: CParser::create(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) (in ./parser)
==89424== by 0x10001503E: CSelection::find(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) (in ./parser)
==89424== by 0x100002935: CDocument::find(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) (in ./parser)
==89424== by 0x1000022B3: test_parser() (in ./parser)
==89424== by 0x1000025C2: main (in ./parser)
I think it happens here: https://github.com/lazytiger/gumbo-query/blob/master/src/Parser.cpp#L625
Result is never freed. But I can't guess right place to delete this object, and it seems like it's not the only thing to delete after selection is done.
I follow the instruction and tried to install through brew
Tuzis-MacBook:build tuzi$ brew install gumbo-paerser
Error: No available formula with the name "gumbo-paerser"
==> Searching for a previously deleted formula (in the last month)...
Warning: homebrew/core is shallow clone. To get complete history run:
git -C "$(brew --repo homebrew/core)" fetch --unshallow
Error: No previously deleted formula found.
==> Searching for similarly named formulae...
==> Searching local taps...
Error: No similarly named formulae found.
==> Searching taps...
==> Searching taps on GitHub...
Error: No formulae found in taps.
Thanks for your help
Hi,
I tried to install via homebrew today and it told me that the package wasn't found. I checked via Braumeister and same result. Only the parser package is available.
Thanks, Daniel
Parser.cpp Line961: unsigned int will not less than zero
Modify:
for (unsigned int i = 0; i < ds.size(); i++)
{
ret.push_back(ds[ds.size()-i-1]);
}
If I have the following code in a managed C++/CLI project:
for (int i = 0; i < 5000; i++)
{
std::string html = HttpRequest(......);
CDocument d;
d.parse(html);
CSelection c = d.find("#something");
if (c.nodeNum() != 0)
{ ,,,, }
}
Will this cause a memory leak, as variables d and c are not being destructed? What can I do to remedy this if that's the case?
std::string page("<h1><a>some link</a></h1>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("h1 a");
std::cout << c.nodeAt(0).text() << std::endl; // some link
VLD output:
WARNING: Visual Leak Detector detected memory leaks!
---------- Block 2228 at 0x009F5B88: 32 bytes ----------
Leak Hash: 0xCF2E3225, Count: 1, Total 32 bytes
Call Stack (TID 2592):
0x773B1020 (File and line number not available): ntdll.dll!RtlAllocateHeap
f:\dd\vctools\crt\crtw32\heap\malloc.c (58): Tests.exe!_heap_alloc_base
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (431): Tests.exe!_heap_alloc_dbg_impl + 0x9 bytes
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (239): Tests.exe!_nh_malloc_dbg_impl + 0x19 bytes
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (302): Tests.exe!_nh_malloc_dbg + 0x1D bytes
f:\dd\vctools\crt\crtw32\misc\dbgmalloc.c (56): Tests.exe!malloc + 0x15 bytes
f:\dd\vctools\crt\crtw32\heap\new.cpp (59): Tests.exe!operator new + 0x9 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (640): Tests.exe!CParser::parseTypeSelector + 0x7 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (134): Tests.exe!CParser::parseSimpleSelectorSequence + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (58): Tests.exe!CParser::parseSelector + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (38): Tests.exe!CParser::parseSelectorGroup + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (33): Tests.exe!CParser::create + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\selection.cpp (37): Tests.exe!CSelection::find + 0x1C bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\document.cpp (42): Tests.exe!CDocument::find + 0x20 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\source\core\3rdpart\tests\gumbotest.cpp (16): Tests.exe!GumboTest_Simple_Test::TestBody + 0x27 bytes
....
Data:
80 3A 47 01 01 00 00 00 04 00 00 00 00 CD CD CD .:G..... ........
00 00 00 00 00 00 00 00 00 CD CD CD 0F 00 00 00 ........ ........
---------- Block 2233 at 0x009F5BE8: 32 bytes ----------
Leak Hash: 0x39E52DD7, Count: 1, Total 32 bytes
Call Stack (TID 2592):
0x773B1020 (File and line number not available): ntdll.dll!RtlAllocateHeap
f:\dd\vctools\crt\crtw32\heap\malloc.c (58): Tests.exe!_heap_alloc_base
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (431): Tests.exe!_heap_alloc_dbg_impl + 0x9 bytes
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (239): Tests.exe!_nh_malloc_dbg_impl + 0x19 bytes
f:\dd\vctools\crt\crtw32\misc\dbgheap.c (302): Tests.exe!_nh_malloc_dbg + 0x1D bytes
f:\dd\vctools\crt\crtw32\misc\dbgmalloc.c (56): Tests.exe!malloc + 0x15 bytes
f:\dd\vctools\crt\crtw32\heap\new.cpp (59): Tests.exe!operator new + 0x9 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (640): Tests.exe!CParser::parseTypeSelector + 0x7 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (134): Tests.exe!CParser::parseSimpleSelectorSequence + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (89): Tests.exe!CParser::parseSelector + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (38): Tests.exe!CParser::parseSelectorGroup + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\parser.cpp (33): Tests.exe!CParser::create + 0x8 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\selection.cpp (37): Tests.exe!CSelection::find + 0x1C bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\contrib\source\gumbo-query\src\document.cpp (42): Tests.exe!CDocument::find + 0x20 bytes
d:\develop\imageuploader-1.3.2-vs2013\image-uploader\source\core\3rdpart\tests\gumbotest.cpp (16): Tests.exe!GumboTest_Simple_Test::TestBody + 0x27 bytes
...
Data:
80 3A 47 01 01 00 00 00 04 00 00 00 00 CD CD CD .:G..... ........
00 00 00 00 00 00 00 00 00 CD CD CD 27 00 00 00 ........ ....'...
Visual Leak Detector detected 2 memory leaks (21900 bytes).
Largest number used: 82245 bytes.
Total allocations: 287293 bytes.
Visual Leak Detector is now exiting.
I have fixed this in CSelector* CParser::parseSelector()
:
CSelector* ret_old = ret; // <-----
CSelector* sel = parseSimpleSelectorSequence();
// ....
else if (combinator == '~')
{
ret = new CBinarySelector(ret, sel, true);
}
else
{
throw error("impossible");
}
ret_old->release(); // <---
sel->release(); // <---
but i'm not sure if this is correct way to fix it
So, going over the code for parsing selectors, there's the method parseEscape(...)
which appears to be intended to convert attribute values that contain escaped characters and code points to literal values. However, doing some quick tests embedding such values in HTML, parsing then matching, it appears that gumbo_parser
directly embeds these values untouched. So if I'm right, unescaping these sequences in supplied selectors will actually cause the selectors which should match, to not match.
I'm not familiar with the GO html package, but from what searching I've done, it appears that this might have been a requirement for cascadia
(since it worked against the GO html package), but is invalid for use with gumbo_parser
. On that note, gumbo_parser
does appear to convert numeric character references and named character references, but gumbo_query
doesn't of course, so matching a selector with an id containing something like " "
will fail.
Will try to post some tests to confirm, I'm just reading over all of this figuring it out.
References:
http://www.w3.org/International/questions/qa-escapes
https://mathiasbynens.be/notes/css-escapes
http://www.w3.org/TR/html5/syntax.html#named-character-references
─$ brew install gumbo-query
Updating Homebrew...
Error: No available formula with the name "gumbo-query"
==> Searching for a previously deleted formula (in the last month)...
Error: No previously deleted formula found.
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
==> Searching taps on GitHub...
Error: No formulae found in taps.
Hi,
How can I get the OuterHTML (not just the InnerText) of a specific Tag.
For example, If I want to get the div tag with class="content" from this HTML source:
<html>
<head>
<title>My Title</title>
<meta content="">
<style></style>
</head>
<body>
<h3>First header</h3>
<p>text text text</p>
<div class="content">
<h3>My Text 0<a href="https://www.google.com">The site of google</a></h3>
</div>
<div>
My Text 1
<div class="cls">My Text 2
<h1>My Text 3</h1>
</div>
</div>
</body>
</html>
The result I want extract:
<div class="content">
<h3>My Text 0<a href="https://www.google.com">The site of google</a></h3>
</div>
Thanks.
It is a common practice for c++ libraries to avoid naming collisions.
For example, CDocument
is a very common class name in C++ projects.
It could be namespace GumboQuery {
or something like this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.