Comments (6)
I think the tidy step is somewhat valuable (maybe we could even extend it to remove img
without src
) so I would add source
and other void elements to the list. It is not like new HTML elements appear that often.
Perhaps the medium page is XHTML5 page? [Edit: Nevermind, they do not close the img
.] For XML parser, self-termination should be equivalent to no content followed by a closing tag. I verified it with validator on the following document:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
<picture>
<source srcset="[…]" type="image/webp"></source>
<source data-testid="og" srcset="[…]"></source>
<img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
</picture>
</body>
</html>
Needs to be uploaded as xhtml file, direct input does not detect XML correctly.
from graby.
maybe we could even extend it to remove
img
withoutsrc
Removing img
without src
attribute would break picture
tags containing it, even if source
tags are present, see wallabag/wallabag#6414 (comment)
from graby.
I think the fastest is to add source
to the list.
About adding more elements to the list, is that one enough? https://developer.mozilla.org/en-US/docs/Glossary/Void_element
Funny I didn't see the iframe
element in the list..
from graby.
Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty td
s are needed for empty table cells because removing them would jumble the table. And while we could enumerate void elements quite easily, the non-void ones are much harder since there is probably no official list of elements that are meaningful without content.
Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. p|span|font
). That would give us at least some tidiness. We could also log all other empty elements not in the whitelist to allow us to grow coverage over time.
But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete?
from graby.
Related Issues (20)
- Strip XPath not working for some reason HOT 3
- Support tag change matching a XPath query (`retag(tag)`) HOT 3
- Grabbing audio tags
- Usage with MAMP PHP HOT 1
- Call to undefined method Graby\Graby::setContentAsPrefetched() HOT 3
- Installing Graby with Symfony 6.1 HOT 3
- Add more image lazy load attributes
- Skip site-config if selector does not exist HOT 1
- function convert2Utf8() return wrong HOT 1
- Prefered way for paywalled articles HOT 2
- graby for wallabag with custom site_config
- site_config's author definition is ignored if present in json HOT 2
- Date tests failing locally HOT 1
- Cannot install with composer HOT 8
- Support for websites with login page in two steps HOT 2
- Allow to configure httplug-ssrf-plugin
- Using Graby with Laravel HOT 1
- Error: Call to a member function saveXML() on null HOT 2
- Setting prefetched content breaks after utf8 conversion HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graby.