Comments (7)
This is not possible for complex turtle. But if it is a one line per triple format with escaped \n skipping the line and continueing would be nice.
Overall, I needed several weeks now to get the HDT working for DBpedia, i.e. merging several bz2 files into one .gz and compiling the develop branch.
Then I had to rezip everything again to let rapper filter out bad triples, which was just an extra uneccessary step.
from serd.
@kurzum I can feel your pain having to deal with dirty data! Linked Data Quality is an important topic.
WRT parser fallback strategies, the ClioPatria parser collection has fallback strategies for N-Triples and N-Quads, but also for Turtle and RDF/XML (and maybe for RDFa as well, but I'm not sure). This feature is very important in LOD Laundromat, where the vast majority of datasets on the web turn out to be syntactically malformed.
from serd.
Hm. I'm not sure. I sympathise, but this strikes me as a never-ending rabbit hole... that said, I'll look into these heuristics. If something relatively simple to implement works well and isn't problematic I will give it a shot.
from serd.
There are already some lax parsing facilities in serd and an existing option for that, so statement-level recovery could latch on to that. Worth noting that anything that requires backtracking can never work in serd (which is strictly a streaming parser), though, so a "perfectly" malformed file could probably cause massive chunks of the file to be skipped. Some condensed real test cases would definitely help with this if you could provide any.
from serd.
@wouterbeek can point you to the dirty and clean lodlaundromat files.
Skipping lines until a new well formed statement appears covers 90% of use cases for ntriples, Quads and maybe turtle.
Btw although streaming, keeping a "previous 5 line" window enables some backtracking.
Not saying that it is worth the effort....
from serd.
8d954ab will skip to the next line when lax parsing. This works pretty well for the line-based formats (gets me through sketchy dbpedia dumps anyway), definitely will fail horribly in many cases for abbreviated syntaxes, but doing that well will require more work and an actual test suite and so on.
from serd.
Thank you for implementing the skip functionality for lax parsing mode.
from serd.
Related Issues (20)
- Colliding generated blank nodes during TriG import HOT 6
- How to apply a base URI? HOT 4
- Resolution for base URIs with empty path HOT 2
- Cannot parse a valid TriG document HOT 1
- ShEx support HOT 6
- Error parsing 'a' without whitespace HOT 1
- Build error HOT 3
- Parsing from a string in python HOT 11
- Compile failure on OSX (gcc) due to deprecated attributes message HOT 1
- serd 0.30.8 build failure on mojave and catalina HOT 9
- Unable to parse triple-quoted literal HOT 7
- Add streaming support for .gz and .bz2 format input / output files HOT 7
- Write canonical NTriples 1.1 by default HOT 6
- pkg-config file should container -DSERD_STATIC on static build HOT 11
- Debian / Archlinux package: Available ? HOT 1
- Does serdi support named pipe input/output ? HOT 5
- Add support for reading RDF* HOT 2
- [master/0.30.16] Statc build (-Dstatic=true) fails with link error: attempted static link of dynamic object `libserd-0.so.0.31.0' HOT 9
- Bug: serd_reader_read_chunk does not support NQuads HOT 2
- Version >= 0.30.16 writes faulty syntax for tuples in TTL files HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serd.