Giter Site home page Giter Site logo

drobilla / serd Goto Github PK

View Code? Open in Web Editor NEW
87.0 8.0 15.0 9.26 MB

A lightweight C library for RDF syntax

Home Page: https://gitlab.com/drobilla/serd

License: ISC License

C 86.29% Python 7.48% Meson 6.23%
rdf turtle ntriples nquads semantic-web parser serializer

serd's Introduction

Serd

Serd is a lightweight C library for working with RDF data.

Serd can be used by high-performance or resource-limited applications to read or write Turtle, TriG, NTriples, and NQuads. The included serdi tool can be used to efficiently process RDF documents in scripts or on the command-line.

Features

  • Free: Serd is Free Software released under the extremely liberal ISC license.

  • Portable and Dependency-Free: Serd has no external dependencies other than the C standard library. It is known to compile with Clang, GCC, and MSVC, and is tested on GNU/Linux, FreeBSD, MacOS, and Windows.

  • Small: Serd is implemented in a few thousand lines of C. When optimized, it compiles to well under 100 KiB.

  • Fast and Lightweight: Serd can stream abbreviated Turtle, unlike many tools which must first build an internal model. This makes it particularly useful for writing very large data sets, since it can do so using only a small amount of memory. Serd is, to the author's knowledge, the fastest Turtle reader/writer by a wide margin (see Performance below).

  • Conformant and Well-Tested: Serd passes all tests in the Turtle and TriG test suites, correctly handles all "normal" examples in the URI specification, and includes many additional tests which were written manually or discovered with fuzz testing. The test suite is run continuously on many platforms, has 100% code coverage by line, and runs with zero memory errors or leaks. Code quality is continuously checked statically by clang-tidy, and dynamically by various clang sanitizers.

Performance

The benchmarks below compare serdi, rapper, and riot re-serialising Turtle data generated by sp2b on an AMD 1950x. Of the three, serdi is the fastest by a wide margin, and the only one that uses a constant amount of memory for all input sizes.

Throughput Time Memory

Documentation

Versioning

Serd uses strict semantic versioning, which reflects the ABI of the C library. The shared library name, include directory, and pkg-config file are all suffixed with the major version number to allow for parallel installation of several major versions (which distribution packages should preserve). To build against serd, use the pkg-config package serd-0:

pkg-config --cflags --libs serd-0

-- David Robillard [email protected]

serd's People

Contributors

artob avatar drobilla avatar johannes-mueller avatar nick87720z avatar tartina avatar tpt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

serd's Issues

Serd parses prefixed IRIs that contain illegal Unicode characters

Serd parses prefixed IRIs that contain illegal Unicode characters in their local name.

For example, the following Turtle snippet appears in an actual data file (notice that the underscores are the illegal Unicode character EN DASH (U+2013):

@prefix dbp: <http://dbpedia.org/property/> .
@prefix dbr: <http://dbpedia.org/resource/> .
dbr:Germany_at_the_2006–08_European_Nations_Cup dbp:stadium     dbr:Amsterdam .

Serdi parses this snippet, but it should raise an error:

serdi unicode.ttl
<http://dbpedia.org/resource/Germany_at_the_2006\u201308_European_Nations_Cup> <http://dbpedia.org/property/stadium> <http://dbpedia.org/resource/Amsterdam> .

Tested with Serd 0.28.0.

serd 0.30.8 build failure on mojave and catalina

πŸ‘‹ trying to build the latest release, but run into some build issue. The error log is as below:

serd 0.30.8 test failure on mojave
==> serdi -
dyld: lazy symbol binding failed: Symbol not found: _aligned_alloc
  Referenced from: /usr/local/Cellar/serd/0.30.8/lib/libserd-0.0.dylib
  Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _aligned_alloc
  Referenced from: /usr/local/Cellar/serd/0.30.8/lib/libserd-0.0.dylib
  Expected in: /usr/lib/libSystem.B.dylib
serd 0.30.8 build failure on catalina
../src/serd_config.h:32:9: warning: 'SERD_VERSION' macro redefined [-Wmacro-redefined]
#define SERD_VERSION "0.30.7"
        ^
<command line>:2:9: note: previous definition is here
#define SERD_VERSION "0.30.8"
        ^
../src/system.c:59:10: error: implicit declaration of function 'aligned_alloc' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  return aligned_alloc(alignment, size);
         ^
../src/system.c:59:10: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'void *' [-Wint-conversion]
  return aligned_alloc(alignment, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 1 error generated.

In file included from ../src/serdi.c:19:
../src/serd_config.h:32:9: warning: 'SERD_VERSION' macro redefined [-Wmacro-redefined]
#define SERD_VERSION "0.30.7"
        ^
<command line>:2:9: note: previous definition is here
#define SERD_VERSION "0.30.8"
        ^
1 warning generated.

Full build log is in here, https://github.com/Homebrew/homebrew-core/runs/1678765724
relates to Homebrew/homebrew-core#68697

Compile failure on OSX (gcc) due to deprecated attributes message

The optional message is a GNU extension only available with modern gcc.
Preferably serd would to fall back to __attribute__((deprecated)) with gcc-4

Setting top to                                : /Users/ardour/src/stack/serd-0.30.7
Setting out to                                : /Users/ardour/src/stack/serd-0.30.7/build
Checking for 'clang' (C compiler)             : not found
Checking for 'gcc' (C compiler)               : gcc
Checking for flag '-std=c11'                  : no
Checking for flag '-std=c99'                  : yes
Checking for aligned_alloc                    : no
Checking for fileno                           : yes
Checking for posix_fadvise                    : no
Checking for posix_memalign                   : yes
Install prefix                                : /Users/ardour/gtk/inst
C Flags                                       : -DMAC_OS_X_VERSION_MAX_ALLOWED=1050 -mmacosx-version-min=10.5 -O3 -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -fshow-column -std=c99
Debuggable                                    : no
Build documentation                           : no
Build unit tests                              : no
Build shared library                          : yes
Build static library                          : no
Build utilities                               : yes
'configure' finished successfully (1.324s)
Build commands will be stored in build/compile_commands.json
Waf: Entering directory `/Users/ardour/src/stack/serd-0.30.7/build'
[ 1/15] Compiling src/reader.c
[ 2/15] Compiling src/byte_source.c
[ 3/15] Compiling src/node.c
[ 4/15] Compiling src/base64.c
In file included from ../src/byte_source.h:20,
                 from ../src/byte_source.c:17:
../include/serd/serd.h:367: error: wrong number of arguments specified for 'deprecated' attribute

In file included from ../src/byte_source.h:20,
                 from ../src/reader.h:20,
                 from ../src/reader.c:17:
../include/serd/serd.h:367: error: wrong number of arguments specified for 'deprecated' attribute

In file included from ../src/base64.h:20,
                 from ../src/base64.c:17:
../include/serd/serd.h:367: error: wrong number of arguments specified for 'deprecated' attribute

In file included from ../src/node.h:20,
                 from ../src/node.c:17:
../include/serd/serd.h:367: error: wrong number of arguments specified for 'deprecated' attribute

Waf: Leaving directory `/Users/ardour/src/stack/serd-0.30.7/build'

Build error

alex@alextone:~/github/serd$ ./waf configure --prefix=/usr
Traceback (most recent call last):
File "./waf", line 5, in
from waflib import Context, Scripting
ImportError: No module named waflib

Reports syntax error on blank node statements

$ echo '[ <urn:test:p> <urn:test:o1> ].' | serdi -
_:b1 <urn:test:p> <urn:test:o1> .
error: (stdin):1:32: syntax error

However, parsing is correct and seems to continue:

$ echo '[ <urn:test:p> <urn:test:o1> ]. <urn:test:a> <urn:test:b> <urn:test:c>.' | serdi -
_:b1 <urn:test:p> <urn:test:o1> .
error: (stdin):1:32: syntax error
<urn:test:a> <urn:test:b> <urn:test:c> .

Error parsing 'a' without whitespace

The following Turtle file is not parsed, possibly because the a appears with no whitespace surrounding it:

[a<>].
[a[]].
[a()].
<>a<>.
<>a[].
<>a().

This is the error message that I get:

$ serdi -i turtle shortest-triple.ttl 
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <> .
error: shortest-triple.ttl:2:2: bad verb
error: shortest-triplettl:2:2: expected `]', not `['
error: shortest-triple.ttl:2:2: bad verb
error: shortest-triple.ttl:2:4: bad verb
error: shortest-triple.ttl:2:4: bad subject

How to apply a base URI?

I'm trying to add support for a base URI flag in HDT (rdfhdt/hdt-cpp#131).

Firstly, I have a SerdEnv to which I can set and get a base URI:

const SerdNode* envBase = serd_env_get_base_uri(env, nullptr);
SerdURI base_uri;

Secondly, I have a SerdNode* term which is a relative IRI, but I cannot obtain the corresponding absolute IRI:

SerdNode base {serd_node_new_uri_from_string(envBase->buf, nullptr, &base_uri)};
SerdNode iri {serd_node_new_uri_from_string(term->buf, &base_uri, nullptr)};

When I print iri.buf it is still the same as term->buf. I must be doing something wrong...

Parsing erroneous literals containing double quotes

The following is illegal in Turtle-based languages:

<a:a> <a:a> """^^<b:b> .
<c:c> <c:c> """^^<b:b> .

But Serd 0.29.2 parses this, resulting in an output that is also illegal Turtle:

$ serdi test.ttl
<a:a> <a:a> "^^<b:b> .\n<c:c> <c:c> "^^<b:b> .

[master/0.30.16] Statc build (-Dstatic=true) fails with link error: attempted static link of dynamic object `libserd-0.so.0.31.0'

Hi, happy new year!

This was brought to my attention and I just verified the issue locally. Here's how to reproduce:

# cd "$(mktemp -d)"

# git clone --depth 1 https://github.com/drobilla/serd
Cloning into 'serd'...
remote: Enumerating objects: 1145, done.
remote: Counting objects: 100% (1145/1145), done.
remote: Compressing objects: 100% (760/760), done.
remote: Total 1145 (delta 239), reused 1027 (delta 219), pack-reused 0
Receiving objects: 100% (1145/1145), 424.59 KiB | 3.27 MiB/s, done.
Resolving deltas: 100% (239/239), done.

# cd serd/

# meson -Dstatic=true ./build && ninja -C ./build
The Meson build system
Version: 1.0.0
Source dir: /tmp/tmp.TS6RWN6Ff5/serd
Build dir: /tmp/tmp.TS6RWN6Ff5/serd/build
Build type: native build
Project name: serd
Project version: 0.31.0
C compiler for the host machine: cc (gcc 11.3.1 "cc (Gentoo 11.3.1_p20221223 p5) 11.3.1 20221223")
C linker for the host machine: cc ld.bfd 2.39
Host machine cpu family: x86_64
Host machine cpu: x86_64
Library m found: YES
Program doxygen found: YES (/usr/bin/doxygen)
Program sphinxygen found: NO
Program sphinx-build found: YES (/usr/bin/sphinx-build)
Program mandoc found: NO
Downloading sphinxygen source from https://download.drobilla.net/sphinxygen-1.0.0.tar.gz
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13894/13894 [00:00<00:00, 101804.17bytes/s]

Executing subproject sphinxygen 

sphinxygen| Project name: sphinxygen
sphinxygen| Project version: 0.0.1
sphinxygen| Program python3 (argparse, textwrap, xml.etree.ElementTree) found: YES (/usr/bin/python3) modules: argparse, textwrap, xml.etree.ElementTree
sphinxygen| Program sphinxygen found: YES (/tmp/tmp.TS6RWN6Ff5/serd/subprojects/sphinxygen-1.0.0/src/sphinxygen/sphinxygen.py)
sphinxygen| Build targets in project: 10
sphinxygen| Subproject sphinxygen finished.

Program sphinxygen found: YES (overridden)
Program python3 (sphinx_lv2_theme) found: NO
doc/c/meson.build:13: WARNING: Missing sphinx_lv2_theme module, falling back to alabaster
Configuring conf.py using configuration
Program doxygen found: YES (/usr/bin/doxygen)
Configuring Doxyfile using configuration
Program stylelint found: NO
Build targets in project: 14
NOTICE: Future-deprecated features used:
 * 0.64.0: {'copy arg in configure_file'}

serd 0.31.0

    API Documentation: YES
    Tests            : YES
    Tools            : YES
    Install prefix   : /usr/local
    Headers          : /usr/local/include
    Libraries        : /usr/local/lib64
    Executables      : /usr/local/bin
    Man pages        : /usr/local/share/man

  Subprojects
    sphinxygen       : YES

  User defined options
    static           : true

Found ninja-1.11.1 at /usr/bin/ninja
WARNING: Running the setup command as `meson [options]` instead of `meson setup [options]` is ambiguous and deprecated.
ninja: Entering directory `./build'
[24/34] Linking target serdi
FAILED: serdi 
cc  -o serdi serdi.p/src_serdi.c.o -Wl,--as-needed -Wl,--no-undefined -Wl,-O1 '-Wl,-rpath,$ORIGIN/' -Wl,-rpath-link,/tmp/tmp.TS6RWN6Ff5/serd/build/ -Wl,--start-group libserd-0.so.0.31.0 -Wl,--end-group -static
/usr/lib/gcc/x86_64-pc-linux-gnu/11/../../../../x86_64-pc-linux-gnu/bin/ld: attempted static link of dynamic object `libserd-0.so.0.31.0'
collect2: error: ld returned 1 exit status
[29/34] Generating doc/c/singlehtml with a custom command
ninja: build stopped: subcommand failed.

Any ideas for a fix?

Thanks and best, Sebastian

Digits in URI scheme result in parsing errors

The following is not parsed correctly by Serd 0.26.0 when run as serdi -i ntriples <FILE>:

<a:b> <a:b> <a2:b>

This gives the following error:

syntax does not support relative IRIs
Catch exception load: Error parsing input.

But there are no relative IRIs in the above content. Specifically, the digit (2) in the third IRI's scheme component causes the problem. According to URI/IRI grammars, non-initial characters of scheme components are allowed to be decimal digits.

The same content does parse correctly with serdi -i turtle <FILE>.

Does serdi support named pipe input/output ?

Sample bash program:

mkfifo in;
mkfifo ou;
(serdi -i "Nquads" -o "NTriples" in > out) &;
write_to in &; // write_to is any program capable of writing to a named pipe
read_from out &; // read_form is any program capable of reading from a named pipe

Add streaming support for .gz and .bz2 format input / output files

I'm finding very large turtle/triple datasets may be best kept in compressed form, if only to not be throttled by disk I/O max speeds while reading/writing them with a blazing fast library. Two common compressions I'm running into with triples store data are gz and bzip2. Would you consider adding the ability to stream out and back into compressed forms? There are libraries for Python to make it easy for both formats, so I'm hoping the same might be true here?

Resolution for base URIs with empty path

When the base URI's path is empty (e.g., https://a.org), resolution of URIs gives incorrect results. E.g., relative URI b resolves to absolute URI https://a.orgb, whereas https://a.org/b was expected.

serd_env_expand and SERD_ERR_BAD_CURIE

serd_env_expand will return SERD_ERR_NOT_FOUND if the prefix does not exist / was not found, and yet the enum doc for SERD_ERR_BAD_CURIE says: "Invalid CURIE (e.g. prefix does not exist)". Not a huge deal, but a bit redundant/confusing? Just intuitively, after the reading the docs, I would expect serd_env_expandto return the 'bad curie' enum value instead of the 'not found' one.

Write canonical NTriples 1.1 by default

(Edited) The output does not appear to be UTF-8, is this is a bug? I thought UTF-8 would be the default given there is an option to "Write ASCII output if possible"

Example:

source triple from dbpedia/article-templates_lang=en_nested.ttl
<http://dbpedia.org/resource/AndrΓ©_Γ‰ric_LΓ©tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl
article-templates_lang=en_nested.ttl: UTF-8 Unicode text

serdi output:
<http://dbpedia.org/resource/Andr\u00E9_\u00C9ric_L\u00E9tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested-serdi.nt
article-templates_lang=en_nested-serdi.nt: ASCII text, with very long lines

apache jena riot output:
<http://dbpedia.org/resource/AndrΓ©_Γ‰ric_LΓ©tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl.bz2-riot.nt
article-templates_lang=en_nested.ttl.bz2-riot.nt: UTF-8 Unicode text

Spec Reference:
https://www.w3.org/TR/n-triples/#canonical-ntriples

Note: At first I thought maybe this was a BOM related rendering/display issue, but file would reveal if there is a BOM, and the same tools were used to find and display the examples above...

Unable to parse triple-quoted literal

The following triple is valid according to rule 25 in the Turtle grammar:

[]a""""""".

But Serdi emits the following error when parsing:

_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "" .
error: /home/wouter/tmp/test.ttl:1:10: missing ';' or '.'

This was observed with the last commit on the master branch, but may appear in earlier commits/versions as well.

Version >= 0.30.16 writes faulty syntax for tuples in TTL files

To reproduce:

  1. jalv.gtk3 http://lsp-plug.in/plugins/lv2/impulse_responses_mono
  2. Load an IR file.
  3. Create a new preset.
  4. Close jalv and restart with above command.
  5. Try to load preset. It will not load correctly, i.e. the IR file will not be loaded.

If you examine the <name>.ttl file in the preset bundle, you'll find something like this at the end of the file:

...
state:state [
		<[http://lsp-plug.in/plugins/lv2/impulse_responses_mono/KVT>](http://lsp-plug.in/plugins/lv2/impulse_responses_mono/KVT%3E) [
			a atom:Tuple ;
			rdf:value ()
		]<[http://lsp-plug.in/plugins/lv2/impulse_responses_mono/ports#ifn>](http://lsp-plug.in/plugins/lv2/impulse_responses_mono/ports#ifn%3E) <my-ir.wav> .

... which is not valid syntax. It should be

...
state:state [
	<http://lsp-plug.in/plugins/lv2/impulse_responses_mono/KVT> [
		a atom:Tuple ;
		rdf:value ()
	] ;
	<http://lsp-plug.in/plugins/lv2/impulse_responses_mono/ports#ifn> <my-ir.wav>
] .

I bisected this by going back to older version of serd one-by-one. I use Arch (Manjaro), which currently has serd 0.32.0, so I tested 0.30.16 and 0.30.14. The latter works fine, while 0.30-16 exhibits the same faulty behavior as 0.32.0.

This of course affects not only jalv, but all hosts, which use the system serd lib to write presets or plugin state in their projects (e.g. Ardour from the Arch packages too).

Segmentation fault when parsing a specific file

I have a large (~13GB) GNU zipped TriG file. When I parse it with Serdi, I consistently get a segmentation fault after a couple of seconds. To reproduce:

$ curl -L 'https://demo.triply.cc/wouter/bag/assets/bag.trig.gz' | zcat | serdi - > /dev/null

Colliding generated blank nodes during TriG import

The following TriG import creates a blank node b1 which collides with the already existing one:

> serdi -i trig -s "_:b1 <http://example.org/p> [] ."
_:b1 <http://example.org/p> _:b1 .

The Turtle importer correctly avoids collision:

> serdi -i turtle -s "_:b1 <http://example.org/p> [] ."
_:B1 <http://example.org/p> _:b1 .

Tested with the branches master and serd1.

Add robustness against syntax errors?

Based on a remark by @kurzum in an hdt-cpp issue, would it be make sense to support robustness against syntax errors to Serd?

Whenever a syntax error is identified, Serd could use some heuristics in order to determine where the next triple may start, and try to continue parsing from there.

Why does lax mode not skip this line?

I'm trying out line skipping in lax parsing mode, but I do not yet understand how it works. For the following example the serdi command line tool gives the same result with and without lax mode:

<x:x> <x:x> <x:x> .
<y:y> <y:y> <y:z}> .
<z:z> <z:z> <z:z> .
$ serdi input.nt
<x:x> <x:x> <x:x> .
error: input.nt:2:127: invalid IRI character `{'
$ serdi -l input.nt
<x:x> <x:x> <x:x> .
error: input.nt:2:17: invalid IRI character `}'

Tested with serdi 0.29.2.

Unclear semantics of Serdi flag `-i`

I originally understood the semantics of the Serdi -i flag to be that the specified grammar must be used (and that -- as a consequence -- every violation of that grammar would throw a warning). However, it seems possible to parse some Turtle files that are not N-Triples using the N-Triples grammar, without emitting a warning. For example, the following

$ serdi -i ntriples test.ttl

parses the following data file using the Turtle (and not the indicated N-Triples) grammar:

@prefix foaf:  <http://xmlns.com/foaf/0.1/> .

_:a  foaf:name   "Johnny Lee Outlaw" .
_:a  foaf:mbox   <mailto:[email protected]> .
_:b  foaf:name   "Peter Goodguy" .
_:b  foaf:mbox   <mailto:[email protected]> .
_:c  foaf:mbox   <mailto:[email protected]> .

pkg-config file should container -DSERD_STATIC on static build

With --default-library=static, using the resulting static libraries results in a bunch of linker errors, because serd.h adds __declspec(dllimport), which changes the symbol mangling, and obviously tries to import it from a DLL.

When building a static library, -DSERD_STATIC needs to be part of the cflags of the resulting pkg-config file.

This same issue exists on the entire lilv dependency chain, including lilv itself. All with their respective own macro. Not sure if I should open issues there individually?

Cannot find shared object file

I like to pull and compile the latest and greatest Serd version; I compile it with ./waf configure, ./waf, and sudo ./waf install. Directly after completing this sequence of commands, Serdi works fine. But after a while I get the following error:

$ serdi
serdi: error while loading shared libraries: libserd-0.so.0: cannot open shared object file: No such file or directory

The library file is located in what seems to be a normal location:

$ find /usr/ -name libserd-0.so.0
/usr/local/lib/libserd-0.so.0

Recompiling fixes the issue, but only temporarily. I'm using Fedora 27.

Any ideas on what might cause this?

Cannot parse a valid TriG document

For a file test.trig with the following content:

prefix : <https://example.com/>
:g {
  :x :p :y.
  :x :p :y; :q :z.
}

run the following command:

$ serdi -i trig test.trig 
<https://example.com/x> <https://example.com/p> <https://example.com/y> <https://example.com/g> .
<https://example.com/x> <https://example.com/p> <https://example.com/y> <https://example.com/g> .
error: /home/wbeek/tmp/test.trig:4:10: bad subject

Filter specific graph in nquads

First, thanks for serdi, very nice & fast library!

I often use it in RDF creation pipelines and one job I do on a regular base is to convert quads to triples. Sometimes the graph does not matter but in other cases it would be nice to be able to specify which graph I want to have in the output and throw away the rest (or vice versa).

I currently do this with pipe-filters but I would feel more comfortable when I could add that as parameter to serdi.

To what extent are IRIs parsed?

Serd currently accepts IRI terms that do not follow the RFC 3987 grammar. E.g., the following input data contains a subject term that is not a valid IRI:

<_:x> <x:x> <x:x> .

The Turtle grammar seems to allow for almost arbitrary strings in the IRI production:

'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' 

However, this is not the entire story, because the Turtle standard also requires that relative IRIs be made absolute relative to a base URI. So a Turtle parser should at least be able to distinguish relative from absolute IRIs, which implies that it should at least implement the IRI grammar to some extent.

So my question is: to what extent does Serd intend to parse IRIs? Is it feasible to require full RFC 3987 compliance, or should we aim for compliance WRT a well-defined subset of that grammar?

Add canonical parsing of lexical forms

A datatype defines its own functions, mapping the lexical form to the value space, and mapping values from the value space to the lexical space. Oftentimes, there is a canonical variant of the latter, which allows values to be serialized in one, canonical way. It would be great if Serd would allow literals to be parsed in a way that guarantees that the lexical forms of some common datatypes are written down in a canonical way.

Here are some examples of non-canonical lexical forms and their canonical variants:

"01"^^xsd:int β‡’ "1"^^xsd:int
"0.0001"^^xsd:double β‡’ "0.1E-3"^^xsd:double
"1"^^xsd:boolean β‡’ "true"^^xsd:boolean

Canonical lexical forms are preferred over non-canonical ones, because (1) they allow value equality to be determined simply by comparing lexical form equality, and (2) they allow the relative ordering of many values (i.e., < and >) to be determined based on a simple comparison of their lexical forms.

There are many use cases in which these two properties have a profound impact. Two use cases of only the former property are (1) SPARQL filter expressions that check for literal value equality, and (2) graph navigation where the nodes "1"^^xsd:boolean and "true"^^xsd:boolean should be treated as the same node.

Bug: serd_reader_read_chunk does not support NQuads

The following code does not properly read NQuads (Graph is always null):

auto* r = serd_reader_new(...);
serd_reader_start_source_stream(...);
while (serd_reader_read_chunk(...) ...) {...};

(Reading a file directly via serd_reader_read_file_handle works, but reading via a simple adaptor from a c++ std::istream to serd_reader_read_chunk does not)

Possible cause:
serd_reader_read_chunk calls read_statement which calls read_n3_statement direcly, without checking the readers syntax mode.

Backslash escaped IRI local names are not parsed

I'm posting here an issue related to an issue opened by @wonkydonky in the hdt-cpp project.

IIUC it is allowed to use backslash escaping in IRI local names, which would allow dots to appear as the first and/or last character of a local name (something that is not allowed without backslash escaping).

A simple example to illustrate this:

$ serdi -s "prefix : <x:x> :\. : : ."
error: (string):1:20: bad subject

Parsing from a string in python

Is there an example of how to parse from a string in python?

world.load works fine from a file, but if I put the same content in a string and use a StringIO / BytesIO it fails.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.