BibTeX Parser Experiments

Experiments to determine the new BibTeX parser formula. The result may be applied to other formats as well in the future.

Above is some pseudo-grammar describing the stages of parsing. The reason I distinguish between parsing the file and parsing values is best demonstrated in the "entry value with mid-command concatenation" test case:

Input:

@book{a,
  title = "foo \\copy" # "right{} bar"
}

Output:

[{
  type: 'book',
  id: 'a',
  properties: {
    title: 'foo © bar'
  }
}]

Participants

Note: citationjs-idea (Citation.js Idea #1) and citationjs-nearley are skipped in some tables, as they are only in here for historic reasons.

Citation.js (old)

At the time of starting these experiments, the TokenStack class was utilized, together with a simple RegExp that tokenizes commands.

Citation.js Ideas

The idea was to explore tokenization without introducing a formal grammar, as formal grammars introduce extra build steps, runtime dependencies and large swaths of generated code. However, used as I was to the syntax of PEG.js and nearley.js, I made some unnecessarily complicated features like consumeAnyRule(), and some weird loops in the rules. This was partly due to bad tokenization.

Idea #2, currently active in Citation.js has new tokens, a simpler Grammar class and simplified rules. It also has more features, including more commands and diacritics, including more ways to write them.

Citation.js with nearley

In parallel to reworking the idea, I used the tokenizer in a nearley.js grammar, which failed miserably. This is probably the result of bad grammar-writing on my part, and not a reflection of the capabilities of nearley.js. However, an additional downside of this route is that it introduces an extra build step (nearleyc) and a runtime dependency — nearley itself.

astrocite

The astrocite-bibtex package by @dsifford uses PEG.js. It is capable of returning an AST.

fiduswriter

Fiduswriter's biblatex-csl-converter seems to perform very poorly on the larger file. However, it does return lossless values, although I am not a fan of the (lack of) difference between arrays representing single and multiple values:

[
  { type: 'text', text: 'foo' }
]
// vs
[
  { literal: [{ type: 'text', text: 'foo' }] },
  { literal: [{ type: 'text', text: 'bar' }] }
]

This causes testing such as

literal in value[0]
// or more properly
value.every(part => literal in part)

Zotero

Zotero Translators are relatively hard to use stand-alone, as they depend on a Zotero framework in the global scope. It immediately converts to Zotero API JSON while parsing the syntax. Not shown in the performance table is that Zotero requires initialization, not counting the time it takes to import files, and that this takes relatively long.

Better BibTeX for Zotero (BBT)

Using @retorquere/bibtex-parser, this performs very well. It is capable of returning an AST. I have not had a chance to test out all the parser features for literal/text/name values yet.

JabRef

JabRef is reference management software with Bib(La)TeX as the internal representation, so one can assume their support for parsing it to be pretty good. However, I have not found a way to export their internal representation in the level of detail required for passing the syntax tests. Similarly, as their program only partly supports a CLI (no stdio) and is written in Java, a performance comparison would not be very fair. For now, it is possible to test syntax features by uncommenting the jabref entry in test/feature.js and running npm run features -- parser jabref.

API Features

	citationjs-old	citationjs	astrocite	fiduswriter	zotero	bbt
Sync/Async	sync	sync	sync	both	async	both
AST output	✘	✘	✓	✘	✘	✓
Lossless schema¹	✓	✓	✓	✓	✘	✓
Lossless values	✘	✘	✘	✓	✘	✓
Error recovery	✓	✘	✘	✓	✘	✓

¹ specifically the schema used to represent data entries and not value syntax (commands, formatting), and disregarding AST

Syntax Features

Empty cells indicate a choice to follow either natbib or biblatex for certain behavior, this becomes clear from context. If both cells are empty, this is may be an error, but that should be indicated by a different test fixture. Auto-generated by npm run features -- fixtures, see also the fixture file.

	citationjs-old	citationjs	astrocite	fiduswriter	zotero	bbt
entry with lowercase type	✓	✓	✓	✓	✓	✓
entry with mixed-case type	✓	✓	✓	✓	✓	✓
entry with uppercase type	✓	✓	✓	✓	✓	✓
entry with parentheses	✘	✓	✓	✓	✓	✓
entry with spacing	✓	✓	✘	✓	✓	✓
entry with trailing comma	✓	✓	✓	✓	✓	✓
string key with colon	✘	✓	✘	✓	✘	✓
entry key with colon	✓	✓	✓	✘	✘	✓
entry value with annotation
entry label with number	✓	✓	✓	✓	✓	✓
entry label with colon	✓	✓	✓	✓	✓	✓
entry label with double quotes	✘	✓	✓	✓	✓	✓
entry value of quoted string	✓	✓	✓	✓	✓	✓
entry value of braced string	✓	✓	✓	✓	✓	✓
entry value of number	✘¹	✓	✓	✘¹	✘¹	✓
entry value with mid-and concatenation	✘²	✓	✘²	✘²	✘²	✓
entry value with mid-command concatenation	✘²	✓	✘²	✘²	✘²	✘²
entry value with sentence-casing (real title)	✘¹	✓	✘¹	✘¹	✘¹	✓
entry value with sentence-casing (artificial title)	✘¹	✓	✘¹	✘¹	✘¹	✘¹
entry value with sentence-casing (markup)	✘¹	✓	✘¹	✘¹	✘¹	✘¹
entry value with sentence-casing (env markup)	✘¹	✓	✘¹	✘¹	✘¹	✘¹
entry value with markup	✘¹	✓	✘¹	✘¹	✘¹	✓
entry value with envs	✘¹	✓	✘¹	✘¹	✘¹	✘¹
entry value with env overrides	✘¹	✓	✘¹	✘¹	✘¹	✓
entry value with literal names	✘	✓	✘	✘	✘	✓
entry value with truncated names	✘¹	✘¹	✘¹	✘¹	✘¹	✘¹
entry value with extended names (biblatex)		✓				✓
entry value with verbatim fields	✓	✓				✓
entry value with uri fields		✓
entry value with pre-encoded uri fields	✓	✓	✓			✓
entry value with diacritics	✘	✓	✘	✘	✘	✓
entry value with escapes	✘	✓	✘	✘	✘	✘
entry value with sub/superscript	✘	✓	✘	✘	✘	✓
entry value with multi-argument commands	✘	✘	✘	✘	✘	✓
entry value with verbatim-argument commands	✘	✘	✘	✘	✘	✓
entry value with unbracketed-argument commands	✘¹	✘¹	✘¹	✘¹	✘¹	✓
TODO
string with lowercase type	✘	✓	✓	✓	✓	✓
string with mixed-case type	✘	✓	✓	✓	✓	✓
string with uppercase type	✘	✓	✓	✓	✓	✓
string with parentheses	✘	✓	✓	✓	✓	✓
string value with string	✘	✓	✓	✓	✓	✓
string value with concatenated string	✘	✓	✘	✓	✓	✓
preamble with quoted string	✘	✓	✓	✓	✓	✓
preamble with string	✘	✓	✘	✓	✓	✓
preamble with concatenated string	✘	✓	✘	✓	✓	✓
comment before entry	✘	✓	✓	✓	✓	✓
comment around entry (natbib)		✓
comment around entry (biblatex)			✓	✓	✓	✓

¹ undefined representation, actual support may vary
² very unlikely to matter

Performance

Data from npm test, as run on Travis CI.

	Init	Time (single entry)	Time (3345 entries)
citationjs-old	0.795ms ± 2.2%	0.842ms ± 1.6%	1.97e+3ms ± 5.5%
citationjs	4.42ms ± 2.3%	0.455ms ± 1.5%	1.20e+3ms ± 11.7%
astrocite	1.73ms ± 2.2%	0.665ms ± 2.3%	2.17e+3ms ± 7.0%
fiduswriter	17.7ms ± 12.5%	7.52ms ± 24.7%	1.16e+5ms ± 4.0%
zotero	4.16ms ± 12.1%	2.13ms ± 1.3%	1.00e+4ms ± 0.8%
bbt	21.7ms ± 23.4%	1.55ms ± 6.5%	2.27e+4ms ± 5.4%

@comment behavior in BibTeX vs BibLaTeX

biblatex and natbib treat @comment entries differently. However, neither treat them as I implemented based on a summary I found, namely from @comment to the end of the line (source). biblatex seems to treat it as a regular entry (e.g. expecting opening braces and ignoring everything including other entries until the end brace), while natbib seems to just ignore the @comment text, so if an entry starts with @comment instead of @book it does nothing. However, it does not require @comment to be an entry, as biblatex does. Also, natbib behavior differs from the aforementioned summary in that it still counts entries that start on the same line as @comment, while the summary states "that everything from the @Comment and to the end of line is ignored".

Examples:

@comment {
  @misc{label,
    author = "name",
    title = "displays with natbib, not with biblatex",
    year = 2019
  }
}

@comment{ @misc{label2,
    author = "name",
    title = "displays with natbib, not with biblatex",
    year = 2019
  }
}

@comment @misc{label3,
    author = "name",
    title = "errors (fatally) with biblatex, displays with natbib",
    year = 2019
  }

@comment {} @misc{label4,
    author = "name",
    title = "displays with biblatex, natbib",
    year = 2019
  }

citation-js / bibtex-parser-experiments Goto Github PK