Giter Site home page Giter Site logo

citation-js / bibtex-parser-experiments Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 2.0 2.11 MB

Experiments to determine a new BibTeX parser formula for Citation.js -- to be applied to other formats as well

Home Page: https://travis-ci.com/citation-js/bibtex-parser-experiments/builds

License: MIT License

JavaScript 94.66% Nearley 5.10% Shell 0.24%
bibtex parser csl-json grammar nearleyjs pegjs zotero

bibtex-parser-experiments's Introduction

BibTeX Parser Experiments

Experiments to determine the new BibTeX parser formula. The result may be applied to other formats as well in the future.

Parsing stages of BibTeX

Above is some pseudo-grammar describing the stages of parsing. The reason I distinguish between parsing the file and parsing values is best demonstrated in the "entry value with mid-command concatenation" test case:

Input:

@book{a,
  title = "foo \\copy" # "right{} bar"
}

Output:

[{
  type: 'book',
  id: 'a',
  properties: {
    title: 'foo © bar'
  }
}]

Participants

Note: citationjs-idea (Citation.js Idea #1) and citationjs-nearley are skipped in some tables, as they are only in here for historic reasons.

Citation.js (old)

At the time of starting these experiments, the TokenStack class was utilized, together with a simple RegExp that tokenizes commands.

Citation.js Ideas

The idea was to explore tokenization without introducing a formal grammar, as formal grammars introduce extra build steps, runtime dependencies and large swaths of generated code. However, used as I was to the syntax of PEG.js and nearley.js, I made some unnecessarily complicated features like consumeAnyRule(), and some weird loops in the rules. This was partly due to bad tokenization.

Idea #2, currently active in Citation.js has new tokens, a simpler Grammar class and simplified rules. It also has more features, including more commands and diacritics, including more ways to write them.

Citation.js with nearley

In parallel to reworking the idea, I used the tokenizer in a nearley.js grammar, which failed miserably. This is probably the result of bad grammar-writing on my part, and not a reflection of the capabilities of nearley.js. However, an additional downside of this route is that it introduces an extra build step (nearleyc) and a runtime dependency — nearley itself.

astrocite

The astrocite-bibtex package by @dsifford uses PEG.js. It is capable of returning an AST.

fiduswriter

Fiduswriter's biblatex-csl-converter seems to perform very poorly on the larger file. However, it does return lossless values, although I am not a fan of the (lack of) difference between arrays representing single and multiple values:

[
  { type: 'text', text: 'foo' }
]
// vs
[
  { literal: [{ type: 'text', text: 'foo' }] },
  { literal: [{ type: 'text', text: 'bar' }] }
]

This causes testing such as

literal in value[0]
// or more properly
value.every(part => literal in part)

Zotero

Zotero Translators are relatively hard to use stand-alone, as they depend on a Zotero framework in the global scope. It immediately converts to Zotero API JSON while parsing the syntax. Not shown in the performance table is that Zotero requires initialization, not counting the time it takes to import files, and that this takes relatively long.

Better BibTeX for Zotero (BBT)

Using @retorquere/bibtex-parser, this performs very well. It is capable of returning an AST. I have not had a chance to test out all the parser features for literal/text/name values yet.

JabRef

JabRef is reference management software with Bib(La)TeX as the internal representation, so one can assume their support for parsing it to be pretty good. However, I have not found a way to export their internal representation in the level of detail required for passing the syntax tests. Similarly, as their program only partly supports a CLI (no stdio) and is written in Java, a performance comparison would not be very fair. For now, it is possible to test syntax features by uncommenting the jabref entry in test/feature.js and running npm run features -- parser jabref.

API Features

citationjs-old citationjs astrocite fiduswriter zotero bbt
Sync/Async sync sync sync both async both
AST output
Lossless schema¹
Lossless values
Error recovery

¹ specifically the schema used to represent data entries and not value syntax (commands, formatting), and disregarding AST

Syntax Features

Empty cells indicate a choice to follow either natbib or biblatex for certain behavior, this becomes clear from context. If both cells are empty, this is may be an error, but that should be indicated by a different test fixture. Auto-generated by npm run features -- fixtures, see also the fixture file.

citationjs-old citationjs astrocite fiduswriter zotero bbt
entry with lowercase type
entry with mixed-case type
entry with uppercase type
entry with parentheses
entry with spacing
entry with trailing comma
string key with colon
entry key with colon
entry value with annotation
entry label with number
entry label with colon
entry label with double quotes
entry value of quoted string
entry value of braced string
entry value of number ✘¹ ✘¹ ✘¹
entry value with mid-and concatenation ✘² ✘² ✘² ✘²
entry value with mid-command concatenation ✘² ✘² ✘² ✘² ✘²
entry value with sentence-casing (real title) ✘¹ ✘¹ ✘¹ ✘¹
entry value with sentence-casing (artificial title) ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
entry value with sentence-casing (markup) ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
entry value with sentence-casing (env markup) ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
entry value with markup ✘¹ ✘¹ ✘¹ ✘¹
entry value with envs ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
entry value with env overrides ✘¹ ✘¹ ✘¹ ✘¹
entry value with literal names
entry value with truncated names ✘¹ ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
entry value with extended names (biblatex)
entry value with verbatim fields
entry value with uri fields
entry value with pre-encoded uri fields
entry value with diacritics
entry value with escapes
entry value with sub/superscript
entry value with multi-argument commands
entry value with verbatim-argument commands
entry value with unbracketed-argument commands ✘¹ ✘¹ ✘¹ ✘¹ ✘¹
TODO
string with lowercase type
string with mixed-case type
string with uppercase type
string with parentheses
string value with string
string value with concatenated string
preamble with quoted string
preamble with string
preamble with concatenated string
comment before entry
comment around entry (natbib)
comment around entry (biblatex)

¹ undefined representation, actual support may vary
² very unlikely to matter

Performance

Data from npm test, as run on Travis CI.

Init Time (single entry) Time (3345 entries)
citationjs-old 0.795ms ± 2.2% 0.842ms ± 1.6% 1.97e+3ms ± 5.5%
citationjs 4.42ms ± 2.3% 0.455ms ± 1.5% 1.20e+3ms ± 11.7%
astrocite 1.73ms ± 2.2% 0.665ms ± 2.3% 2.17e+3ms ± 7.0%
fiduswriter 17.7ms ± 12.5% 7.52ms ± 24.7% 1.16e+5ms ± 4.0%
zotero 4.16ms ± 12.1% 2.13ms ± 1.3% 1.00e+4ms ± 0.8%
bbt 21.7ms ± 23.4% 1.55ms ± 6.5% 2.27e+4ms ± 5.4%

bibtex-parser-experiments's People

Contributors

larsgw avatar retorquere avatar zuphilip avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bibtex-parser-experiments's Issues

closedown biblatex-cslc-onverter

Hey, I just discovered this chart. I have been participating in the maintenance of biblatex-csl-converter over the past few years. Based on your chart it looks like Idea (reworked) gives the same output quality as biblatex-csl-converter. Does that mean that it can be used as a drop in replacement and that it covers all the same features? If that is the case, is there any reason why I would continue to maintain biblatex-csl-converter?

BBT parser

WRT the issues reported here on the BBT parser:

  • does not seem to support all diacritics (errors on {\r a}) is fixed now
  • does not seem to support chained concatenations (a # b # c) I can't replicate this.

The sample below imports in BBT since 5.1.154

@string{j = {a space between this }}
@string{a = { string a}}
@string{b = { string b}}
@string{c = { string c}}
@article{key,
    author  = "Author",
    title   = "{\r a}Title" # a # b # c,
    year    = 1990,
    journal = j # "and this"
}

but the concat part of the title imported before that too.

Zotero parser comparison

The Zotero parser isn't really compared in the same way as the others -- it starts a full-blown translation server which does a lot more. The Zotero translators aren't really meant to be ran standalone but I've created a haphazard test runner for the bibtex translator here, and that runs on long.bib in 7s.

Sharing Bib(La)TeX test cases?

Do you want to share the Bib(La)TeX test cases as well? Or are they already somewhere online available, but I couldn't find them?

I wanted to analyze Zotero's behavior on them and use them for possible improvements of its BibTeX import translator (e.g. supporting electronic as it seems to be an alias for online).

Progress on the active parser ("citationjs")

2020-09-15: update below

One big problem is the question of what should be parsed when parsing syntax, and what should parsed when mapping to CSL. Consider also that Bib.TXT should be able to use the same mapping.

  • diacritics: when parsing syntax, as Bib.TXT and some BibTeX supports utf8
  • other known symbol commands and ligatures: when parsing syntax
  • except, fields tagged as verbatim or url in the specification should not have commands parsed, and then the syntax parser has to know about all the different fields.
  • although field data is available, URL escaping should be handled when mapping since Bib.TXT should probably have that behavior too
  • name field parsing: should probably be when mapping
  • list field parsing (splitting on " and "): should probably be when mapping
  • markup: should be done when mapping, as markup differs between formats
  • crossref: should be done when mapping

Less crucial things, maybe:

  • let people extend the constants (i.e., add commands, diacritics, ligatures)
  • (only) warn for mis-matched entry brackets

npm install fails

When running npm install on this repo (on Mac), I get:

> [email protected] install /Users/emile/github/bibtex-parser-experiments/node_modules/chokidarAt2/node_modules/fsevents
> node install.js

internal/modules/cjs/loader.js:1033
  throw err;
  ^

Error: Cannot find module 'nan'
Require stack:
- /Users/emile/github/bibtex-parser-experiments/node_modules/chokidarAt2/node_modules/fsevents/[eval]
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:1030:15)
    at Function.Module._load (internal/modules/cjs/loader.js:899:27)
    at Module.require (internal/modules/cjs/loader.js:1090:19)
    at require (internal/modules/cjs/helpers.js:75:18)
    at [eval]:1:1
    at Script.runInThisContext (vm.js:131:18)
    at Object.runInThisContext (vm.js:295:38)
    at Object.<anonymous> ([eval]-wrapper:10:26)
    at Module._compile (internal/modules/cjs/loader.js:1201:30)
    at evalScript (internal/process/execution.js:98:25) {
  code: 'MODULE_NOT_FOUND',
  requireStack: [
    '/Users/emile/github/bibtex-parser-experiments/node_modules/chokidarAt2/node_modules/fsevents/[eval]'
  ]
}
gyp: Call to 'node -e "require('nan')"' returned exit status 1 while in binding.gyp. while trying to load binding.gyp
gyp ERR! configure error 
gyp ERR! stack Error: `gyp` failed with exit code: 1
gyp ERR! stack     at ChildProcess.onCpExit (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:351:16)
gyp ERR! stack     at ChildProcess.emit (events.js:314:20)
gyp ERR! stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:276:12)
gyp ERR! System Darwin 19.6.0
gyp ERR! command "/usr/local/Cellar/node/14.5.0/bin/node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
gyp ERR! cwd /Users/emile/github/bibtex-parser-experiments/node_modules/chokidarAt2/node_modules/fsevents
gyp ERR! node -v v14.5.0
gyp ERR! node-gyp -v v5.1.0
gyp ERR! not ok 

Argument commands

The citationjs parser needs to allow for more different kinds of commands, mostly argument commands. Arguments seem to be treated the same always: it either takes in a braced block or the first character of text. Exceptions are math blocks: \url takes in the dollar sign verbatim while \emph does not.

@comment behavior in BibTeX vs BibLaTeX

biblatex and natbib treat @comment entries differently. However, neither treat them as I implemented based on a summary I found, namely from @comment to the end of the line (source). biblatex seems to treat it as a regular entry (e.g. expecting opening braces and ignoring everything including other entries until the end brace), while natbib seems to just ignore the @comment text, so if an entry starts with @comment instead of @book it does nothing. However, it does not require @comment to be an entry, as biblatex does. Also, natbib behavior differs from the aforementioned summary in that it still counts entries that start on the same line as @comment, while the summary states "that everything from the @Comment and to the end of line is ignored".

Examples:

@comment {
  @misc{label,
    author = "name",
    title = "displays with natbib, not with biblatex",
    year = 2019
  }
}

@comment{ @misc{label2,
    author = "name",
    title = "displays with natbib, not with biblatex",
    year = 2019
  }
}

@comment @misc{label3,
    author = "name",
    title = "errors (fatally) with biblatex, displays with natbib",
    year = 2019
  }

@comment {} @misc{label4,
    author = "name",
    title = "displays with biblatex, natbib",
    year = 2019
  }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.