Giter Site home page Giter Site logo

dblp-to-csv's Introduction

dblp-to-csv

Convert a DBLP (Computer Science Bibliography) XML file to CSV format.

Usage

For each element in the XML file, so article, book, phdthesis, etc, this tool will generate an output file. Each element output file contains only the necessary columns for that element, meaning each column will be non-empty on at least one row. When multiple similar attribute tags are encountered on an element, e.g. multiple authors of an article, then those values will be contained within an array "[item1|item2|...|itemN]". When calling the tool, pass as parameters the input XML file, the DTD file and the desired output file name format; e.g. output.csv will generate output_article.csv, output_book.csv, etc.

Optional type annotated headers

Optionally, one can use --annotate to enable type annotation. Per element, this will create an extra header file containing a single line with an annotated header. This annotated header is of the format name:type per column or name:type[] for columns that contain at least one array entry. The type can be integer, float, boolean or string.

Commandline options

usage: XMLToCSV.py [-h] [--annotate] [--neo4j]
                   [--relations RELATIONS [RELATIONS ...]]
                   xml_filename dtd_filename outputfile

Parse the DBLP XML file and convert it to CSV

positional arguments:
  xml_filename          The XML file that will be parsed
  dtd_filename          The DTD file used to parse the XML file
  outputfile            The output CSV file

optional arguments:
  -h, --help            show this help message and exit
  --annotate            Write a separate annotated header with type
                        information
  --neo4j               Headers become more Neo4J-specific and a neo4j-import
                        shell script is generated for easy importing. Implies
                        --annotate.
  --relations RELATIONS [RELATIONS ...]
                        The element attributes that will be treated as
                        elements, and for which a relation to the parent
                        element will be created. For example, in order to turn
                        the author attribute of the article element into an
                        element with a relation, use "author:authors". The
                        part after the colon is used as the name of the
                        relation.

Example

chmod +x XMLToCSV.py
./XMLToCSV.py --annotate --neo4j dblp.xml dblp.dtd output.csv --relations author:authored_by journal:published_in publisher:published_by school:submitted_at editor:edited_by cite:has_citation series:is_part_of

This command will parse the DBLP XML file dblp.xml using the DTD file dblp.dtd. Because the --annotate option is used, type annotations will be generated as well. The --relations option is given, so the element attributes author, journal, publisher, school and editor will be treated as nodes and relations will be created between these nodes and the elements that contained them. The string output.csv is used as a pattern, so generated files will be named output_article.csv etc. The type annotations will be stored in similarly named files, for example output_article_header.csv. Since we also passed the --neo4j option, the type annotations will be Neo4j compatible, and the tool generates a shell script called neo4j_import.sh that can be run to import the generated CSV files into a Neo4j graph database using the neo4j-admin import bulk importer tool.

Requirements

  • Python 3.7+ (use python3 --version to confirm)
  • lxml (pip3 install lxml)
  • Or: pip3 install -r requirements.txt

Links

dblp-to-csv's People

Contributors

thomhurks avatar dependabot[bot] avatar

Stargazers

Robin Weil avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.