Giter Site home page Giter Site logo

cre-dev / xml2db Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 3.0 809 KB

A Python package to load complex XML files into a relational database

Home Page: https://cre-dev.github.io/xml2db/

License: MIT License

Python 100.00%
data-engineering data-loader database mssql postgresql python relational-databases sqlalchemy xml xmlschema xsd elt etl lxml mysql

xml2db's Introduction

Loading complex XML files to a relational database

xml2db is a Python package which allows parsing and loading XML files into a relational database. It is designed to handle complex XML files which cannot be denormalized to flat tables. It works out of the box, without any custom mapping rules.

It can be used within an Extract, Load, Transform data pipeline pattern as it allows loading XML files into a relational data model which is very close from the source data, yet easy to work with.

Starting from an XSD schema which represents a given XML structure, xml2db builds a data model, i.e. a set of database tables linked to each other by foreign keys relationships. Then, it allows parsing and loading XML files into the database, and getting them back from the database into XML format if needed.

Loading XML files into a relational database with xml2db can be as simple as:

from xml2db import DataModel

# Create a data model of tables with relations based on the XSD file
data_model = DataModel(
    xsd_file="path/to/file.xsd", 
    connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
)
# Parse an XML file based on this XSD
document = data_model.parse_xml(
    xml_file="path/to/file.xml"
)
# Insert the document content into the database
document.insert_into_target_tables()

The resulting data model will adhere closely to the XSD schema. However, xml2db will perform a few systematic simplifications aimed at limiting the complexity of the resulting data model and the storage footprint. The resulting data model can be configured, but the above code will work out of the box, with reasonable defaults.

The raw data loaded into the database can then be processed if need be, using for instance DBT, SQL views or stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.

This package uses sqlalchemy to interact with the database, so it should work with different database backends. Automated integration tests run against PostgreSQL, MySQL and MS SQL Server. xml2db does not work with SQLite. You may have to install additional packages to connect to your database (e.g. psycopg2 for PostgreSQL, pymysql for MySQL or pyodbc for MS SQL Server).

Please read the package documentation website for all the details!

Installation

The package can be installed, preferably in a virtual environment, using pip:

pip install xml2db

Testing

Running the tests requires installing additional development dependencies, after cloning the repo, with:

pip install -e .[tests,docs]

Run all tests with the following command:

python -m pytest

Integration tests require write access to a PostgreSQL or MS SQL Server database; the connection string is provided as an environment variable DB_STRING. If you want to run only conversion tests that do not require a database you can run:

pytest -m "not dbtest"

Contributing

xml2db is developed and used at the French energy regulation authority (CRE) to process complex XML data.

Contributions are welcome, as well as bug reports, starting on the project's issue page.

xml2db's People

Contributors

cre-os avatar martinv13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.