Giter Site home page Giter Site logo

catmandu-dbi's Introduction

Build Status Coverage Status Cypress.io tests

LibreCat - an institutional repository

LibreCat is a new institutional repository system developed by LibreCat Group which has as its key features:

  • institutional repository
  • publication list manager for reseachers
  • institutional research data archive.

The development started in 2013 in Bielefeld and was made available on GitHub from the start. Since 2015 the code is in production at Bielefeld. In 2016 Ghent University started using the cataloging backend in production.

Features

  • Cataloging of many record types: Book, Book (Editor), Book Chapter, Book Review, Conference Abstract, Conference (Editor), Conference Paper, Dissertation, Encyclopedia Article, Journal Article, Special Issue, Newspaper Article, Preprint, Report, Translation, Translation (Section), Working Paper, Thesis, Research Data, Project, Award, Research Group
  • Drag and drop upload of full-text publications
  • Copycat from DOI, PubMED, ArXiv and Web of Science
  • Google Scholar indexation support
  • Citation styles configurable from Zotero Style Repository
  • Full MathJAX Latex support to add mathematical formulas in abstracts and titles
  • Pluggable authentication modules
  • Delegate input and management to user others
  • Multilingual support
  • ElasticSearch indexing
  • Pluggable file store backend
  • Command line support using 'Catmandu'
  • OAI-PMH and SRU
  • REST / content negotiation
  • Signposting
  • The LibreCat is open source and shipped with the same license as the Perl language: http://dev.perl.org/licenses/

Install

See our Wiki at: https://github.com/LibreCat/LibreCat/wiki

catmandu-dbi's People

Contributors

nichtich avatar nicolasfranck avatar nics avatar phochste avatar vpeil avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jorol

catmandu-dbi's Issues

DBI iterator starts fast, but slows down afterwards

The reason for this is the generator in Catmandu::Store::DBI::Iterator,
that fetches the records in batches: for every batch a query is executed,
using "order by id", "offset" and "limit". This means that the underlying
dbi system has to jump further every time it is accessed.

Better to cache the cursor

Encoding issues getting data from a MySQL database

foobar_2018-10-31.txt

Problem

When I try to export data from a MySQL database with UTF-8 encoding, I encounter encoding issues. The value "Venetië, contépotlood" will be output as "Venetië, contépotlood"

How to reproduce.

  1. Import the SQL file I've attached into a MySQL database called "foobar"
  2. Run this command:
catmandu convert DBI --dsn dbi:mysql:foobar --user root --password root --query "SELECT * FROM foobar"
  1. The output in bash will look like this:
[{"title":"Venetië, contépotlood","id":1}]

More background.

  • I'm running MySQL 5.7.21 on OSX High Sierra
  • I'm using Perl 5.26.0 (via plenv)
  • I'm running Catmandu 1.10
  • I'm using Catmandu::DBI 0.0702
  • Column / Table information: UTF-8 Unicode (utf8) with collation utf8_general_ci.

I've tried setting the --encoding flag but that doesn't do anything.

programatically

I'm actually trying to import data from MySQL into SQLite using DBI programmatically in Perl. I've tried setting binmode manually and what-not to no avail. I've also tried setting the encoding flag with something like :encoding(utf8) but that doesn't work either.

Which leaves me doing this wonky code:

$importer->each(sub {
  my $item = shift;
  $item->{title} = decode('UTF-8', $item->{title});
  $exporter->add($item);

I don't think explicitly converting fields is a good practice. I've understood that Perl has something called 'internal encoding' as opposed to external encoding/decoding of the IO. My guess is that somewhere, data is encoded twice in UTF-8 which ends up causing issues.

Any ideas?

custom mapped text fields do not support utf-8

The reason for this is that the database encoding is set to sql_ascii by default.
Insertion fails with "wide character ..".

This does not happen for the field "data" where the type is "bytea", and the data
is accepted as-is.

Configure standard ORDER on `id` in generators

The current Catmandu::Store::DBI generator orders all results on id which requires many CPU resources and slows down long listing considerably. For default UUID id generators this sorting of ids is not of any use.

Error in documentation or bug?

The example doesn't work:

catmandu import JSON to DBI --data_source SQLite:mydb.sqlite < data.json

Works with

catmandu import JSON to DBI --data_source DBI:SQLite:mydb.sqlite < data.json

The documentation for the constuctor tells it should work without DBI

Create a new Catmandu::Store::DBI store using a DBI $data_source. The prefix "DBI:" is added automatically if needed.

Tested with Catmandu-DBI-0.0504

Error if database does not exist (mysql-driver)

Here's my catmandu.yml (tested with the mysql driver)

store:
  default:
    package: DBI 
    options:
      data_source: "DBI:mysql:database=test"
      username: whoami
      password: secret

The command

$ echo '{"hello" : "world"}' | catmandu import

dies with no database exists. Is this behaviour intentional? In case of Elasticsearch the index gets created if needed. Should we add here a "create database $name"?

DBD::mysql::db begin_work failed

We are seeing the following error message every morning in a worker using publication->add:

"DBD::mysql::db begin_work failed: Turning off AutoCommit failed at [....]/LibreCat/local/lib/perl5/Catmandu/Store/DBI.pm line 117"

Restarting the worker will fix the problem, but only until the next morning. The worker itself keeps running and working fine. Only writing to the db fails.

Google led me to this bug report: perl5-dbi/DBD-mysql#202 which, unfortunately, does not have a solution yet.

Any idea on how to fix this? (As a temporary fix we are restarting workers nightly via cron.)

Support custom mapping types

Currently the data mapping only allows string, integer,binary,datetime but doesn't check for other types like int4, int2, bool. Make it an option to use an exact data type when speficied. E.g.

bags:
        data:
                mapping:
                    _id:
                        type: 'varchar(255)'
                        required: 1
                        index: 1
                        column: handle
                        unique: 1

select fails when key is not a mapped field

TODO for branch feature-mapping:

$bag->select(title => "Nicolas")

will fail if "title" is not included as a mapped field, because
it will issue an sql query on that non-existant field.

Solution: check mapping for given field, and if it does not exist,
then do full scan. Document this slow behaviour.

Make it configurable to set the encoding of the database

In a previous version of Catmandu::DBI the default encoding of the SQLite databases was ascii now it is standaard uncode. This has the effect that old databases can't be read with Catmandu tools when UTF-8 encoded data was stored in the database.

Oops! malformed UTF-8 character in JSON string, at character offset 138 (before "\x{fffd}enri Constan...") at /opt/librecat/local/lib/perl5/Catmandu/Serializer/json.pm line 16.

UTF8 issues with _id in SQLite3

Problem

I'm trying to import a CSV file with UTF8 data in a SQLite store using the DBI module. The _id field has values which contain non-ASCII characters (é, à, ë,...) because these are names. When I'm trying to convert the SQLite database back again to JSON, CSV,... the characters in the _id field are garbled:

So this Abbate, Nicolò dell' becomes this Abbate, Nicolò dell'

Steps to reproduce

Given this CSV file called "creator.csv"

_id,id,rkd_uri,viaf_uri,wikidata_uri,viaf_alternate
"Abbate, Nicolò dell'",20001109,https://rkd.nl/explore/artists/112,http://viaf.org/viaf/88804962,,"Nicolo Abate,Niccolo,Niccolò dell'Abate,Niccolò Dell'Abbate,Nicolo Dell'Abbate,Նիկոլո դել Աբատե,Niccolo Abbate,Niccolò Dell'Abate,Niccolò dell' Abbate,Abbate,Nicolo dell'Abate,Messer Niccolo,Niccolino dell'Abate,Nicolo dell'Abbate,Nicolò Abbati,Niccolò del Abbate,尼科洛·德尔·阿巴特,Nicolas Labbé,Nicolò �dell� Abbate,Nicolò dell’Abbate,Niccolo �dell'� Abate,Niccolò dell' Abate,Niccolò Abate,Niccolino Dell' Abbate,Niccolò dell' Abatti,Niccolò dell' Abate,Messer Niccolò,Nicolo dell' Abate,Nicolò Dell'Abate,Abati,Messer Niccolino,Nicolò dell' Abate,Nicolò dell'Abbate,Niccolo dell' Abbate,Nicolas Labbé,Niccolò Pisano de Brusis,Nicolas de Batty,Nicolas de Labaty,Нікколо дель Аббате,ニコロ・デッラバーテ,Abbate Nïkolo delʹ,Messer Niccolò,Nicolò dell' Abate,Никколо дель Аббате,Niccolò Pisano di Bartolomeo,Nicolò dell Abbate,Niccolo dell’Abbate,Nicolo Dell'Abate,Niccolò Dell' Abate,Nicolò Dell'Abate,Nicolò dell'Abate,Niccolò Dell'Abbate,Niccolò dell'Abate,أبيات نيقولا دل,Dell'Abate,Nicolò Abbati,Niccolò dell'Abbate,Niccolo dell'Abate,Nicolo del Abati,Nikolo del Abate,ابباتە نىيكولو دەل,Nicolo del Abbati,Niccolò dell' Abbate,Nicolo de Labbaty,Niccolo di Bartolomeo Pisano,Niccolo dell' Abate,Nicola,Nicolo �del� Abbati,Nicolò dell Abbate,Niccolò,Niccolò Abbati,Nicolaio,Niccolo Dell'Abbate,Dell'Abbate,Niccolò Pisano,Nicolo dell Abate,Niccolo �dell'� Abbate,Nicolas Abbati,Niccolò Abbati,Николо дел Абате,1470 Niccolò,Аббате Николо дель,Niccolino,Abate,Niccolò,Niccolo' dell' Abate,Niccolò Abatti,Nicolo �del� Abati"

I'm import this into a SQLite database using DB like this:

catmandu import CSV to DBI --data_source dbi:SQLite:/tmp/creators.sqlite < creators.csv

And then I go back to JSON (or any other format) with this:

catmandu export DBI --data_source dbi:SQLite:/tmp/creators.sqlite to JSON --pretty 1

Which yields this output:

[{
   "wikidata_uri" : "",
   "_id" : "Abbate, Nicolò dell'",
   "viaf_alternate" : "Nicolo Abate,Niccolo,Niccolò dell'Abate,Niccolò Dell'Abbate,Nicolo Dell'Abbate,Նիկոլո դել Աբատե,Niccolo Abbate,Niccolò Dell'Abate,Niccolò dell' Abbate,Abbate,Nicolo dell'Abate,Messer Niccolo,Niccolino dell'Abate,Nicolo dell'Abbate,Nicolò Abbati,Niccolò del Abbate,尼科洛·德尔·阿巴特,Nicolas Labbé,Nicolò �dell� Abbate,Nicolò dell’Abbate,Niccolo �dell'� Abate,Niccolò dell' Abate,Niccolò Abate,Niccolino Dell' Abbate,Niccolò dell' Abatti,Niccolò dell' Abate,Messer Niccolò,Nicolo dell' Abate,Nicolò Dell'Abate,Abati,Messer Niccolino,Nicolò dell' Abate,Nicolò dell'Abbate,Niccolo dell' Abbate,Nicolas Labbé,Niccolò Pisano de Brusis,Nicolas de Batty,Nicolas de Labaty,Нікколо дель Аббате,ニコロ・デッラバーテ,Abbate Nïkolo delʹ,Messer Niccolò,Nicolò dell' Abate,Никколо дель Аббате,Niccolò Pisano di Bartolomeo,Nicolò dell Abbate,Niccolo dell’Abbate,Nicolo Dell'Abate,Niccolò Dell' Abate,Nicolò Dell'Abate,Nicolò dell'Abate,Niccolò Dell'Abbate,Niccolò dell'Abate,أبيات نيقولا دل,Dell'Abate,Nicolò Abbati,Niccolò dell'Abbate,Niccolo dell'Abate,Nicolo del Abati,Nikolo del Abate,ابباتە نىيكولو دەل,Nicolo del Abbati,Niccolò dell' Abbate,Nicolo de Labbaty,Niccolo di Bartolomeo Pisano,Niccolo dell' Abate,Nicola,Nicolo �del� Abbati,Nicolò dell Abbate,Niccolò,Niccolò Abbati,Nicolaio,Niccolo Dell'Abbate,Dell'Abbate,Niccolò Pisano,Nicolo dell Abate,Niccolo �dell'� Abbate,Nicolas Abbati,Niccolò Abbati,Николо дел Абате,1470 Niccolò,Аббате Николо дель,Niccolino,Abate,Niccolò,Niccolo' dell' Abate,Niccolò Abatti,Nicolo �del� Abati",
   "viaf_uri" : "http://viaf.org/viaf/88804962",
   "id" : "20001109",
   "rkd_uri" : "https://rkd.nl/explore/artists/112"
}]

Notice how the data in the other fields are all okay, but the data in the _id field is not.

Probable causes

The problem is that the _id field is a TEXT type, while all other fields are stored as a binary blob in SQLite. So, that explains why it works perfectly for the other fields like "viaf_alternate". It fails for the _id field because somewhere, somehow the UTF-8 conversion is not done correctly.

On the CLI, inspecting the SQLite database with sqlite3 yields this:

$ sqlite3 /tmp/creators.sqlite
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
sqlite> select * from data;
{"wikidata_uri":"","rkd_uri":"https://rkd.nl/explore/artists/112","id":"20001109","viaf_alternate":"Nicolo Abate,Niccolo,Niccolò dell'Abate,Niccolò Dell'Abbate,Nicolo Dell'Abbate,Նիկոլո դել Աբատե,Niccolo Abbate,Niccolò Dell'Abate,Niccolò dell' Abbate,Abbate,Nicolo dell'Abate,Messer Niccolo,Niccolino dell'Abate,Nicolo dell'Abbate,Nicolò Abbati,Niccolò del Abbate,尼科洛·德尔·阿巴特,Nicolas Labbé,Nicolò �dell� Abbate,Nicolò dell’Abbate,Niccolo �dell'� Abate,Niccolò dell' Abate,Niccolò Abate,Niccolino Dell' Abbate,Niccolò dell' Abatti,Niccolò dell' Abate,Messer Niccolò,Nicolo dell' Abate,Nicolò Dell'Abate,Abati,Messer Niccolino,Nicolò dell' Abate,Nicolò dell'Abbate,Niccolo dell' Abbate,Nicolas Labbé,Niccolò Pisano de Brusis,Nicolas de Batty,Nicolas de Labaty,Нікколо дель Аббате,ニコロ・デッラバーテ,Abbate Nïkolo delʹ,Messer Niccolò,Nicolò dell' Abate,Никколо дель Аббате,Niccolò Pisano di Bartolomeo,Nicolò dell Abbate,Niccolo dell’Abbate,Nicolo Dell'Abate,Niccolò Dell' Abate,Nicolò Dell'Abate,Nicolò dell'Abate,Niccolò Dell'Abbate,Niccolò dell'Abate,أبيات نيقولا دل,Dell'Abate,Nicolò Abbati,Niccolò dell'Abbate,Niccolo dell'Abate,Nicolo del Abati,Nikolo del Abate,ابباتە نىيكولو دەل,Nicolo del Abbati,Niccolò dell' Abbate,Nicolo de Labbaty,Niccolo di Bartolomeo Pisano,Niccolo dell' Abate,Nicola,Nicolo �del� Abbati,Nicolò dell Abbate,Niccolò,Niccolò Abbati,Nicolaio,Niccolo Dell'Abbate,Dell'Abbate,Niccolò Pisano,Nicolo dell Abate,Niccolo �dell'� Abbate,Nicolas Abbati,Niccolò Abbati,Николо дел Абате,1470 Niccolò,Аббате Николо дель,Niccolino,Abate,Niccolò,Niccolo' dell' Abate,Niccolò Abatti,Nicolo �del� Abati","viaf_uri":"http://viaf.org/viaf/88804962"}|Abbate, Nicolò dell'
sqlite> .schema data
CREATE TABLE "data"("data" BLOB,"id" TEXT UNIQUE NOT NULL);

So, if you look at the output, you'll notice that the characters in the _id field are stored as not-encoded. It's just that when Catmandu::DBI retrieves them, somewhere it gets converted / garbled.

Any ideas what could go wrong here?

Support a basic search on mapped fields

If I have a store like:

my $store = Catmandu::Store::DBI->new(
    data_source => 'DBI:mysql:database=test',
    bags => {
        # books table
        books => {
            mapping => {
                # these keys will be directly mapped to columns
                # all other keys will be serialized in the data column
                title => {type => 'string', required => 1, column => 'book_title'},
                isbn => {type => 'string', unique => 1},
                authors => {type => 'string', array => 1}
            }
        }
    }
);

I would like to do a :

my $hits = $store->bags('books')->search("title = 'DNA'");

This documentation mentions a reference to Catmandu::Searchable anyway.

Optionally add versioning

One should optionally enable a second table with revisions:

create table if not exists {$name}_revisions (
    rev_id varchar(255) not null,   # foreign key to $name.id
    rev_data longblob,              # null if deleted but recreated in $name
    rev_timestamp timestamp not null
)

Before modification or deletion, a record is first copied to the revisions table, so the current version is not included in this optional table. The combination of rev_id and rev_timestamp should be unique

When other stores provide versioning too, one could abstract related actions, such as revert and (get/list)revisions for easy access.

Support custom ORDER field

In the current implementation the DBI Iterator is ordered on _id. Make this order configurable (and maybe also ignore order in some cases)

Mapping to datetime columns fails for MySQL

I have a mapping:

mapping:
            date_updated:
                type: datetime
                required: 1

This generates an error in MySQL 5.5 installations:

BD::mysql::db do failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(3) NOT NULL,`id` VARCHAR(255) BINARY UNIQUE NOT NULL,`date_created` DATETIME(3)' at line 1 at /opt/librecat/local/lib/perl5/Catmandu/Store/DBI/Handler/MySQL.pm line 58.

I guess DATETTIME(3) isnt valid for MySQL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.