Giter Site home page Giter Site logo

csv2db / csv2db Goto Github PK

View Code? Open in Web Editor NEW
140.0 5.0 22.0 315 KB

The CSV to database command line loader

Home Page: https://csv2db.github.io

License: Apache License 2.0

Python 93.58% Shell 1.89% Batchfile 1.36% TSQL 1.31% Dockerfile 1.86%
csv-files csv-import database etl loader csv2db mysql postgresql db2 csv

csv2db's Introduction

                  ___       ____  
  ___________   _|__ \ ____/ / /_ 
 / ___/ ___/ | / /_/ // __  / __ \
/ /__(__  )| |/ / __// /_/ / /_/ /
\___/____/ |___/____/\____/_____/ 
                                  

The CSV to database command line loader.

csv2db reads CSV files and loads them into a database. Rather than having to go through the CSV data first and find out what columns and data types are present in the CSV files, csv2db will read the header in each CSV file and automatically load data into the columns of the same name into the target table. The case of the header column names will be preserved as present in the CSV file. Spaces in the header column names are automatically replaced with _ characters, for example, the column station id in the CSV file will be interpreted as station_id column in the table.

This approach allows you to get data into the database first and worry about the data cleansing part later, which is usually much easier once the data is in the database rather than in the CSV files.

csv2db is capable of scanning all CSV file headers at once and generate a CREATE TABLE statement with all the column names present. This is particularly useful if the format of the CSV files has changed over time or because you want to load different CSV file types into the same database table.

Usage

$ ./csv2db -h
usage: csv2db [-h] {generate,gen,load,lo} ...

The CSV to database command line loader.
Version: 1.6.1
(c) Gerald Venzl

positional arguments:
  {generate,gen,load,lo}
    generate (gen)      Prints a CREATE TABLE SQL statement to create the
                        table and columns based on the header row of the CSV
                        file(s).
    load (lo)           Loads the data from the CSV file(s) into the database.

options:
  -h, --help            show this help message and exit
$ ./csv2db generate -h
usage: csv2db generate [-h] [-f FILE] [-e ENCODING] [-v] [--debug]
                       [-o {oracle,mysql,postgres,sqlserver,db2}] [-t TABLE]
                       [-c COLUMN_TYPE] [-s SEPARATOR] [-q QUOTE]
                       [--case-insensitive-identifiers] [--quote-identifiers]

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  The file to read the header from, by default all
                        *.csv.zip files
  -e ENCODING, --encoding ENCODING
                        The file encoding to be used to read the file, see htt
                        ps://docs.python.org/3/library/codecs.html#standard-
                        encodings for a list of all allowed encodings.
  -v, --verbose         Verbose output.
  --debug               Debug output.
  -o {oracle,mysql,postgres,sqlserver,db2}, --dbtype {oracle,mysql,postgres,sqlserver,db2}
                        The database type.
  -t TABLE, --table TABLE
                        The table name to use.
  -c COLUMN_TYPE, --column-type COLUMN_TYPE
                        The column type to use for the table generation.
  -s SEPARATOR, --separator SEPARATOR
                        The columns separator character(s).
  -q QUOTE, --quote QUOTE
                        The quote character on which a string won't be split.
  --case-insensitive-identifiers
                        If set, all identifiers will be upper-cased.
  --quote-identifiers   If set, all table and column identifiers will be
                        quoted.
$ ./csv2db load -h
usage: csv2db load [-h] [-f FILE] [-e ENCODING] [-v] [--debug] -t TABLE
                   [-o {oracle,mysql,postgres,sqlserver,db2}] -u USER
                   [-p PASSWORD] [-m HOST] [-n PORT] [-d DBNAME] [-b BATCH]
                   [-s SEPARATOR] [-q QUOTE] [-a] [--truncate] [-i] [-l]
                   [--case-insensitive-identifiers] [--quote-identifiers]

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  The file to load, by default all *.csv.zip files
  -e ENCODING, --encoding ENCODING
                        The file encoding to be used to read the file, see htt
                        ps://docs.python.org/3/library/codecs.html#standard-
                        encodings for a list of all allowed encodings.
  -v, --verbose         Verbose output.
  --debug               Debug output.
  -t TABLE, --table TABLE
                        The table name to use.
  -o {oracle,mysql,postgres,sqlserver,db2}, --dbtype {oracle,mysql,postgres,sqlserver,db2}
                        The database type.
  -u USER, --user USER  The database user to load data into.
  -p PASSWORD, --password PASSWORD
                        The database schema password. csv2db will prompt for
                        the password if the parameter is missing which is a
                        more secure method of providing a password.
  -m HOST, --host HOST  The host name on which the database is running on.
  -n PORT, --port PORT  The port on which the database is listening. If not
                        passed on the default port will be used (Oracle: 1521,
                        MySQL: 3306, PostgreSQL: 5432, SQL Server: 1433, DB2:
                        50000).
  -d DBNAME, --dbname DBNAME
                        The name of the database.
  -b BATCH, --batch BATCH
                        How many rows should be loaded at once.
  -s SEPARATOR, --separator SEPARATOR
                        The columns separator character(s).
  -q QUOTE, --quote QUOTE
                        The quote character on which a string won't be split.
  -a, --directpath      Execute a direct path INSERT load operation (Oracle
                        only).
  --truncate            Truncate/empty table before loading.
  -i, --ignore          Ignore erroneous/invalid lines in files and continue
                        the load.
  -l, --log             Log erroneous/invalid lines in *.bad file of the same
                        name as the input file (this implies the --ignore
                        option).
  --case-insensitive-identifiers
                        If set, all identifiers will be upper-cased.
  --quote-identifiers   If set, all table and column identifiers will be
                        quoted.

How to use csv2db

Loading CSV files into the database

csv2db can load plain text csv files as well as compressed csv files in .zip or .gz format without having to uncompress them first.

$ ./csv2db load -f test/resources/201811-citibike-tripdata.csv -t citibikes -u csv_data -p csv_data -d ORCLPDB1

Loading file test/resources/201811-citibike-tripdata.csv
File loaded.

$ ./csv2db load -f test/resources/201811-citibike-tripdata.csv.gz -t citibikes -u csv_data -p csv_data -d ORCLPDB1

Loading file test/resources/201811-citibike-tripdata.csv.gz
File loaded.

csv2db --verbose option will provide verbose output.

$ ./csv2db load -f test/resources/201811-citibike-tripdata.csv -t citibikes -u csv_data -p csv_data -d ORCLPDB1 --verbose
Finding file(s).
Found 1 file(s).
Establishing database connection.

Loading file test/resources/201811-citibike-tripdata.csv
16 rows loaded.
File loaded.

Closing database connection.

csv2db can load multiple files at once, using either wildcard characters (e.g. data*.csv.zip) or by passing on the folder containing CSV files.

Note: String including wildcard characters have to be enclosed in ""

$ ./csv2db load -f "test/resources/201811-citibike-tripdata.*" -t citibikes -u csv_data -p csv_data -d ORCLPDB1 --verbose
Finding file(s).
Found 3 file(s).
Establishing database connection.

Loading file test/resources/201811-citibike-tripdata.csv
16 rows loaded.
File loaded.


Loading file test/resources/201811-citibike-tripdata.csv.gz
10 rows loaded.
File loaded.


Loading file test/resources/201811-citibike-tripdata.csv.zip
10 rows loaded.
File loaded.

Closing database connection.
$ ./csv2db load -f test/resources -t citibikes -u csv_data -p csv_data -d ORCLPDB1 --verbose
Finding file(s).
Found 3 file(s).
Establishing database connection.

Loading file test/resources/201811-citibike-tripdata.csv
16 rows loaded.
File loaded.


Loading file test/resources/201811-citibike-tripdata.csv.gz
10 rows loaded.
File loaded.


Loading file test/resources/201811-citibike-tripdata.csv.zip
10 rows loaded.
File loaded.

Closing database connection.

csv2db will load all values as strings. You can either load all data into a staging table with all columns being strings as well, or rely on implicit data type conversion on the database side.

Create a staging table

csv2db can generate the SQL statement for a staging table for your data using the generate command:

$ ./csv2db generate -f test/resources/201811-citibike-tripdata.csv
CREATE TABLE <TABLE NAME>
(
  TRIPDURATION VARCHAR(1000),
  STARTTIME VARCHAR(1000),
  STOPTIME VARCHAR(1000),
  START_STATION_ID VARCHAR(1000),
  START_STATION_NAME VARCHAR(1000),
  START_STATION_LATITUDE VARCHAR(1000),
  START_STATION_LONGITUDE VARCHAR(1000),
  END_STATION_ID VARCHAR(1000),
  END_STATION_NAME VARCHAR(1000),
  END_STATION_LATITUDE VARCHAR(1000),
  END_STATION_LONGITUDE VARCHAR(1000),
  BIKEID VARCHAR(1000),
  USERTYPE VARCHAR(1000),
  BIRTH_YEAR VARCHAR(1000),
  GENDER VARCHAR(1000)
);

By default, you will have to fill in the table name. You can also specify the table name via the -t option:

$ ./csv2db generate -f test/resources/201811-citibike-tripdata.csv -t STAGING
CREATE TABLE STAGING
(
  TRIPDURATION VARCHAR(1000),
  STARTTIME VARCHAR(1000),
  STOPTIME VARCHAR(1000),
  START_STATION_ID VARCHAR(1000),
  START_STATION_NAME VARCHAR(1000),
  START_STATION_LATITUDE VARCHAR(1000),
  START_STATION_LONGITUDE VARCHAR(1000),
  END_STATION_ID VARCHAR(1000),
  END_STATION_NAME VARCHAR(1000),
  END_STATION_LATITUDE VARCHAR(1000),
  END_STATION_LONGITUDE VARCHAR(1000),
  BIKEID VARCHAR(1000),
  USERTYPE VARCHAR(1000),
  BIRTH_YEAR VARCHAR(1000),
  GENDER VARCHAR(1000)
);

csv2db will use VARCHAR(1000) as default data type for all columns for the staging table. If you wish to use a different data type, you can specify it via the -c option:

$ ./csv2db generate -f test/resources/201811-citibike-tripdata.csv -t STAGING -c CLOB
CREATE TABLE STAGING
(
  TRIPDURATION CLOB,
  STARTTIME CLOB,
  STOPTIME CLOB,
  START_STATION_ID CLOB,
  START_STATION_NAME CLOB,
  START_STATION_LATITUDE CLOB,
  START_STATION_LONGITUDE CLOB,
  END_STATION_ID CLOB,
  END_STATION_NAME CLOB,
  END_STATION_LATITUDE CLOB,
  END_STATION_LONGITUDE CLOB,
  BIKEID CLOB,
  USERTYPE CLOB,
  BIRTH_YEAR CLOB,
  GENDER CLOB
);

The idea is to have a staging table that data can be loaded into first and then figure out the correct data types for each column.

Installation

You can install csv2db either by installing it as a Python package, which will automatically install all dependencies except the Db2 driver (as this one is still in Beta status)

$ python3 -m pip install csv2db
$ csv2db

or cloning this Git repository

$ git clone https://github.com/csv2db/csv2db

or by downloading one of the releases

$ LOCATION=$(curl -s https://api.github.com/repos/csv2db/csv2db/releases/latest | grep "tag_name" | awk '{print "https://github.com/csv2db/csv2db/archive/" substr($2, 2, length($2)-3) ".zip"}') ; curl -L -o csv2db.zip $LOCATION
$ unzip csv2db.zip
$ cd csv2db*
$ ./csv2db

In order for csv2db to work the appropriate database driver or drivers need to be installed. This installation is done automatically when installing csv2db as a Python package (pip install csv2db). The following drivers are being used, and are all available on pypi.org:

You can install any of these drivers via pip:

$ python3 -m pip install oracledb
$ python3 -m pip install mysql-connector-python
$ python3 -m pip install psycopg[binary]
$ python3 -m pip install pymssql
$ python3 -m pip install ibm-db

For more instruction on how to install the driver(s) on your environment, please see the documentation of the individual driver or refer to the csv2db Installation Guide.

NOTE: You only have to install the driver for the database(s) that you want to load data into.

Miscellaneous

What csv2db is and, more importantly, what it is not!

Since the very inception of csv2db, it has been a core principle for it not to become an ETL tool with all the bells and whistles. There are already many very good ETL tools out there and the world doesn't need yet another one. Instead, csv2db should aid users as a simple command-line tool to get rows from a delimited file into a database table, and not more! Following that core design goal, csv2db will most likely never provide many database-specific options or parameters for the end-user to set, it will not deal with explicit data type or character set conversion or globalization support that some databases offer. If a user requires any of these features or more, he or she should look for one of the already existing ETL tools out there.

Simply put, csv2db does not do much more than taking rows from a delimited file and execute INSERT INTO statements with the values of these rows. It is there to help users to get the contents of a file into a database table quickly where the data can then be further processed.

Using the csv2db Docker image

csv2db is also offered as a Docker image, making the usage of csv2db quick and easy without requiring any install.

To run csv2db, simply invoke the docker|podman run command, for example:

# quick test the image
podman run --rm csv2db --help

To load data, bind the folder containing the input files as a Docker volume:

podman run --rm -v <input files folder>:/input/ csv2db load -f /input/<input file(s)> ...

For example:

podman run --rm -v $HOME/input_files:/input csv2db \
  load -f /input/201811-citibike-tripdata.csv -t citibikes \
  -u db_user -p db_pass -d my_db

Exit codes

csv2db returns following exit codes:

Exit code Value Meaning
SUCCESS 0 Successful execution of the program.
GENERIC_ERROR 1 A generic error occurred.
ARGUMENT_ERROR 2 An argument is either missing or incorrect.
DATABASE_ERROR 3 A database error occurred.
DATA_LOADING_ERROR 4 An error occurred during loading of data. csv2db will continue to process other files, if any.

$NO_COLOR support

csv2db is capable of color coded output and will do so by default (except on Windows).
Debug output is yellow.
Error output is red.
This can be deactivated by setting the $NO_COLOR environment variable. For more details on $NO_COLOR see https://no-color.org/

LICENSE

Copyright 2024 Gerald Venzl

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

csv2db's People

Contributors

aberezin avatar gvenzl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

csv2db's Issues

Provide a parameter for Date/Time format

Provide a parameter to pass on the date/time format of CSV date/time column data.
This will avoid (potentially silent) conversion errors when the user's client uses the wrong NLS_DATE_FORMAT setting for whatever reason.

Exit program when no files found

When not a single file has been identified, the program continues further, including connecting and disconnecting to the database for the load command or generating an empty CREATE TABLE statement for the generate command. This is pointless, as no file(s) has been identified and the program can already exist.

Option: truncate table before load

Include a new option for truncating a table before the load.
This is useful for when you want to load files into a staging table on a regularly, scripted or scheduled basis.

Don't print color codes on non-color capable terminals

Right now csv2db prints color codes regardless of whether the terminal is color capable or not:

C:\Users\gvenzl\Downloads\csv2db-1.4.0>csv2db lo -f test -t bla -u u -p p -v --debug
�[33mDEBUG: 2019-11-10 18:18:39.460232: Table name: bla
�[0m�[33mDEBUG: 2019-11-10 18:18:39.460232: Column separator: ,
�[0m�[33mDEBUG: 2019-11-10 18:18:39.460232: Column escape character: "
�[0mFinding file(s).
Found 0 file(s).

Add return code to indicate that data loading had an issue

If the data loading of a file was unsuccessful csv2db will gracefully stop loading the file and, if present, move on to the next. The exit code will be that of a normal execution, as it was a normal execution and csv2db handled the faulty data gracefully. However, it might makes sense to have a separate error code to indicate that not all data has been loaded successfully so that shell scripts can react accordingly to notify the user, etc.

Error generating statement: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Picking up issue when files are "UTF-16LE" encoded:

:~/csv2db# file -i testimportdb.csv
testimportdb.csv: text/plain; charset=utf-16le

~/csv2db# ./csv2db generate --debug -s ' ' -f testimportdb.csv

Error generating statement: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

A quick fix is to convert the file, then it executes successfully and generates table.

:~/csv2db# iconv -f UTF-16LE -t UTF-8 testimportdb.csv -o testimportdbdecode.csv

**:~/csv2db# python3 --version
Python 3.8.5

:~/csv2db# ./csv2db -h
usage: csv2db [-h] {generate,gen,load,lo} ...

The CSV to database command line loader.
Version: 1.5.1**

Provide mechanism for quoted identifiers

Databases have a list of reserved words that cannot be used in standard SQL as they confuse the parser with what the user actually means, e.g. an identifier or a keyword. For example, having a column name called FROM' would be confusing in the SELECT statement SELECT id, txt, tms, from FROM table` for the parser to know whether the first FROM is supposed to be interpreted as a column name or whether it's simply a syntactical error.
Hence databases, by default, don't allow the use of reserved words as arbitrary identifier names.

However, if a user nonetheless wishes to use such a reserved word, databases have a means of quoting identifiers, hence telling the parser to interpret whatever was written within the quotes literally (for most databases that also means case sensitive).
In this case, the SELECT statement would look like SELECT id, txt, tms, "from" FROM table.
Here the " quotes tell the parser to not interpret but take literally anything within the quotation marks.

As there is a legitimate use of quotation marks for identifiers, csv2db shall have a flag to quote all identifiers if the user wishes to do so.

Are tabs as separator with no quotes around strings possible

CSV is fine, but TSV is finer ;-)

I prefer using tabs as separator as tabs are most likely not included in strings. Additionally: When using tabs as separator one doesn't need quotes around strings, which makes some processes easier.

Is it or will it be possible to use csv2db that way?

Greetings to lovely SF

Wolfgang

Extra parameter for TNSNAMES.ORA use scenario (Oracle OCI instances connection)

I suggest to add extra parameter --tnsname TNSNAME to accommodate for use of Oracle 'tnsnames.ora' file. This will simplify connections to OCI cloud instances, for example, or any other instances already described as aliases in Oracle tnsnames.ora file. Otherwise there is no option to use TLS connections to Oracle with csv2db.

I made necessary changes to source code (attached) to accommodate for this option. In case it is specified and not empty it overrides host, port, dbname and only uses user and password. Works OK with my OCI Oracle instance.

Kind regards,
csv2db_src.zip

Enhancement request - direct path load for Oracle

Allow for direct path load in Oracle which would:

  • bump up the batch size to (say) 8k, ie always larger than what could fit in a single block
  • adds an APPEND_VALUES to the insert statement

(Unsure if the cx_oracle driver supports this)

IBM Db2 LUW support

Add support for IBM Db2 LUW.
I can test it on IBM Db2 LUW version 10.5, 11.1 on Linux.

Be more precise with status output

Right now when loading of a file is finished, successful or not, the output states "Done". This was considered to tell the user that the file has been processed, however, this can be misunderstood as that the file has been successfully loaded instead.

Instead there should be an explicit "File loaded." printed when the file is actually loaded successfully and otherwise there should be a "Skipping file." printed instead.

Introduce logging of bad rows into error file

This is a follow-on enhancement of ER #43.
It may make sense that if errors should be skipped, that these erroneous rows are logged into an error file so that the user can go back later and see which rows didn't get loaded

Command `generate`: Fail, when header field is empty

What happened:

Consider a CSV like

Ingredient code,SR description,Nutrient code,Nutrient value,Nutrient value source,Derivation code,SR AddMod year,Start date,End date,,,
1001,"Butter, salted",203,0.85,SR Legacy,,1976,1/1/2017 0:00,12/31/2018 0:00,,,
1001,"Butter, salted",204,81.11,SR Legacy,,1976,1/1/2017 0:00,12/31/2018 0:00,,,

The last three header fields don't contain a name.

If you run

$ csv2db generate --file fndds_ingredient_nutrient_value.csv --table fndds_ingredient_nutrient_value --quote-identifiers --dbtype mysql

You will get an output like

CREATE TABLE `fndds_ingredient_nutrient_value`
(
 `INGREDIENT_CODE` VARCHAR(1000),
 `SR_DESCRIPTION` VARCHAR(1000),
 `NUTRIENT_CODE` VARCHAR(1000),
 `NUTRIENT_VALUE` VARCHAR(1000),
 `NUTRIENT_VALUE_SOURCE` VARCHAR(1000),
 `DERIVATION_CODE` VARCHAR(1000),
 `SR_ADDMOD_YEAR` VARCHAR(1000),
 `START_DATE` VARCHAR(1000),
 `END_DATE` VARCHAR(1000),
 `` VARCHAR(1000)
);

What you expected to happen:

I would expect csv2db to fail with exit code > 0.
Why?
Because

  • I would consider database fields without a name as not a valid database field (how to query it?)
  • It might drop data because the last three fields are summed up into one field
  • I am not even sure if the load command would work

Anything else we need to know?:

The --quote-identifiers can be ignored for this bug.
This is WIP in #51.

@gvenzl Please let me know what you think about it. If you agree with it, I would apply a PR.

Postgres fast load

Hi Gerald,

First off this is a great tool. Thanks very much.

My use case is to upload CSV files into AWS Postgres DB for data analytics (Windows environment). Your tool works great however it is a bit slow when dealing with my data volumes. I get files with a million rows that need to be uploaded to Postgre on a daily basis.

For testing, I loaded 20000 rows and it took 2-3 mins, while using pgsql COPY load it took 2 seconds.

I was wondering if there is a way to tap into using pgsql for a faster upload?

I am assuming your process, in order to be generic, uses INSERT statements behind the scene?

Also I was wondering if we can add a few more options to "csv2db" that I think others users may find useful - and hopefully you see value in it.

Additional useful options for "csv2db load"

  1. Truncate table and load (optional)
  2. Append to target table (default)

Additional useful option for "csv2db generate"
3. Create table in the target database (the default fornow i.e. to print the CREATE TABLE script)

Let me know what you think.

Thanks
Sunil

Check for mandatory command line parameters

Right now it's not checked for whether mandatory parameters have been passed on.
The tool should check for all mandatory parameters and produce errors if they haven't been passed on.

Direct path loading hint at wrong position

The direct path loading hint APPEND_VALUES introduced with ER #8 is at the wrong position in the INSERT statement behind the INTO keyword instead immediately following the INSERT keyword.

MySQL: Support for `LOAD DATA INFILE`

What would you like to be added:

Row-Inserts are done via (bulk) inserts right now.
For bigger imports, MySQL provides a LOAD DATA command.
See 13.2.7 LOAD DATA Statement @ MySQL docs.

Adding support for this command would be cool.

Why is this needed:

This would speed up the import process of bigger CSV datasets quite significantly.

Anything else we need to know?:

Dedicated speed tests haven't been done yet.

This is similar to the PostgreSQL COPY variant mentioned in #44 and #47.

"Smart mode" for schema creation

What would you like to be added:

When csv2db creates a schema, every column is created as VARCHAR(1000).
This default type can be changed by using

  -c COLUMN_TYPE, --column-type COLUMN_TYPE
                        The column type to use for the table generation.

A smart mode would be quite useful.
When this mode is enabled, csv2db would scan the content of the CSV file first and tries to determine the right type per column.

It doesn't need to be super smart.
Maybe beginning with three different types:

  • strings by VARCHAR(1000)
  • integers via a column that supports integer 64 bit
  • floats via a column that supports decimals

A CSV like this

id,name,unit_name,nutrient_nbr,rank
1001,Solids,G,201,200
1002,Nitrogen,G,202,500
1003,Protein,G,203,600
1004,Total lipid (fat),G,204,800
1005,"Carbohydrate, by difference",G,205,1110

would lead to something like

CREATE TABLE `nutrient`
(
 `ID` int,
 `NAME` VARCHAR(1000),
 `UNIT_NAME` VARCHAR(1000),
 `NUTRIENT_NBR` int,
 `RANK` int
);

Above is pseudocode, database types for the particular database systems need to be determined).

Why is this needed:

  • The --column-type can only change the type for all columns, not for specific ones
  • Queries on the imported data might be more efficient (especially for numbers)
  • We would leverage the functionality of the database "a bit more"

Anything else we need to know?:

  • We assume CSVs that follow a valid structure
  • Database types for the particular database systems need to be determined
  • A current work-a-round would be to ALTER the schema after the import

Feedback:

Let me know what you think about this.
If this would be a welcoming feature, I might be able to spent some time on it.

Provide failing data record as debug output

When a record is faulty and produces and error on loading, there is no easy way to narrow down which record it was that caused the error. The only way today is to set the batch size to 1 and wait for it. However, it could be beneficial if the program automatically determines the faulty record and prints it out. This probably makes only sense when running in verbose or debug mode.

Record list not cleared on error while loading multiple files

If multiple files are loaded at once and there is a file with a faulty record, the record list is not cleared before moving on to the next file. This is because the exception is handled too far up and hence the record list clearing did not occur anymore. Subsequently, on the next file, csv2db fails again as it is trying to reload the same faulty batch.

Upsert support

I am planning on using this tool to insert some data from CSV to a PSQL12 database. I was wondering is there is any UPSERT support with the current version? Can we set any actions for CONFLICT situations?

Thanks

Wrong exit code passed through on error

Exit code 0 is always passed through, also on error. Errors should be raising other exit codes than 0, which is generally regarded as successful execution.

Convert into a Python package

Hi, just thinking that there would be a number of advantages to converting this to a Python package:

  • You get automatic executable installation. ie the user can pip install csv2db and then execute csv2db load in any directory without needing to know where the executable is (./csv2db)
  • You get a Python API. This allows people to run this more smoothly from Python without needing to use subprocess etc

Change default data type to VARCHAR(1000)

VARCHAR(1000) is supported by all supported databases, making it the obvious default for the create table statement.

Oracle interprets VARCHAR correctly as VARCHAR2.
MySQL cannot go over 65k altogether in a row which makes VARCHAR(4000) only support up to 16 columns, often by far not enough. VARCHAR(1000) on the other hand supports up to 65 columns which can be seen as a reasonable default.

How to execute the unit tests?

Is there a good way to execute all unit tests?

The only way I found was

cd test/python
export PYTHONPATH=../../src/python
python3 functional_tests.py

but this feels wrong and executes only the functional_tests Test suite.

Verbose "row loaded" output printed unnecessarily

$ wc -l 201812-citibike-tripdata.csv
   10001 201812-citibike-tripdata.csv
$ csv2db load -f 201812-citibike-tripdata.csv -u data -p secret -d data_science -t trips -v
Finding file(s).
Establishing database connection.

Loading file 201812-citibike-tripdata.csv
10000 rows loaded
0 rows loaded
Done

Closing database connection.
$

The line 0 rows loaded should not be printed.

Postgres: execute_batch still significantly slower than copy command

As discovered as part of #44, the execute_batch facility in psycopg2 used for Postgres still leaves a lot of room for improvement. The native Postgres COPY command is about 55x faster to load the same amount of data locally as compared to batch inserts using execute_batch.

Using execute_batch – 1m51s:

$ wc -l 201701-citibike-tripdata.csv
  726677 201701-citibike-tripdata.csv

$ ls -lh 201701-citibike-tripdata.csv
   117M Jun 13 12:35 201701-citibike-tripdata.csv

$ time csv2db load -f 201701-citibike-tripdata.csv -o postgres -u test -d test -t test

Loading file 201701-citibike-tripdata.csv
File loaded.

real	1m51.615s
user	0m17.277s
sys	0m1.015s

Using COPY – 2.7s:

test=> \timing
Timing is on.
test=> \copy test from '/tests/201701-citibike-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY 726676
Time: 2755.240 ms (00:02.755)

Python source not found in bash wrapper when not in csv2db home dir

When csv2db is not invoke from the home directory but via $PATH, the script doesn't find the Python source files:

gvenzl-mac:scratchpad gvenzl$ ls -al
total 48
drwxr-xr-x   4 gvenzl  staff    128 Feb  9 16:39 .
drwxr-xr-x+ 89 gvenzl  staff   2848 Feb  9 16:44 ..
drwxr-xr-x@  9 gvenzl  staff    288 Feb  9 16:44 csv2db-1.1.0
-rw-r--r--@  1 gvenzl  staff  20927 Feb  9 16:39 csv2db-1.1.0.zip
gvenzl-mac:scratchpad gvenzl$ export PATH=$PATH:$PWD/csv2db-1.1.0
gvenzl-mac:scratchpad gvenzl$ csv2db -h
/Users/gvenzl/scratchpad/csv2db-1.1.0/csv2db: line 25: ./src/python/csv2db.py: No such file or directory

Do not implicitly uppercase identifiers

So far, csv2db has uppercased all identifiers (table name, column names, data type for columns).
However, most common databases treat SQL case insensitive in any case, unless identifier quotes are used (").
Hence it makes little sense for csv2db to uppercase these identifiers.
More so, because this can lead to issues with databases that do not treat SQL case insensitive by default.
Hence, by default csv2db should preserve the case of the column names as provided in the CSV file(s) and preserve the case of the table name and column type as typed by the user.

For cases, where case insensitivity is wished for, csv2db shall provide a new parameter for the user to set to get such behavior.

Set autocommit explicitly to false

Autocommit is turned on by default for example DB2.
Although unfortunate that autocommit even exists as feature, it cannot be relied on that it will be always off by default. Hence it is better to explicitly turn it off at connection establishment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.