Giter Site home page Giter Site logo

yehoshuadimarsky / bcpandas Goto Github PK

View Code? Open in Web Editor NEW
121.0 8.0 43.0 1.2 MB

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

License: MIT License

Python 100.00%
pandas bcp sql-server python

bcpandas's Introduction

bcpandas

PyPI version Conda-Forge version GitHub license CI codecov Code style: black PyPI pyversions Awesome Badges Downloads pre-commit.ci status

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!! (pronounced BEE-CEE-Pandas)

Quickstart

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: from bcpandas import SqlCreds, to_sql

In [2]: creds = SqlCreds(
   ...:     'my_server',
   ...:     'my_db',
   ...:     'my_username',
   ...:     'my_password'
   ...: )

In [3]: df = pd.DataFrame(
   ...:         data=np.ndarray(shape=(10, 6), dtype=int),
   ...:         columns=[f"col_{x}" for x in range(6)]
   ...:     )

In [4]: df
Out[4]:
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

In [5]: to_sql(df, 'my_test_table', creds, index=False, if_exists='replace')

In [6]: df2 = pd.read_sql_table(table_name='my_test_table', con=creds.engine)

In [7]: df2
Out[7]:
     col_0    col_1    col_2    col_3    col_4    col_5
0  4128860  6029375  3801155  5570652  6619251  7536754
1  4849756  7536751  4456552  7143529  7471201  7012467
2  6029433  6881357  6881390  7274595  6553710  3342433
3  6619228  7733358  6029427  6488162  6357104  6553710
4  7536737  7077980  6422633  7536732  7602281  2949221
5  6357104  7012451  6750305  7536741  7340124  7274610
6  7340141  6226036  7274612  7077999  6881387  6029428
7  6619243  6226041  6881378  6553710  7209065  6029415
8  6881378  6553710  7209065  7536743  7274588  6619248
9  6226030  7209065  6619231  6881380  7274612  3014770

IMPORTANT - Read vs. Write

The big speedup benefit of bcpandas is in the to_sql function, as the benchmarks below show. However, the bcpandas read_sql function actually performs slower than the pandas equivalent. Therefore, the bcpandas read_sql function was deprecated in v5.0 and has now been removed in v6.0+. To read data from SQL to pandas, use the native pandas method pd.read_sql_table or pd.read_sql_query.

Benchmarks

See figures below. All code is in the /benchmarks directory. To run the benchmarks, from the root directory of this repository, run python benchmarks/benchmark.py main --help and fill in the command line options that are presented.

Running this will output

  1. PNG image of the graph
  2. JSON file of the benchmark data
  3. JSON file with the environment details of the machine that was used to generate it

to_sql

I didn't bother including the pandas non-multiinsert version here because it just takes way too long

to_sql benchmark graph

Why not just use the new pandas method='multi'?

  1. Because it is still much slower
  2. Because you are forced to set the chunksize parameter to a very small number for it to work - generally a bit less then 2100/<number of columns>. This is because SQL Server can only accept up to 2100 parameters in a query. See here and here for more discussion on this, and the recommendation to use a bulk insert tool such as BCP. It seems that SQL Server simply didn't design the regular INSERT statement to support huge amounts of data.

read_sql

As you can see, pandas native clearly wins here

read_sql benchmark graph

Requirements

Database

Any version of Microsoft SQL Server. Can be installed on-prem, in the cloud, on a VM, or one of the Azure versions.

Python User

  • BCP Utility
  • Microsoft ODBC Driver 11, 13, 13.1, or 17 for SQL Server. (Microsoft Docs) See the pyodbc docs for more on different driver versions.
  • Python >= 3.8
  • pandas >= 1.5
  • sqlalchemy >= 1.4
  • pyodbc as the supported DBAPI

Installation

Source Command
PyPI pip install bcpandas
Conda conda install -c conda-forge bcpandas

Usage

  1. Create creds (see next section)
  2. Replace any df.to_sql(...) in your code with bcpandas.to_sql(df, ...)

That's it!

Credential/Connection object

Bcpandas requires a bcpandas.SqlCreds object in order to use it, and also a sqlalchemy.Engine. The user has 2 options when constructing it.

  1. Create the bcpandas SqlCreds object with just the minimum attributes needed (server, database, username, password), and bcpandas will create a full Engine object from this. It will use pyodbc, sqlalchemy, and the Microsoft ODBC Driver for SQL Server, and will store it in the .engine attribute.

    In [1]: from bcpandas import SqlCreds
    
    In [2]: creds = SqlCreds('my_server', 'my_db', 'my_username', 'my_password')
    
    In [3]: creds.engine
    Out[3]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)
  2. Pass a full Engine object to the bcpandas SqlCreds object, and bcpandas will attempt to parse out the server, database, username, and password to pass to the command line utilities. If a DSN is used, this will fail.

    (continuing example above)

    In [4]: creds2 = SqlCreds.from_engine(creds.engine)
    
    In [5]: creds2.engine
    Out[5]: Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password)
    
    In [6]: creds2
    Out[6]: SqlCreds(server='my_server', database='my_db', username='my_username', with_krb_auth=False, engine=Engine(mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 17 for SQL Server};Server=tcp:my_server,1433;Database=my_db;UID=my_username;PWD=my_password), password=[REDACTED])

Recommended Usage

Feature Pandas native BCPandas
Super speed
Good for simple data types like numbers and dates
Handle messy string data

built with the help of https://www.tablesgenerator.com/markdown_tables# and https://gist.github.com/rxaviers/7360908

Known Issues

Here are some caveats and limitations of bcpandas.

  • Bcpandas has been tested with all ASCII characters 32-127. Unicode characters beyond that range have not been tested.
  • An empty string ("") in the dataframe becomes NULL in the SQL database instead of remaining an empty string.
  • Because bcpandas first outputs to CSV, it needs to use several specific characters to create the CSV, including a delimiter and a quote character. Bcpandas attempts to use characters that are not present in the dataframe for this, going through the possilbe delimiters and quote characters specified in constants.py. If all possible characters are present in the dataframe and bcpandas cannot find both a delimiter and quote character to use, it will throw an error.
    • The BCP utility does not ignore delimiter characters when surrounded by quotes, unlike CSVs - see here in the Microsoft docs.
  • If there is a NaN/Null in the last column of the dataframe it will throw an error. This is due to a BCP issue. See my issue with Microsoft about this here. This doesn't seem to be a problem based on the tests.

Troubleshooting

All quote characters appear in the data

If you encounter the error:

bcpandas.constants.BCPandasValueError: Data contains all of the possible quote characters ('"', "'", '`', '~'),
cannot use BCP to import it. Replace one of the possible quote characters in
your data, or use another method besides bcpandas.

And want to still use BCPandas, you will need to pick a quote character and remove all instances of it from the dataframe. Note that you are modifying your data and take care that replacing e.g., the ~ character will not have undesired consequences.

In this case we are looking to remove ~, replacing it with -. Hunt for its presence in a column:

my_df['some_text_column'].str.contains('~').sum()

If that returns a value greater than zero, you can perform replacement in that column like this:

my_df['some_text_column'] = my_df['some_text_column'].str.replace('~','-')

Then use the first command to confirm that the number of occurences is now 0.

All delimiter characters appear in the data

Very similar to above, but with the error message:

bcpandas.constants.BCPandasValueError: Data contains all of the possible delimiter characters (',', '|', '\t'),
cannot use BCP to import it. Replace one of the possible delimiter characters in
your data, or use another method besides bcpandas.

Approach this as is described above for quote characters. If you target the | character for replacement, note that it must be escaped in a regular expression with a backslash. So the relevant commands would be (here the pipe | is replaced with a front slash):

my_df['some_text_column'] = my_df['some_text_column'].str.replace('\|','/')
my_df['some_text_column'].str.contains('\|').sum()

Write to database fails due to spaces in columns

If you get this error message when writing to the database:

Error = [Microsoft][ODBC Driver 17 for SQL Server]Incorrect host-column number found in BCP format-file

Try replacing any space characters in your column names, with a command like my_df.columns = my_df.columns.str.replace(' ','_') (source).

Background

Writing data from pandas DataFrames to a SQL database is very slow using the built-in to_sql method, even with the newly introduced execute_many option. For Microsoft SQL Server, a far far faster method is to use the BCP utility provided by Microsoft. This utility is a command line tool that transfers data to/from the database and flat text files.

This package is a wrapper for seamlessly using the bcp utility from Python using a pandas DataFrame. Despite the IO hits, the fastest option by far is saving the data to a CSV file in the file system and using the bcp utility to transfer the CSV file to SQL Server. Best of all, you don't need to know anything about using BCP at all!

Existing Solutions

Much credit is due to bcpy for the original idea and for some of the code that was adopted and changed.

bcpy

bcpy has several flaws:

  • No support for reading from SQL, only writing to SQL
  • A convoluted, overly class-based internal design
  • Scope a bit too broad - deals with pandas as well as flat files This repository aims to fix and improve on bcpy and the above issues by making the design choices described earlier.

Design and Scope

The only scope of bcpandas is to read and write between a pandas DataFrame and a Microsoft SQL Server database. That's it. We do not concern ourselves with reading existing flat files to/from SQL - that introduces way to much complexity in trying to parse and decode the various parts of the file, like delimiters, quote characters, and line endings. Instead, to read/write an exiting flat file, just import it via pandas into a DataFrame, and then use bcpandas.

The big benefit of this is that we get to precicely control all the finicky parts of the text file when we write/read it to a local file and then in the BCP utility. This lets us set library-wide defaults (maybe configurable in the future) and work with those.

For now, we are using the non-XML BCP format file type. In the future, XML format files may be added.

Testing

Testing Requirements

  • Docker Desktop installed, either of the Linux or Windows runtimes, doesn't matter
  • pytest
  • hypothesis
  • pytest-cov (coverage.py)

What Is Tested?

We take testing very seriously here. In order to rely on a library like this in production, it MUST be ruthlessly tested, which thankfully it is. Here is a partial list of what has been tested so far. Pull Requests welcome!

  • Data types: All ASCII characters 32-127 (using the Hypothesis library, see below). Unicode characters beyond that range have not been tested.
  • numpy.NaN, None
  • numpy.inf (fails, as expected)
  • Empty dataframe (nothing happens, database not modified)
  • Duplicate column names (raises error)
  • Database columns that are missing from the dataframe, are out of order, or both (passes)
  • Extra dataframe columns that aren't in database, when if_exists="append" specified (fails)

Testing Implementation

  • Testing uses pytest.
  • To test for all possible data types, we use the hypothesis library, instead of trying to come up with every single case on our own.
  • pytest-cov (which uses coverage.py under the hood) is used to measure code coverage. This is then uploaded to codecov.io as part of the CI/CD process (see below).
  • In order to spin up a local SQL Server during testing, we use Docker. Specifically, we run one of the images that Microsoft provides that already have SQL Server fully installed, all we have to do is use the image to run a container. Here are the links to the Linux versions and the Windows versions - Express and Developer.
    • When running the tests, we can specify a specific Docker image to use, by invoking the custom command line option called --mssql-docker-image. For example:
      pytest bcpandas/tests --mssql-docker-image mcr.microsoft.com/mssql/server:2019-latest
  • Instead of using the subprocess library to control Docker manually, we use the elegant docker-py library which works very nicely. A DockerDB Python class is defined in bcpandas/tests/utils.py and it wraps up all the Docker commands and functionality needed to use SQL Server into one class. This class is used in conftest.py in the core bcpandas tests, and in the benchmarks/ directory for both the benchmarks code as well as the legacy tests for read_sql.

CI/CD

Github Actions is used for CI/CD, although it is still somewhat a work in progress.

Contributing

Please, all contributions are very welcome!

I will attempt to use the pandas docstring style as detailed here.

GitHub Stars Trend

Stargazers over time

bcpandas's People

Contributors

borchero avatar dependabot[bot] avatar deschman avatar gordthompson avatar jacobshaw42 avatar jasonni0 avatar kevinushey avatar meitham avatar pre-commit-ci[bot] avatar ratserx avatar salazander avatar sfirke avatar simonstride avatar tsummerill avatar windiana42 avatar yehoshuadimarsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcpandas's Issues

Use engine as-is in SqlCreds.from_engine

I was having similar issues as #23 where the connection string I used to create the SQLAlchemy engine was not in the format from_engine() expected.

However, is there a reason why from_engine() doesn't use the provided engine instance, like so?

def from_engine(cls, engine: sa.engine.base.Engine) -> "SqlCreds":
     sql_creds = cls.__new__(cls)
     sql_creds.engine=engine
     return engine

This prevents any assumptions from being broken, and other than for __repr__(), it doesn't appear the other class attributes are ever used.

BCP Command Failed - Error

Hi,

I'm testing this on a CentOs with Python 3.6 and the latest pandas version available in my repo.
BCP Version => Version: 17.7.0001.1

Reading via creds.engine works, so the connection seems to be ok, but writing via to_sql just leads to the following Error:
raise BCPandasException(f"Bcp command failed with exit code {ret_code}")
bcpandas.constants.BCPandasException: Bcp command failed with exit code 1

Any ideas?

to_sql(df, 'TEST', creds, index=False, if_exists='replace')

Installing bcpandas for use in Shiny App using R Package Reticulate

Hello!

Apologies if this isnt exactly the correct forum for this but at my wits end would be an understatement

My colleague has written a great function using bcpandas - the problem is installing it so that it's recognised and can be used by the R Reticulate package

After a fresh Install of miniconda the packages we need to run his code arent available, so I install them...

conda_install(envname = "C:\\Users\\xxxxx\\AppData\\Local\\r-miniconda\\envs\\r-reticulate", packages="pandas")
conda_install(envname = "C:\\Users\\xxxxx\\AppData\\Local\\r-miniconda\\envs\\r-reticulate", packages="pyodbc")
conda_install(envname = "C:\\Users\\xxxxx\\AppData\\Local\\r-miniconda\\envs\\r-reticulate", packages="bcpandas")

After installation however - bcpandas is not available even though I can see it has been installed into the site packages folder.

> py_module_available("numpy")
[1] TRUE
> py_module_available("glob")
[1] TRUE
> py_module_available("os")
[1] TRUE
> py_module_available("numpy") 
[1] TRUE
> py_module_available("subprocess")
[1] TRUE
> py_module_available("pyodbc") 
[1] TRUE
> py_module_available("pandas")
[1] TRUE
> py_module_available("bcpandas")
[1] FALSE

Do you know why this might be the case?

InterfaceError

I get sqlalchemy.exc.InterfaceError / pyodbc.InterfaceError 28000

the same data config works fine when i used a direct sqlalchemy connection with pandas to_sql,
i'm connecting to an specific enviroment (host/enviroment)
what could this be?
thanks

ps:
output is:
"sqlalchemy.exc.InterfaceError: (pyodbc.InterfaceError) ('28000', "[28000] [Microsoft] [ODBC 17 for SQL Server][SQL Server] Login failed for user 'USERNAME' (18456) (SQLDriverConnect); [28000] [Microsoft] [ODBC 17 for SQL Server][SQL Server] Login failed for user 'USERNAME'. (18456)"

  • user is modified here , but i do connect OK when using sqlmanagmentstudio, sqlalchemy, pandas, and pyodbc.

Parsing issues: "bcp: unknown option u"

Good afternoon,

I have tried implementing BCP in a project in order to speed up the upload speed of an SQL table when re-creating it from scratch.

However, there seems to be an issue in the exection of the BCP command generated by bcp.to_sql().

Please consider the log output below:

[2022-09-23T10:50:36.018Z] Executing BCP command now... 
[2022-09-23T10:50:36.018Z] BCP command is: ['bcp', 'dbo.db', 'in', '/var/folders/2v/75hdky1j1xv5mw9hmtz1l2140000gq/T/8EZugLrdXVvi4mCZk4SxP', '-S', 'dbs.database.windows.net', '-d', 'comments', '-q', '-U', 'robot', '-P', '[REDACTED]', '-f', '/var/folders/2v/75hdky1j1xv5mw9hmtz1l2140000gq/T/kP38gPDN1tQnyYB6iIxuO']
[2022-09-23T10:50:36.082Z] bcp:  unknown option u
*followed by a listing of allowed options for bcp*

What could be the reason for this? I suspect that their is something amiss with how the engine details are handed to the bcp shell invocation, but I could also be completely amiss.

Best regards,
Malte

suppress the "1000 rows sent to SQL Server."

is there any way to have an option to suppress all those "1000 rows sent to SQL Server. Total Sent: .." messages? With a few million records the logging gets muddled, especially since I can't intercept them cleanly.

I adore this project so much.

bcppath error

Hi - not sure if I'm having trouble setting up sqlcreds correctly, or if this is a different problem. The to_sql is failing on the line
bcp_path = bcp_path with Error: FileNotFoundError: [WinError 2] The system cannot find the file specified.

I'm using Jupyter notebook - the section highlighted right before it skips to the runcmd to show the error output is:

[librarypath]\bcpandas\main.py in to_sql(df, table_name, creds, sql_type, schema, index, if_exists, batch_size, debug, bcp_path, dtype, process_dest_table, print_output, delimiter, quotechar)
456 schema=schema,
457 batch_size=batch_size,
--> 458 bcp_path=bcp_path,
459 )

Any advice? Thank you.

Caveat: handling of empty fields in bcp/mssql

Hi,

if you have a dataframe with empty values and you want to keep this empty values in the database, consider to use bcp's option "-k".
-k Specifies that empty columns should retain a null value during the operation, rather than have any default values for the columns inserted. For more information, see Keep Nulls or Use Default Values During Bulk Import (SQL Server).

Modified BCP command:

bcp_command = [
    "bcp" if bcp_path is None else quote_this(str(bcp_path)),
    sql_item_string,
    direc,
    flat_file,
    "-S",
    creds.server,
    "-d",
    creds.database,
    "-q",  # Executes the SET QUOTED_IDENTIFIERS ON statement, needed for Azure SQL DW
    "-k",   # keep null values
] + auth

HTH

latin1 issue, fixed

imagen

Hi! i was having some trouble writing into sqlserver with latin1, i made this fix on tmp csv creation file,

file: main.py,
line added: 409

I thing this should be a param. If you like i can work out this one and make a pull request.

Can't connect to SQL Server

Thank you very much for the work you put into this project @yehoshuadimarsky !

Sadly I seem to be unable to use it though.
I have a serverinstance with the following pattern: "SERVERNAME\INSTANCENAME"
With this I'm able to connect via SQL-Alchemy with this connection-string pattern:
sa.create_engine(f'mssql+pyodbc://{argx.user}:{argx.password}@{argx.server}/{argx.database}?driver=ODBC Driver 17 for SQL Server')
However, suppling the same variables to SQLCredits as
SqlCreds(server=argx.server,database=argx.database,username=argx.user,password=argx.password) results in a "can't find server" error.
I also tried to use the "from_engine" method, but it failed due to a parsing issue
I'd rather avoid to make modifications to the packages main function as mentioned in #21

Any help would be very appreciated!

Improvement: Identifying Collation for format file

You can use the following SQL to identify collation information for a database, column, etc. This could be important to those of us coming from a Unix background and turn off the default case insensitive collations. :)

COLLATION_SQL = """
DECLARE @DefaultCollation varchar(256);
DECLARE @Database varchar(1000);
DECLARE @Table varchar(1000);
SET @Database = '{}';
SET @Table = '{}';
SELECT @DefaultCollation = CONVERT (varchar(256), DATABASEPROPERTYEX(@Database, 'collation'));

SELECT t.name as tableName,
c.name as columnName,
IIF(collation_name is NULL, @DefaultCollation, collation_name) as collationName
FROM [{}].sys.columns c
INNER JOIN [{}].sys.tables t
ON c.object_id = t.object_id
AND t.name = @Table;
"""

Reference: https://docs.microsoft.com/en-us/sql/relational-databases/collations/view-collation-information?view=sql-server-ver15

I tried to use this tool as is and ended up implementing my own version of it to have more control over how the data is handled within an Airflow operator. I'll try to suggest improvements and potentially submit PRs. My time is pretty limited for open source contributions lately.

odbc_kwargs need ';' as separator from password for sqlalchemy engine URL

When passing odbc_kwargs["encrypt"]="no" to bcpandas.sqlCreds, it is appended to the connect string after the password without a separator. Workaround that creates working sqlalchemy engine but does not work with bcp executable: sqlCreds(password=password+";", odbc_kwargs=dict(encrypt="no"))

Re: Adding Another Maintainer

Hi all, I see that there's been an uptick in PRs and issues in the past few months, I'm grateful people are finding this library useful and are engaging and using it. But to be honest, I have kind of stopped maintaining it, because I don't have a use case for this anymore. When I created it, it was something I needed at the time, but after changing jobs and circumstances, I'm not really maintaining this anymore.

That said: I'm open to adding one or more people to this repo with admin privileges, to become a maintainer along with me. If you are interested, please reply in the comments! Thanks all.

Option to set logging to false

Right now when using to_sql it writes this extensive logging of each action. Is there an option to set logging to False?

Example:

Starting copy...
1000 rows sent to SQL Server. Total sent: 1000
1000 rows sent to SQL Server. Total sent: 2000
1000 rows sent to SQL Server. Total sent: 3000
1000 rows sent to SQL Server. Total sent: 4000
1000 rows sent to SQL Server. Total sent: 5000
1000 rows sent to SQL Server. Total sent: 6000
1000 rows sent to SQL Server. Total sent: 7000
1000 rows sent to SQL Server. Total sent: 8000
1000 rows sent to SQL Server. Total sent: 9000
1000 rows sent to SQL Server. Total sent: 10000
1000 rows sent to SQL Server. Total sent: 11000
1000 rows sent to SQL Server. Total sent: 12000
1000 rows sent to SQL Server. Total sent: 13000
1000 rows sent to SQL Server. Total sent: 14000
1000 rows sent to SQL Server. Total sent: 15000
1000 rows sent to SQL Server. Total sent: 16000
1000 rows sent to SQL Server. Total sent: 17000
1000 rows sent to SQL Server. Total sent: 18000
1000 rows sent to SQL Server. Total sent: 19000
1000 rows sent to SQL Server. Total sent: 20000
1000 rows sent to SQL Server. Total sent: 21000
1000 rows sent to SQL Server. Total sent: 22000
1000 rows sent to SQL Server. Total sent: 23000
1000 rows sent to SQL Server. Total sent: 24000
1000 rows sent to SQL Server. Total sent: 25000
1000 rows sent to SQL Server. Total sent: 26000
1000 rows sent to SQL Server. Total sent: 27000
1000 rows sent to SQL Server. Total sent: 28000
1000 rows sent to SQL Server. Total sent: 29000
1000 rows sent to SQL Server. Total sent: 30000
1000 rows sent to SQL Server. Total sent: 31000
1000 rows sent to SQL Server. Total sent: 32000
1000 rows sent to SQL Server. Total sent: 33000
1000 rows sent to SQL Server. Total sent: 34000
1000 rows sent to SQL Server. Total sent: 35000
1000 rows sent to SQL Server. Total sent: 36000
1000 rows sent to SQL Server. Total sent: 37000
1000 rows sent to SQL Server. Total sent: 38000
1000 rows sent to SQL Server. Total sent: 39000
1000 rows sent to SQL Server. Total sent: 40000
1000 rows sent to SQL Server. Total sent: 41000
1000 rows sent to SQL Server. Total sent: 42000
1000 rows sent to SQL Server. Total sent: 43000
1000 rows sent to SQL Server. Total sent: 44000
1000 rows sent to SQL Server. Total sent: 45000
1000 rows sent to SQL Server. Total sent: 46000
1000 rows sent to SQL Server. Total sent: 47000
1000 rows sent to SQL Server. Total sent: 48000
1000 rows sent to SQL Server. Total sent: 49000
1000 rows sent to SQL Server. Total sent: 50000
1000 rows sent to SQL Server. Total sent: 51000
1000 rows sent to SQL Server. Total sent: 52000
1000 rows sent to SQL Server. Total sent: 53000
1000 rows sent to SQL Server. Total sent: 54000
1000 rows sent to SQL Server. Total sent: 55000
1000 rows sent to SQL Server. Total sent: 56000
1000 rows sent to SQL Server. Total sent: 57000
1000 rows sent to SQL Server. Total sent: 58000
1000 rows sent to SQL Server. Total sent: 59000
1000 rows sent to SQL Server. Total sent: 60000
1000 rows sent to SQL Server. Total sent: 61000
1000 rows sent to SQL Server. Total sent: 62000
1000 rows sent to SQL Server. Total sent: 63000
1000 rows sent to SQL Server. Total sent: 64000
1000 rows sent to SQL Server. Total sent: 65000
1000 rows sent to SQL Server. Total sent: 66000
1000 rows sent to SQL Server. Total sent: 67000
1000 rows sent to SQL Server. Total sent: 68000
1000 rows sent to SQL Server. Total sent: 69000
1000 rows sent to SQL Server. Total sent: 70000
1000 rows sent to SQL Server. Total sent: 71000
1000 rows sent to SQL Server. Total sent: 72000
1000 rows sent to SQL Server. Total sent: 73000
1000 rows sent to SQL Server. Total sent: 74000
1000 rows sent to SQL Server. Total sent: 75000
1000 rows sent to SQL Server. Total sent: 76000
1000 rows sent to SQL Server. Total sent: 77000
1000 rows sent to SQL Server. Total sent: 78000
1000 rows sent to SQL Server. Total sent: 79000
1000 rows sent to SQL Server. Total sent: 80000
1000 rows sent to SQL Server. Total sent: 81000
1000 rows sent to SQL Server. Total sent: 82000
1000 rows sent to SQL Server. Total sent: 83000
1000 rows sent to SQL Server. Total sent: 84000
1000 rows sent to SQL Server. Total sent: 85000
1000 rows sent to SQL Server. Total sent: 86000
1000 rows sent to SQL Server. Total sent: 87000
1000 rows sent to SQL Server. Total sent: 88000
1000 rows sent to SQL Server. Total sent: 89000
1000 rows sent to SQL Server. Total sent: 90000
1000 rows sent to SQL Server. Total sent: 91000
1000 rows sent to SQL Server. Total sent: 92000
1000 rows sent to SQL Server. Total sent: 93000
1000 rows sent to SQL Server. Total sent: 94000
1000 rows sent to SQL Server. Total sent: 95000
1000 rows sent to SQL Server. Total sent: 96000
1000 rows sent to SQL Server. Total sent: 97000
1000 rows sent to SQL Server. Total sent: 98000
1000 rows sent to SQL Server. Total sent: 99000
1000 rows sent to SQL Server. Total sent: 100000
1000 rows sent to SQL Server. Total sent: 101000

101727 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total     : 54239  Average : (1875.5 rows per sec.)
56.49252915382385

FutureWarning line_terminator

hey @yehoshuadimarsky -

This is a super awesome package and has totally saved me a ton of time and headaches, well done to everyone who contributed!

Just a heads up, there is a FutureWarning:

./miniconda3/envs/nvm/lib/python3.10/site-packages/bcpandas/main.py:399: FutureWarning: the 'line_terminator'' keyword is deprecated, use 'lineterminator' instead.
  df.replace({True: 1, False: 0}).to_csv(

Someone may want to fix this.

Support more possible delimiter and quote characters

Hi Josh, I'm not sure if this is possible, but can BCP and Pandas to_csv support more delimiter and quote characters than just the ones you've specified so far? I'm possibly trying to push the limits of bcpandas too far by writing DataFrames with big json blob elements (I know you don't recommend using it with messy text data) but I didn't know if there was still potential to improve it in that regard.

Happy to try creating a PR if you think there's still potential improvements here.

Fail on Windows Server 2019 Ver 1809.

It works well windows10 enterprise ->ok, no issue, but I cannot run it on Windows Server 2019
The error message is below.
I really appreciate your support in solving it on Windows Server.


File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\bcpandas\main.py", line 447, in to_sql
bcp(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\bcpandas\utils.py", line 114, in bcp
ret_code = run_cmd(bcp_command, print_output=print_output)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\bcpandas\utils.py", line 234, in run_cmd
proc = Popen(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

Issue using Windows Auth

Hi Josh,

Awesome library - thanks! Currently working great (and runs really fast!) with SQL Auth but cannot create SqlCreds with blank username & password. The intended behaviour is to default to a Windows Auth connection - but this does not quite work:

from bcpandas import SqlCreds

creds = SqlCreds(server=".", database="Sandbox")

...fails with error: AttributeError: 'SqlCreds' object has no attribute 'username'.

I've written a fix and associated Unit Tests - happy to contribute a PR if that's ok with you?

Many thanks again,

Simon

Additional parameters to bcp engine

would be great if additional parameters could be passed such as -h"TABLOCK" during to_sql.
Ability to import in Native formats -n

They add considerable performance boost.

latest version doesn't seem to be available to pip

Hey there, I'm new to Python so bear with me.

ISSUE:
PyPi only shows me up to version 0.2.2 using pip to install bcpandas...

Is this because I'm on an unsupported version of Python? I'm using 3.9.6. What can I do to get the latest version from PyPi?

Support for SQLAlchemy 2.0

We are moving our project to SQLAlchemy 2.0 and this is one of the last libraries, that doesn't support it.
Are there any issues with 2.0? And would you support a PR upgrading SQLAlchemy to 2.0?
If so I would be happy to contribute and fix any issues if they exist.

Enable running unit tests via docker-compose

Out of the box, the current database tests don't run for me (Linux) and I am not sure, I want to configure my machine such that this mechanism works. Here is the error message:

E   requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.42/containers/create?name=bcpandas-mssql-container
        http_error_msg = ('409 Client Error: Conflict for url: '
 'http+docker://localhost/v1.42/containers/create?name=bcpandas-mssql-container')
        reason     = 'Conflict'
        self       = <Response [409]>

What I would prefer is that for database tests to run, I can issue a docker-compose up command. Launching containers via docker-compose is also nicely supported by github actions. But this might be just my personal opinion.

Can BCPandas support Service Principal logins?

The code states " Only supports SQL based logins, not Active Directory or Azure AD." The project which I'm working on connects to an Azure SQL Server, and the connection is through a service principal. I was hoping that once windows/AD support is added, service principal creds will be included there

HOW To Connect MS SQL with Jupyter Notebook

Attachment of My SQL Server Connection>>>

what should i write code on my jupyter Notebook to connect my sql there

My database name is : SQL_Databse

bellow I tried the step :

#1 :

#installing Librarry for first time for SQL

!pip install ipython-sql

#2 :
import pyodbc

#3 :

cnxn = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};Server=VIVEK\SQLEXPRESS01;Database=SQL_Databse;User ID=?;Password=?')

Help me with the correct code to connect my SQL

Note : ? mark is used to denote the things that i am confused in.
sql

help me with the solution

full engine config

would be great to have port and full engine config,

how can i patch/edit this? thanks

Allow the user to specify a delimiter and quote character for perf

Firstly, thank you for an excellent package - this has saved me lots of time and is a wonderful implementation.

A suggestion - I am inserting multiple GB's of numerical data, the data is formatted in a known way so I know what characters would be suitable as quote chars and delims. If I could specify the characters I'd save lots of time in execution.

For example, when benchmarking a ~150MB DataFrame with ~1.25M rows, total execution takes 106 seconds, with the primary time being spent like so (non-exhaustive, just the longest sections):

get_delimiter - 18.6s
get_quotechar - 17s
replace_bools_and_insert - 63s

This would be very similar to #22, a concern was mentioned about wanting to check if the users delimiter is in the data set, perhaps this could be dealt with by a "check_delim_is_present" option/flag/argument that when present avoids the safe path in the pursuit of performance.

If you're happy with this change, I would be happy to raise an MR with an implementation / tests, just let me know.

Cant import into database

Hi,

I try to use bcpandas to load a csv into a database (mssql). I use Trusted Connection when I connect to the database. Unfortunately, i get this error when I try to bulk my data into my database :

Error = [Microsoft][ODBC Driver 17 for SQL Server]Incorrect host-column number found in BCP format-file

Do you know what kind of problem can trigger this ?

Thanks you,
David

special characters in bcp auth

If I have special characters in my username/password , the pyodbc connection will work appropriately, but the bcp command will get an invalid username/password.

I believe that the special characters are being handled in the shell for bcp. For example, if my username is user$name and my password is pas$word, the resulting credentials passed to sql server would be username: user, password: pas

Columns in MSSQL are case insensitive

Append failed with the error message "f"Column(s) detected in the dataframe that are not in the database" but the columns are in the database. Example: dataframe col1 <-> MSSQL COL1. The default setting in MSSQL is case insensitive and due to this col1 is equal COL1 and the check shouldn't fail.

Implement equivalent of Pandas to_sql dtype parameter

Thanks very much for this package, it definitely fills a need I have.

The only thing I'm missing is the equivalent of the pandas dtype parameter for passing a dictionary to set the column types.

Maybe an easy implementation would be to call the pandas to_sql method with an empty frame and the supplied dtype if the table doesn't already exist - if you think it's worth adding.

BCP Copy failed with Invalid character value for cast specification

Hi,

while using to_sql(), .e.g.
bcpandas.to_sql(loco_tracks_df, tempTableName, creds, index=False, if_exists="replace")
unfortunately BCP fails with the following error message:

Starting copy...
SQLState = 22005, NativeError = 0
Error = [Microsoft][ODBC Driver 17 for SQL Server]Invalid character value for cast specification
0 rows copied.

However, if I change
bcpandas.to_sql(loco_tracks_df, tempTableName, creds, index=False, if_exists="replace") to
bcpandas.to_sql(loco_tracks_df, tempTableName, creds, index=False, if_exists="append")

The error message changes to
Starting copy...
SQLState = 22001, NativeError = 0
Error = [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation.

Is there a way to get more information from the BCP tool, thus I can determine which cast fails?

BR Johannes

Performance improvements to bool replacement

Current behaviour is to run df.replace({True: 1, False: 0}) on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.

For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.

Some options to avoid this that seem sensible:

  • Only run replace on bool columns (or perhaps Object columns too?)
  • Allow the user to specify if they would like the replacement to go ahead via an argument or flag
  • Allow the user which columns should be replaced, defaulting to current behaviour

I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.

Document API

Document the API for this module, preferably in the readme.

If you agree that the documentation should be added to the readme, I am willing to write it up and make a pull request.

Error: incorrect host-column number found in BCP format-file

Just tried to install / use bcpandas on a colleagues laptop (windows 10) and got this strange error, could not make anything of it:

SQLState = S1000, NativeError = 0
Error = [Microsoft][ODBC Driver 17 for SQL Server]Incorrect host-column number found in BCP format-file

Pass dtype in to_sql

Hi, thanks a lot for this package, I just found it and figured I'd give it a try, and it's very nice!

I do have a question: is it possible somehow to pass a dtype along to to_sql()? I have my own dtype dict which I can pass with pandas' to_sql(), but I can't find that option here. I went into the bcpandas to_sql() function, it seems easy to me to pass it, first to to_sql(), then to _create_table() and then to pandas' own SQLTable() constructor. Am I missing something?

SQLAlchemy >= 1.3.18 breaks tests of `sql_creds`

Using sqlalchemy >=1.3.18 leads to test failures. Need to figure out what's causing this and fix it. Also, this is preventing us from supporting Python 3.9+ because this version of sqlalchemy doesn't support it.

Answer is somewhere in here https://docs.sqlalchemy.org/en/14/changelog/changelog_13.html#change-1.3.18.

Example

pytest -x -k test_sql_creds bcpandas/tests

Output:

============================================================ test session starts =============================================================
platform darwin -- Python 3.8.8, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/REDACTED/miniconda3/envs/bcpandas-env/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/REDACTED/Documents/GitHub/bcpandas/.hypothesis/examples')
rootdir: /Users/REDACTED/Documents/GitHub/bcpandas, configfile: setup.cfg
plugins: cov-2.11.1, hypothesis-4.54.2
collected 52 items / 40 deselected / 12 selected

bcpandas/tests/test_sqlcreds.py::test_sql_creds_for_username_password FAILED                                                           [  8%]

================================================================== FAILURES ==================================================================
____________________________________________________ test_sql_creds_for_username_password ____________________________________________________
bcpandas/tests/test_sqlcreds.py:50: in test_sql_creds_for_username_password
    assert str(creds.engine.url) == (
E   AssertionError: assert 'mssql+pyodbc...test_password' == 'mssql+pyodbc...test_password'
E     - mssql+pyodbc:///?odbc_connect=Driver={ODBC Driver 99 for SQL Server};Server=tcp:test_server,1433;Database=test_database;UID=test_user;PWD=test_password
E     ?                                     ^^    ^      ^  ^   ^   ^      ^^      ^   ^           ^    ^        ^             ^   ^         ^   ^
E     + mssql+pyodbc:///?odbc_connect=Driver%3D%7BODBC+Driver+99+for+SQL+Server%7D%3BServer%3Dtcp%3Atest_server%2C1433%3BDatabase%3Dtest_database%3BUID%3Dtest_user%3BPWD%3Dtest_password
E     ?                                     ^^^^^^    ^      ^  ^   ^   ^      ^^^^^^      ^^^   ^^^          ...
E
E     ...Full output truncated (1 line hidden), use '-vv' to show
        creds      = SqlCreds(server='test_server', database='test_database', port=1433, driver='{ODBC Driver 99 for SQL Server}', username...er%3Dtcp%3Atest_server%2C1433%3BDatabase%3Dtest_database%3BUID%3Dtest_user%3BPWD%3Dtest_password), password=[REDACTED])

---------- coverage: platform darwin, python 3.8.8-final-0 -----------
Coverage HTML written to dir htmlcov

=========================================================== Hypothesis Statistics ============================================================
========================================================== short test summary info ===========================================================
FAILED bcpandas/tests/test_sqlcreds.py::test_sql_creds_for_username_password - AssertionError: assert 'mssql+pyodbc...test_password' == 'ms...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
====================================================== 1 failed, 40 deselected in 0.35s ======================================================

Add documentation or functionality to address breaking default behavior of Microsoft ODBC Driver 18 for SQL Server

The standard configuration with the minimal arguments works with Microsoft ODBC Driver for SQL Server versions <=17 but not with version 18, which has breaking security changes. In my case the connection fails due to a self-signed certificate. I want to tell bcpandas to trust the server's certificate.

It would be great to have this addressed, either by modifying SqlCreds or simply updating the docs to show an example of how to pass values to SqlCreds such that it trusts the server certificate.

Hand over odbc_kwargs["encrypt"]="yes" as -Ym and "no" as -Yo over to bcp command line tool

Spinning up a SQL Server database as Linux running container is a nice way to test code meant for SQL Server databases. However, to use bcpandas with such a database, we need to pass ?Encrypt=no to the sqlalchemy connect string. This can be handed over to bcpandas.sqlCreds as odbc_kwargs["Encrypt"]="no". When running bcp executable this needs to be translated to adding argument -Yo.

`quote_this` prevents password from being redacted on non-Windows OS

On a non-Windows OS, the quote_this function calls shlex.quote(this) on the provided password. Later , the bcp_command_log is modified to redact the password, but because the password has been wrapped in single quotes this does not work.

bcp_command_log = [c if c != creds.password else "[REDACTED]" for c in bcp_command]

A likely fix is to reassign the quoted value back to the creds object after quoting the values. This section is included below. Another option would be to handle the quote_this logic directly in the creds object.

    if creds.with_krb_auth:
        auth = ["-T"]
    else:
        auth = ["-U", quote_this(creds.username), "-P", quote_this(creds.password)]

Error: OSError: [WinError 6] The handle is invalid

Thanks for your fix the other day! (#15 ) bcpandas (V1.0.1) now imports fine via the reticulate package into our R App & our function starts to run.

We get an error however when using the to_sql function when we reach line 158...

to_sql(dfgroups[dfg], tablename, sql_config, index=True, if_exists='append')

The error message (below) shows the same 'handle is invalid' message as we received the other day - so Im wondering if these issues could be linked?

Error: OSError: [WinError 6] The handle is invalid Detailed traceback: File "",
line 158, in F_Upload_P File "C:\Users\woodfieldc\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\bcpandas\main.py", 
line 348, in to_sql bcp_path=bcp_path, File "C:\Users\woodfieldc\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\bcpandas\utils.py", 
line 114, in bcp ret_code = run_cmd(bcp_command) File "C:\Users\woodfieldc\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\bcpandas\utils.py", 
line 231, in run_cmd proc = Popen(cmd, stdout=PIPE, stderr=PIPE, encoding="utf-8", errors="utf-8", shell=with_shell,) File "C:\Users\woodfieldc\AppData\Local\r-miniconda\envs\r-reticulate\lib\subprocess.py", 
line 687, in __init__ errread, errwrite) = self._get_handles(stdin, stdout, stderr) File "C:\Users\woodfieldc\AppData\Local\r-miniconda\envs\r-reticulate\lib\subprocess.py", 
line 910, in _get_handles p2cread = _winapi.GetStdHandle(_winapi.STD_INPUT_HANDLE)

sqlserver connection failed

creds = SqlCreds(server='PC-20200521CWPJ\MSSQLSERVER2',database='qq',username='sa',password ='111111')

sqlalchemy.exc.InterfaceError: (pyodbc.InterfaceError) ('28000', '[28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]用户 'sa' 登录失败。 (18456) (SQLDriverConnect); [28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]无法打开登录所请求的数据库 "qq"。登录失败。 (4060); [28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]用户 'sa' 登录失败。 (18456); [28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]无法打开登录所请求的数据库 "qq"。登录失败。 (4060)')
(Background on this error at: http://sqlalche.me/e/14/rvf5)

Roadmap

I have been following this project for a while and wanted to thank you for implementing this idea. Would you be able to sort of list down the roadmap for this project?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.