azure-samples / smartbulkcopy Goto Github PK

High-Speed Bulk Copy tool to move data from one Azure SQL / SQL Server database to another. Smartly uses logical or physical partitions to maximize speed.

License: MIT License

C# 93.28% TSQL 4.90% Dockerfile 0.23% Shell 1.60%

azure-sql-database sql-server bulk-copy azure-sql-server

smartbulkcopy's Introduction

page_type

languages

products

description

urlFragment

sample

tsql

sql

csharp

azure-sql-database

sql-server

azure-sql-managed-instance

azure-sqlserver-vm

azure

dotnet

Smart, High-Speed, Bulk Copy tool to move data from one Azure SQL or SQL Server database to another

smart-bulk-copy

Smart Bulk Copy

Latest Stable Version: 1.9.9

Smart, High-Speed, Bulk Copy tool to move data from one Azure SQL / SQL Server database to another. Smartly uses logical or physical partitions to maximize transfer speed using parallel copy tasks.

It can be also used to efficiently and quickly move data from two instances of SQL Server running in two different cloud providers or to move from on-premises to the cloud.

Smart Bulk Copy is also available as a Docker Image. To run Smart Bulk Copy via docker, you have to map a volume where the desired .config file can be found. For example (on Windows):

docker run -it -v c:\work\_git\smart-bulk-copy\client\configs:/app/client/configs yorek/smartbulkcopy:latest /app/client/configs/smartbulkcopy.config.json

You can also run Smart Bulk Copy using Azure Container Instances. Use the provided azure-deploy.sh script to create an ACI and execute Smart Bulk Copy. Make sure you have created the smartbulkcopy.config.json file in the client/configs folder before running the .sh script.

How it works

Smart Bulk Copy uses Bulk Copy API with parallel tasks. A source table is split in partitions, and each partition is copied in parallel with others, up to a defined maximum, in order to use all the available network bandwidth and all the cloud or server resources available to minimize the load times.

Partitioned Source Tables

When a source table is partitioned, it uses the physical partitions to execute several queries like the following:

SELECT * FROM <sourceTable> WHERE $partition.<partitionFunction>(<partitionColumn>) = <n>

Queries are executed in parallel to load, always in parallel, data into the destination table. TABLOCK options is used - when possible and needed - on the table to allow fully parallelizable bulk inserts. ORDER option is also used when possible to minimize the sort operations on the destination table, for example when inserting into a table with an existing clustered rowstore index.

Non-Partitioned Source Tables

If a source table is not partitioned, then Smart Bulk Copy will use the %%PhysLoc%% virtual column to logically partition tables into non-overlapping partitions that can be safely read in parallel. %%PhysLoc%% is not documented, but more info are available here:

Where is a record really located?

If the configuration file specifies a value greater than 1 for logical-partitions the following query will be used to read the logical partitions in parallel:

SELECT * FROM <sourceTable> WHERE ABS(CAST(%%PhysLoc%% AS BIGINT)) % <logical-partitions-count> = <n>

PLEASE NOTE that the physical position of a row may change at any time if there is any activity on the database (updates, index reorgs, etc...) so it is recommended that this approach is used only in three cases:

You're absolutely sure there is no activity of any kind on the source database, or
You're using a database snapshot as the source database
You're using a database set in READ_ONLY mode

Heaps, Clustered Rowstores, Clustered Columnstores

From version 1.7 Smart Bulk Copy will smartly copy tables with no clustered index (heaps), and tables with clustered index (rowstore or columnstore it doesn't matter.)

Couple of notes for tables with Clustered Columnstore index:

Smart Bulk Copy will always use a Batch Size of a minimum of 102400 rows, no matter what specified in the configuration as per best practices. If you have columnstore it is generally recommended to increase the value to 1048576 in order to maximize compression and reduce number of rowgroups.
When copying a Columnstore table, you may see very low values (<20Mb/Sec) for the "Log Flush Speed". This is correct and expected as Columnstore is extremely compressed and thus the log generation rate (which is what is measured by the Log Flush Speed) is much lower than with Rowstore tables.

How to use it

Download or clone the repository, make sure you have .NET Core 3.1 installed and then build Smart Bulk Copy:

dotnet build

Then, in the client folder, create a smartbulkcopy.config file from the provided client/config/smartbulkcopy.config.template. If you want to start right away just, provide source and destination connection strings and leave all the options as is. Make sure the source database is a database snapshot:

Create a Database Snapshot

Or that the database is set to be in Read-Only mode:

Setting the database to READ_ONLY

Then just run:

cd client
dotnet run

and Smart Bulk Copy will start to copy data from source database to destination database. Please keep in mind that all destination tables will be truncated by default. This means that Foreign key constraints must be dropped in the destination database before copying. Read more about TRUNCATE TABLE restrictions here: TRUNCATE TABLE

Configuration Notes

Here's how you can change Smart Bulk Copy configuration to better suits your needs. Everything is conveniently found in smartbulkcopy.config file. Aside from the obvious source and destination connection strings, here's the configuration options you can use:

Tables to copy

tables: an array of string values the contains the two-part names of the table you want to copy. For example:

'tables': ['dbo.Table1', 'Sales.Customers']

An asterisk * will be expanded to all tables available in the source database:

'tables': ['*']

You can use schema to limit the wildcard scope:

'tables': ['dbo.*']

From version 1.7.1 you can also specify tables to be included and excluded:

"tables": {
    "include": ["dbo.*"],
    "exclude": ["dbo.ORDERS"]
}

Configuration Options

Smart Bulk Copy is highly configurable. Read more in the dedicated document: Smart Bulk Copy Configuration Options

Notes on Azure SQL

Azure SQL is log-rated as described in Transaction Log Rate Governance and it can do 96 MB/sec of log flushing. Smart Bulk Load will report the detected log flush speed every 5 seconds so that you can check if you can actually increase the number of parallel task to go faster or you're already at the limit. Please remember that 96 MB/Sec are done with higher SKU, so if you're already using 7 parallel tasks and you're not seeing something close to 96 MB/Sec please check that

You have enough network bandwidth (this should not be a problem if you're copying data from cloud to cloud)
You're not using some very low SKU (like P1 or lower or just 2 vCPU). In this case move to an higher SKU for the bulk load duration.

There are a couple of exceptions to what just described:

Azure SQL Hyperscale always provides 100 MB/Sec of maximum log throughput, no matter the number of vCores. Of course, if using a small number of cores on Hyperscale, other factors (for example: sorting when inserting into a table with indexes) could come into play and prevent you to reach the mentioned 100 Mb/Sec.
M-series that can do up to 256 MB/sec of log throughput.

Observed Performances

Tests have been ran using the LINEITEM table of TPC-H 10GB test database. Uncompressed table size is around 8.8 GB with 59,986,052 rows. Source database was a SQL Server 2017 VM running on Azure and the target was Azure SQL Hyperscale Gen8 8vCores. Smart Bulk Copy was running on the same Virtual Machine where also source database was hosted. Both the VM and the Azure SQL database were in the same region. Used configuration settings:

"tasks": 7,
"logical-partitions": "auto",
"batch-size": 100000

Here's the result of the tests:

Table	Copy Time (in sec)
HEAP	135
HEAP, PARTITIONED	111
CLUSTERED ROWSTORE	505
CLUSTERED ROWSTORE, PARTITIONED	207
CLUSTERED COLUMNSTORE	315
CLUSTERED COLUMNSTORE, PARTITIONED	196

Questions and Answers

As the document was getting longer and longer, it has been moved here: Smart Bulk Copy FAQ

Tests

This tool has been tested against the following sample database with success:

TPC-H
TPC-E
AdventureWorks2012
AdventureWorks2014
AdventureWorksDW2012
AdventureWorksDW2014

Note that Foreign Keys and Views were dropped from target table before starting bulk copy

Smart Bulk Copy has been tested with the following SQL Server engine versions:

Azure SQL Database
Azure SQL Managed Instance
SQL Server 2019
SQL Server 2017
SQL Server 2016
SQL Server 2014

smartbulkcopy's People

Contributors

Stargazers

Watchers

smartbulkcopy's Issues

Cannot bulk load. The bulk data stream was incorrectly specified as sorted or the data violates a uniqueness constraint imposed by the target table.

I consistently cannot get this to work on a one of my large partitioned tables. I created the target schema (partition function and scheme) by scripting the source, so its identical.

However, when running the copy, I get the following error:
Cannot bulk load. The bulk data stream was incorrectly specified as sorted or the data violates a uniqueness constraint imposed by the target table. Sort order incorrect for the following two rows: primary key of first row: (2016-11-06, 8406347, 20), primary key of second row: (2016-11-02, 8406348, 1)

Everything looks good from the INFO logs:

Source and destination tables have compatible partitioning logic. Parallel load available
Partition By: Date
Order By: TransactionId,CategoryId
Source and destination clustered rowstore index have same ordering. Enabling ORDER hint.
Parallel load will use 18 physical partition(s)
Analysis result: usePartioning=True, partitionType=Physical, orderHintType=ClusteredIndex

Any ideas ?

Add an option to enable/disable bulk load hints

TABLOCK is not the optimal solution when loading an table with indexes. Should be possible to disable it (or let smart-bulk-copy automatically figure it out?)

Take into account compression when analyzing source data size

Decision on partition size should be done using uncompressed data size

.Net 3.1 Core is out of support.

I am getting the below warning while running the dotnet build command.

warning NETSDK1138: The target framework 'netcoreapp3.1' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/dotnet-core-support for more information about the support policy. [/root/smartbulkcopy/client/SmartBulkCopy.csproj]

The .Net versions needs to be updated with .Net 8.0.

SmartBulkCopy was fast only initially but slowed down for last heavy tables

I am trying to move ~1TB of data from one sql server to another sql and it has many tables with ~~10heavy tables( more than~~700gb size). I set the config file with 2 threads initially and then later 8 and then 12 threads. I observed that SmartBulkCopy(SBC) ran quickly with good network speed used up the threads and later slowly degraded once most of the small to medium sized tables are finished and left with ~5 heavy tables and also network speed dropped too. Finally we end up with ~4hours to migrate ~1TB of data. Why is this happening and how to speed up the process here?
What are your recommendations to migrate heavy tables?
How can we ensure the parallel threads use up network speed?

FYI, We disabled foreign key indexes and dropped non-clustered index to move fast

Add support to readonly database option

Beside reading from a snapshot, as another option it should be possible to check that the source database is in read-only mode. If not there are could be to outcomes:

stop the application
set the db in read-only mode

Add staging support

If target table already exists, create a copy of the table, without any data, into a dedicated staging schema (configurable via config file), with no indexes and constraints.

Copying Large Tables Generates Blocking

I am using Smart Bulk Copy to copy a 900GB database from Azure SQL DB to Managed Instance. The MI has 40 vCores and the destination database has a 257GB transaction log. I am running Smart Bulk Copy with 32 tasks and a batch size of 100000. When the process started I was getting ~32 mb/s log transfer. It's now copying 2 tables that are 25GB and 276GB used. The log transfer rate is hovering around 1 mb/s. When I check blocking, the 276GB has 22 bulk copy sessions running and is waiting on ASYNC_NETWORK_IO. The 25GB table has 10 bulk copy sessions running and 1 is blocking the other 9 with LCK_M_X. Is there anything I can do to reduce blocking and speed up transfer rate?

Cleanup code for retry logic

Move to Polly?

Add support to Temporal tables

There are some issues with temporal tables at the moment

Supporting VM Identity or Azure Authentication for Always Encrypted Database scenario

Currently migrating Azure SQL Databases to Azure SQL Managed Instances fro various reason,
other technique are too slow to migrate a db.

But how can I access Always Encrypted data when only connection string are supported and cannot use the identity of the VM on whcih I'm running.

Or maybe the possibility to add an AppId / AppSecret in the config file to support that kind of scenario.

Add an option to copy identity seed

It should be possible to ask Smart Bulk Copy to update the target table so that IDENTITY will generate the same values of the source table, once the copy is finished.

step by step guide

Can you please put a step by step guide to migrate Azure SQL Elastic pool databases /Azure SQL VM to Azure SQL Managed Instance using SQL Bulk Copy . I am getting very confused and complex with the given doc , as i am not from a Development background.

Issues with SQL Server Data Types (Spatial / Hierarchy)

It is not possible to copy tables that have Geography, Geometry or HierarchyID data type. Problem is related to the new Microsoft.Data.SqlClient that is not compatible with Microsoft.SqlServer.Types

Check for FOREIGN KEYS upfront

To avoid avoid the tool crash if it finds some

Temporal Tables retention period always set to INFINITE

Historical retention period of the temporal tables not restored correctly and defaulted to INFINITE.

Improve TABLOCK usage

Right now TABLOCK is always used, but if the destination table has indexes, this will cause massive lock contention.
Absence of indexes should be checked before copying to make sure TABLOCK can be effectively used.
If indexes are detected user should be WARNed that best practice is to drop them before copying.

Add an option to disable constraints in the target database

Add an option to disable FK in the destination table. The option should also allow the user to decided if re-enable them and the end of the process

Create logical partition by size

Aside from supporting the option to specify how many logical partition should be created, also add the support for the ability to specify the size you want the logical partitions to have: the number will be automatically calculated.
For example an 800GB table with 10GB logical partitions will generate 80 partitions. This is really much better for VLDB as it will be more efficient in case connection need to be recovered after a disconnection

Nuget Package

Would it be possible to create a NuGet package for this code? We want to integrate it into our existing processing server stack, so Docker is out of the question for the time being :/

Seems like a great utility and you guys have thought more about this problem I could -- I'm hesitant to copy / paste the code and not be able to receive updates if there are bugs that come up etc.

If you are open to contributions, I would be happy to personally assist in this effort

Add support to Edge and Node tables

Bulk load complains about mismatched column names when copying EDGE or NODE tables

Add support for computed columns

If computed columns are present, an error is generated:

2019/09/20 17:44:06.317|ERROR> Task 7: The column "IsFinalized" cannot be modified because it is either a computed column or is the result of a UNION operator.

Use Clustered Index for logical partitioning

If a source table has a cluster index, that should be used for logical partitioning, as a much cheaper alternative to %%PhysLoc%%, that cannot be pushed into the Storage Engine

Detect and warn if secondary indexes are present

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.