gianlucaborello / cassandradump Goto Github PK

View Code? Open in Web Editor NEW

204.0 13.0 81.0 801 KB

A data exporting tool for Cassandra inspired from mysqldump, with some additional slice and dice capabilities

License: GNU General Public License v2.0

Python 84.22% Shell 7.63% Makefile 8.15%

python cassandra nosql

cassandradump's Introduction

cassandradump

Description

A data exporting tool for Cassandra inspired from mysqldump, with some additional slice and dice capabilities.

Disclaimer: most of the times, you really shouldn't be using this. It's fragile, non-scalable, inefficient and verbose. Cassandra already offers excellent exporting/importing tools:

Snapshots
CQL's COPY FROM/TO
sstable2json

However, especially during development, I frequently need to:

Quickly take a snapshot of an entire keyspace, and import it just as quickly without copying too many files around or losing too much time
Ability to take a very small subset of a massive production database (according to some CQL-like filtering) and import it quickly on my development environment

If these use cases sound familiar, this tool might be useful for you.

It's still missing many major Cassandra features that I don't use daily, so feel free to open an issue pointing them out (or send a pull request) if you need something.

Usage

The help should already contain some useful information:

usage: cassandradump.py [-h] [--cf CF] [--export-file EXPORT_FILE]
                        [--filter FILTER] [--host HOST] [--port PORT]
                        [--import-file IMPORT_FILE] [--keyspace KEYSPACE]
                        [--exclude-cf EXCLUDE_CF] [--no-create] [--no-insert]
                        [--password PASSWORD]
                        [--protocol-version PROTOCOL_VERSION] [--quiet]
                        [--sync] [--username USERNAME] [--ssl]
                        [--certfile CERTFILE]

A data exporting tool for Cassandra inspired from mysqldump, with some added
slice and dice capabilities.

optional arguments:
  -h, --help            show this help message and exit
  --cf CF               export a column family. The name must include the
                        keyspace, e.g. "system.schema_columns". Can be
                        specified multiple times
  --export-file EXPORT_FILE
                        export data to the specified file
  --filter FILTER       export a slice of a column family according to a CQL
                        filter. This takes essentially a typical SELECT query
                        stripped of the initial "SELECT ... FROM" part (e.g.
                        "system.schema_columns where keyspace_name
                        ='OpsCenter'", and exports only that data. Can be
                        specified multiple times
  --host HOST           the address of a Cassandra node in the cluster
                        (localhost if omitted)
  --port PORT           the port of a Cassandra node in the cluster
                        (9042 if omitted)
  --import-file IMPORT_FILE
                        import data from the specified file
  --keyspace KEYSPACE   export a keyspace along with all its column families.
                        Can be specified multiple times
  --exclude-cf EXCLUDE_CF
                        when using --keyspace, specify column family to
                        exclude. Can be specified multiple times
  --no-create           don't generate create (and drop) statements
  --no-insert           don't generate insert statements
  --password PASSWORD   set password for authentication (only if
                        protocol-version is set)
  --protocol-version PROTOCOL_VERSION
                        set auth_provider version (required for
                        authentication)
  --quiet               quiet progress logging
  --sync                import data in synchronous mode (default asynchronous)
  --username USERNAME   set username for auth (only if protocol-version is
                        set)
  --ssl                 enable ssl connection to Cassandra cluster.  Must also
                        set --certfile.
  --certfile CERTFILE   ca cert file for SSL.  Assumes --ssl.

In its simplest invocation, it exports data and schemas for all keyspaces:

$ python cassandradump.py --export-file dump.cql
Exporting all keyspaces
Exporting schema for keyspace OpsCenter
Exporting schema for column family OpsCenter.events_timeline
Exporting data for column family OpsCenter.events_timeline
Exporting schema for column family OpsCenter.settings
Exporting data for column family OpsCenter.settings
Exporting schema for column family OpsCenter.rollups60
Exporting data for column family OpsCenter.rollups60
...

$ cat dump.cql
DROP KEYSPACE IF EXISTS "OpsCenter";
CREATE KEYSPACE "OpsCenter" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;
DROP TABLE IF EXISTS "OpsCenter"."events_timeline";
CREATE TABLE "OpsCenter".events_timeline (key text, column1 bigint, value blob, PRIMARY KEY (key, column1)) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (column1 ASC) AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '{"info": "OpsCenter management data.", "version": [5, 1, 0]}' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '8'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.25 AND speculative_retry = 'NONE';
INSERT INTO "OpsCenter"."events_timeline" (key, column1, value) VALUES ('201501', 1419841027332869, 0x)
INSERT INTO "OpsCenter"."events_timeline" (key, column1, value) VALUES ('201501', 1419841027352525, 0x)
INSERT INTO "OpsCenter"."events_timeline" (key, column1, value) VALUES ('201501', 1419928979070954, 0x)
...

The created dump file can be directly used with cqlsh -f, or there's also a --import-file that uses asynchronous import so it goes definitely fast.

Using --keyspace, it's possible to filter for a specific set of keyspaces

$ python cassandradump.py --keyspace system --export-file dump.cql
Exporting schema for keyspace system
Exporting schema for column family system.peers
Exporting data for column family system.peers
Exporting schema for column family system.range_xfers
Exporting data for column family system.range_xfers
Exporting schema for column family system.schema_columns
Exporting data for column family system.schema_columns
...

$ cat dump.cql
DROP KEYSPACE IF EXISTS "system";
CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}  AND durable_writes = true;
DROP TABLE IF EXISTS "system"."peers";
CREATE TABLE system.peers (peer inet PRIMARY KEY, data_center text, host_id uuid, preferred_ip inet, rack text, release_version text, rpc_address inet, schema_version uuid, tokens set<text>) WITH bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = 'known peers in the cluster' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 3600000 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';
...

Using --cf, it's possible to filter for a specific set of column families:

$ python cassandradump.py --cf OpsCenter.rollups7200 --no-create --export-file dump.cql
Exporting data for column family OpsCenter.rollups7200

$ cat dump.cql
INSERT INTO "OpsCenter"."rollups7200" (key, column1, value) VALUES ('127.0.0.1-foo', 718946047, 0x000000000000000000000000)
INSERT INTO "OpsCenter"."rollups7200" (key, column1, value) VALUES ('127.0.0.1-foo', 718953247, 0x000000000000000000000000)
INSERT INTO "OpsCenter"."rollups7200" (key, column1, value) VALUES ('127.0.0.1-foo', 718960447, 0x000000000000000000000000)
INSERT INTO "OpsCenter"."rollups7200" (key, column1, value) VALUES ('127.0.0.1-foo', 718967647, 0x000000000000000000000000)
INSERT INTO "OpsCenter"."rollups7200" (key, column1, value) VALUES ('127.0.0.1-foo', 719032447, 0x40073fc200000000437bc000)
...

Using --no-insert and --no-create it's possible to tweak what CQL statements are actually included in the dump.

Most of the times, the column families in a production scenario are huge, and you might just want a little slice of it. With --filter, it's possible to specify a set of CQL filters, and just the data that satisfies those filters will be included in the dump:

$ python cassandradump.py --filter "system.schema_columns WHERE keyspace_name='OpsCenter'" --export-file dump.cql
Exporting data for filter "system.schema_columns where keyspace_name ='OpsCenter'"

$ cat dump.cql
INSERT INTO "system"."schema_columns" (keyspace_name, columnfamily_name, column_name, component_index, index_name, index_options, index_type, type, validator) VALUES ('OpsCenter', 'backup_reports', 'backup_id', 1, NULL, 'null', NULL, 'clustering_key', 'org.apache.cassandra.db.marshal.UTF8Type')
INSERT INTO "system"."schema_columns" (keyspace_name, columnfamily_name, column_name, component_index, index_name, index_options, index_type, type, validator) VALUES ('OpsCenter', 'backup_reports', 'deleted_at', 4, NULL, 'null', NULL, 'regular', 'org.apache.cassandra.db.marshal.TimestampType')

cassandradump's People

Contributors

Stargazers

Watchers

Forkers

cleytonbonamigo erussell leiyangyou kokakiwi botinko venmo vincit ijeremic rawmind adazzi 3fs backendbutters dev0x10 mbachry andy8zhao kvamsi khamutov emadwill tonkolviktor jparap ncs-msmith silpac pandeybk thiago zerograviti nicolasroger17 ravibhargava zabbasi tuannvm bigdatafly gdhgdhgdh mikelaws richard-mathie davidhoyt marcelmaatkamp yuan39 uruscg-llc silviot utilitytools terwilliger42 jevgenyr agoetz00 lgromanowski sandeepgajjam hanhui870 pmaluf dioptre kamilpi lgmarmolejo elite-koder tableau2kube codersofthenight phanirajl squirro arteft raaja2810 fpedrini gobraves anarkia7115 vishvendrarana olevole earthlab-luxembourg nvichare tyouno1 lo764640 mhasson tobenum moody1511 tuncaym isclever redsfyre bryanasdev000 tuhinys doc-sheet pandaporter

cassandradump's Issues

Ignore system keyspaces

I've found that when you don't pass a keyspace, this will try to export all keyspaces including "system_schema, system_auth, system" etc, which can't be overwritten, as I get errors like:

system_schema keyspace is not user-modifiable. or Cannot CREATE <keyspace system_auth>

I. am not familiar with cassandra, but even if we find a way to overwrite system keyspaces, it doesn't sound like we should be doing it.
A flag like --ignore-system when used without --keyspaces should be able to copy all keyspaces except the system ones.

Syntax error

File "cassandradump.py", line 63
for has_counter, columns in itertools.groupby(tableval.columns.iteritems(), lambda (k, v): v.data_type.typename == 'counter')
^
SyntaxError: invalid syntax

Not working with case sensitive columnfamily names and colums

UDTs (User-Defined Types) are not encoded correctly

For example, a UDT Point{x int; y int} is encoded as:
"Point(x=1.234, y=3.231)"

However, for it to be CQL compliant, it should be encoded as:
"{x: 1.234, y=3.231}"

Does not work with database containing unicode data.

Traceback (most recent call last):
  File "cassandradump.py", line 247, in <module>
    main()
  File "cassandradump.py", line 241, in main
    export_data(session)
  File "cassandradump.py", line 142, in export_data
    table_to_cqlfile(session, keyname, tablename, None, tableval, f)
  File "cassandradump.py", line 44, in table_to_cqlfile
    filep.write('INSERT INTO "' + keyspace + '"."' + tablename + '" (' + ', '.join(row.keys()) + ') VALUES (' + ', '.join(values) + ')\n')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 206: ordinal not in range(128)

AttributeError: 'ColumnMetadata' object has no attribute 'data_type'

hi,
i get an error while exporting this table with cassandra 2.1:

CREATE TABLE slots (
    type text,
    host text,
    count int,
    PRIMARY KEY (type,host)
 );

Exporting all keyspaces
Exporting schema for keyspace engine
Exporting data for column family engine.slots
Traceback (most recent call last):
  File "cassandradump.py", line 351, in <module>
    main()
  File "cassandradump.py", line 345, in main
    export_data(session)
  File "cassandradump.py", line 225, in export_data
    table_to_cqlfile(session, keyname, tablename, None, tableval, f)
  File "cassandradump.py", line 94, in table_to_cqlfile
    value_encoders = make_value_encoders(tableval)
  File "cassandradump.py", line 58, in make_value_encoders
    return dict((to_utf8(k), make_value_encoder(v.data_type.typename)) for k, v in tableval.columns.ite
  File "cassandradump.py", line 58, in <genexpr>
    return dict((to_utf8(k), make_value_encoder(v.data_type.typename)) for k, v in tableval.columns.ite
AttributeError: 'ColumnMetadata' object has no attribute 'data_type'

Connection Failed

I got connection problem when executing
python dumper.py --export-file fxm_test.cql --host 172.31.5.30

I'm running Cassandra 2.1.5 and Python 2.7.6
cqlsh 5.0.1 | Cassandra 2.1.5 | CQL spec 3.2.0 | Native protocol v3

Traceback (most recent call last):
  File "dumper.py", line 356, in <module>
    main()
  File "dumper.py", line 345, in main
    session = setup_cluster()
  File "dumper.py", line 300, in setup_cluster
    session = cluster.connect()
  File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 839, in connect
    self.control_connection.connect()
  File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 2075, in connect
    self._set_new_connection(self._reconnect_internal())
  File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 2110, in _reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'172.31.5.30': ConnectionException(u'Failed to initialize new connection to 172.31.5.30: code=0000 [Server error] message="io.netty.handler.codec.DecoderException: org.apache.cassandra.transport.ProtocolException: Invalid or unsupported protocol version: 4"',)})

The Cassandra service is running but not sure why the script couldn't connect. Any idea?
Thanks!

Exception on import large database

Hello, when I tried to import a large database, 600k+ lines, throwed me an exception:

Traceback (most recent call last):
File "../cassandradump/cassandradump.py", line 271, in
main()
File "../cassandradump/cassandradump.py", line 263, in main
import_data(session)
File "../cassandradump/cassandradump.py", line 65, in import_data
session.execute(line)
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1405, in execute
result = future.result(timeout)
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 2976, in result
raise self._final_exception
cassandra.protocol.SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] >message="line 0:-1 no viable alternative at input ''">

Script imported most of the code, but the last table, with 500k lines, imported only 11k.

Thanks any way.

Export incorrectly quotes Maps

Export of table definitions creates:

    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'

Should be:

    AND caching = {'keys':'ALL', 'rows_per_partition':'NONE'}

It only seems to be for the caching line, all other maps are correctly generated.

Anything I can do to speedup keyspace dump ?

Hello,

I have been running cassandradump on our local test clusters to export and import keyspaces around.

However, on larger dbs, on multi-datacenter clusters, I can see export jobs running for days, until they export all data.

I'm talking in the regards of a keyspace with replication factor 3 , spread across 4 nodes, around 200GB data total across all nodes.

While this all is fine, considering the amount of data we are dealing with, I am barely seeing any load on the machines that are holding the relevant pieces of data. No high CPU or IO usage or any abnormal behavior really.

Having that in mind, I was wondering, if there's anything I can be tuning to further improve the speeds of exporting those larger dbs ?

Any suggestions appreciated !

Thanks,

It would be great to be able to rename keyspaces as part of the export.

This way I wouldn't have to worry about clobbering production data by accident.

schema dump ";;"

python ~/cassandradump/cassandradump.py --no-insert --export-file ./table2.cql --cf=keyspace1.table2

... AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';;

returns double ";" at the end of file

when i try import schema

cqlsh localhost < ./table2.cql
:24:SyntaxException: <Error from server: code=2000 [Syntax error in CQL query] message="line 1:0 no viable alternative at input ';' ([;])">

There is no version specified for cassandra-driver in requires

When you do install it installs cassandra-driver==3.0.0c1 as a requirement which happens to cause errors when doing exports:

Traceback (most recent call last):
  File "/usr/local/bin/cassandradump", line 9, in <module>
    load_entry_point('cassandradump==0.0.1', 'console_scripts', 'cassandradump')()
  File "/usr/local/lib/python2.7/dist-packages/cassandradump.py", line 350, in main
    export_data(session)
  File "/usr/local/lib/python2.7/dist-packages/cassandradump.py", line 230, in export_data
    table_to_cqlfile(session, keyname, tablename, None, tableval, f)
  File "/usr/local/lib/python2.7/dist-packages/cassandradump.py", line 94, in table_to_cqlfile
    value_encoders = make_value_encoders(tableval)
  File "/usr/local/lib/python2.7/dist-packages/cassandradump.py", line 58, in make_value_encoders
    return dict((to_utf8(k), make_value_encoder(v.data_type.typename)) for k, v in tableval.columns.iteritems())
  File "/usr/local/lib/python2.7/dist-packages/cassandradump.py", line 58, in <genexpr>
    return dict((to_utf8(k), make_value_encoder(v.data_type.typename)) for k, v in tableval.columns.iteritems())
AttributeError: 'ColumnMetadata' object has no attribute 'data_type'

Manually reverting cassandra driver to 2.6.0 solves the problem.