mysqlonrocksdb / mysql-5.6 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from facebook/mysql-5.6
Facebook's branch of the Oracle MySQL v5.6 database
License: GNU General Public License v2.0
This project forked from facebook/mysql-5.6
Facebook's branch of the Oracle MySQL v5.6 database
License: GNU General Public License v2.0
This is a continuation of issue #16.
Let's work with the same dataset
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;
create table t2 (
pk int not null,
a int not null,
b int not null,
primary key(pk),
key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;
Try this:
MySQL [test]> select max(a) from t2 where a < 2;
+--------+
| max(a) |
+--------+
| 2 |
+--------+
This is apparently wrong. For comparison, the same without the index:
MySQL [test]> create table t3 as select * from t2;
Query OK, 1000 rows affected (0.58 sec)
Records: 1000 Duplicates: 0 Warnings: 0
MySQL [test]> select max(a) from t3 where a < 2;
+--------+
| max(a) |
+--------+
| 1 |
+--------+
Currently the RocksDB-SE data dict is populated at startup with table -> index mappings, but those indices don't have mappings to their associated column families. The get_cf call returns null.
This relies on MySQL to open the table at some point in order to fill in the column families. For fast drop table, we need access to these column families even if the table hasn't been opened already.
Now we have CONSISTENT SNAPSHOT in RocksDB, it's time to make mysqldump work for RocksDB.
At least two features are needed.
This is a continuation of issue #16.
Run this
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;
create table t2 (
pk int not null,
a int not null,
b int not null,
primary key(pk),
key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;
The problem query:
MySQL [test]> explain select * from t2 force index (a) where a>=0;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | t2 | range | a | a | 4 | NULL | 10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.01 sec)
MySQL [test]> select * from t2 force index (a) where a>=0;
Empty set (0.01 sec)
The correct result is
MySQL [test]> select * from t2 ignore index (a) where a>=0;
+-----+----+-----+
| pk | a | b |
+-----+----+-----+
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 2 | 0 | 2 |
| 3 | 0 | 3 |
| 4 | 0 | 4 |
| 5 | 0 | 5 |
.. (total 1000 rows)
Currently, index scans for DATETIME, TIMESTAMP, and DOUBLE are not supported
Testcase:
create table t31 (pk int auto_increment primary key, key1 double, key(key1)) engine=rocksdb;
insert into t31 values (),(),(),(),(),(),(),();
explain select key1 from t31 where key1=1.234;
create table t32 (pk int auto_increment primary key, key1 datetime, key(key1))engine=rocksdb;
insert into t32 values (),(),(),(),(),(),(),();
explain select key1 from t32 where key1='2015-01-01 00:11:12';
create table t33 (pk int auto_increment primary key, key1 timestamp, key(key1))engine=rocksdb;
insert into t33 values (),(),(),(),(),(),(),();
explain select key1 from t33 where key1='2015-01-01 00:11:12';
This task is about to support them.
DATETIME/TIMESTAMP use Field_temporal_with_date_and_timef::make_sort_key, which just does memcpy().
DOUBLE uses change_double_for_sort(), we will need to code a reverse function.
From @mdcallag:
For background see http://dev.mysql.com/doc/refman/5.6/en/show-engine.html
We need to change what is in SHOW ENGINE ROCKSDB STATUS. Right now it has live sst files which is a huge list with leveled compaction. For now I prefer to have it list the output from compaction stats. That probably needs to use one of:
db->GetProperty("rocksdb.stats", ...
db->GetProperty("rocksdb.cfstats", ...
*************************** 1. row ***************************
Type: ROCKSDB
Name: live_files
Status: cf=default name=/4908814.sst size=97853952
cf=default name=/4908812.sst size=97879865
cf=default name=/4908807.sst size=97833748
cf=default name=/4905498.sst size=1865749
cf=default name=/4905500.sst size=2670668
From @mdcallag:
This requires more discussion but many of us are in favor of it.
It might be time to use the binlog as the source of truth to avoid the complexity and inefficiency of keeping RocksDB and the binlog synchronized via internal XA. There are two modes for this. The first mode is durable in which case fsync is done after writing the binlog and RocksDB WAL. The other mode is non-durable in which case fsync might only be done once per second and we rely on lossless semisync to recover. Binlog as source of truth might have been discussed on a MariaDB mail list many years ago - https://lists.launchpad.net/maria-developers/msg01998.html
Some details are at http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html
The new protocol will be:
When lossless semisync is used we skip steps 2 and 4. When lossless semisync is not used we do step 2 and skip 3. Step 4 is optional. Recovery in this case is done by:
When running in non durable mode, then on a crash one of the following is true where the relation describes which one has more commits:
From @mdcallag:
How do we do DROP TABLE and DROP INDEX? Using a column family per index makes this easy, just drop the CF but I assume we must support N tables or indexes per column family and need a compaction filter to eventually remove deleted rows along with an option to trigger manual compaction after dropping a large table or index to reclaim disk space faster. Assuming each key for RocksDB has the table/index ID as the prefix do we track IDs for valid objects or IDs for dropped objects that have key-value pairs that must be removed? If we track valid IDs then compaction filters must always do a bloom filter check or hash table lookup to confirm that the IDs in the merged rows are valid. And this can be a big CPU overhead. If we track IDs for dropped objects then we have a new statement management problem. This state must be persisted to survive mysqld restart. We must also figure out when an ID can be removed from this set. But the benefit is that we only need the compaction filter when that set of IDs is not empty. As an optimization we can avoid the compaction filter checks for merges between L0 and L1 because that must be as fast as possible.
Here is the list of tests which fail:
(tests prefixed with # are fixed)
funcs_1.is_columns_mysql : This test fails on RocksDB SE
funcs_1.is_tables_mysql : This test fails on RocksDB SE
innodb.innodb_bug59641 : This test fails on RocksDB SE
innodb.innodb-index-online-fk : This test fails on RocksDB SE
innodb.innodb-system-table-view : This test fails on RocksDB SE
innodb.innodb-tablespace : This test fails on RocksDB SE
innodb_stress.innodb_stress_blob_zipdebug_zlib : This test fails on RocksDB SE
innodb_stress.innodb_stress_mix : This test fails on RocksDB SE
innodb_zip.innodb_16k : This test fails on RocksDB SE
main.bootstrap : This test fails on RocksDB SE
main.connect : This test fails on RocksDB SE
main.hll : This test fails on RocksDB SE
main.innodb_report_age_of_evicted_pages : This test fails on RocksDB SE
main.mysqlbinlog_gtid : This test fails on RocksDB SE
main.mysqld--help-notwin-profiling : This test fails on RocksDB SE
main.mysqld--help-notwin : This test fails on RocksDB SE
main.openssl_1 : This test fails on RocksDB SE
main.plugin_auth_qa_1 : This test fails on RocksDB SE
main.plugin_auth_sha256_server_default_tls : This test fails on RocksDB SE
main.plugin_auth_sha256_tls : This test fails on RocksDB SE
main.rocksdb : This test fails on RocksDB SE
main.ssl_8k_key : This test fails on RocksDB SE
main.ssl_cipher : This test fails on RocksDB SE
main.ssl_compress : This test fails on RocksDB SE
main.ssl_connections_count : This test fails on RocksDB SE
main.ssl_connect : This test fails on RocksDB SE
main.ssl : This test fails on RocksDB SE
main.temp_table_cleanup : This test fails on RocksDB SE
perfschema.aggregate : This test fails on RocksDB SE
perfschema.hostcache_ipv4_auth_plugin : This test fails on RocksDB SE
perfschema.hostcache_ipv6_auth_plugin : This test fails on RocksDB SE
perfschema.no_threads : This test fails on RocksDB SE
perfschema.pfs_upgrade_event : This test fails on RocksDB SE
perfschema.pfs_upgrade_func : This test fails on RocksDB SE
perfschema.pfs_upgrade_proc : This test fails on RocksDB SE
perfschema.pfs_upgrade_table : This test fails on RocksDB SE
perfschema.pfs_upgrade_view : This test fails on RocksDB SE
perfschema.start_server_disable_idle : This test fails on RocksDB SE
perfschema.start_server_disable_stages : This test fails on RocksDB SE
perfschema.start_server_disable_statements : This test fails on RocksDB SE
perfschema.start_server_disable_waits : This test fails on RocksDB SE
perfschema.start_server_innodb : This test fails on RocksDB SE
perfschema.start_server_no_account : This test fails on RocksDB SE
perfschema.start_server_no_cond_class : This test fails on RocksDB SE
perfschema.start_server_no_cond_inst : This test fails on RocksDB SE
perfschema.start_server_no_file_class : This test fails on RocksDB SE
perfschema.start_server_no_file_inst : This test fails on RocksDB SE
perfschema.start_server_no_host : This test fails on RocksDB SE
perfschema.start_server_no_mutex_class : This test fails on RocksDB SE
perfschema.start_server_no_mutex_inst : This test fails on RocksDB SE
perfschema.start_server_no_rwlock_class : This test fails on RocksDB SE
perfschema.start_server_no_rwlock_inst : This test fails on RocksDB SE
perfschema.start_server_no_setup_actors : This test fails on RocksDB SE
perfschema.start_server_no_setup_objects : This test fails on RocksDB SE
perfschema.start_server_no_socket_class : This test fails on RocksDB SE
perfschema.start_server_no_socket_inst : This test fails on RocksDB SE
perfschema.start_server_no_stage_class : This test fails on RocksDB SE
perfschema.start_server_no_stages_history_long : This test fails on RocksDB SE
perfschema.start_server_no_stages_history : This test fails on RocksDB SE
perfschema.start_server_no_statement_class : This test fails on RocksDB SE
perfschema.start_server_no_statements_history_long : This test fails on RocksDB SE
perfschema.start_server_no_statements_history : This test fails on RocksDB SE
perfschema.start_server_no_table_hdl : This test fails on RocksDB SE
perfschema.start_server_no_table_inst : This test fails on RocksDB SE
perfschema.start_server_nothing : This test fails on RocksDB SE
perfschema.start_server_no_thread_class : This test fails on RocksDB SE
perfschema.start_server_no_thread_inst : This test fails on RocksDB SE
perfschema.start_server_no_user : This test fails on RocksDB SE
perfschema.start_server_no_waits_history_long : This test fails on RocksDB SE
perfschema.start_server_no_waits_history : This test fails on RocksDB SE
perfschema.start_server_off : This test fails on RocksDB SE
perfschema.start_server_on : This test fails on RocksDB SE
rpl.rpl_alter_repository : This test fails on RocksDB SE
rpl.rpl_change_master_crash_safe : This test fails on RocksDB SE
rpl.rpl_dynamic_ssl : This test fails on RocksDB SE
rpl.rpl_gtid_crash_safe : This test fails on RocksDB SE
rpl.rpl_heartbeat_ssl : This test fails on RocksDB SE
rpl.rpl_innodb_bug68220 : This test fails on RocksDB SE
rpl.rpl_master_connection : This test fails on RocksDB SE
rpl.rpl_row_crash_safe : This test fails on RocksDB SE
rpl.rpl_ssl1 : This test fails on RocksDB SE
rpl.rpl_ssl : This test fails on RocksDB SE
rpl.rpl_stm_mixed_mts_rec_crash_safe_small : This test fails on RocksDB SE
sys_vars.all_vars : This test fails on RocksDB SE
Since we need to manage more dictionary information, it is time to define better dictionary data model. This has to be done before rolling out widely.
Candidate dictionary model after discussion:
side notes:
We need to support Index Condition Pushdown in RocksDB-SE.
http://bugs.mysql.com/bug.php?id=75248 affects more in RocksDB than InnoDB because of costly range scan. MariaDB already fixed the problem and we already identified one line diff to fix the problem. We need to decide 1. waiting until Oracle fixes the bug and cloning into WebScaleSQL or 2. fixing WebScaleSQL directly
Currently, index-only scans are supported for
for other collations (eg. case-insensitive, _ci collations), index-only scans are not supported. The reason for this is that is not possible to restore the original column value mem-comparable key. For example, in latin_general_ci both 'foo', 'Foo', and 'FOO' have mem-comparable form 'FOO'.
A possible solution could work like this:
See also:
Support START TRANSACTION WITH CONSISTENT SNAPSHOT, and support START TRANSACTION WITH CONSISTENT ROCKSDB SNAPSHOT (fb-mysql extension to print binary log state)
From @mdcallag:
This should avoid doing a table copy when adding an index. See http://dev.mysql.com/doc/refman/5.5/en/innodb-create-index.html
How to repeat:
CREATE TABLE linktable
(
id1
bigint(20) unsigned NOT NULL DEFAULT '0',
id2
bigint(20) unsigned NOT NULL DEFAULT '0',
link_type
bigint(20) unsigned NOT NULL DEFAULT '0',
visibility
tinyint(3) NOT NULL DEFAULT '0',
data
varchar(255) COLLATE latin1_bin NOT NULL DEFAULT '',
time
bigint(20) unsigned NOT NULL DEFAULT '0',
version
int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type
,id1
,id2
) COMMENT 'rev:cf_primary',
KEY id1_type
(id1
,link_type
,visibility
,time
,id2
,version
,data
)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
insert into linktable values (1,1,1,1,1,1,1),(2,2,2,2,2,2,2),(3,3,3,3,3,3,3),(4,4,4,4,4,4,4),(5,5,5,5,5,5,5),(6,6,6,6,6,6,6),(7,7,7,7,7,7,7),(8,8,8,8,8,8,8),(9,9,9,9,9,9,9),(10,10,10,10,10,10,10),(11,11,11,11,11,11,11),(12,12,12,12,12,12,12),(13,13,13,13,13,13,13),(14,14,14,14,14,14,14),(15,15,15,15,15,15,15),(16,16,16,16,16,16,16),(17,17,17,17,17,17,17),(18,18,18,18,18,18,18),(19,19,19,19,19,19,19),(20,20,20,20,20,20,20);
mysql> select * from linktable force index(PRIMARY) where link_type=1;
Empty set (0.00 sec)
mysql> select * from linktable force index(id1_type) where link_type=1;
+-----+-----+-----------+------------+------+------+---------+
| id1 | id2 | link_type | visibility | data | time | version |
+-----+-----+-----------+------------+------+------+---------+
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+-----+-----+-----------+------------+------+------+---------+
1 row in set (0.00 sec)
mysql> update linktable set version=100 where link_type=1;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0 Changed: 0 Warnings: 0
mysql> update linktable set version=100 where id1=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
It doesn't work with reverse cf on secondary index, either:
CREATE TABLE linktable
(
id1
bigint(20) unsigned NOT NULL DEFAULT '0',
id2
bigint(20) unsigned NOT NULL DEFAULT '0',
link_type
bigint(20) unsigned NOT NULL DEFAULT '0',
visibility
tinyint(3) NOT NULL DEFAULT '0',
data
varchar(255) COLLATE latin1_bin NOT NULL DEFAULT '',
time
bigint(20) unsigned NOT NULL DEFAULT '0',
version
int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type
,id1
,id2
),
KEY id1_type
(id1
,link_type
,visibility
,time
,id2
,version
,data
) COMMENT 'rev:cf_id1_type'
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
insert into linktable values (1,1,1,1,1,1,1),(2,2,2,2,2,2,2),(3,3,3,3,3,3,3),(4,4,4,4,4,4,4),(5,5,5,5,5,5,5),(6,6,6,6,6,6,6),(7,7,7,7,7,7,7),(8,8,8,8,8,8,8),(9,9,9,9,9,9,9),(10,10,10,10,10,10,10),(11,11,11,11,11,11,11),(12,12,12,12,12,12,12),(13,13,13,13,13,13,13),(14,14,14,14,14,14,14),(15,15,15,15,15,15,15),(16,16,16,16,16,16,16),(17,17,17,17,17,17,17),(18,18,18,18,18,18,18),(19,19,19,19,19,19,19),(20,20,20,20,20,20,20);
mysql> select * from linktable force index(id1_type) where id1=1;
Empty set (0.00 sec)
mysql> select * from linktable force index(PRIMARY) where id1=1;
+-----+-----+-----------+------------+------+------+---------+
| id1 | id2 | link_type | visibility | data | time | version |
+-----+-----+-----------+------------+------+------+---------+
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+-----+-----+-----------+------------+------+------+---------+
1 row in set (0.00 sec)
mysql> update linktable set version=100 where id1=10;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0 Changed: 0 Warnings: 0
mysql> update linktable set version=100 where link_type=1;
Query OK, 1 row affected (0.03 sec)
Rows matched: 1 Changed: 1 Warnings: 0
This is a continuation of issue #16. It is the last part.
Let's work with the same dataset:
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;
create table t2 (
pk int not null,
a int not null,
b int not null,
primary key(pk),
key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;
Try this:
MySQL [j5]> select * from t2 where a between 98 and 2000 order by a desc;
Empty set (4.83 sec)
The correct result should be:
MySQL [j5]> select * from t2 use index() where a between 98 and 2000 order by a desc;
+-----+----+-----+
| pk | a | b |
+-----+----+-----+
| 999 | 99 | 999 |
| 991 | 99 | 991 |
...
| 980 | 98 | 980 |
+-----+----+-----+
20 rows in set (0.00 sec)
This function modifies ddl_hash and index_num_to_keydef.
However, these changes need to be reversed if dict_manager.commit fails
See the discussion here: https://reviews.facebook.net/D35925#inline-259167
create table r1 (id1 int, id2 int, value int, primary key (id1, id2)) engine=rocksdb;
insert into r1 values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5);
T1:
begin;
update r1 set value=value+1 where id1=2 and id2=2;
T2:
begin;
select * from r1 where id1=2 and id2=2 lock in share mode;
=> not blocked
Mongo on RocksDB code: mongodb-partners/mongo@b7fb7c4
From @yoshinorim:
I just noticed primary keys are stored twice in RocksDB SE – as both key and value, even though
they are not needed to be stored as value.
How to repeat:
10 column primary key
create table r1 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10)
) engine=rocksdb charset=latin1 collate=latin1_bin;
1 column primary key
create table r2 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1)
) engine=rocksdb charset=latin1 collate=latin1_bin;
Inserting 2 million rows on these tables. Values sequentially increasing from 1 to 2 million (all columns have same values per row) -- can be generated like this.
for(my $i=1; $i<= 2000000; $i++) {
for(my $j=0; $j < 12; $j++) {
if($j < 11) {
print "$i,";
}else {
print "$i\n";
}
}
}
then make compaction happen.
Size in KB:
id1-id10 pk:
353764
id1 pk:
212984
innodb
id1-id10 pk:
264241
id1 pk:
260046
So basically long primary keys increase on-disk space.
I also checked via gdb (setting breakpoint at rocksdb::WriteBatch::Put()) and confirmed.
insert into r1 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$1 = 84
(gdb) p value.size_
$2 = 96
insert into r2 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$3 = 12
(gdb) p value.size_
$4 = 96
I expected value.size_ were 12 and 84 respectively. Since both key fields and value fields are stored, duplicate fields can be removed from value fields.
From @mdcallag:
I want the following for RocksDB. This is similar to innodb_flush_log_at_trx_commit which uses an integer set to 0, 1 or 2
We need a) and b) before doing c) so this can be done in two diffs. Doing c) requires a background thread. See srv_master_thread in innodb source code for tips on doing that -- both thread start and thread stop on db shutdown.
Additionally it would be good to have a per-session variable that can override these values. For the per-session variable see bulk_load in ha_rocksdb.cc at https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc#L128
Whether to sync the RocksDB WAL is specified by the RocksDB user in WriteOptions::sync
Compaction Filter V2 had perf problem on building vector. We should switch to Compaction Filter v1.
If SST doesn't have persisted properties, we should use average values for this file, not ignore it as we currently do.
From @mdcallag:
We currently provide read committed. We need to define and implement repeatable read. See https://github.com/MariaDB/webscalesql-5.6/wiki/Cursor-Isolation
This is a placeholder for now.
Currently data dictionary is mapped to "default" CF and shared with regular rows. A dedicated CF for data dictionary will make things easier.
ha_rocksdb::index_read_map() calls io_perf_start()
at function start, but then there are exits from the function that do not call io_perf_end_and_record()
.
One example scenario:
CREATE TABLE t1 (pk int, a int, PRIMARY KEY (pk), KEY(a) ) ENGINE=ROCKSDB;
insert some rows into t1. Suppose, row with a=10 doesn't exist, row with a=11 does exist.
Then, run the query:
select * from t1 where a in (10,11);
And observe:
Breakpoint 6, ha_rocksdb::io_perf_start (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2047
...
Breakpoint 6, ha_rocksdb::io_perf_start (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2047
...
Breakpoint 7, ha_rocksdb::io_perf_end_and_record (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2056
The first hit is from index lookup for a=10
The other two are from lookup on a=11.
Note that the second call of breakpoint 6 wipes out the data collected since the first invocation.
From @mdcallag:
This include per column family options and configuration options.
This is about implementing read-free slave, similar to what TokuDB has recently implemented.
The idea is that one can process RBR events without making Get() calls or scans. They have sufficient information to issue Put/Delete calls.
RocksDB SE will support RBR only so RBR restriction is not a problem.
Currently crash safe slave doesn't work on RocksDB storage engine. This is because Xid binlog event is not written at transaction commit in RocksDB, but Query event is written instead. This needs to be changed. To make Xid event is written at RocksDB commit, it is needed to add handler::prepare() method. Empty definition is fine.
In addition to that, slave_relay_log_info and slave_gtid_info, slave_worker_info tables need to be changed to RocksDB.
From @yoshinorim:
We need an online physical backup tool, similar to xtrabackup in InnoDB. It should meet at least following features/conditions.
Streaming backup (copying directly to remote host, since local host may not have enough space for backups)
Can take consistent binlog position or GTID (so that we can create slave instances easily)
Backup can finish within reasonable time (close to disk read & transfer & write time of the target data volume, and applying logs)
Not too much write performance drops while taking backups
Not too much read performance drops while taking backups
Nice to have:
Incremental backup
Partial backups (per CF)
If there are many tombstone rows, Seek()/Prev()/Next() may take (potentially much) longer time, because they have to scan many tombstone rows. This is RocksDB internal issue and MyRocks is affected by the limitation. We need find ways to fix -- either fixing RocksDB itself or making workarounds in MyRocks side. The latter will likely to make automatic and adaptive compaction thread, based on tombstone row statistics.
How to repeat:
On my environment, mysqld crashed at ha_rocksdb::position().
#0 ha_rocksdb::position (this=0x7f235d432910, record=0x7f235d503410 "\371\001")
at /data/users/yoshinori/mysql/5.6/storage/rocksdb/ha_rocksdb.cc:3152
#1 0x0000000000636b86 in handler::rnd_pos_by_record (this=0x7f235d432910, record=0x7f235d503410 "\371\001")
at /data/users/yoshinori/mysql/5.6/sql/handler.h:2412
#2 0x00000000008ebfcb in Rows_log_event::do_index_scan_and_update (this=this@entry=0x7f235d48bf00, rli=rli@entry=
0x7f2363ba6000) at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:10825
#3 0x00000000008eb18e in Rows_log_event::do_apply_event (this=0x7f235d48bf00, rli=0x7f2363ba6000)
at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:11709
#4 0x00000000008ee2b7 in Log_event::apply_event (this=this@entry=0x7f235d48bf00, rli=rli@entry=0x7f2363ba6000)
at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:3345
#5 0x0000000000918d8c in apply_event_and_update_pos (ptr_ev=ptr_ev@entry=0x7f2386eac670, thd=thd@entry=
0x7f235d429000, rli=rli@entry=0x7f2363ba6000) at /data/users/yoshinori/mysql/5.6/sql/rpl_slave.cc:3575
#6 0x000000000091bf89 in exec_relay_log_event (rli=0x7f2363ba6000, thd=0x7f235d429000)
at /data/users/yoshinori/mysql/5.6/sql/rpl_slave.cc:4122
last_rowkey was null.
(gdb) p ref
$1 = (uchar *) 0x7f235d503738 ""
(gdb) p last_rowkey
$2 = {Ptr = 0x0, str_length = 0, Alloced_length = 0, alloced = false, str_charset = 0x15d1a60 <my_charset_bin>}
(gdb) p ref_length
$3 = 8
Currently, ha_rocksdb::update_row() updates all indexes, even those who do not have a single changed column. We should update only those whose values were not changed.
A question: are there any rocksdb status counters that could be used to observe this?
rocksdb.delete test case failed because deletes were not visible within transaction. Transactions are are currently started with GetSnapshot().
Initial issue was at
https://github.com/mariadb-corporation/webscalesql-5.6-rocksdb/issues/18
This issue is very similar to issue #36 ("Range in form tbl.key >= const doesn't work in reverse column family"). They both come from a set of problems discovered when working on issue #16.
create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;
create table t2 (
pk int not null,
a int not null,
b int not null,
primary key(pk),
key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;
The problem query is very similar too:
MySQL [test]> explain select * from t2 force index (a) where a>0;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | t2 | range | a | a | 4 | NULL | 10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
MySQL [test]> select * from t2 force index (a) where a>0;
Empty set (0.01 sec)
From @mdcallag:
I assume we will put many indexes in one column family as column family per index doesn't scale to a large number of indexes. When that is done we need a method for specifying the index columns that will be used in the column family. This assumes that we can use bloom filters on a prefix of the index key and that the format of an index key within RockDB is:
...
In that case the options are:
Note that
From @yoshinorim
I'm trying to compile on Ubuntu 14.04 LTS (gcc version Ubuntu 4.8.2-19ubuntu1), using this line:
cmake . -DCMAKE_BUILD_TYPE=Debug -DWITH_SSL:STRING=system \
-DWITH_ZLIB:STRING=system -DMYSQL_MAINTAINER_MODE=1
and I'm getting two compile errors:
[ 30%] Building CXX object storage/innobase/CMakeFiles/innobase.dir/dict/dict0mem.cc.o
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/storage/innobase/dict/dict0mem.cc: In function 'void dict_mem_index_add_field(dict_index_t*, const char*, ulint)':
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/storage/innobase/dict/dict0mem.cc:678:63: error: passing NULL to non-pointer argument 5 of 'void dict_mem_index_add_field(dict_index_t*, const char*, ulint, char**, uint, unsigned int)' [-Werror=conversion-null]
dict_mem_index_add_field(index, name, prefix_len, 0, NULL, 0);
^
cc1plus: all warnings being treated as errors
make[2]: *** [storage/innobase/CMakeFiles/innobase.dir/dict/dict0mem.cc.o] Error 1
make[2]: Target `storage/innobase/CMakeFiles/innobase.dir/build' not remade because of errors.
make[1]: *** [storage/innobase/CMakeFiles/innobase.dir/all] Error 2
and
Scanning dependencies of target mysqlbinlog
[ 39%] Building CXX object client/CMakeFiles/mysqlbinlog.dir/mysqlbinlog.cc.o
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/client/mysqlbinlog.cc: In function 'Exit_status dump_remote_log_entries(PRINT_EVENT_INFO*, const char*)':
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/client/mysqlbinlog.cc:2409:31: error: format not a string literal and no format arguments [-Werror=format-security]
sprintf(cur_logname, logname);
^
cc1plus: all warnings being treated as errors
make[2]: *** [client/CMakeFiles/mysqlbinlog.dir/mysqlbinlog.cc.o] Error 1
make[2]: Target `client/CMakeFiles/mysqlbinlog.dir/build' not remade because of errors.
From @mdcallag:
I think this depends on basic support for IS.table_statistics getting ported to WebScaleSQL
From @mdcallag:
We can get an estimate for the row count via ('Rows:' below is always 10,000 today) via RocksDB property "rocksdb.estimate-num-keys".
Data_length and index_length are best defined when not using an index-organized table. For an index-organized table like InnoDB, the data_length column includes the size of the PK index (including all columns) and the index_length column has the size of all secondary indexes. We can treat RocksDB in the same way. However, we aren't using file-per-table or file-per-index for RocksDB so we need to determine whether we can get estimates on index sizes.
mysql -e 'show table status\G' iok1
*************************** 1. row ***************************
Name: ai
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 1000
Avg_row_length: 0
Data_length: 0
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
Originally from discussion on facebook.com:
create table r0 (id int primary key, value int) engine=rocksdb;
insert into r0 values (1,1),(2,2),(3,3);
explain select * from r0 ORDER BY id;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | r0 | ALL | NULL | NULL | NULL | NULL | 1000 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
Actually, the records are read in the PK order, so filesort is redundant.
Properties of RocksDB-SE table are very similar to properties of InnoDB table with primary key. With InnoDB table and the above example, one gets:
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| 1 | SIMPLE | r0 | index | NULL | PRIMARY | 4 | NULL | 3 | NULL |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
(This is a part of patch for issue #16)
Ranges that start with t.key > const are not handled efficiently. One can observe this as follows:
create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
create table t1 (
pk int primary key,
a int,
col1 int,
key(a)
) engine=rocksdb;
insert into t1 select a,a,123 from one_k;
insert into t1
select 1000 + A.a + 1000*B.a, 100, 1234
from one_k A, one_k B where B.a<100;
explain select * from t1 where a>100 and a < 105;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
Now, set two breakpoints in ha_rocksdb::index_read_map:
run
select * from t1 where a>100 and a < 105;
and see
Breakpoint 6, ha_rocksdb::index_read_map (this=0x7fffa40af3f0, buf=0x7fffa40b7f50 "\377", key=0x7fffa41c8948 "", keypart_map=1, find_flag=HA_READ_AFTER_KEY) at /home/psergey/dev-git/mysql-5.6-rocksdb-look/storage/rocksdb/ha_rocksdb.cc:2631
(gdb) p rocksdb::perf_context.find_next_user_entry_time
$45 = 654578691779
(gdb) p rocksdb::perf_context.internal_key_skipped_count
$46 = 707050
(gdb) c
Continuing.
Breakpoint 7, ha_rocksdb::index_read_map (this=0x7fffa40af3f0, buf=0x7fffa40b7f50 "\371e", key=0x7fffa41c8948 "", keypart_map=1, find_flag=HA_READ_AFTER_KEY) at /home/psergey/dev-git/mysql-5.6-rocksdb-look/storage/rocksdb/ha_rocksdb.cc:2935
(gdb) p rocksdb::perf_context.find_next_user_entry_time
$47 = 654591615526
(gdb) p rocksdb::perf_context.internal_key_skipped_count
$48 = 807051
(gdb) p $48 - $46
$49 = 100001
So we have enumerated 100K rows in order to return 5 rows.
From @mdcallag:
Example output is listed below. There is work-in-progress to get a "show processlist" like feature into RocksDB so we can track per-thread states. Output from that can be displayed during SHOW ENGINE ROCKSDB TRANSACTION STATUS.
The other request is to get the age of the per-transaction or per-statement snapshots and iterators in use by a given connection. There is work-in-progress from RocksDB to track that. But give that the GetSnapshot call is done by our code (on the MySQL side) we can also read the clock when that is done and track the snapshot age.
Long-open snapshots block purge for InnoDB which can be bad. Long open snapshots for RocksDB mean that old versions of some keys cannot be dropped and creates more CPU overhead during compaction. Long open iterators mean that files replaced by compaction won't be deleted until the iterators that reference them are closed.
mysql -h udb5784.prn2 -P3307 -A -e 'show engine innodb transaction status\G'
*************************** 1. row ***************************
Type: InnoDB
Name:
Status:
2014-12-06 08:16:56 7f9722cc4700 INNODB TRANSACTION MONITOR OUTPUT
TRANSACTIONS
Trx id counter 168494781347
Purge done for trx's n:o < 168494781347 undo n:o < 0 state: running but idle
History list length 1675
Lock stats: 0 deadlocks, 0 lock wait timeouts
Commits: 67286553 all, 67126992 with undo
Rollback: 619397 total, 37201404 partial
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 17872, OS thread handle 0x7f9722cc4700, query id 178298354 2401:db00:11:2a:face:0:3b:0 admin3 cleaning up
show engine innodb transaction status
---TRANSACTION 168494781345, not started
MySQL thread id 12, OS thread handle 0x7f9725378700, query id 178298352 Slave has read all relay log; waiting for the slave I/O thread to update it
END OF INNODB TRANSACTION MONITOR OUTPUT
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.