Giter Site home page Giter Site logo

mysqlonrocksdb / mysql-5.6 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from facebook/mysql-5.6

62.0 62.0 13.0 310.41 MB

Facebook's branch of the Oracle MySQL v5.6 database

License: GNU General Public License v2.0

Shell 1.11% CMake 0.68% PHP 0.41% C 32.26% C++ 54.99% Perl 3.72% R 0.04% Objective-C 0.64% Pascal 0.11% Groff 0.36% Python 1.39% SQLPL 0.05% Perl 6 0.01% Visual Basic 0.01% DTrace 0.02% Makefile 0.05% PLpgSQL 0.14% Java 3.87% HTML 0.10% GAP 0.08%

mysql-5.6's People

Contributors

agopi avatar alfranio avatar bjornmu avatar bkandasa avatar blaudden avatar frazerclement avatar gkodinov avatar glebshchepa avatar gurusami avatar harinvadodaria avatar jhauglid avatar jtolmer avatar kboortz avatar marcalff avatar maykov avatar nacarvalho avatar nryeng avatar phulakun avatar roylyseng avatar santoshbanda avatar satya461 avatar snarkedi avatar spetrunia avatar steaphangreene avatar stewartsmith avatar thirunarayanan avatar tianx avatar vaintroub avatar vasild avatar yoshinorim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mysql-5.6's Issues

index_read_map(HA_READ_BEFORE_KEY) does not work in reverse column family

This is a continuation of issue #16.

Let's work with the same dataset

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;

create table t2 (
  pk int not null,
  a  int not null,
  b  int not null,
  primary key(pk),
  key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;

Try this:

MySQL [test]> select max(a) from t2 where a < 2;
+--------+
| max(a) |
+--------+
|      2 |
+--------+

This is apparently wrong. For comparison, the same without the index:

MySQL [test]> create table t3 as select * from t2;
Query OK, 1000 rows affected (0.58 sec)
Records: 1000  Duplicates: 0  Warnings: 0

MySQL [test]> select max(a) from t3 where a < 2;
+--------+
| max(a) |
+--------+
|      1 |
+--------+

Populate table index -> column family mappings at start-up

Currently the RocksDB-SE data dict is populated at startup with table -> index mappings, but those indices don't have mappings to their associated column families. The get_cf call returns null.

This relies on MySQL to open the table at some point in order to fill in the column families. For fast drop table, we need access to these column families even if the table hasn't been opened already.

mysqldump supporting RocksDB

Now we have CONSISTENT SNAPSHOT in RocksDB, it's time to make mysqldump work for RocksDB.
At least two features are needed.

  • START TRANSACTION WITH CONSISTENT ROCKSDB SNAPSHOT instead of INNODB SNAPSHOT. This can be done by either passing an option (i.e. --rocksdb) or by detecting if there are RocksDB tables.
  • Adding an option to take dump by descending order -- adding an option --order-by-primary-desc then taking backup by descending order. This is efficient if primary key is reverse order.

Range in form tbl.key >= const doesn't work in reverse column family

This is a continuation of issue #16.

Run this

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;

create table t2 (
  pk int not null,
  a  int not null,
  b  int not null,
  primary key(pk),
  key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;

The problem query:

MySQL [test]> explain select * from t2 force index (a) where a>=0;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t2    | range | a             | a    | 4       | NULL |   10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.01 sec)

MySQL [test]> select * from t2 force index (a) where a>=0;
Empty set (0.01 sec)

The correct result is

MySQL [test]> select * from t2 ignore index (a) where a>=0;
+-----+----+-----+
| pk  | a  | b   |
+-----+----+-----+
|   0 |  0 |   0 |
|   1 |  0 |   1 |
|   2 |  0 |   2 |
|   3 |  0 |   3 |
|   4 |  0 |   4 |
|   5 |  0 |   5 |
.. (total 1000 rows)

Support index-only scans for DATETIME, TIMESTAMP, and DOUBLE

Currently, index scans for DATETIME, TIMESTAMP, and DOUBLE are not supported

Testcase:

create table t31 (pk int auto_increment primary key, key1 double, key(key1)) engine=rocksdb;
insert into t31 values (),(),(),(),(),(),(),();
explain select key1 from t31 where key1=1.234;

create table t32 (pk int auto_increment primary key, key1 datetime, key(key1))engine=rocksdb;
insert into t32 values (),(),(),(),(),(),(),();
explain select key1 from t32 where key1='2015-01-01 00:11:12';

create table t33 (pk int auto_increment primary key, key1 timestamp, key(key1))engine=rocksdb;
insert into t33 values (),(),(),(),(),(),(),();
explain select key1 from t33 where key1='2015-01-01 00:11:12';

This task is about to support them.
DATETIME/TIMESTAMP use Field_temporal_with_date_and_timef::make_sort_key, which just does memcpy().
DOUBLE uses change_double_for_sort(), we will need to code a reverse function.

change SHOW ENGINE ROCKSDB STATUS

From @mdcallag:

For background see http://dev.mysql.com/doc/refman/5.6/en/show-engine.html

We need to change what is in SHOW ENGINE ROCKSDB STATUS. Right now it has live sst files which is a huge list with leveled compaction. For now I prefer to have it list the output from compaction stats. That probably needs to use one of:
db->GetProperty("rocksdb.stats", ...
db->GetProperty("rocksdb.cfstats", ...

*************************** 1. row ***************************
Type: ROCKSDB
Name: live_files
Status: cf=default name=/4908814.sst size=97853952
cf=default name=/4908812.sst size=97879865
cf=default name=/4908807.sst size=97833748
cf=default name=/4905498.sst size=1865749
cf=default name=/4905500.sst size=2670668

support binlog + rocksdb group commit

From @mdcallag:

This requires more discussion but many of us are in favor of it.

It might be time to use the binlog as the source of truth to avoid the complexity and inefficiency of keeping RocksDB and the binlog synchronized via internal XA. There are two modes for this. The first mode is durable in which case fsync is done after writing the binlog and RocksDB WAL. The other mode is non-durable in which case fsync might only be done once per second and we rely on lossless semisync to recover. Binlog as source of truth might have been discussed on a MariaDB mail list many years ago - https://lists.launchpad.net/maria-developers/msg01998.html

Some details are at http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html

The new protocol will be:

  1. write binlog
  2. optionally sync binlog
  3. optionally wait for semisync ack
  4. commit rocksdb - this also persists the GTID within RocksDB for the most recent commit, this also makes changes from the transaction visible to others
  5. optionally sync rocksdb WAL

When lossless semisync is used we skip steps 2 and 4. When lossless semisync is not used we do step 2 and skip 3. Step 4 is optional. Recovery in this case is done by:

  1. query RocksDB to determine GTID of last commit it has
  2. extract/replay transactions from binlog >= GTID from previous step

When running in non durable mode, then on a crash one of the following is true where the relation describes which one has more commits:

  1. rocksdb > binlog
  2. binlog > rocksdb
  3. rocksdb == binlog
    If you know which state the server is in, then you can reach state 3. If in state 1 then append events to the binlog without running them on innodb. If in state 2 then replay events to innodb without recording to binlog. If in state 3 then do nothing. Both RocksDB and the binlog can tell us the last GTID they contain and we can compare that with the binlog archived via lossless semisync to determine the state.

Use a compaction filter to eventually remove data after DROP TABLE and DROP INDEX

From @mdcallag:

How do we do DROP TABLE and DROP INDEX? Using a column family per index makes this easy, just drop the CF but I assume we must support N tables or indexes per column family and need a compaction filter to eventually remove deleted rows along with an option to trigger manual compaction after dropping a large table or index to reclaim disk space faster. Assuming each key for RocksDB has the table/index ID as the prefix do we track IDs for valid objects or IDs for dropped objects that have key-value pairs that must be removed? If we track valid IDs then compaction filters must always do a bloom filter check or hash table lookup to confirm that the IDs in the merged rows are valid. And this can be a big CPU overhead. If we track IDs for dropped objects then we have a new statement management problem. This state must be persisted to survive mysqld restart. We must also figure out when an ID can be removed from this set. But the benefit is that we only need the compaction filter when that set of IDs is not empty. As an optimization we can avoid the compaction filter checks for merges between L0 and L1 because that must be as fast as possible.

Fix test cases which fail with RocksDB SE

Here is the list of tests which fail:
(tests prefixed with # are fixed)

funcs_1.is_columns_mysql : This test fails on RocksDB SE
funcs_1.is_tables_mysql : This test fails on RocksDB SE
innodb.innodb_bug59641 : This test fails on RocksDB SE
innodb.innodb-index-online-fk : This test fails on RocksDB SE
innodb.innodb-system-table-view : This test fails on RocksDB SE
innodb.innodb-tablespace : This test fails on RocksDB SE
innodb_stress.innodb_stress_blob_zipdebug_zlib : This test fails on RocksDB SE
innodb_stress.innodb_stress_mix : This test fails on RocksDB SE
innodb_zip.innodb_16k : This test fails on RocksDB SE
main.bootstrap : This test fails on RocksDB SE
main.connect : This test fails on RocksDB SE
main.hll : This test fails on RocksDB SE
main.innodb_report_age_of_evicted_pages : This test fails on RocksDB SE

main.innodb_snapshot_nobinlog : This test fails on RocksDB SE

main.innodb_snapshot_noinnodb : This test fails on RocksDB SE

main.myisam-blob : This test fails on RocksDB SE

main.mysqlbinlog_gtid : This test fails on RocksDB SE

main.mysqlcheck : This test fails on RocksDB SE

main.mysqld--help-notwin-profiling : This test fails on RocksDB SE
main.mysqld--help-notwin : This test fails on RocksDB SE

main.mysql_embedded : This test fails on RocksDB SE

main.openssl_1 : This test fails on RocksDB SE
main.plugin_auth_qa_1 : This test fails on RocksDB SE
main.plugin_auth_sha256_server_default_tls : This test fails on RocksDB SE
main.plugin_auth_sha256_tls : This test fails on RocksDB SE
main.rocksdb : This test fails on RocksDB SE
main.ssl_8k_key : This test fails on RocksDB SE
main.ssl_cipher : This test fails on RocksDB SE
main.ssl_compress : This test fails on RocksDB SE
main.ssl_connections_count : This test fails on RocksDB SE
main.ssl_connect : This test fails on RocksDB SE
main.ssl : This test fails on RocksDB SE
main.temp_table_cleanup : This test fails on RocksDB SE

main.warnings : This test fails on RocksDB SE

perfschema.aggregate : This test fails on RocksDB SE
perfschema.hostcache_ipv4_auth_plugin : This test fails on RocksDB SE
perfschema.hostcache_ipv6_auth_plugin : This test fails on RocksDB SE
perfschema.no_threads : This test fails on RocksDB SE
perfschema.pfs_upgrade_event : This test fails on RocksDB SE
perfschema.pfs_upgrade_func : This test fails on RocksDB SE
perfschema.pfs_upgrade_proc : This test fails on RocksDB SE
perfschema.pfs_upgrade_table : This test fails on RocksDB SE
perfschema.pfs_upgrade_view : This test fails on RocksDB SE
perfschema.start_server_disable_idle : This test fails on RocksDB SE
perfschema.start_server_disable_stages : This test fails on RocksDB SE
perfschema.start_server_disable_statements : This test fails on RocksDB SE
perfschema.start_server_disable_waits : This test fails on RocksDB SE
perfschema.start_server_innodb : This test fails on RocksDB SE
perfschema.start_server_no_account : This test fails on RocksDB SE
perfschema.start_server_no_cond_class : This test fails on RocksDB SE
perfschema.start_server_no_cond_inst : This test fails on RocksDB SE
perfschema.start_server_no_file_class : This test fails on RocksDB SE
perfschema.start_server_no_file_inst : This test fails on RocksDB SE
perfschema.start_server_no_host : This test fails on RocksDB SE
perfschema.start_server_no_mutex_class : This test fails on RocksDB SE
perfschema.start_server_no_mutex_inst : This test fails on RocksDB SE
perfschema.start_server_no_rwlock_class : This test fails on RocksDB SE
perfschema.start_server_no_rwlock_inst : This test fails on RocksDB SE
perfschema.start_server_no_setup_actors : This test fails on RocksDB SE
perfschema.start_server_no_setup_objects : This test fails on RocksDB SE
perfschema.start_server_no_socket_class : This test fails on RocksDB SE
perfschema.start_server_no_socket_inst : This test fails on RocksDB SE
perfschema.start_server_no_stage_class : This test fails on RocksDB SE
perfschema.start_server_no_stages_history_long : This test fails on RocksDB SE
perfschema.start_server_no_stages_history : This test fails on RocksDB SE
perfschema.start_server_no_statement_class : This test fails on RocksDB SE
perfschema.start_server_no_statements_history_long : This test fails on RocksDB SE
perfschema.start_server_no_statements_history : This test fails on RocksDB SE
perfschema.start_server_no_table_hdl : This test fails on RocksDB SE
perfschema.start_server_no_table_inst : This test fails on RocksDB SE
perfschema.start_server_nothing : This test fails on RocksDB SE
perfschema.start_server_no_thread_class : This test fails on RocksDB SE
perfschema.start_server_no_thread_inst : This test fails on RocksDB SE
perfschema.start_server_no_user : This test fails on RocksDB SE
perfschema.start_server_no_waits_history_long : This test fails on RocksDB SE
perfschema.start_server_no_waits_history : This test fails on RocksDB SE
perfschema.start_server_off : This test fails on RocksDB SE
perfschema.start_server_on : This test fails on RocksDB SE
rpl.rpl_alter_repository : This test fails on RocksDB SE
rpl.rpl_change_master_crash_safe : This test fails on RocksDB SE
rpl.rpl_dynamic_ssl : This test fails on RocksDB SE
rpl.rpl_gtid_crash_safe : This test fails on RocksDB SE
rpl.rpl_heartbeat_ssl : This test fails on RocksDB SE
rpl.rpl_innodb_bug68220 : This test fails on RocksDB SE
rpl.rpl_master_connection : This test fails on RocksDB SE
rpl.rpl_row_crash_safe : This test fails on RocksDB SE
rpl.rpl_ssl1 : This test fails on RocksDB SE
rpl.rpl_ssl : This test fails on RocksDB SE
rpl.rpl_stm_mixed_mts_rec_crash_safe_small : This test fails on RocksDB SE
sys_vars.all_vars : This test fails on RocksDB SE

New Data Dictionary and atomic dictionary operations

Since we need to manage more dictionary information, it is time to define better dictionary data model. This has to be done before rolling out widely.

Candidate dictionary model after discussion:

  • Table Name => internal index ids
    key: RDBSE_KEYDEF::DDL_ENTRY_INDEX_START_NUMBER(0x1) + dbname.tablename
    value: version, {index_id}*n_indexes_of_the_table
  • Internal index id => CF id
    key: RDBSE_KEYDEF::INDEX_CF_MAPPING(0x2) + index_id
    value: version, cf_id
  • CF id => cf flags
    key: RDBSE_KEYDEF::CF_DEFINITION(0x3) + cf_id
    value: version, {is_reverse_cf, is_auto_cf}
  • Ongoing drop index entry
    key: RDBSE_KEYDEF::DDL_DROP_INDEX_ONGOING(0x4) + index_id
    value: version
  • Binlog entry (updated at commit)
    key: RDBSE_KEYDEF::BINLOG_INFO_INDEX_NUMBER (0x5)
    value: version, {binlog_name,binlog_pos,binlog_gtid}
  • table_stats (same as innodb)
    key: RDBSE_KEYDEF::TABLE_STATISTICS(0x6) + db_name.tablename
    value: version, {n_rows, clustered_index_size, sum_of_other_index_sizes, last_update}
  • index_stats (same as innodb)
    key: RDBSE_KEYDEF::INDEX_STATISTICS(0x7) + index_id
    value: version, {stat_value, sample_size, last_update, stat_description}

side notes:

  • We agreed to have a dedicated column family for data dictionary.
  • New data dictionary is not compatible with current dictionary. This is fine as long as we're in alpha stage.
  • Adding version number (2 bytes possibly) to make it easier for further format changes.
  • About implementation, I think extending Table_ddl_manager would be fine.
  • DDL operations should be atomic. For example, when adding a new index with new column family, it is necessary to call Put() three times. They have to be atomic -- using WriteBatch() to include all Puts.
  • Currently dictionary is cached at ddl_hash. I think table_name->index_id->cf_id may be managed via mysql hash as well, in unnormalized format, for performance reasons.

Support index-only scans for collations other than _bin

Currently, index-only scans are supported for

  • numeric columns
  • varchar columns with BINARY, latin1_bin, utf8_bin collation.

for other collations (eg. case-insensitive, _ci collations), index-only scans are not supported. The reason for this is that is not possible to restore the original column value mem-comparable key. For example, in latin_general_ci both 'foo', 'Foo', and 'FOO' have mem-comparable form 'FOO'.

A possible solution could work like this:

  1. In addition to value->mem_comparable_form function, develop value->(mem_comparable_form, restore_data) function. This is easy for some charsets.
  2. store restore_data in RocksDB's value part of the key-value pair. We already used to store information about VARCHAR field length there, but now the value part is unused.

See also:

Point lookup by reverse column family doesn't work

How to repeat:
CREATE TABLE linktable (
id1 bigint(20) unsigned NOT NULL DEFAULT '0',
id2 bigint(20) unsigned NOT NULL DEFAULT '0',
link_type bigint(20) unsigned NOT NULL DEFAULT '0',
visibility tinyint(3) NOT NULL DEFAULT '0',
data varchar(255) COLLATE latin1_bin NOT NULL DEFAULT '',
time bigint(20) unsigned NOT NULL DEFAULT '0',
version int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type,id1,id2) COMMENT 'rev:cf_primary',
KEY id1_type (id1,link_type,visibility,time,id2,version,data)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
insert into linktable values (1,1,1,1,1,1,1),(2,2,2,2,2,2,2),(3,3,3,3,3,3,3),(4,4,4,4,4,4,4),(5,5,5,5,5,5,5),(6,6,6,6,6,6,6),(7,7,7,7,7,7,7),(8,8,8,8,8,8,8),(9,9,9,9,9,9,9),(10,10,10,10,10,10,10),(11,11,11,11,11,11,11),(12,12,12,12,12,12,12),(13,13,13,13,13,13,13),(14,14,14,14,14,14,14),(15,15,15,15,15,15,15),(16,16,16,16,16,16,16),(17,17,17,17,17,17,17),(18,18,18,18,18,18,18),(19,19,19,19,19,19,19),(20,20,20,20,20,20,20);

mysql> select * from linktable force index(PRIMARY) where link_type=1;
Empty set (0.00 sec)

mysql> select * from linktable force index(id1_type) where link_type=1;
+-----+-----+-----------+------------+------+------+---------+
| id1 | id2 | link_type | visibility | data | time | version |
+-----+-----+-----------+------------+------+------+---------+
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+-----+-----+-----------+------------+------+------+---------+
1 row in set (0.00 sec)

mysql> update linktable set version=100 where link_type=1;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0 Changed: 0 Warnings: 0

mysql> update linktable set version=100 where id1=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

It doesn't work with reverse cf on secondary index, either:
CREATE TABLE linktable (
id1 bigint(20) unsigned NOT NULL DEFAULT '0',
id2 bigint(20) unsigned NOT NULL DEFAULT '0',
link_type bigint(20) unsigned NOT NULL DEFAULT '0',
visibility tinyint(3) NOT NULL DEFAULT '0',
data varchar(255) COLLATE latin1_bin NOT NULL DEFAULT '',
time bigint(20) unsigned NOT NULL DEFAULT '0',
version int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type,id1,id2),
KEY id1_type (id1,link_type,visibility,time,id2,version,data) COMMENT 'rev:cf_id1_type'
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
insert into linktable values (1,1,1,1,1,1,1),(2,2,2,2,2,2,2),(3,3,3,3,3,3,3),(4,4,4,4,4,4,4),(5,5,5,5,5,5,5),(6,6,6,6,6,6,6),(7,7,7,7,7,7,7),(8,8,8,8,8,8,8),(9,9,9,9,9,9,9),(10,10,10,10,10,10,10),(11,11,11,11,11,11,11),(12,12,12,12,12,12,12),(13,13,13,13,13,13,13),(14,14,14,14,14,14,14),(15,15,15,15,15,15,15),(16,16,16,16,16,16,16),(17,17,17,17,17,17,17),(18,18,18,18,18,18,18),(19,19,19,19,19,19,19),(20,20,20,20,20,20,20);

mysql> select * from linktable force index(id1_type) where id1=1;
Empty set (0.00 sec)

mysql> select * from linktable force index(PRIMARY) where id1=1;
+-----+-----+-----------+------------+------+------+---------+
| id1 | id2 | link_type | visibility | data | time | version |
+-----+-----+-----------+------------+------+------+---------+
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
+-----+-----+-----------+------------+------+------+---------+
1 row in set (0.00 sec)

mysql> update linktable set version=100 where id1=10;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0 Changed: 0 Warnings: 0

mysql> update linktable set version=100 where link_type=1;
Query OK, 1 row affected (0.03 sec)
Rows matched: 1 Changed: 1 Warnings: 0

index_read_map(HA_READ_PREFIX_LAST) does not work in reverse CF

This is a continuation of issue #16. It is the last part.

Let's work with the same dataset:

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;

create table t2 (
  pk int not null,
  a  int not null,
  b  int not null,
  primary key(pk),
  key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;

Try this:

MySQL [j5]>  select * from t2 where a between 98 and 2000 order by a desc; 
Empty set (4.83 sec)

The correct result should be:

MySQL [j5]>  select * from t2 use index() where a between 98 and 2000 order by a desc; 
+-----+----+-----+
| pk  | a  | b   |
+-----+----+-----+
| 999 | 99 | 999 |
| 991 | 99 | 991 |
...
| 980 | 98 | 980 |
+-----+----+-----+
20 rows in set (0.00 sec)

SELECT ... LOCK IN SHARE MODE doesn't work

create table r1 (id1 int, id2 int, value int, primary key (id1, id2)) engine=rocksdb;
insert into r1 values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5);

T1:
begin;
update r1 set value=value+1 where id1=2 and id2=2;

T2:
begin;
select * from r1 where id1=2 and id2=2 lock in share mode;
=> not blocked

Doubling space by primary key

From @yoshinorim:

I just noticed primary keys are stored twice in RocksDB SE – as both key and value, even though
they are not needed to be stored as value.

How to repeat:

10 column primary key

create table r1 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10)
) engine=rocksdb charset=latin1 collate=latin1_bin;

1 column primary key

create table r2 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1)
) engine=rocksdb charset=latin1 collate=latin1_bin;

Inserting 2 million rows on these tables. Values sequentially increasing from 1 to 2 million (all columns have same values per row) -- can be generated like this.
for(my $i=1; $i<= 2000000; $i++) {
for(my $j=0; $j < 12; $j++) {
if($j < 11) {
print "$i,";
}else {
print "$i\n";
}
}
}

then make compaction happen.

Size in KB:
id1-id10 pk:
353764

id1 pk:
212984

innodb
id1-id10 pk:
264241

id1 pk:
260046

So basically long primary keys increase on-disk space.

I also checked via gdb (setting breakpoint at rocksdb::WriteBatch::Put()) and confirmed.
insert into r1 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$1 = 84
(gdb) p value.size_
$2 = 96

insert into r2 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$3 = 12
(gdb) p value.size_
$4 = 96

I expected value.size_ were 12 and 84 respectively. Since both key fields and value fields are stored, duplicate fields can be removed from value fields.

Add option to enable/disable fsync on commit

From @mdcallag:

I want the following for RocksDB. This is similar to innodb_flush_log_at_trx_commit which uses an integer set to 0, 1 or 2

  1. A global option (rocksdb_sync_on_commit) to either
    a) disable fsync-on-commit
    b) enable fsync-on-commit
    c) do an fsync for the WAL once per second in the background.

We need a) and b) before doing c) so this can be done in two diffs. Doing c) requires a background thread. See srv_master_thread in innodb source code for tips on doing that -- both thread start and thread stop on db shutdown.

Additionally it would be good to have a per-session variable that can override these values. For the per-session variable see bulk_load in ha_rocksdb.cc at https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc#L128

Whether to sync the RocksDB WAL is specified by the RocksDB user in WriteOptions::sync

io_perf_start/io_perf_end_and_record calls are not matched

ha_rocksdb::index_read_map() calls io_perf_start() at function start, but then there are exits from the function that do not call io_perf_end_and_record().

One example scenario:

CREATE TABLE t1 (pk int, a int, PRIMARY KEY (pk), KEY(a) ) ENGINE=ROCKSDB; 

insert some rows into t1. Suppose, row with a=10 doesn't exist, row with a=11 does exist.

Then, run the query:

select * from t1 where a in (10,11);

And observe:

  Breakpoint 6, ha_rocksdb::io_perf_start (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2047
... 
  Breakpoint 6, ha_rocksdb::io_perf_start (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2047
...
  Breakpoint 7, ha_rocksdb::io_perf_end_and_record (this=0x7fffa8009df0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look-2/storage/rocksdb/ha_rocksdb.cc:2056

The first hit is from index lookup for a=10
The other two are from lookup on a=11.
Note that the second call of breakpoint 6 wipes out the data collected since the first invocation.

Implement Read Free Replication

This is about implementing read-free slave, similar to what TokuDB has recently implemented.

The idea is that one can process RBR events without making Get() calls or scans. They have sufficient information to issue Put/Delete calls.

RocksDB SE will support RBR only so RBR restriction is not a problem.

Make crash safe slave work

Currently crash safe slave doesn't work on RocksDB storage engine. This is because Xid binlog event is not written at transaction commit in RocksDB, but Query event is written instead. This needs to be changed. To make Xid event is written at RocksDB commit, it is needed to add handler::prepare() method. Empty definition is fine.

In addition to that, slave_relay_log_info and slave_gtid_info, slave_worker_info tables need to be changed to RocksDB.

Create an online physical backup tool (similar to xtrabackup)

From @yoshinorim:

We need an online physical backup tool, similar to xtrabackup in InnoDB. It should meet at least following features/conditions.

Streaming backup (copying directly to remote host, since local host may not have enough space for backups)
Can take consistent binlog position or GTID (so that we can create slave instances easily)
Backup can finish within reasonable time (close to disk read & transfer & write time of the target data volume, and applying logs)
Not too much write performance drops while taking backups
Not too much read performance drops while taking backups
Nice to have:

Incremental backup
Partial backups (per CF)

Fix slow seek/prev/next after bulk delete

If there are many tombstone rows, Seek()/Prev()/Next() may take (potentially much) longer time, because they have to scan many tombstone rows. This is RocksDB internal issue and MyRocks is affected by the limitation. We need find ways to fix -- either fixing RocksDB itself or making workarounds in MyRocks side. The latter will likely to make automatic and adaptive compaction thread, based on tombstone row statistics.

slave crash on update with row based binary logging

How to repeat:

  1. Create regular master/slave instances.
  2. Run the following statements on master. Then slave crashed on update.
    create table x (id int primary key, value int, value2 int, index(value)) engine=rocksdb;
    insert into x values (1,1,1);
    insert into x values (2,1,1);
    insert into x values (3,1,1);
    insert into x values (4,1,1);
    insert into x values (5,1,1);
    update x set value2=100 where id=1;

On my environment, mysqld crashed at ha_rocksdb::position().
#0 ha_rocksdb::position (this=0x7f235d432910, record=0x7f235d503410 "\371\001")

at /data/users/yoshinori/mysql/5.6/storage/rocksdb/ha_rocksdb.cc:3152

#1 0x0000000000636b86 in handler::rnd_pos_by_record (this=0x7f235d432910, record=0x7f235d503410 "\371\001")

at /data/users/yoshinori/mysql/5.6/sql/handler.h:2412

#2 0x00000000008ebfcb in Rows_log_event::do_index_scan_and_update (this=this@entry=0x7f235d48bf00, rli=rli@entry=

0x7f2363ba6000) at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:10825

#3 0x00000000008eb18e in Rows_log_event::do_apply_event (this=0x7f235d48bf00, rli=0x7f2363ba6000)

at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:11709

#4 0x00000000008ee2b7 in Log_event::apply_event (this=this@entry=0x7f235d48bf00, rli=rli@entry=0x7f2363ba6000)

at /data/users/yoshinori/mysql/5.6/sql/log_event.cc:3345

#5 0x0000000000918d8c in apply_event_and_update_pos (ptr_ev=ptr_ev@entry=0x7f2386eac670, thd=thd@entry=

0x7f235d429000, rli=rli@entry=0x7f2363ba6000) at /data/users/yoshinori/mysql/5.6/sql/rpl_slave.cc:3575

#6 0x000000000091bf89 in exec_relay_log_event (rli=0x7f2363ba6000, thd=0x7f235d429000)

at /data/users/yoshinori/mysql/5.6/sql/rpl_slave.cc:4122

last_rowkey was null.
(gdb) p ref
$1 = (uchar *) 0x7f235d503738 ""
(gdb) p last_rowkey
$2 = {Ptr = 0x0, str_length = 0, Alloced_length = 0, alloced = false, str_charset = 0x15d1a60 <my_charset_bin>}
(gdb) p ref_length
$3 = 8

Don't update indexes if index values have not changed

Currently, ha_rocksdb::update_row() updates all indexes, even those who do not have a single changed column. We should update only those whose values were not changed.

A question: are there any rocksdb status counters that could be used to observe this?

Range in form tbl.key > const doesn't work in reverse column family

This issue is very similar to issue #36 ("Range in form tbl.key >= const doesn't work in reverse column family"). They both come from a set of problems discovered when working on issue #16.

create table t0 (a int);
insert into t0 values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
create table t1(a int);
insert into t1 select A.a + B.a* 10 + C.a * 100 from t0 A, t0 B, t0 C;

create table t2 (
  pk int not null,
  a  int not null,
  b  int not null,
  primary key(pk),
  key(a) comment 'rev:cf1'
) engine=rocksdb;
insert into t2 select A.a, FLOOR(A.a/10), A.a from t1 A;

The problem query is very similar too:

MySQL [test]> explain select * from t2 force index (a) where a>0;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t2    | range | a             | a    | 4       | NULL |   10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

MySQL [test]> select * from t2 force index (a) where a>0;
Empty set (0.01 sec)

Bloom filters and column families

From @mdcallag:

I assume we will put many indexes in one column family as column family per index doesn't scale to a large number of indexes. When that is done we need a method for specifying the index columns that will be used in the column family. This assumes that we can use bloom filters on a prefix of the index key and that the format of an index key within RockDB is:
...

In that case the options are:

  1. no bloom filter
  2. bloom filter on index ID
  3. bloom filter on index ID, index column 1, ..., index column N-1>
  4. bloom filter on index ID, index column 1, ..., index column N>

Note that

  1. and 3) above are on a prefix of the visible index columns
    2), 3), and 4) require all indexes in the column family to have a common prefix in columns
    This requires much more discussion.

Compile failure on Ubuntu 14.04 LTS with -DMYSQL_MAINTAINER_MODE=1

I'm trying to compile on Ubuntu 14.04 LTS (gcc version Ubuntu 4.8.2-19ubuntu1), using this line:

cmake . -DCMAKE_BUILD_TYPE=Debug -DWITH_SSL:STRING=system \
-DWITH_ZLIB:STRING=system -DMYSQL_MAINTAINER_MODE=1

and I'm getting two compile errors:

[ 30%] Building CXX object storage/innobase/CMakeFiles/innobase.dir/dict/dict0mem.cc.o
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/storage/innobase/dict/dict0mem.cc: In function 'void dict_mem_index_add_field(dict_index_t*, const char*, ulint)':
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/storage/innobase/dict/dict0mem.cc:678:63: error: passing NULL to non-pointer argument 5 of 'void dict_mem_index_add_field(dict_index_t*, const char*, ulint, char**, uint, unsigned int)' [-Werror=conversion-null]
   dict_mem_index_add_field(index, name, prefix_len, 0, NULL, 0);
                                                               ^
cc1plus: all warnings being treated as errors
make[2]: *** [storage/innobase/CMakeFiles/innobase.dir/dict/dict0mem.cc.o] Error 1
make[2]: Target `storage/innobase/CMakeFiles/innobase.dir/build' not remade because of errors.
make[1]: *** [storage/innobase/CMakeFiles/innobase.dir/all] Error 2

and

Scanning dependencies of target mysqlbinlog
[ 39%] Building CXX object client/CMakeFiles/mysqlbinlog.dir/mysqlbinlog.cc.o
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/client/mysqlbinlog.cc: In function 'Exit_status dump_remote_log_entries(PRINT_EVENT_INFO*, const char*)':
/home/psergey/dev-git/mysql-5.6-rocksdb-fixcompile/client/mysqlbinlog.cc:2409:31: error: format not a string literal and no format arguments [-Werror=format-security]
   sprintf(cur_logname, logname);
                               ^
cc1plus: all warnings being treated as errors
make[2]: *** [client/CMakeFiles/mysqlbinlog.dir/mysqlbinlog.cc.o] Error 1
make[2]: Target `client/CMakeFiles/mysqlbinlog.dir/build' not remade because of errors.

get correct row counts into SHOW TABLE STATUS

From @mdcallag:

We can get an estimate for the row count via ('Rows:' below is always 10,000 today) via RocksDB property "rocksdb.estimate-num-keys".

Data_length and index_length are best defined when not using an index-organized table. For an index-organized table like InnoDB, the data_length column includes the size of the PK index (including all columns) and the index_length column has the size of all secondary indexes. We can treat RocksDB in the same way. However, we aren't using file-per-table or file-per-index for RocksDB so we need to determine whether we can get estimates on index sizes.

mysql -e 'show table status\G' iok1

*************************** 1. row ***************************
Name: ai
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 1000
Avg_row_length: 0
Data_length: 0
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:

SELECT ... FROM rocksdb_table ORDER BY primary_key uses sorting

Originally from discussion on facebook.com:

create table r0 (id int primary key, value int) engine=rocksdb;
insert into r0 values (1,1),(2,2),(3,3);
explain select * from r0 ORDER BY id;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
|  1 | SIMPLE      | r0    | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+

Actually, the records are read in the PK order, so filesort is redundant.

Properties of RocksDB-SE table are very similar to properties of InnoDB table with primary key. With InnoDB table and the above example, one gets:

+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
|  1 | SIMPLE      | r0    | index | NULL          | PRIMARY | 4       | NULL |    3 | NULL  |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+

Ranges on t.key > const are not handled efficiently

(This is a part of patch for issue #16)

Ranges that start with t.key > const are not handled efficiently. One can observe this as follows:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  pk int primary key,
  a int,
  col1 int,
  key(a)
) engine=rocksdb; 

insert into t1 select a,a,123 from one_k;

insert into t1 
select 1000 + A.a + 1000*B.a, 100, 1234 
from one_k A, one_k B where B.a<100;
explain  select * from t1 where a>100 and a < 105;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t1    | range | a             | a    | 5       | NULL |   10 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+

Now, set two breakpoints in ha_rocksdb::index_read_map:

  • a breakpoint at io_perf_start() call
  • a breakpoint at the last io_perf_end_and_record() call.

run

select * from t1 where a>100 and a < 105;

and see

  Breakpoint 6, ha_rocksdb::index_read_map (this=0x7fffa40af3f0, buf=0x7fffa40b7f50 "\377", key=0x7fffa41c8948 "", keypart_map=1, find_flag=HA_READ_AFTER_KEY) at /home/psergey/dev-git/mysql-5.6-rocksdb-look/storage/rocksdb/ha_rocksdb.cc:2631
(gdb) p rocksdb::perf_context.find_next_user_entry_time
  $45 = 654578691779
(gdb) p rocksdb::perf_context.internal_key_skipped_count
  $46 = 707050
(gdb) c
  Continuing.

  Breakpoint 7, ha_rocksdb::index_read_map (this=0x7fffa40af3f0, buf=0x7fffa40b7f50 "\371e", key=0x7fffa41c8948 "", keypart_map=1, find_flag=HA_READ_AFTER_KEY) at /home/psergey/dev-git/mysql-5.6-rocksdb-look/storage/rocksdb/ha_rocksdb.cc:2935
(gdb) p rocksdb::perf_context.find_next_user_entry_time
  $47 = 654591615526
(gdb) p rocksdb::perf_context.internal_key_skipped_count
  $48 = 807051
(gdb) p $48 - $46
  $49 = 100001

So we have enumerated 100K rows in order to return 5 rows.

Get something like SHOW ENGINE INNODB TRANSACTION STATUS, but for RocksDB

From @mdcallag:

Example output is listed below. There is work-in-progress to get a "show processlist" like feature into RocksDB so we can track per-thread states. Output from that can be displayed during SHOW ENGINE ROCKSDB TRANSACTION STATUS.

The other request is to get the age of the per-transaction or per-statement snapshots and iterators in use by a given connection. There is work-in-progress from RocksDB to track that. But give that the GetSnapshot call is done by our code (on the MySQL side) we can also read the clock when that is done and track the snapshot age.

Long-open snapshots block purge for InnoDB which can be bad. Long open snapshots for RocksDB mean that old versions of some keys cannot be dropped and creates more CPU overhead during compaction. Long open iterators mean that files replaced by compaction won't be deleted until the iterators that reference them are closed.

mysql -h udb5784.prn2 -P3307 -A -e 'show engine innodb transaction status\G'
*************************** 1. row ***************************
Type: InnoDB
Name:

Status:

2014-12-06 08:16:56 7f9722cc4700 INNODB TRANSACTION MONITOR OUTPUT

TRANSACTIONS

Trx id counter 168494781347
Purge done for trx's n:o < 168494781347 undo n:o < 0 state: running but idle
History list length 1675
Lock stats: 0 deadlocks, 0 lock wait timeouts
Commits: 67286553 all, 67126992 with undo
Rollback: 619397 total, 37201404 partial
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 17872, OS thread handle 0x7f9722cc4700, query id 178298354 2401:db00:11:2a:face:0:3b:0 admin3 cleaning up
show engine innodb transaction status
---TRANSACTION 168494781345, not started

MySQL thread id 12, OS thread handle 0x7f9725378700, query id 178298352 Slave has read all relay log; waiting for the slave I/O thread to update it

END OF INNODB TRANSACTION MONITOR OUTPUT

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.