shopify / ghostferry Goto Github PK

The swiss army knife of live data migrations

Home Page: https://shopify.github.io/ghostferry

License: MIT License

Makefile 0.32% Go 57.16% TLA 4.82% HTML 1.05% CSS 0.18% Shell 0.04% Ruby 13.97% Nix 0.48% Jupyter Notebook 21.98%

ghostferry's Introduction

Ghostferry

Ghostferry is a library that enables you to selectively copy data from one mysql instance to another with minimal amount of downtime.

It is inspired by Github's gh-ost, although instead of copying data from and to the same database, Ghostferry copies data from one database to another and has the ability to only partially copy data.

There is an example application called ghostferry-copydb included (under the copydb directory) that demonstrates this library by copying an entire database from one machine to another.

Talk to us on IRC at irc.freenode.net #ghostferry.

Tutorial and General Documentations: https://shopify.github.io/ghostferry
Code documentations: https://godoc.org/github.com/Shopify/ghostferry

Overview of How it Works

An overview of Ghostferry's high-level design is expressed in the TLA+ specification, under the tlaplus directory. It may be good to consult with that as it has a concise definition. However, the specification might not be entirely correct as proofs remain elusive.

On a high-level, Ghostferry is broken into several components, enabling it to copy data. This is documented at https://shopify.github.io/ghostferry/main/technicaloverview.html

Development Setup

Installation

For Internal Contributors

dev up

For External Contributors

Have Docker installed
Clone the repo
docker-compose up -d
nix-shell

Testing

Run all tests

make test

Run example copydb usage

make copydb && ghostferry-copydb -verbose examples/copydb/conf.json
For a more detailed tutorial, see the documentation.

Ruby Integration Tests

Kindly take note of following options:

DEBUG=1: To see more detailed debug output by Ghostferry live, as opposed to only when the test fails. This is helpful for debugging hanging test.

Example:

DEBUG=1 ruby test/main.rb -v -n "TrivialIntegrationTests#test_logged_query_omits_columns"

ghostferry's People

Contributors

Stargazers

Watchers

Forkers

shlomi-noach pjjw shuhaowu vermontdevil littlelotta ymakedaq arthurnn rakibulislam chenpingzhao priestd09 fjordan jackerzhou klassmann lishengliu tclh123 boogermann tufanbarisyildirim isgasho everpcpc hadesfeng izogain bakhti davgit jerdoubleu liaokunlin joshflchan kolbitsch-lastline lastline-inc danieloliveira079 papidb zfxuanye doytsujin 5l1v3r1 lukeenterprise willhan123 shawnloong isabella232 iamthatiam777 suryatmodulus hemeraone yyfs goncaloperes laradevio cb-manick samajain patrickmamaid vyshah asana rogersbobmunyenga frabits bolajiwahab kevalthakarar fanooos qpc-github quantum-platinum-cloud lcpinto iq-scm khatwaninikhil kevin-zhangwen smestorage dragon28 rdupon 15901880620

ghostferry's Issues

Ghostferry interrupt/resume meta issue

Several steps are required to get to a resumable Ghostferry to handle schema changes. This issue outlines the steps required in their order.

To achieve this goal, we removed the IterativeVerifier and replaced it with the InlineVerifier. The InlineVerifier is conceptually simpler than the IterativeVerifier. The interrupt/resume code is less complicated than it would have been for the IterativeVerifier.

Phase 1: Interrupt/Resume Basic `Ghostferry`

~~Ghostferry copy interruption/resume:~~ DONE

This PR hooks up the StateTracker to a signal handler and allow the normal copy phase of Ghostferry to be interrupted and resumed.
This code here will be changed slightly in subsequent PRs when we hook the IterativeVerifier.

~~Ruby integration tests for interrupt/resume~~: Done

This PR adds the Ruby integration framework to allow interrupt/resume to be properly tested.
The framework compiles a specialized Ghostferry that is capable of being paused by the Ruby framework code via an Unix socket.

Phase 2: InlineVerifier

These are completed in #120, #121, #122, #123, #124, #127 #129.

After these are done, we should try to integrate Ghostferry changes into any downstream consumers to make sure it works correctly.

Phase 3: Handle schema changes

These are pretty tentative.

Implement error handling such that errors caused by a schema change are retried/stalled.
Implement schema change detection via binlog on both the source and the target.

When a schema change is detected, save the state and kill the process.

Modify the Reconciler for tables whose schemas changed.

Delete those tables and start fresh.

Foreign key constraint is incorrectly formed

When it comes to creating a table on the target server with Foreign Key and if referenced table hasn't created yet, it gives an error.

cannot create table, this may leave the target database in an insane state error="Error 1005: Can't create table publisher.View (errno: 150 \"Foreign key constraint is incorrectly formed\")" table=publisher.View error: failed to create databases and tables: Error 1005: Can't create table publisher.View (errno: 150 "Foreign key constraint is incorrectly formed")

Example of a create Query for View Table:

CREATE TABLE View (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
platform_id int(10) unsigned NOT NULL,
user_id int(11) NOT NULL,
object_id int(10) unsigned NOT NULL,
object_type tinyint(4) NOT NULL,
timestamp timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (id),
KEY view_platform_id_foreign (platform_id),
CONSTRAINT view_platform_id_foreign FOREIGN KEY (platform_id) REFERENCES Platform (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

In this case, ghostfrerry-copydb hasn't created Platform table yet.

We have a fix for this too. Just putting this issue here for future reference.

AutomaticCutover needs some refactor

AutomaticCutover is put there in order to prevent ghostferry from automatically finishing, allowing the cutover steps to be run manually via by a human operator. However, the way it is done (blocking the onFinishedIterations callback for as long as AutomaticCutover == false, which blocks WaitUntilRowCopyIsComplete) feels like a hack. A cleaner strategy might be having a separate wait for manual cutover step or something like that.

The URL in your github description

I'm assuming it should be the contents of https://shopify.github.io/ghostferry/master/index.html ^^? https://shopify.github.io/ghostferry/ is a lil bit empty right now

Support iteration on non-PK unique integer column

hi again

I got this error after fixing other things and wondered why its not supported? what is the deal with it or is there any planned feature about it?

anyway, we will try to fix it but wanted to hear from your side

Data corruption in binary and varbinary columns

When retrieving updates via binlog , the replication module is stripping trailing 0-bytes from the data in data updates. This affects the data inserted via insert/update statements as well as the "where" clause for update/delete statements.
As a result, a target DB starts to diverge in the data that it stores.

IMO this is a bug in the upstream library we use, and I have opened a ticket with them:

go-mysql-org/go-mysql#477

However, existing users of the library may break if the library is changed, which is why I think we may need to work around the bug in ghostferry directly.

To do this, we need to update go-mysql to parse BINARY/VAR_BINARY table column definition (it currently assumes they are simple string columns) and extend input array before using the data in calls like BinlogUpdateEvent. AsSQLString

Corruption error due to InlineVerification in BatchWriter that goes away after being retried.

Symptom

InlineVerification in BatchWriter could log an error message such as the following:

level=error msg="failed to write batch to target, 1 of 5 max retries" error="row fingerprints for pks [XXXXXX] on schema.table do not match" tag=batch_writer

This error may show up multiple times and resolves itself. The move continues on as normal and passes the final verification.

Possible cause

This could be caused by a harmless race condition between the BinlogWriter and the InlineVerifier. Since the InlineVerifier is not covered by TLA+, this race condition is not discovered until after implementation.

To understand the race condition, it is important to understand the operations that occur within the BatchWriter in pseudocode:

for i = 1..retries:
  (On source) BEGIN
  (On source) SELECT ... FOR UPDATE
  for j = 1..retries:
    (On target) BEGIN
    (On target) INSERT IGNORE ...
    (On target) SELECT MD5(... # Checksum occurs here
    (On target) COMMIT/ROLLBACK
  (On source) COMMIT/ROLLBACK

The error above is emitted at the SELECT MD5(... line, which causes the inner loop to retry. The second time it encounters SELECT MD5( line, the verification succeeds, causing the code to break out both the inner loop and outer loop and move on to the next batch of rows.

To understand the race, we can take a look at the state matrix of the source and target database row as the steps in the above pseudo code are executed. Specifically, we need to look at how the actions are serialized. We denote the database row values with v1 and v2, where v1 represents one set of values and v2 represents a different set of values. We define a "payload" as either the argument an action is given (e.g. INSERT v1 INTO ...) or the results an action receives (e.g. SELECT FROM ... -> v1). Let's also assume that the error occured once and resolved itself after:

Step	Actor	Action	Payload	Source	Target
1	DataIterator	BEGIN	N/A	??	??
2	DataIterator	SELECT FOR UPDATE	v2	??	??
3	BatchWriter	BEGIN	N/A	??	??
4	BatchWriter	INSERT IGNORE	v2	??	??
5	BatchWriter	SELECT MD5()	v2	??	??
6	BatchWriter	ROLLBACK	N/A	??	??
7	BatchWriter	BEGIN	N/A	??	??
8	BatchWriter	INSERT IGNORE	v2	??	??
9	BatchWriter	SELECT MD5()	v2	v2	v2
10	BatchWriter	COMMIT	N/A	??	??
11	DataIterator	ROLLBACK	N/A	??	??

At step 9, know the checksum succeeds. Thus the payload == source == target at this step. Since the source row (and therefore its checksum) is obtained in step 2, we know the payload is always v2.

Filling out the rest of the states according to the transactional guarentees provided by MySQL gives us the following (exercise on filling it out is left to the reader):

Step	Actor	Action	Payload	Source	Target
1	DataIterator	BEGIN	N/A	v2	v1
2	DataIterator	SELECT FOR UPDATE	v2	v2	v1
3	BatchWriter	BEGIN	N/A	v2	v1
4	BatchWriter	INSERT IGNORE	v2	v2	v1
5	BatchWriter	SELECT MD5()	v2	v2	v1
6	BatchWriter	ROLLBACK	N/A	v2	v1
7	BatchWriter	BEGIN	N/A	v2	v2
8	BatchWriter	INSERT IGNORE	v2	v2	v2
9	BatchWriter	SELECT MD5()	v2	v2	v2
10	BatchWriter	COMMIT	N/A	v2	v2
11	DataIterator	ROLLBACK	N/A	v2	v2

The only way such states could be obtained is by inserting some additional actions before step 1 and between step 6 and 7:

Step	Actor	Action	Payload	Source	Target
0a	Application	INSERT (SOURCE)	v1	v1	nil
0b	Application	UPDATE (SOURCE)	v2	v2	nil
0c	BinlogWriter	INSERT	v1	v2	v1
1	DataIterator	BEGIN	N/A	v2	v1
2	DataIterator	SELECT FOR UPDATE	v2	v2	v1
3	BatchWriter	BEGIN	N/A	v2	v1
4	BatchWriter	INSERT IGNORE	v2	v2	v1
5	BatchWriter	SELECT MD5()	v2	v2	v1
6	BatchWriter	ROLLBACK	N/A	v2	v1
6a	BinlogWriter	UPDATE	v2	v2	v2
7	BatchWriter	BEGIN	N/A	v2	v2
8	BatchWriter	INSERT IGNORE	v2	v2	v2
9	BatchWriter	SELECT MD5()	v2	v2	v2
10	BatchWriter	COMMIT	N/A	v2	v2
11	DataIterator	ROLLBACK	N/A	v2	v2

This is relatively harmless and fixing it would require a lot more effort the worth for now.

should not check binlog_row_image variable if using low version mysql/mariadb?

binlog_row_image parameter introduced in mariadb 10.1.6.

https://mariadb.com/kb/en/library/replication-and-binary-log-system-variables/#binlog_row_image

Support arbitrary primary key types

Originally by @pushrax:

Right now only integer types are supported. gh-ost handles this by using interface{} and letting the type propagate through untouched.

It's also possible to support composite keys in this way.

Comment by @shuhaowu:

Any thoughts on having a PK type? This would be more explicit than an interface{}. Additionally it may allow us to quickly refactor the code with the new type but without having to change the underlying type right now from uint64 to something else.

IterativeVerifier improvements

Some potential things that we need to look at for the IterativeVerifier:

Verify that deleted rows on the source are not orphaned on the target. This could have an impact on copydb.
Make the threshold for iterative verifier's iterative reverify configurable/tunable. Right now it's based on the number of elements in reverify store: if it is smaller than 1000 or bigger than when we started reverify, stop reverifying. If the measured last reverify time is smaller than the max downtime configuration, fail.

copydb: State export on stdout makes resuming difficult, if tool invocation fails

Exporting state on stdout works really well, but it has an annoying gotcha: if the tool invocation fails (e.g., a configuration error, DB connectivity error, or something similar) the output on stdout will not contain a valid resume state, as we never entered the main loop of ghostferry-copydb. As a result, the caller must be aware of any startup failure and has to take special precaution to not overwrite the statefile generated by a previous tool invocation.

For example: if we run ghostferry-copydb in a loop, where the previous iteration's stdout is written to disk and passed as CLI argument to the next iteration, any iteration where tool invocation fails will inevitably overwrite (and thus lose) the state.

One way to handle this situation would be to configure ghostferry to dump state to a file, and to only write the updated state once we are in a condition where this makes sense - we have successfully started, loaded the original state (if any), are able to export the state, etc. At the same time, we could import the state on copydb start (if that file already exists).

Note that this could also be handled outside the ghostferry library or ghostferry-copydb by putting more intelligence into a caller. But, by embedding more careful handling in the library/tool itself, we can provide users with safe default handling and avoid different solutions to a shared problem.
Clearly whatever change we make in the library/tool must allow for backwards compatibility to maintain the original behavior if that is expected by a caller

Attempt to Retry Cutover Locks

Currently, Ghostferry will only attempt to acquire the cutover lock a single time before failing. We want to modify Ghostferry to retry acquiring the cutover lock, first using a static value, and then eventually using a dynamic value sent back that tells Ghostferry for how long to wait before attempting to acquire the lock again.

Resume Ghostferry after interruption

In order to handle schema changes on the source and the target databases of the same application, we opted for a method where we pause Ghostferry upon the beginning of a schema change and resume it after the schema change has completed on both the source and the target database.

In addition to being useful for schema changes, having a resumable Ghostferry is useful in the general. An example would be if the target/source database becomes temporarily unavailable. As of right now, the data on the target must be cleaned up and then we have to restart Ghostferry from the beginning.

Resume via Reconciliation

The main issue with Ghostferry being interrupted is that binlogs are no longer being streamed from the source to the target. If the binlogs are not streamed, the target database is then no longer up to date and the data may not be valid. This issue is not exclusive to Ghostferry and also affects regular MySQL replication. The solution there involves starting replication at some user specified binlog position. Implementing the same within Ghostferry will be difficult and inefficient:

Implementing the ability to follow the binlogs through an ALTER TABLE event will be difficult to accomplish.
A row may change multiple times during the time that Ghostferry is down. Updating the entire row over multiple times will be inefficient.

Instead of replaying the binlogs as is, a different method can be employed to keep the target up to date:

Resume Ghostferry with:
1. a known good binlog position that has been streamed to the target,
2. a known good cursor (PK) position where all PK <= this position has been copied to the target, and
3. a copy of the table schema cache valid at the known good binlog position.
Loop through the binlogs from the known good position to the current position from SHOW MASTER STATUS. For each binlog event encountered, get the associated primary key and for that row:
1. delete the row from the target database if exists;
2. copy the current row from the source database to the target database if the row has already been copied, which is determined by a comparison of the row’s PK with the known good cursor position.
3. NOTE: this step will now be referred to as the reconciliation step.
After the process is complete, we can simply start Ghostferry as normal.

Safety of the Reconciliation Step

To analyze the safety of the reconciliation step, we must first make the following assumptions:

The known good binlog/cursor positions given have been copied by a previous Ghostferry run.
The known good binlog/cursor position can be an underestimate of the actual good binlog/cursor position.
1. In other words: it’s possible that we overcopied rows/binlogs from a previous run but didn’t manage the save the position as the process is shutdown prematurely.

We can then analyze the situation where the known good binlog position is the same as the actual good binlog position (no overcopy of binlogs occured):

If a row that is known to have been copied is modified (pk <= knownGoodCursorPos):
1. Suppose there are 4 versions of this row due to modifications: (v1) the state of the row at the time of the interruption, (v2) the state after modification while Ghostferry is down, (v3) the state after a modification that occured before we reached this row during reconciliation but after SHOW MASTER STATUS, and (v4) a modification that occurs after the reconciliation process is done but before Ghostferry finishes.
2. When we encounter the binlog entry that performs v1 -> v2, the row on the source is at v3 and the row on the target is at v1.
3. The target row is deleted and recopied from the source, updating it to v3.
4. After the reconciliation process is complete, the binlog streamer encounters the event v2 -> v3. This event will not execute on the target in the current Ghostferry implementation as is.
5. When the BinlogStreamer finally encounter the event v3 -> v4, it will be executed.
If a row that is copied but is not known to be copied is modified (knownGoodCursorPos < pk <= actualGoodCursorPos)
1. the row is deleted from the target if exists;
2. after the reconciliation step, Ghostferry will resume from the knownGoodCursorPos and thus will copy the row over as a part of its normal run. All the safety properties of regular Ghostferry applies.
If a row that have not been copied is modified (pk > actualGoodCursorPos)
1. same as case 2.

We can now simply extend this to cases when the known good binlog position is an underestimate of the actual position: if we reconcile a binlog entry that has already been streamed to the target, it will simply be deleted from the target and recopied. Thus it does not pose a problem.

The safety of this reconciliation are also verified with a (unreviewed) TLA+ model.

Safety of Interruption

As demonstrated above, as long as Ghostferry saves at worst an underestimate of the binlog/cursor position, the reconciliation process is safe.

The current code only increments the last successful binlog/cursor position after the binlog/row is successfully streamed/copied. This means that if we were to panic the process at any time and get the saved values out, those values are at worst an underestimate, unless there’s something about Go that we don’t quite understand (?).

Handling Schema Changes with Reconciliation

If a schema change occur on either the source or the target, we must interrupt the Ghostferry and only resume Ghostferry at some future point. We can asynchronously detect the schema changes on either the source or the target and abort the process. If an error occurs elsewhere because of the schema change within Ghostferry but at a different thread, we stall that code until we can positively identify a schema change OR we abort if some timeout has reached and yet we cannot positively identify the schema change.

We assume that:

The source and target database are assumed to be for the same application. This means they must eventually have a consistent schema. We only resume Ghostferry when the consistent schema is reached.

Once resumed:

For each table that was in progress AND has a schema change applied during the interruption:
- delete all application records of this table from the target database.
- set the known good cursor position to 0 so it gets recopied.
For all other tables, the regular reconciliation process must apply as otherwise we might lose changes that have occured during the interruption to those tables.
- Note we need to do this for not just finished tables, but also those that haven’t started as during the previous run, a row could have been INSERTed by the binlog streamer to a table that hasn’t started its copy process. If it is updated during the downtime and we don’t do something about it, it will corrupt.

How to handle begin/commit events in the binlog

Originally by @pushrax:

It seems that even with RBR the binlog gets BEGIN/COMMIT events (via XID_EVENT) that are required for perfect replication while servicing queries in a consistent way from the replica. These events allow the replica to commit changes to multiple tables atomically. This might matter for our use-case, if the data we are writing to the target is also being selected. My understanding is that this only affects consistency of queries in the replica/target during the copying. We already tolerate inconsistent data during copying, so maybe it's completely fine to ignore these events.

Comment by @shuhaowu:

Note that it is not a problem if the target that we are copying data to is not being queried by any other source other than ghostferry itself. This means that the vast majority of the use cases is safe (also includes Github's gh-ost).

failed to read current binlog position

Hello,

I have an old MariaDB server and I wanted to use this tool for migration. Source MariaDB server was 10.0.x and binlog_row_image was not supported in this version. That's why I've upgraded it to 10.2.22 and this variable is working fine.

After granting necessary permissions to the migration user for both source and target servers, I got this error:

ERRO[0000] failed to read current binlog position        error="sql: expected 4 destination arguments in Scan, not 5" tag=binlog_streamer
error: failed to start ferry: sql: expected 4 destination arguments in Scan, not 5

I checked your binary codes and did so for the tutorial, but I couldn't find where exactly is locating for binlog position.

Can you please advise how to fix it?

Thanks in advance.

Adjust DataIterationBatchSize when we fail to copy the batch

We transfer 200 records at a time by default (DataIterationBatchSize). If the records are large, we will run into maximum limits and fail. We will retry, but always with the same number of records, and fail the same way.

This is a suggestion we treat DataIterationBatchSize as a default value and back off exponentially if we fail because of the maximum limits on the size. For example, use DataIterationBatchSize/2 on the second try, DataIterationBatchSize/4 on the third, and 1 on the last try.

Some TLA+ suggestions

I ran across your spec and thought I'd give some tips on using TLA+ here!

Multiple Initial Starting States

SourceTable = InitialTable means you have to create a separate model for each different initial state. What we could instead do is say

CONSTANT TableCapacity
InitialTables == [1..TableCapacity -> PossibleRecords]

(*--algorithm
variables SourceTable \in InitialTables;

This sets the initial SourceTable to some possible InitialTable, meaning that a single model can now explore every single table of size TableCapacity. This also has a performance bonus, since by wrapping everything in a single model TLC can skip checking symmetric states.

If you don't need to account for gaps in your table, IE NoRecordHere isn't part of your specs then we can replace our InitialTables with

InitialTables == UNION {[1..tc -> Records]: tc \in 0..TableCapacity}

This now generates all tables of size TableCapacity or less. This means instead of writing <<r0, r1, NoRecordHere>> we write <<r0, r1>>.

ASSUME

If you do need NoRecordHere, instead of defining it as something TLC can't check, try doing this:

CONSTANT NoRecord
ASSUME NoRecord \notin Records

That signals the intent more clearly, and also doesn't require you to override the module. Instead you just make NoRecord a model value.

Indenting PlusCal Labels

You current have this:

binlog_loop: while (...) {
      binlog_read:   if (...) {
                       ...
      binlog_write:    ...
      binlog_upkey:    ...
    };
}

It's hard to tell if the labels are sublabels or side-by-side or what. Instead, try this:

binlog_loop: 
  while (...) {
    binlog_read:   
      if (...) {
        ...
        binlog_write: 
          ...
        binlog_upkey:    
          ...
    };
}

Now it's clear that binlog_write and binlog_upkey are siblings under binlog_read.

Hope these tips help!

panic: sync: negative WaitGroup counter in iterative_verifier

...
INFO[0271] starting iterative verification in the background  tag=iterative_verifier
INFO[0271] served http request                           method=POST path=/api/actions/verify tag=control_server time="44.141µs"
INFO[0271] starting verification during cutover          tag=iterative_verifier
DEBU[0271] reverifying                                   batches=2 tag=iterative_verifier
DEBU[0271] received pk batch to reverify                 len(pks)=1 table=t1 tag=iterative_verifier
DEBU[0271] received pk batch to reverify                 len(pks)=1 table=t2 tag=iterative_verifier
INFO[0271] cutover verification complete                 tag=iterative_verifier
panic: sync: negative WaitGroup counter

goroutine 6547 [running]:
sync.(*WaitGroup).Add(0xc00029c650, 0xffffffffffffffff)
        /usr/lib/go/src/sync/waitgroup.go:74 +0x137
sync.(*WaitGroup).Done(0xc00029c650)
        /usr/lib/go/src/sync/waitgroup.go:99 +0x34
github.com/Shopify/ghostferry.(*IterativeVerifier).StartInBackground.func1(0xc000314000)
        /home/user1/go/src/github.com/Shopify/ghostferry/iterative_verifier.go:288 +0x117
created by github.com/Shopify/ghostferry.(*IterativeVerifier).StartInBackground
        /home/user1/go/src/github.com/Shopify/ghostferry/iterative_verifier.go:279 +0x1f6

InlineVerifier cleanup tracking

Remove the IterativeVerifier from the code base.
Remove the "hack" of f.inlineVerifier on Ferry and make it more integrated into Ferry directly.
- Possibly review the Verifier interface to see if it needs to be extended or removed.
Write some sort of test for race condition within InlineVerifier.verifyallEventsInStore(). Specifically, if a row is added to the BinlogVerifyStore between the .Batches() call and .RemoveVerifiedBatches() call, that row must remain present after .RemoveVerifiedBatches().
- This is somewhat non-trivial as there's no easy way to mock methods and the race condition is entirely contained within verifyAllEventsInStore, a private method.
- A possible way to test this kind of is to add the same row multiple times and see if the counter has increased proportionally. Then, run verifyAllEventsInStore once and see if the counter only decreases by 1 time, or something along this line of thought.
Disallow interrupt after FlushBinlog.
SELECT query logs with the InlineVerifier will be massive because of the repeated MD5 COALESCE: #136
Port more IterativeVerifier tests to the InlineVerifier: #137
InlineVerifier is likely broken with the ControlServer and thus copydb.
ColumnCompressionConfig is only used by the InlineVerifier. Should either move it into the InlineVerifierConfig or make the InlineVerifierConfig and thus InlineVerifier a first class member of Ghostferry

Resuming ghostferry-copydb after interrupt fails due to missed TableMapEvent

The interrupt-resume feature as described in

https://shopify.github.io/ghostferry/master/copydbinterruptresume.html

works well, if the interrupt does not happen within batched RowsEvent events being processed.

A mysql replication event for changing data is always started by sending a TableMapEvent (describing the table to be changed), followed by one or more RowsEvent (containing the data to be changed). If multiple consecutive RowsEvent events are sent for the same table, the TableMapEvent is typically skipped (after sending it once).

Thus, if the interrupt happens after receiving the TableMapEvent but before receiving the last RowsEvent, ghostferry will try to resume from the last processed RowsEvent, causing the replication/BinlogSyncer syncer to crash with an invalid table id < table-ID >, no corresponding table map event exception.

This is due to the following code:

var ok bool
e.Table, ok = e.tables[e.TableID]
if !ok {
	if len(e.tables) > 0 {
		return errors.Errorf("invalid table id %d, no corresponding table map event", e.TableID)
	} else {
		return errMissingTableMapEvent(errors.Errorf("invalid table id %d, no corresponding table map event", e.TableID))
	}
}

Note that this is not a bug in the replication module, because ghostferry points the syncer at a resume position after the TableMapEvent , and the code cannot satisfy this (without ignoring the fact that it doesn't know the table ID).

Changing the replication to ignore unknown table IDs is tempting but very tricky: we could cache the table IDs previously seen in ghostferry and "fill the gap" this way. Unfortunately the parsing of the RowsEvent data relies on the table schema. And, conceptually, it also seems quite unclean.

Thus, IMO the correct approach is to make sure the syncer is not pointed at a location from which it cannot resume.

Ghostferry today only stores the resume position via its lastStreamedBinlogPosition property. In my variant of ghostferry (see #26) , I have changed the BinlogStreamer class to keep track of an additional position: the lastResumeBinlogPosition . The idea is to analyze each received event from the syncer and check if it's a RowsEvent or not. If not, the event position is considered "safe to be resumed from" and the property is updated - otherwise it is kept at its current value.
Thus, the resume position is a tuple of "last received/processed event" and "last position we can resume from".

Then, when resuming, we do not resume from lastStreamedBinlogPosition, but instead from lastResumeBinlogPosition. Any event received in BinlogStreamer.Run is first analyzed if the event position is before lastStreamedBinlogPosition , and if so, it is skipped. Otherwise it is queued to the eventListeners .

Happy to share my changes as a pull request - unfortunately it's currently somewhat intermangled with my changes on ghostferry-replicatedb (see other ticket). So before I untangle these two unrelated changes, I wanted to hear your opinions on the above, and if I'm overlooking a simpler way of doing this.

Iterative verifier multiple reverification before cutover

We can tie in the metrics of how many rows we are verifying vs the amount of binlogs we are streaming to figure out how many times we should reverify before we go to cutover. With some sort of basic controls algorithm we could potentially reduce the downtime due to verification to essentially 0.

If we implement this we need to be able to configure maximum number of reverification.

Support running ghostferry off a slave host.

Running Ghostferry with the source database being a replica

Running Ghostferry with the source database as a replica is subject to a race condition where when we stop the binlog streamer (and start the cutover stage), pending writes on the source database master might not have propagated to the binlogs of the replicas. Since Ghostferry has no idea about these upstream servers, it could miss writes and thus cause data corruption.

I'm not quite sure yet if we should integrate something that checks the upstream binlog position matches the replica's binlog position direct into ghostferry.Ferry. However, we can make it an API that we provide as a part of library and integrate it into copydb.

@BoGs @pushrax @hkdsun

Note this is an issue with the master as well if sync_binlog != 1. In these cases we recommend that you call FLUSH BINARY LOGS, which I assume will flush all pending writes to disk as it closes the current binary log file and open a new one with a separate file name. We've never tested this scenario to my knowledge.

Tests hang on mysql2 connection and command out of sync

The datawriter was causing some flakiness in tests (and almost certainty on travis) where:

It would hang on DataWriter#write_data, causing the DataWriter thread to stall indefinitely and the tests will hang forever.
Commands out of sync exception that immediately fail the tests.

This was worked around via 56a3100 as I don't have too much time to look into it atm.

The first problem resembles brianmario/mysql2#896, where the author didn't report a reason, but reported a similar workaround to the above.

The second problem resembles brianmario/mysql2#956, although that issue is closed with the release 0.5.1. We currently use 0.5.2.

I suspect the root cause of this issue is somewhere inside mysql2, but I don't know what. The connection are not being shared across multiple threads, so this shouldn't be an issue. We are writing to the database as fast as we can in a loop, perhaps this contributes to the problem as slowing it down with a new connection every loop seems to have helped.

Investigate the possibility of removing the IterativeVerifier

After some discussion with @pushrax and @hkdsun, we think it may be possible that all the IterativeVerifier does is check if a type, encoding, or truncation issue (henceforth referred to as encoding issues) caused data corruption to occur during the BatchWriter and BinlogWriter. Examples of this are: #23 and go-mysql-org/go-mysql#205. It is unclear what other issues the IterativeVerifier can catch. This is especially true because the IterativeVerifier uses the same BinlogStreamer and data iteration as the Ferry: if there are any bugs within those structs, they would equally affect the IterativeVerifier.

If the assumptions above are indeed true, that is, the IterativeVerifier exists purely to catch encoding issues due to the need to translate data from MySQL to Golang and back to MySQL: the verification of data correctness could instead be done directly after the writes occurred. We can then eliminate the the IterativeVerifier and the overall architecture of Ghostferry would be much simpler. A simpler architecture also means the likelihood of bugs would be lower, so it's an appealing idea. It would also make implementing interrupt/resume much easier.

We should dig a little bit further on whether or not the IterativeVerifier only ever catch encoding issues. If so, we should decide if we want to get rid of the IterativeVerifier in favour of inline verification.

Things to investigate

Better reasoning/analysis on what the IterativeVerifier actually covers.
Find examples of IterativeVerifiers failing a verification and investigate the causes.
Understand what "checking directly after writes" means: what does it catch and what does it miss?
Decide if we are OK with removing the IterativeVerifier in favour of checking inline.

Some more optional thoughts

The original design of Ghostferry only included a CHECKSUM TABLE verifier. This verifier is implemented effectively outside of Ghostferry. It ensures the entire Ghostferry algorithm is implemented correctly (assuming CHECKSUM TABLE is implemented correctly). The IterativeVerifier was conceived as a drop-in replacement to CHECKSUM TABLE so we can verify partial table moves. This means the original mental model for this "drop-in" IterativeVerifier is that it would "ensures the entire Ghostferry algorithm is implemented correctly". It's unclear there was any direct challenge against this mental model.

So instead of getting rid of the IterativeVerifier, we could make the current IterativeVerifier match the mental model above more closely. However, it's equally unclear how to do this without reimplementing the binlog streaming and data iteration.

Support tables with foreign key constraints

Ghostferry currently documents that it cannot work with foreign-key constraints (FKCs).

Intuitively this makes sense for two reasons:

data is inserted in per-tables batches, and it's not guaranteed that inserts on one table are not inserted before other tables that rows with FKCs may refer to, and
binlog writing happens in parallel with batch processing, so we may be inserting "new" data that refers to rows that the batch-writer has not yet processed.

However: mysql allows disabling foreign-key constraint checks on a per-session basis, and it does not re-validate constraints when this is disabled. As a result, we may temporarily disable constraint enforcement until the database is back in a consistent state. The only issue that does arise is that tables must be created in an order that satisfies their inter-dependencies.

The golang sql mysql driver even allows disabling constraints on a DB connection using a simple configuration change, making support even easier.

I have a working version of the above table creation change and was curious if you guys think it's a useful addition to the ghostferry-copydb tool. I understand it's somewhat hack-ish, but it could be useful in many scenarios where ghostferry cannot be used today.

Also: am I overlooking something with my assumption that disabling FKCs during the copy process is a problem?

ShardingErrorHandler can be merged with PanicErrorHandler

All ShardingErrorHandler does is send a webhook. We can definitely merge this into the default but optional behaviour of core Ghostferry.

Also should make it so the error callback can be called independently of panic, so it can be invoked during the Initialize/Start phase of Ghostferry.

Minimum example of a Ghostferry based application

It would be nice to have, in the repository, a commented example of a minimum Ghostferry based application, as the existing copydb/sharding application is quite complex. This will allow beginners to quickly gain an understanding of how Ghostferry based applications should look like.

Having such an application will also benefit existing developers in testing POCs, as changing copydb/sharding could be a complicated manner.

I made a version of this when I'm testing some new features locally without depending on copydb and what not: https://gist.github.com/shuhaowu/5c48465040bda9d4143363f06c600c59. Anyone can try to convert this to a better example.

One thing we might also want to consider is to refactor copydb a little bit to make it more like a library so it can be customized more easily, as it exposes several "nice" features (like config file/filter building) that'll likely have to be reimplemented if a standalone app is to be created.

Ruby integration test AFTER_BINLOG_APPLY is not actually after binlog apply

It's actually AFTER_BINLOG_STREAMED. The apply happens in another goroutine, so the name is in accurate.

Same applies for BEFORE_BINLOG_APPLY

Weird errors in packets.go (mysql-driver layer, don't think it actually causes issue in production)

There are some really weird errors when running Ghostferry for an extended amount of time. Errors like the following shows up in the logs when I run Ghostferry moving data to move a large amount of data:

11786:[mysql] 2017/08/29 17:02:37 connection.go:67: invalid connection
12772:[mysql] 2017/08/29 17:15:40 packets.go:130: write tcp [REDACTED]->[REDACTED]: write: broken pipe
12773:[mysql] 2017/08/29 17:15:40 packets.go:130: write tcp [REDACTED]->[REDACTED]: write: broken pipe
12774:[mysql] 2017/08/29 17:15:40 connection.go:97: write tcp [REDACTED]->[REDACTED]: write: broken pipe
13756:[mysql] 2017/08/29 17:28:17 packets.go:66: unexpected EOF
13757:[mysql] 2017/08/29 17:28:17 packets.go:412: busy buffer
13760:[mysql] 2017/08/29 17:28:17 connection.go:67: invalid connection
15713:[mysql] 2017/08/29 17:55:37 packets.go:33: unexpected EOF
15718:[mysql] 2017/08/29 17:55:40 connection.go:67: invalid connection
16444:[mysql] 2017/08/29 18:12:27 packets.go:66: unexpected EOF
16445:[mysql] 2017/08/29 18:12:27 packets.go:412: busy buffer
16448:[mysql] 2017/08/29 18:12:27 connection.go:67: invalid connection
17208:[mysql] 2017/08/29 18:27:57 packets.go:33: unexpected EOF
17211:[mysql] 2017/08/29 18:27:57 connection.go:67: invalid connection
17295:[mysql] 2017/08/29 18:29:37 packets.go:66: unexpected EOF
17296:[mysql] 2017/08/29 18:29:37 packets.go:412: busy buffer
17299:[mysql] 2017/08/29 18:29:37 connection.go:67: invalid connection

No issues seem to arise from this as I think the underlying driver just reconnects, but I cannot be certain. Upstream has a lot of bugs complaining of similar behaviour. All the comments seems to be all over the place so I don't really know what to make of them (go-sql-driver/mysql#582, go-sql-driver/mysql#674, go-sql-driver/mysql#673 just for a few).

Also I just learned that go-sql-driver/mysql#302 merged about two weeks ago, which supposedly fixed an issue with potentially sending duplicate queries to MySQL. However, another bug seems to be present after: go-sql-driver/mysql#657

Also I noticed while tcpdumping that the errors are caused by connection reset by the mysql server, rather than the client (ghostferry). This maybe a red herring however as I only looked at a few cases and I don't have the logs from that anymore.

I haven't ran a detailed investigation into this or seen any case where this causes any issues during a real run.

copydb Apply event Is not Autocommit ?

I Found replication event Not apply on taget tables!
But I deug it print the apply sql!

Update ControlServer to display detailed information about InlineVerifier status

And the verbose debug log shows it is actually still copying

DEBU[8627] found 200 rows                                args="[1079853538]" sql="SELECT `id` FROM `xxx` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=xxx tag=cursor
...

Flaky Tests

In ghostferry.tests.TestIgnoresTables:

--- FAIL: TestIgnoresTables (1.70s)
	Error Trace:	iterative_verifier_integration_test.go:139
			integration_test_case.go:158
			integration_test_case.go:105
			integration_test_case.go:42
			integration_test_case.go:33
			iterative_verifier_integration_test.go:153
	Error:		Should be true

Some deadlocking issues:

ERRO[0006] failed to write events to target, 1 of 5 max retries  error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=binlog_writer
ERRO[0006] failed to write events to target, 2 of 5 max retries  error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=binlog_writer
ERRO[0006] failed to write events to target, 3 of 5 max retries  error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=binlog_writer
ERRO[0006] failed to write events to target, 4 of 5 max retries  error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=binlog_writer
ERRO[0006] failed to write events to target after 5 attempts, retry limit exceeded  error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=binlog_writer
ERRO[0006] fatal error detected, state dump coming in stdout  errfrom=binlog_writer error="exec query (445 bytes): Error 1213: Deadlock found when trying to get lock; try restarting transaction" tag=error_handler
{
  "CompletedTables": {},
  "LastSuccessfulBinlogPos": {
    "Name": "mysql-bin.000003",
    "Pos": 852531
  },
  "LastSuccessfulPrimaryKeys": {
    "gftest.table1": 459
  }
}
panic: fatal error detected, see logs for details

goroutine 1624 [running]:
github.com/Shopify/ghostferry.(*PanicErrorHandler).Fatal(0xc422171090, 0x8d27a3, 0xd, 0xb502a0, 0xc422fa6140)
	/home/ubuntu/.go_project/src/github.com/Shopify/ghostferry/error_handler.go:44 +0x70a
github.com/Shopify/ghostferry.(*BinlogWriter).Run(0xc420220af0)
	/home/ubuntu/.go_project/src/github.com/Shopify/ghostferry/binlog_writer.go:60 +0x37d
github.com/Shopify/ghostferry.(*Ferry).Run.func4(0xc422e91ba0, 0xc420492b40)
	/home/ubuntu/.go_project/src/github.com/Shopify/ghostferry/ferry.go:288 +0x55
created by github.com/Shopify/ghostferry.(*Ferry).Run
	/home/ubuntu/.go_project/src/github.com/Shopify/ghostferry/ferry.go:286 +0x249
exit status 2
FAIL	github.com/Shopify/ghostferry/test	6.349s

Error 3144: Cannot create a JSON value from a string with CHARACTER SET binary

Moving a shop failed with

Error 3144: Cannot create a JSON value from a string with CHARACTER SET binary
error.

Found mistakes in tutorial docs

Please see #24 for fixes.

GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON `abc`.* TO 'ghostferry'@'%';

is an invalid query (you cannot grant REPLICATION SLAVE, REPLICATION CLIENT on a schema)

The SELECT * FROM abc.table1 WHERE id = 351; validation query needs to run on the target DB, not the source DB.

Check and disallow foreign key during the beginning of the run

Foreign key is not supported (same as ghost). We should ensure ghostferry bails on start up when it detects a FK on a table as oppose to crashing in the middle of a run.

Support copying tables without "paging" primary keys

Some of the databases we want to copy using ghostferry-copydb are simple key-value pairs (2 string columns). This is currently not supported by ghostferry, as we need indexed columns (typically a primary key) to paginate by.

Given the nature of tables without such indices, it might be useful to be able to copy such tables "in one big batch". Clearly this only applies in very specific scenarios and has strong limitations if these tables have many rows (e.g., we must lock the entire table for the copy, and cannot resume a copy), but in certain scenarios it might be useful.

I have a working version of the above copy algorithm that I'm happy to send for review - but I'm not sure if you guys think this is generally useful. Please let me know and I can send a PR.

Also note that I agree that it's probably better to extend the pagination mechanism to support arbitrary tables. However this seems to be more work than what is reasonably possible in the near future, and it would not solve the issue with tables that don't have proper indices (although that is indeed a weird corner case)

Support for changing schemas

The most common type of failures in Shopify's internal usage of Ghostferry is errors caused by changing schemas. These can manifest themselves in many different ways including DataIterator/BinlogWriter inserting data into deleted columns, tables disappearing on target database, and other incompatibilities between source and target databases.

This is especially problematic for very time consuming ferry runs as it increases the likelihood of this class of failures.

We need to think about this can be remedied. Starting with a limited scope seems like a good idea for this. For example, start by supporting addition/removal of columns or tables.

cc @Shopify/pods

General statement cache for performance gains

A proper abstraction could replace the ad-hoc statement cache in

ghostferry/batch_writer.go

Line 21 in ee2ff82

statements map[string]*sql.Stmt

and get us performance improvements in possible the cursor (thus the iterative verifier and data iterator will both get a performance boost).

binlog_streamer get panic: ev.Header.Timestamp = 0

DEBU[8766] found 200 rows                                args="[1089172653]" sql="SELECT `id` FROM `db1`.`table1` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=db1.table1 tag=cursor
DEBU[8766] found 200 rows                                args="[1089172906]" sql="SELECT `id` FROM `db1`.`table1` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=db1.table1 tag=cursor
PANI[8766] logpos: 28594 0 *replication.GenericEvent     tag=binlog_streamer
INFO[8766] exiting binlog streamer                       tag=binlog_streamer
[2018/11/16 15:26:51] [info] binlogsyncer.go:163 syncer is closing...
DEBU[8766] found 200 rows                                args="[1089173124]" sql="SELECT `id` FROM `db1`.`table1` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=db1.table1 tag=cursor
DEBU[8766] found 200 rows                                args="[1089173356]" sql="SELECT `id` FROM `db1`.`table1` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=db1.table1 tag=cursor
DEBU[8767] found 200 rows                                args="[1089173568]" sql="SELECT `id` FROM `db1`.`table1` WHERE `id` > ? ORDER BY `id` LIMIT 200" table=db1.table1 tag=cursor
[2018/11/16 15:26:51] [error] binlogstreamer.go:57 close sync with err: sync is been closing...
[2018/11/16 15:26:51] [info] binlogsyncer.go:178 syncer is closed
panic: (*logrus.Entry) (0x8ccd40,0xc000be3b30)
goroutine 42 [running]:
github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus.Entry.log(0xc00013a1e0, 0xc0001de4e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/user1/go/src/github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:128 +0x5a8
github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus.(*Entry).Panic(0xc00013b090, 0xc000959ca8, 0x1, 0x1)
        /home/user1/go/src/github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:173 +0xb2
github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus.(*Entry).Panicf(0xc00013b090, 0x8e4a89, 0x10, 0xc0000bdd30, 0x3, 0x3)
        /home/user1/go/src/github.com/Shopify/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:221 +0xed
github.com/Shopify/ghostferry.(*BinlogStreamer).updateLastStreamedPosAndTime(0xc000105ba0, 0xc000c442a0)
        /home/user1/go/src/github.com/Shopify/ghostferry/binlog_streamer.go:211 +0x140
github.com/Shopify/ghostferry.(*BinlogStreamer).Run(0xc000105ba0)
        /home/user1/go/src/github.com/Shopify/ghostferry/binlog_streamer.go:169 +0x356
github.com/Shopify/ghostferry.(*Ferry).Run.func4(0xc000298610, 0xc00015c0c0)
        /home/user1/go/src/github.com/Shopify/ghostferry/ferry.go:339 +0x55
created by github.com/Shopify/ghostferry.(*Ferry).Run
        /home/user1/go/src/github.com/Shopify/ghostferry/ferry.go:336 +0x243

Race condition in tests causing intermittent failures

In interrupt_resume_test.rb, we send SIGTERM to Ghostferry when ROW_COPY_COMPLETED is sent. However, the status handler doesn't wait until Ghostferry exits before returning. This means that Ghostferry is free to continue executing code, possibly moving an additional batch before quitting. Moving an additional batch results in a failed test occasionally.

To fix this, the send_signal call needs to be changed to something similar to .kill.

BinlogStreamer could block forever due to unconfigurable caughtupThreshold

Originally by @hkdsun:

See https://github.com/Shopify/ghostferry/blob/be4f15d/ferry.go#L254-L258

In the situation where the source database is under very heavy load, such that ghostferry is always caughtupThreshold behind on writing data to the target, we could stay in that loop for a potentially really long time.

An idea is to have the function accept a configurable deadline. Once the deadline is reached, either abort the run or force the application to stop writes on the source (i.e. forcefully initiate the cutover phase).

Comment by @shuhaowu:

I would like to add, we could also add an API to dynamically force a cutover to occur (i.e. make IsAlmostCaughtUp return true always), perhaps triggerable via the ControlServer.

Comment by @fw42:

I think I'd prefer to abort the move rather than to force the cutover, at least as the default behaviour. If the source database has so many writes that we can't catch up, that probably means that it's very active right now. Locking the source database would be very disruptive at that moment.

Ensure Prepared Statements are Used Where Necessary

To ensure MySQL's plain text interface isn't used, we need to make sure this uses prepared statements. A test should be added that fails if non-prepared statements are used to query MySQL.

More context can be gathered from #44 (comment)

Investigate bundling the ControlServer UI HTML/CSS

Right now the HTML/CSS associated with the control server is configured to be a predefined location with the debian package. With a go build version of Ghostferry, one has to specify this as a config option. This is not really ideal. Perhaps there are better alternatives here.

Ensure max packet size is respected

Originally by @pushrax:

The Go library will return an error (ErrPktTooLarge) if a packet is written that exceeds what it thinks is the max packet size.

By default that's 4 MiB. The default values are set in NewConfig(). However, we're not calling NewConfig(), we're building the config object ourselves. This means that in this code MaxAllowedPacket is 0, and it loads the limit from the server. I tested this locally and it indeed is always loading the limit from the server. Thus, at the moment, exceeding the real max packet size will always return a nice Go error.

Possible issue with waiting to catch up to master position's connection getting closed?

As we merged #35, I moved the code that initialized the initializeWaitUntilReplicaIsCaughtUpToMasterConnection call in ShardingFerry from after Waiting for binlog to catchup to Initialize, which is quite a big change. The reason to do is is such that Ferry could use the WaitUntilReplicaIsCaughtUpToMaster struct to sanity check the configuration during the Initialize phase.

A comment on the PR says that the reason of the previous placement is because pt-kill will kill idle connections. If we connect to the master too early (during Initialize), the connection might no longer be available. It was under my impression that the Go driver will handle this transparently.

Since, we never resolved that comment, I'm moving the discussion out here.

cc: @hkdsun @pushrax

Ghostferry documentation

We need some documentation for this project. Specifically:

General introduction on Ghostferry and copydb (in the README.md file)
API documentation on how to use Ghostferry as a library.
Document the general workflow for Ghostferry and the key concepts (what's cutover, why is it important, what properties do I need to achieve on the source and the target)
Document how to verify the data. Trade offs between iterative and checksum table verifier.
Document a FAQ page, and a list of gotchas/limitations.

And possibly:

Document the work flow for running ghostferry-copydb and its configuration options.
Document how to use the TLA+ model (possibly with some failure cases to illustrate gotchas).

The API doc we can use godoc. We can document the rest via github pages via sphinx.

ghostferry-sharding will be undocumented for now as it is not a stable general purpose tool at this moment, but that may change in the future as it is well built and tested.

Provide a generic way of checking whether a database is the active writer

Currently, we rely on the @@read_only variable to check whether our connection to the database is indeed to an active writer or if it's actually a read replica. This is done specifically before we use the WaitUntilReplicaIsCaughtUpToMaster struct to enter the cutover phase.

Perhaps not everybody's failover strategy is compatible with this behaviour. We could provide a configuration option on the WaitUntilReplicaIsCaughtUpToMaster struct so that the user can provide us with a query that checks this condition.

RowData.GetUint64() crashes on uint64 value

dml_events.go contains a helper method for converting MySQL values to uint64:

func (r RowData) GetUint64(colIdx int) (res uint64, err error) {

which documents that MySQL never returns uint64 values. In experiments I see that for certain values, however, this is not the case, and invoking the method with RowData that already contains the expected value type crashes

I will try to reproduce the issue using a unit-test, unfortunately I lost the stack-trace where this was happening, but I know that the crash happened on a row of table

CREATE TABLE `mytable` (
  `mytable_id` bigint(19) unsigned NOT NULL AUTO_INCREMENT,
...
  PRIMARY KEY (`entry_info_id`),
...
) ENGINE=InnoDB AUTO_INCREMENT=13871426091 DEFAULT CHARSET=utf8

and an mytable_id value of 13871229209 .

Binlog replication uses timestamp values from the machine-local timezone as oppose to UTC

I Test Copying , Found This Problem ,
Table Cloum Type Is:

| sf_modify_time | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |

I'll try to reproduce this and record it here.

shopify / ghostferry Goto Github PK

ghostferry's Introduction

Ghostferry

Overview of How it Works

Development Setup

Installation

For Internal Contributors

For External Contributors

Testing

Run all tests

Run example copydb usage

Ruby Integration Tests

ghostferry's People

Contributors

Stargazers

Watchers

Forkers

ghostferry's Issues

Phase 1: Interrupt/Resume Basic Ghostferry

Phase 2: InlineVerifier

Phase 3: Handle schema changes

Symptom

Possible cause

Resume via Reconciliation

Safety of the Reconciliation Step

Safety of Interruption

Handling Schema Changes with Reconciliation

Multiple Initial Starting States

ASSUME

Indenting PlusCal Labels

Things to investigate

Some more optional thoughts

Recommend Projects

Recommend Topics

Recommend Org

Phase 1: Interrupt/Resume Basic `Ghostferry`