Giter Site home page Giter Site logo

pingcap / tiflow Goto Github PK

View Code? Open in Web Editor NEW
411.0 77.0 267.0 153.54 MB

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)

License: Apache License 2.0

Makefile 0.19% Go 87.40% Shell 10.59% Dockerfile 0.06% Python 0.61% JavaScript 0.21% HTML 0.01% TypeScript 0.80% PLpgSQL 0.12% Less 0.01% Smarty 0.01%
cdc tidb mysql kafka ticdc dm

tiflow's Introduction

TiFlow

LICENSE GitHub release (latest SemVer) GitHub Release Date GitHub go.mod Go version Build Status codecov Go Report Card

Introduction

TiFlow is a unified data replication platform for TiDB that consists of two main components: TiDB Data Migration (DM) and TiCDC.

  • DM enables full data migration and incremental data replication from MySQL or MariaDB to TiDB.
  • TiCDC replicates change data to various downstream systems, such as MySQL protocol-compatible databases and Kafka.

For more details, see DM README and TiCDC README.

License

TiFlow is under the Apache 2.0 license. See the LICENSE file for details.

tiflow's People

Contributors

3aceshowhand avatar amyangfei avatar asddongmen avatar ben1009 avatar buchuitoudegou avatar charlescheung96 avatar csuzhangxc avatar d3hunter avatar dependabot[bot] avatar ehco1996 avatar glorv avatar gmhdbjd avatar hanfei1991 avatar hi-rustin avatar hicqu avatar hongyunyan avatar ianthereal avatar july2993 avatar lance6716 avatar lichunzhu avatar liuzix avatar okjiang avatar overvenus avatar sdojjy avatar shafreeck avatar sleepymole avatar suzaku avatar wangxiangustc avatar wizardxiao avatar zhaoxinyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tiflow's Issues

EventFeed may miss some regions when region is splitting

The function partialRegionFeed (in cdc/kv/client.go) accepts a region info from parameter and may reload region info from region cache before sending request. So it's possible that the region has changed after a split. As a result it will get a smaller region after calling regionCache.LocateKey that doesn't cover the range hold by the parameter regionInfo, so the remaining part of the range will be missing.

resolved ts is not forward after one TiKV crash

  1. we start a sysbench task (sysbench oltp_insert --tables=1 --threads=200) and a CDC changefeed
  2. the replication works well, at about 2020/03/10 04:28:40.560 -04:00 one TiKV is killed because of OOM. The CDC doesn't receive any region data anymore. (Except for a region_id=8)

Note: has a special tikv config to test frequently region split

[coprocessor]

region-max-keys = 3000
region-split-keys = 2500

Have some doubts:

  1. CDC has almost pulled all changefeed event in time, why TiKV's memory consumes so much.
  2. regions seem not quite balance, one store (172.16.5.210) does most work
  3. CDC fails to continue replicate after one TiKV crashes.

cdc log: http://139.219.11.38:8000/cDBuP/cdc.log.tar.gz

Screen Shot 2020-03-10 at 16 52 04

Screen Shot 2020-03-10 at 16 55 54

Out of order problem occurs between puller and kv client

Feature Request

In current kv client, we process kv event Entry in a *cdcpb.Event_Entries_ one by one.

put entry to eventCh -> put a sorter item to sorter -> 
  put entry to eventCh -> put a sorter item to sorter -> 
    put entry to eventCh -> put a sorter item to sorter -> ...

So we will generate multiple commit event entry with the same commit ts to puller and one resolved ts event generated by the sorter, all of them have the same commit ts.

kv -> resolve -> kv -> kv

In fact this is not exactly as our design.

Describe alternatives you've considered:

We can remove the sorter mechanism and forward event totally based on the resolved ts from TiKV

make a better processor aliveness check and ensure RTO < 30s

Feature Request

Is your feature request related to a problem? Please describe:

We have s simple processor aliveness check, basically check either resolvedTs or checkpointTs is updated in one minute, which doesn't meet the requirement of RTO < 30s

In some tests we found kv client may block or with some other reasons, the resolvedTs and checkpointTs can't be updated in time, which means the not real-time of replication status doesn't always mean abnormal of a processor

Describe the feature you'd like:

Design a better aliveness check strategy, which satisfies

  • can find a processor abnormal (have many scenarios, like network partition, capture node crashes etc.) and rebalance tables belong to it to other processors less than 30s.

Support cache only mode in changefeed

Feature Request

Is your feature request related to a problem? Please describe:

CDC has a common use case as follows:

  • We use br or dumpling/lightning to import full data to downstream, the full dump data has a checkpoint ts
  • After the full data is imported, we start a CDC incremental replication task from checkpoint ts

However the restore/import procedure could be very long and TiKV doesn't have a long enough GC interval for us to catch data from the checkpoint ts, so the use case is often like this

  • We start a CDC changefeed, captures data change from upstream, caches them in CDC and doesn't - replicate to downstream.
  • Restore or import data as usual and get the checkpoint ts.
  • After the full data set is restored/imported, notify the CDC to start replicating from checkpoint ts

Describe the feature you'd like:

We need CDC to provide a new work mode:

  • receives changed data from TiKV, caches them in memory only (If we use kafka or other message queue as sink, we don't need to cache data in memory)
  • receives a notification and starts replicating from a given TSO

confusing warn log at start up

[2020/02/18 19:05:47.839 +08:00] [WARN] [disk.go:56] ["Mkdir temporary file error"] [tmpDir=/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/tidb-server-tidb-server] [error="mkdir /var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/tidb-server-tidb-server: file exists"]
➜  ticdc git:(ana) ✗ fd disk.go ./vendor
vendor/github.com/pingcap/tidb/util/chunk/disk.go
vendor/github.com/shirou/gopsutil/disk/disk.go

cause tidb/util/chunk/disk.go init a temporary dir(we start multi instance of tidb)

func init() {
    err := os.RemoveAll(tmpDir) // clean the uncleared temp file during the last run.
    if err != nil {
        log.Warn("Remove temporary file error", zap.String("tmpDir", tmpDir), zap.Error(err))
    }
    err = os.Mkdir(tmpDir, 0755)
    if err != nil {
        log.Warn("Mkdir temporary file error", zap.String("tmpDir", tmpDir), zap.Error(err))
    }
}

Slow replication speed with large scale of regions

Feature Request

Is your feature request related to a problem? Please describe:

  • set tikv config
[coprocessor]
region-max-keys = 6000
region-split-keys = 5000
  • prepare data
sysbench oltp_write_only --create_secondary=off --rand-seed=$RANDOM --tables=1 --table-size=10000000 prepare
  • start a CDC server and create a changefeed
  • run sysbench
sysbench oltp_write_only --create_secondary=off --rand-seed=$RANDOM --tables=1 --table-size=10000000 run

Describe the feature you'd like:

  • we found the replication speed is quite slow
  • and the puller entry buffer will be filled up very fast

Screen Shot 2020-02-24 at 16 39 13

Screen Shot 2020-02-24 at 16 37 58

Screen Shot 2020-02-24 at 16 37 37

Besides found the span_frontier tasks too much CPU:

Screen Shot 2020-02-24 at 16 42 23

profile file: http://139.219.11.38:8000/oitOt/pprof.cdc.samples.cpu.005.pb.gz

error: mkdir /tmp/tidb-server-cdc: file exists

1. Version Info

./resources/bin/br version
Release Version:
Git Commit Hash: 719cac031a89dff89e8c8d3f2c10d988bf401617
Git Branch: master
UTC Build Time:  2019-12-09 03:32:23
Race Enabled:  false

2. Reproduce steps

../go-tpc/bin/go-tpc --time=400m tpch --host 172.16.5.86 -P 4000 -T 1 --sf=1 prepare
mysql -h 172.16.5.86 -uroot -P4000 -e 'drop database if exists tmp_db'
mysql -h 172.16.5.86 -uroot -P4000 -e 'create database tmp_db'

./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log
./resources/bin/cdc cli --pd-addr http://172.16.5.86:2379 --start-ts 1 --sink-uri 'root@tcp(127.0.0.1:3306)/test'

kill -9 $(pgrep cdc)
./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log
./resources/bin/cdc cli --pd-addr http://172.16.5.86:2379 --start-ts 1 --sink-uri 'root@tcp(127.0.0.1:3306)/test

3. Log

$ cat cdc_server.log

[2019/12/12 13:49:29.296 +08:00] [WARN] [disk.go:56] ["Mkdir temporary file error"] [tmpDir=/tmp/tidb-server-cdc] [error="mkdir /tmp/tidb-server-cdc: file exists"]

Potential goroutine leaks when tables are removed

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

We have processors for each change feed task, the processor is essentially a goroutine. When all the tables processed by this processor are all removed, the processor should have stopped.

Reproduce steps:

  • set up the test environment
  • create database foo, table foo.user, and insert one entry.
  • create a change feed replication job
  • ensure the processor has been started from the ticdc's log
  • drop database foo from upstream
  • the processor never stops even the table has been removed and there is nothing to do now.

Solution:

  • The task status should be removed if there is no table left.
  1. What did you expect to see?

The processor exits when all its tables are dropped.

  1. What did you see instead?

The processor keeps running.

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      master
      
    • TiCDC version (execute cdc version):

      master
      

kv client failed to fetch some region data

At about 2020/03/08 05:27:06.141 -04:00 the replication does not forward any anymore, because one of table's(sbtest3) resolved ts is not forward

CDC uses #308 version, TiKV uses 5kbpers/tikv@1765a5b version

➜  curl -s http://172.16.5.113:10080/tables/cdc_bench/sbtest3/regions |grep region_id
   "region_id": 176,
   "region_id": 192,
   "region_id": 204,
   "region_id": 160,

cdc log: http://139.219.11.38:8000/KJTrQ/issue_321_cdc.log.tar.gz
tikv log: http://139.219.11.38:8000/NK0s0/tikv.log.tar.gz

some abnormal behavior:

  • region_id=160, last register at
[2020/03/08 05:26:56.099 -04:00] [INFO] [endpoint.rs:242] ["cdc register region"] [region_id=160]

but last resolved ts in TiKV is

[2020/03/08 05:29:05.462 -04:00] [INFO] [delegate.rs:279] ["resolved ts updated"] [resolved_ts=415146900050149856] [region_id=160]
  • region_id=176, not register after deregister
[2020/03/08 05:26:56.099 -04:00] [INFO] [endpoint.rs:242] ["cdc register region"] [region_id=176]
[2020/03/08 05:26:56.104 -04:00] [INFO] [endpoint.rs:169] ["cdc deregister region"] [error="Some(Request(message: \"peer is not leader for region 176, leader may None\" not_leader { region_id: 176 }))"] [conn_id=Some(ConnID(6))] [downstream_id=Some(DownstreamID(10))] [region_id=176]

Replace Loader with Lightning

Description

Replace Loader with Lightning

Background

In DM, the loader.Loader struct implements loading data from mydumper output files into TiDB. Since v3.0.3, Lightning supports the TiDB backend, which enables Lightning to do the same thing.

Proposal

We consider the TiDB backend mode of Lightning the better implementation of the two, because:

  1. It has more complete parsing support that can succeed in cases loader.Loader would fail;
  2. It has more fine-grain concurrency control, allowing users to config read and write concurrency differently.

So we propose to replace Loader with Lightning.

Success Criteria

  • Reimplement loader.Loader (which is an implementation of the Unit interface) with Lightning;
  • Maintain backward-compatibility of config files like task.yaml

Category

Enhancement

Value

Enhance the ability of load data.

Score

1500

SIG slack channel

sig-migrate

Mentor

csuzhangxc

TODO list

  • Refactor Lightning to make the loader path usabale as a library in DM
  • A basic replacement of loader that implements essential methods of the Unit interface (eg. Init, Process) and leave the other methods with dummy implementations
  • Implement Close
  • Implement Pause and Resume
  • Implement Status, Error, Type, IsFreshTask

Documentations

Project

N/A

Failed to replicate when pk is a generated column

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
create table t (a int, b int as (a + 1) stored primary key);
insert into t(a) values (1),(2), (3);
update t set a = 10 where a = 1;
  1. What did you expect to see?
mysql [email protected]:aa> select * from t;
+----+----+
| a  | b  |
+----+----+
| 2  | 3  |
| 3  | 4  |
| 10 | 11 |
+----+----+
  1. What did you see instead?
mysql [email protected]:aa> select * from t;
+---+---+
| a | b |
+---+---+
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
+---+---+
3 rows in set

Support NATS / LIftBridge

Description

We cant run kafka as we just dont like the legacy of java.

It would be awesome if you also support LIftBridge. It has a GRPC interface as well as a NATS interface.
Its basically like Kafka but written in Golang

https://github.com/liftbridge-io/liftbridge

Value Score

Optional value: 1~5

  • 3

Workload Estimation

1 Point for 1 Person/Work Day

  • 20 Points

Question ask: How to know the changefeed task is completed?

I'm going to bench the performance of CDC. I populate some data into the upstream TiDB cluster and after it's finished, I run cdc cli, and get some info like:

create changefeed detail &{SinkURI:root@tcp(127.0.0.1:3306)/test Opts:map[] CreateTime:2019-12-09 21:11:09.219849281 -0500 EST m=+0.017868850 StartTs:413114580074496000 TargetTs:0 Info:<nil>}

The sync task is running on cli server, so How do I know data is synced?

Support batch stream in kv client

Feature Request

Is your feature request related to a problem? Please describe:

The original TiKV EventFeed API will be changed to a duplex stream. ref:
https://docs.google.com/document/d/1SN3ztOXy2QTlCS1Qp9dUWTBfxowx-nIGpkuCw2ccULM/edit

Describe the feature you'd like:

The kv client has a clear input and ouput:

  • input: a key range
  • output: kv change log from all regions located in the given kv range

Things need to be done:

  • We need to refine our kv client in CDC to support new API.
  • Have a better error handling in kv client

fail to run test for go 1.14

make test

which bin/failpoint-ctl >/dev/null 2>&1 || CGO_ENABLED=0 GO111MODULE=on go build  -trimpath -o bin/failpoint-ctl github.com/pingcap/failpoint/failpoint-ctl
mkdir -p "/tmp/tidb_cdc_test"
$(echo $(for p in $(go list ./...| grep -vE 'vendor|proto|ticdc\/tests'); do echo ${p#"github.com/pingcap/ticdc/"}|grep -v "github.com/pingcap/ticdc"; done) | xargs bin/failpoint-ctl enable >/dev/null)
ok  	github.com/pingcap/ticdc	0.070s	coverage: 100.0% of statements
{"level":"info","ts":"2020-03-03T17:40:20.472+0800","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:50545"]}
{"level":"info","ts":"2020-03-03T17:40:20.473+0800","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["http://localhost:50546"]}
{"level":"info","ts":"2020-03-03T17:40:20.474+0800","caller":"embed/etcd.go:299","msg":"starting an etcd server","etcd-version":"3.4.3","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.14","go-os":"darwin","go-arch":"amd64","max-cpu-set":12,"max-cpu-available":12,"member-initialized":false,"name":"default","data-dir":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:50545"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:50546"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2020-03-03T17:40:20.566+0800","caller":"etcdserver/backend.go:79","msg":"opened backend db","path":"/var/folders/nw/c0ncybdd6gj2f5w5tmqvk9y40000gn/T/check-2797722976074707835/0/member/snap/db","took":"90.734245ms"}
fatal error: checkptr: unsafe pointer conversion

goroutine 157 [running]:
runtime.throw(0x6c41cad, 0x23)
	/usr/local/go/src/runtime/panic.go:1112 +0x72 fp=0xc0007d97a0 sp=0xc0007d9770 pc=0x40379f2
runtime.checkptrAlignment(0xc0001d9370, 0x6a84ee0, 0x1)
	/usr/local/go/src/runtime/checkptr.go:18 +0xb7 fp=0xc0007d97d0 sp=0xc0007d97a0 pc=0x4009617
go.etcd.io/bbolt.(*Bucket).write(0xc0007d9948, 0x0, 0x0, 0x0)
	/Users/huangjiahao/go/pkg/mod/go.etcd.io/[email protected]/bucket.go:624 +0x15c fp=0xc0007d9838 sp=0xc0007d97d0 pc=0x59c87bc
go.etcd.io/bbolt.(*Bucket).CreateBucket(0xc000200018, 0x858a188, 0x7, 0x7, 0xc0007d9ba8, 0x5a862c

Open TiCDC protocol

Open TiCDC protocol project

Overview

  • Design an open TiCDC protocol, make other applications easy to access TiCDC data.
  • Support Kafka and Pulsar as downstream data-targets for TiCDC.

Background

TiCDC(TiDB Change Data Capture) is a new distributed incremental replication tool for TiDB ecosystem. TiCDC is still in development, but it already works properly in the experimental environment. When the TiCDC cluster starts, an owner will be voted and other nodes are named processor. The processors pull change key-value logs from TiKV, assemble logs into transactions, output to downstream data-target.The owner watches the replication progress of processors and coordinates them to ensure the transaction order.

Problem Statement

  • The current design of TiCDC is for MySQL protocol downstream data-targets. It cannot accept any Non-MySQL protocol data-targets. It's hard to access TiCDC data for other applications.

Success Criteria

  • Under the premise of ensuring correctness, TiCDC can output to Kafka/Pulsar.
  • Exist a library can consume Kafka/Pulsar and parse open TiCDC protocol.
  • Guarantee transaction integrity of downstream data even the TiCDC cluster is broken.

TODO list

  • Design an open TiCDC protocol.
  • Implement TiCDC output using open TiCDC protocol.
  • Implement open TiCDC protocol parser library and consumer client.
  • Support output data to Kafka.
  • Support output data to Pulsar.
  • Correctness and stability test.

Difficulty

Easy

Score

2100

Mentor(s)

@leoppro

Recommended Skills

  • Go language
  • Kafka/Pulsar operation
  • Basic understanding of TiKV

References

TiCDC distributed design(Chinese version)

TiCDC HA design(Chinese version)

TiCDC github repo

UCP: Fix DDL incompatible problem between upstream and downstream

Description

Currently TiCDC could replicate some DDLs that are not compatible between upstream and downstream. If DDL is executed failed in downstream, the replication task will be stopped.

The compatibility issues exist in higher version TiDB and lower version TiDB, or TiDB and MySQL.

To improve user experience, we can:

  1. Add an option to filter incompatible DDL automatically if the DDL is ignorable.
  2. In some cases, rewrite the DDL and ensure it can be executed in the downstream.

Score

  • 600

Mentor(s)

Recommended Skills

  • Go programming

cdc server produced 5Gi logs in ten minutes

Version Info

/ # ./cdc version
Release Version:
Git Commit Hash: 80ee381230d5b7a3181464ad874f9a54c9220184
Git Branch: master
UTC Build Time: 2019-12-13 09:54:47
Go Version: go version go1.13.4 linux/amd64 

sh-4.2# ./tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: HEAD
UTC Build Time: 2019-12-02 09:22:52
GoVersion: go version go1.13.4 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

sh-4.2# ./tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/ # ./pd-server -V
Release Version: v4.0.0-alpha-200-gf7f643c61
Git Commit Hash: f7f643c6138cc5240d954bfa1a560e3b14bfdc6e
Git Branch: HEAD
UTC Build Time:  2019-12-13 11:42:23

Description

use test-infra to test cdc, found that cdc server produced 5Gi logs in ten minutes
image

log:

[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]

endless log 'region not found on incremental scan' of tikv

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

  2. sh tests/run.sh --debug

  3. sysbench load some data at upstream.

  4. What did you expect to see?
    replication works normally.

  5. What did you see instead?
    stop replicate after some time (no more data at down stream).

log of tikv.log continues printing (even after stop cdc server means we can make sure no any more request to tikv)
endless retry of tikv?

[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.256 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]
[2020/02/19 11:48:13.257 +08:00] [WARN] [endpoint.rs:255] ["region not found on incremental scan"] [region_id=48]

version of tikv: ad59724513ab83461c54c1996f89235301a036d7

the region not.... log in tikv is filtered
cdc.tar.gz
issue268.tar.gz

TiKV EventFeed supports receiving data with regions more than 1024

Feature Request

Is your feature request related to a problem? Please describe:

  • update merge-schedule-limit = 0 in pd.toml to disable region merge
  • prepare data
sysbench --config-file=config oltp_insert --rand-seed=$RANDOM --tables=1 --table-size=8000000 prepare
  • split table into 3997 regions
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (0) and (1100000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (1100000) and (2200000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (2200000) and (3300000) regions 1000"
mysql -h 172.16.5.113 -u root -P 4000 -e "split table cdc_bench.sbtest1 between (3300000) and (4400000) regions 1000"
  • start a CDC server and create a changefeed
  • we found not all regions data have been received in CDC, up to 3080 approximately

Screen Shot 2020-02-25 at 17 49 32

  • in TiKV log of each node, only 1024 regions are registered.
➜  grep "cdc register region" tikv.log|wc -l 
1024

Describe the feature you'd like:

TiKV EventFeed supports receiving data with regions more than 1024.

refine replication forward model in CDC

In CDC we have the following replication model

1. kv client recvs data
2. kv client sends data to puller via an event chan
3. puller adds data to a buffer, sorts data and re-constructs transactions
4. puller sends transactions to tableInfo (managed in a processor) via a txn chan
5. procssor pulls all txns from txn chan of each tableInfo (with txn ts no more than CDC GlobalResolvedTs)

In the large number of regions test, we found the replication blocked with buffer in 3, chan in 4 full and no data was pulled in step 5. Maybe this is also part reason for slow replication and low throughput. We should have a better data forward model, for the following consideration:

  1. table level sort and data flow should not block the kv client pulling messages from TiKV
  2. If we use MQ such as Kafka as sink, we even don't need the transaction reconstruction in step 3 (sort in step 3 is enough).

We can separate this refactor into multiple small changes, including:

  1. implement a new memory cache mechanism, always receive messages from TiKV and buffer messages in memory if they are not consumed fast enough.
  2. limit memory usage in optimization 1
  3. we can forward messages to the sink module regardless of the global resolved ts, and only consider the global resolved ts when we really replicate data to the target.

Add a straightforward replication status query command

Feature Request

Is your feature request related to a problem? Please describe:

Currently a user can't detect whether a replication task is running normally.

Describe the feature you'd like:

TiCDC should collect replication status, and provide a convenient way to query them, the status maybe include

  • health status of each capture
  • replication gap, including resolved ts, checkpoint ts
  • replication speed
  • more replication details, such as table assignment

Make region cache used in kv client more robust

Feature Request

Is your feature request related to a problem? Please describe:

  • first we setup a TiDB cluster and CDC replication job, especially with the following TiKV config, which aims to have lot of region split during benchmark
[coprocessor]
region-max-keys = 1200
region-split-keys = 1000
  • start a sysbench test, check whether CDC replication works well

Describe the feature you'd like:

Currently CDC reuses the region cache lib in TiDB, and is able to handle normal region split. But in the above benchmark scenario, it always fails.
We need to dig into this problem and make the kv client more robust.

kv client panic

test on pr #308

fatal error: sync: RUnlock of unlocked RWMutex

goroutine 508 [running]:
runtime.throw(0x1e130ef, 0x21)
	runtime/panic.go:774 +0x72 fp=0xc0017637f0 sp=0xc0017637c0 pc=0x42f612
sync.throw(0x1e130ef, 0x21)
	runtime/panic.go:760 +0x35 fp=0xc001763810 sp=0xc0017637f0 pc=0x42f595
sync.(*RWMutex).rUnlockSlow(0xc0008b64a0, 0xc0bfffffff)
	sync/rwmutex.go:80 +0x3f fp=0xc001763838 sp=0xc001763810 pc=0x46f42f
sync.(*RWMutex).RUnlock(...)
	sync/rwmutex.go:70
github.com/pingcap/ticdc/cdc/kv.(*CDCClient).receiveFromStream(0xc0006365d0, 0x2128360, 0xc000944000, 0xc0001cef90, 0xc000051fe0, 0x12, 0x5, 0x2145bc0, 0xc0007405a0, 0xc000445860, ...)
	github.com/pingcap/ticdc@/cdc/kv/client.go:563 +0x3a4 fp=0xc001763ec8 sp=0xc001763838 pc=0x14391b4
github.com/pingcap/ticdc/cdc/kv.(*CDCClient).dispatchRequest.func1(0xc0008ad768, 0x0)
	github.com/pingcap/ticdc@/cdc/kv/client.go:290 +0xc2 fp=0xc001763f58 sp=0xc001763ec8 pc=0x1443612
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc0001cef90, 0xc00016ad20)
	golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x64 fp=0xc001763fd0 sp=0xc001763f58 pc=0xe0da34
runtime.goexit()
	runtime/asm_amd64.s:1357 +0x1 fp=0xc001763fd8 sp=0xc001763fd0 pc=0x45f131
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

full stdout log: http://139.219.11.38:8000/suZMH/20200306_1258_cdc_stdout.log

changefeed restriction

when user creates a new chagnefeed, TiCDC should have some pre verification for the changefeed config, including

  • The changefeed StartTs must be larger than tikv_gc_safe_point
  • The must be unique, which means TiCDC can't have more than one changefeed to replicate to .
    • Here we should make a definition for unique data flow, maybe (schema, table, target sink)?
    • We also need to have a boundary for each definition, such as
      • What if there exist three downstream TiDB addresses and they share the same TiKV store, then how can TiCDC recognize it.
      • The table-filter in changefeed config may be various, is it possible to check table confliction accurately
    • What should TiCDC do when it detects data flow confliction

Naming

If this tool will be used by tikv alone in the future, then the name, tidb-cdc, is not accurate.

create database failed if the schema `test` does not exist in downstream

[2020/03/07 10:53:45.345 -05:00] [INFO] [mysql.go:97] ["execute DDL failed, but error can be ignored"] [query="create database cdc_bench"] [error="Error 1049: Unknown database 'test'"] [errorVerbose="Error 1049: Unknown database 'test'
github.com/pingcap/errors.AddStack
        github.com/pingcap/[email protected]/errors.go:174
github.com/pingcap/errors.Trace
        github.com/pingcap/[email protected]/juju_adaptor.go:15
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDL
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:109
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDLWithMaxRetries.func1
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:95
github.com/pingcap/ticdc/pkg/retry.Run.func1
        github.com/pingcap/ticdc@/pkg/retry/retry.go:31
github.com/cenkalti/backoff.RetryNotify
        github.com/cenkalti/[email protected]+incompatible/retry.go:37
github.com/cenkalti/backoff.Retry
        github.com/cenkalti/[email protected]+incompatible/retry.go:24
github.com/pingcap/ticdc/pkg/retry.Run
        github.com/pingcap/ticdc@/pkg/retry/retry.go:30
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).execDDLWithMaxRetries
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:94
github.com/pingcap/ticdc/cdc/sink.(*mysqlSink).EmitDDLEvent
        github.com/pingcap/ticdc@/cdc/sink/mysql.go:89
github.com/pingcap/ticdc/cdc.(*changeFeed).handleDDL
        github.com/pingcap/ticdc@/cdc/owner.go:900
github.com/pingcap/ticdc/cdc.(*ownerImpl).handleDDL
        github.com/pingcap/ticdc@/cdc/owner.go:811
github.com/pingcap/ticdc/cdc.(*ownerImpl).run
        github.com/pingcap/ticdc@/cdc/owner.go:1118
github.com/pingcap/ticdc/cdc.(*ownerImpl).Run
        github.com/pingcap/ticdc@/cdc/owner.go:1076
github.com/pingcap/ticdc/cdc.(*Capture).Start.func1
        github.com/pingcap/ticdc@/cdc/capture.go:150
golang.org/x/sync/errgroup.(*Group).Go.func1
        golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
        runtime/asm_amd64.s:1357"]

Refine command line help output

Feature Request

Is your feature request related to a problem? Please describe:

The output from cdc -h is not straightforward or providing an easy for a newbie to start to use CDC rapidly.

Describe the feature you'd like:

  • Provide straightforward help content
  • Provide a brief, clear instruction for starting a CDC service in help.

Capture info is deleted when owner changed multiple times

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

start a CDC server

  1. What did you expect to see?

cdc cli capture list returns the server

  1. What did you see instead?
[2020/03/18 19:44:09.797 +08:00] [INFO] [root.go:47] ["init log"] [file=ticdc_1.log] [level=debug]
[2020/03/18 19:44:09.797 +08:00] [INFO] [version.go:34] ["Welcome to Change Data Capture (CDC)"] [release-version=v4.0.0-beta.2] [git-hash=63b1db95df26ef914bc1f1dc29ddfa4936100ff8] [git-branch=master] [utc-build-time="2020-03-13 09:45:32"] [go-version="go version go1.13 linux/amd64"]
[2020/03/18 19:44:09.797 +08:00] [INFO] [server.go:76] ["creating CDC server"] [pd-addr=http://hw-dt-wms-warp1-tidb01:2379] [status-host=127.0.0.1] [status-port=8301]
[2020/03/18 19:44:09.804 +08:00] [INFO] [capture.go:96] ["creating capture"] [capture-id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.805 +08:00] [INFO] [client.go:134] ["[pd] create pd client with endpoints"] [pd-address="[http://hw-dt-wms-warp1-tidb01:2379]"]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:226] ["[pd] update member urls"] [old-urls="[http://hw-dt-wms-warp1-tidb01:2379]"] [new-urls="[http://10.232.0.109:2379,http://10.232.0.166:2379,http://10.232.0.212:2379]"]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://10.232.0.212:2379] [old-leader=]
[2020/03/18 19:44:09.812 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=6804742633952162675]
[2020/03/18 19:44:09.812 +08:00] [INFO] [http_status.go:54] ["status http server is running"] [addr=127.0.0.1:8301]
[2020/03/18 19:44:09.819 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.819 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 19:44:09.819 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ea25e]
[2020/03/18 19:44:10.317 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 19:44:10.318 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=93442]
[2020/03/18 21:49:11.764 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:49:11.764 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:11.764 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:11.765 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:49:13.448 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:13.448 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:49:13.448 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee261]
[2020/03/18 21:49:13.733 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 21:49:13.734 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=97243]
[2020/03/18 21:51:42.785 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:51:42.785 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:51:42.785 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:51:42.785 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:51:55.899 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:06.900 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:17.901 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:28.901 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: request timed out"]
[2020/03/18 21:52:30.285 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:52:30.286 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:52:30.286 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee387]
[2020/03/18 21:52:30.410 +08:00] [INFO] [owner.go:1263] ["start to watch processors"]
[2020/03/18 21:52:30.411 +08:00] [INFO] [owner.go:1213] ["monitoring processors"] [key=/tidb/cdc/processor/info] [rev=97424]
[2020/03/18 21:54:06.330 +08:00] [INFO] [manager.go:301] ["watch failed, owner is deleted"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.330 +08:00] [DEBUG] [manager.go:274] ["lost owner role, send retire notification"]
[2020/03/18 21:54:06.330 +08:00] [WARN] [manager.go:229] ["lost owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.330 +08:00] [ERROR] [owner.go:1272] ["watch processor failed"] []
[2020/03/18 21:54:06.331 +08:00] [ERROR] [manager.go:215] ["failed to campaign"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: requested lease not found"]
[2020/03/18 21:54:06.333 +08:00] [INFO] [manager.go:207] ["etcd session encounters the error of lease not found, closes it"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462] [error="etcdserver: requested lease not found"]
[2020/03/18 21:54:06.333 +08:00] [INFO] [manager.go:187] ["etcd session is done, creates a new one"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [INFO] [manager.go:223] ["campaign to be owner"] [id=9fa3ce40-00e8-4153-873c-4dfcbf4b3462]
[2020/03/18 21:54:06.340 +08:00] [DEBUG] [manager.go:269] ["watch owner key"] [key=/tidb/cdc/capture/owner/6aab70e2c63ee446]

企业微信截图_d73234a8-2379-49ab-bc71-3b4b9bb83d08

`sh tests/run.sh --debug` success but server fails to start acttually

very easy to happen.

look well

...
| tikv_gc_life_time     | 10m0s                                                                                           | All versions within life time will not be collected by GC, at least 10m, in Go format. |
| tikv_gc_last_run_time | 20200218-18:51:14 +0800                                                                         | The time when last GC starts. (DO NOT EDIT)                                            |
| tikv_gc_safe_point    | 20200218-18:41:14 +0800                                                                         | All versions after safe point can be accessed. (DO NOT EDIT)                           |
+-----------------------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
create changefeed ID: 04ab9a26-a5fb-42a4-a89f-7853ebf725e4 detail {"sink-uri":"root@tcp(127.0.0.1:3306)/","opts":{},"create-time":"2020-02-18T18:51:18.754343+08:00","start-ts":414717857956102145,"target-ts":0,"admin-job-type":0,"config":{"filter-case-sensitive":false,"filter-rules":null,"ignore-txn-commit-ts":null}}
You may now debug from another terminal. Press [ENTER] to exit.

log of cdc.server

Error: run server: create change feed 04ab9a26-a5fb-42a4-a89f-7853ebf725e4: create schema store failed: [tikv:9001]PD server timeout
Usage:
  cdc server [flags]

Flags:
  -h, --help                  help for server
      --pd-endpoints string   endpoints of PD, separated by comma (default "http://127.0.0.1:2379")
      --status-addr string    bind address for http status server (default "127.0.0.1:8300")

Global Flags:
      --log-file string    log file path (default "cdc.log")
      --log-level string   log level (etc: debug|info|warn|error) (default "debug")

run server: create change feed 04ab9a26-a5fb-42a4-a89f-7853ebf725e4: create schema store failed: [tikv:9001]PD server timeout
 +08:00] [INFO] [client.go:134] ["[pd] create pd client with endpoints"] [pd-address="[http://127.0.0.1:2379]"]
[2020/02/18 18:51:17.738 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://127.0.0.1:2379] [old-leader=]
[2020/02/18 18:51:17.738 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=6794737329153784617]
[2020/02/18 18:51:17.738 +08:00] [INFO] [http_status.go:54] ["status http server is running"] [addr=0.0.0.0:8300]
[2020/02/18 18:51:17.771 +08:00] [INFO] [manager.go:253] ["get owner"] [ownerID=8db4a77e-bf7d-4a20-bc13-de2975abc096]

Error 1298: Unknown or incorrect time zone: 'UTC'

topology:
Upstream TiDB: 172.16.6.206
Downstream: MySQL 5.6.46

1. Version Info

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: master
UTC Build Time: 2019-10-14 03:55:02
GoVersion: go version go1.13 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

[tidb@localhost tidb-ansible]$ /data1/tidb/deploy/bin/pd-server -V
Release Version: v4.0.0-alpha-191-g7811255c
Git Commit Hash: 7811255c7345503ed5f44afb981bbf9712fd25c6
Git Branch: master
UTC Build Time:  2019-12-06 05:07:33

2. Reproduce steps

mysql -h 127.0.0.1 -P 4000 -u root

CREATE table test.simple2(id int primary key, val int);
CREATE table test.simple2(id int primary key, val int);

## start_ts=$(($(date +%s%N | cut -b1-13)<<18)) => 413114580074496000

INSERT INTO test.simple1(id, val) VALUES (1, 1);
INSERT INTO test.simple1(id, val) VALUES (2, 2);
INSERT INTO test.simple1(id, val) VALUES (3, 3);
UPDATE test.simple1 set val = 22 where id = 2;
DELETE from test.simple1 where id = 3

mysql -h 127.0.0.1 -P 3306 -u root -e 'create database test'

nohup /home/tidb/cdc server --pd-endpoints http://172.16.6.206:2379 &

/home/tidb/cdc cli --pd-addr http://172.16.6.206:2379 --start-ts=413114580074496000 --sink-uri 'root@tcp(127.0.0.1:3306)/test'

3. Expected and Got

image

cdc.log:

[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE if not exists mysql.stats_top_n (\n\t\ttable_id bigint(64) NOT NULL,\n\t\tis_index tinyint(2) NOT NULL,\n\t\thist_id bigint(64) NOT NULL,\n\t\tvalue longblob,\n\t\tcount bigint(64) UNSIGNED NOT NULL,\n\t\tindex tbl(table_id, is_index, hist_id)\n\t);"] [job="ID:38, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:37, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.003 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.stats_top_n] [id=37]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE IF NOT EXISTS mysql.expr_pushdown_blacklist (\n\t\tname char(100) NOT NULL\n\t);"] [job="ID:40, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:39, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.103 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.843 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.expr_pushdown_blacklist] [id=39]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE TABLE IF NOT EXISTS mysql.opt_rule_blacklist (\n\t\tname char(100) NOT NULL\n\t);"] [job="ID:42, Type:create table, State:synced, SchemaState:public, SchemaID:3, TableID:41, RowCount:0, ArgLen:0, start time: 2019-12-09 21:14:55.153 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=mysql.opt_rule_blacklist] [id=41]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE table test.simple1(id int primary key, val int)"] [job="ID:44, Type:create table, State:synced, SchemaState:public, SchemaID:1, TableID:43, RowCount:0, ArgLen:0, start time: 2019-12-09 21:17:06.003 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=test.simple1] [id=43]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:302] ["handle job: "] ["sql query"="CREATE table test.simple2(id int primary key, val int)"] [job="ID:46, Type:create table, State:synced, SchemaState:public, SchemaID:1, TableID:45, RowCount:0, ArgLen:0, start time: 2019-12-09 21:17:12.253 -0500 EST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2019/12/09 21:20:10.844 -05:00] [DEBUG] [storage.go:221] ["create table success"] [name=test.simple2] [id=45]
[2019/12/09 21:20:10.845 -05:00] [DEBUG] [client.go:228] ["singleEventFeed quit"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [processor.go:353] ["Checkpoint worker exited"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [client.go:235] ["EventFeed disconnected"] [span="{\"Start\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABs\",\"End\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABt\"}"] [checkpoint=413124368270098433] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).singleEventFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:408\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:227\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:30\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:215\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).EventFeed.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:188\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.845 -05:00] [INFO] [scheduler.go:313] ["stop to run processor"] ["changefeed id"=245b6079-015f-4707-9f18-78bca094b6cf]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [client.go:228] ["singleEventFeed quit"]
[2019/12/09 21:20:10.846 -05:00] [ERROR] [server.go:80] ["run server"] [error="Error 1298: Unknown or incorrect time zone: 'UTC'\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/sink.(*mysqlSink).Emit\n\tgithub.com/pingcap/ticdc@/cdc/sink/mysql.go:141\ngithub.com/pingcap/ticdc/cdc.(*processor).syncResolved\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:587\ngithub.com/pingcap/ticdc/cdc.(*processor).Run.func3\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:283\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [client.go:235] ["EventFeed disconnected"] [span="{\"Start\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABs\",\"End\":\"bURETEpvYkxp/3N0AAAAAAAA+QAAAAAAAABt\"}"] [checkpoint=413124368270098433] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).singleEventFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:408\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:227\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/[email protected]+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:30\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).partialRegionFeed\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:215\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).EventFeed.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:188\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [capture_info.go:128] ["watchC from etcd close normally"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [owner.go:372] ["handleWatchCapture quit"]
[2019/12/09 21:20:10.846 -05:00] [DEBUG] [etcd.go:205] ["update subchangefeed info success"] ["changefeed id"=6cdfb9e6-e0ec-4933-bd77-b269946cd685] ["capture id"=a3d0a077-497e-4b4a-a7c3-cb186e9e110d] [modRevision=232] [info="{\"checkpoint-ts\":0,\"resolved-ts\":413124368270098433,\"table-infos\":[{\"id\":45,\"start-ts\":413124328229699584}],\"table-p-lock\":null,\"table-c-lock\":null}"]
[2019/12/09 21:20:10.846 -05:00] [INFO] [processor.go:330] ["Local resolved worker exited"]

failed to start cdc with multiple pd address

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
./cdc cli --pd-addr=172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329 --sink-uri="mysql://root:[email protected]:13307/" --start-ts 0
Error: [pd] failed to get cluster id
  1. What did you expect to see?

start a changefeed

  1. What did you see instead?
[pd] failed to get cluster id"] [url=http://172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329] [error="error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329: too many colons in address\" target:172.16.5.83:2329,172.16.5.84:2329,172.16.5.89:2329 status:TRANSIENT_FAILURE"] 

Set up capture to use mysqlSink by default

Our first milestone is to make CDC works for syncing to MySQL/TiDB, which corresponds to the mysqlSink in our code.

We should update the way Capture interact with Sink so that the necessary information can be passed to create a mysqlSink

UCP: Refine log and stdout of cdc cli tool

Description

Currently, TiCDC cli outputs logs to cdc.log by default and, as a command-line tool, it is not reasonable and not easy for user to debug problems. However if we log to stdout directly, we may get a log of noise logs.

To improve user experience when using cdc cli, we can:

  1. By default, output logs to /tmp/cdc.log,
  2. print necessary information to stdout, including error message, running result etc.

Score

  • 300

Mentor(s)

Recommended Skills

  • Go programming

Add test for alter pk in upstream

Feature Request

Is your feature request related to a problem? Please describe:

  • set alter-pk =true
  • alter table sbtest_pk add primary key(id);

check downstream status

Describe the feature you'd like:

Describe alternatives you've considered:

[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="alter table sbtest_pk add primary key(id)"] [job="ID:3484, Type:add primary key, State:synced, SchemaState:public, SchemaID:3413, TableID:3482, RowCount:0, ArgLen:0, start time: 2020-03-11 23:37:14.406 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:415220638759256066"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="drop table sbtest_pk"] [job="ID:3485, Type:drop table, State:synced, SchemaState:none, SchemaID:3413, TableID:3482, RowCount:0, ArgLen:0, start time: 2020-03-11 23:45:02.856 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:367] ["drop table success"] [name=sbtest_pk] [id=3482]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:445] ["handle job: "] ["sql query"="CREATE TABLE `sbtest_pk` (   `id` int(11) NOT NULL,   `k` int(11) NOT NULL DEFAULT '0',   `c` char(120) NOT NULL DEFAULT '',   `pad` char(60) NOT NULL DEFAULT '' )"] [job="ID:3487, Type:create table, State:synced, SchemaState:public, SchemaID:3413, TableID:3486, RowCount:0, ArgLen:0, start time: 2020-03-11 23:45:08.256 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2020/03/11 23:47:04.772 +08:00] [DEBUG] [schema_storage.go:383] ["create table success"] [name=cdc_sbtest.sbtest_pk] [id=3486]

run server: schema 68 not found

topology:
Upstream TiDB: 172.16.5.86
Downstream: MySQL 5.7.28

1. Version Info

/data1/deploy1/bin/tidb-server -V
Release Version: v4.0.0-alpha-1148-g5da10ffec
Git Commit Hash: 5da10ffecc280136b2041801b23034c557e41751
Git Branch: HEAD
UTC Build Time: 2019-12-12 03:12:21
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306

/data1/deploy1/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   38579ea3e2ed08dc5bd724b2c0cda82b4588c42f
Git Commit Branch: master
UTC Build Time:    2019-12-09 04:37:17
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/data1/deploy1/bin/tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   38579ea3e2ed08dc5bd724b2c0cda82b4588c42f
Git Commit Branch: master
UTC Build Time:    2019-12-09 04:37:17
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)
[tidb@localhost tidb-ansible]$ /data1/deploy1/bin/pd-server -V
Release Version: v4.0.0-alpha-197-gbd7b3f46
Git Commit Hash: bd7b3f46eef5dfb8241bcdcea27c68454b2f1f1c
Git Branch: master
UTC Build Time:  2019-12-12 02:16:14

2. Reproduce steps

../go-tpc/bin/go-tpc --time=400m tpch --host 172.16.5.86 -P 4000 -T 1 --sf=1 prepare // 灌数据
// get ts
+ mysql -h 172.16.5.86 -uroot -P4000 -e 'drop database if exists tmp_db' // 创建表
+ mysql -h 172.16.5.86 -uroot -P4000 -e 'create database tmp_db'

./resources/bin/cdc server --pd-endpoints http://172.16.5.86:2379 > cdc_server.log

3. Expected and Got

$ cat cdc_server.log

Error: run server: schema 68 not found
Usage:
  cdc server [flags]

Flags:
  -h, --help                  help for server
      --pd-endpoints string   endpoints of PD, separated by comma (default "http://127.0.0.1:2379")
      --status-addr string    bind address for http status server (default "127.0.0.1:8300")

Global Flags:
      --log-file string    log file path (default "cdc.log")
      --log-level string   log level (etc: debug|info|warn|error) (default "debug")

run server: schema 68 not found

cdc.log
ddl_history.log

Some tables failed to sync to the downstream mysql database

Version Info

/ # ./cdc version
Release Version:
Git Commit Hash: 80ee381230d5b7a3181464ad874f9a54c9220184
Git Branch: master
UTC Build Time: 2019-12-13 09:54:47
Go Version: go version go1.13.4 linux/amd64 

sh-4.2# ./tidb-server -V
Release Version: v4.0.0-alpha-516-g5466a3c31
Git Commit Hash: 5466a3c31bf4b93fb3a2c595dd6aeac46aca7b8e
Git Branch: HEAD
UTC Build Time: 2019-12-02 09:22:52
GoVersion: go version go1.13.4 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

sh-4.2# ./tikv-server -V
TiKV
Release Version:   4.0.0-alpha
Git Commit Hash:   56dc6d63ade182289c4ab1e37996746040bc07d6
Git Commit Branch: cdc
UTC Build Time:    2019-11-06 03:29:57
Rust Version:      rustc 1.39.0-nightly (c6e9c76c5 2019-09-04)

/ # ./pd-server -V
Release Version: v4.0.0-alpha-200-gf7f643c61
Git Commit Hash: f7f643c6138cc5240d954bfa1a560e3b14bfdc6e
Git Branch: HEAD
UTC Build Time:  2019-12-13 11:42:23

Description

use test-infra to test cdc,
found that some tables failed to sync to mysql and some error log in cdc log

tidb tables: 
+----------------+
| Tables_in_test |
+----------------+
| amfev          |
| cmqwqm         |
| dcnlvf         |
| dofvv          |
| eacnnohz       |
| iyotimi        |
| mnrbu          |
| mvtabkee       |
| nqjftinj       |
| phfxkijuy      |
| pklmxor        |
| sflzhfns       |
| sfrpmvpa       |
| sxdho          |
| t1576252933    |
| wzgzdkdwq      |
| xlfmlygpi      |
| zpmlqvdcr      |
+----------------+
18 rows in set (0.00 sec)

mysql tables: 
+----------------+
| Tables_in_test |
+----------------+
| amfev          |
| cmqwqm         |
| dcnlvf         |
| dofvv          |
| eacnnohz       |
| mvtabkee       |
| phfxkijuy      |
| pklmxor        |
| sflzhfns       |
| sfrpmvpa       |
| sxdho          |
| wzgzdkdwq      |
| xlfmlygpi      |
| zpmlqvdcr      |
+----------------+
14 rows in set (0.00 sec)

cdc log:

[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]
[2019/12/13 16:05:14.537 +00:00] [ERROR] [client.go:371] ["RPC error"] [error="rpc error: code = Canceled desc = context canceled"]

the full log: http://139.219.11.38:8000/qTl2Q/cdc-2019-12-13T16-05-14.537.log

Support TLS and online reload with new certs

Description

Support TLS and online reload new certs.

Category

  • Security

Task list

  • Enable TLS for PD client and TiKV Client #347
  • Enable TLS for HTTP client #347
  • Enable TLS for MySQL sink #742
  • Enable TLS for Kafka sink #764
  • Sppport reloading certs online. #347
  • Validate Common Name #747

Value

Value Description

  • Improves security.
  • Allow TiCDC to be used in public could platform.

Value Score

Optional value: 1~5

  • 4

Workload Estimation

1 Point for 1 Person/Work Day

  • 5 Points

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.