inmobi / archived-data-bus Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 6.0 64.19 MB

License: Other

Java 100.00%

archived-data-bus's People

Contributors

Stargazers

Watchers

Forkers

kiranpraneeth rajubairishetti g-ranjan143 annapurnavemuri isabella232

archived-data-bus's Issues

Databus stalled usecases

LocalStreamService, MergedStreamService, MirrorStreamService use the following commit Protocol to guarantee 0% data loss

COMMIT Protocol

Producer writes the data getting published in a META file
producer publishes the data.
Consumer reads the data to PULL from the published file in [1]
Consumer pulls the data, publishes it in an atomic fashion.
Consumer deletes/commits the META file in [1]

When can databus get stalled.

producer dies after[1] and before [2]
Producer's in DATABUS are highly available through ZooKeeper
Another producer starts and follows the COMMIT protocol described above to publish more data.
All Consumers in DATABUS(MergedStreamService, LocalStreamService) use DISTCP(augmented to support -preserveSourcePath - https://github.com/InMobi/DistCpV2-0.20.203) with -f option
DISTCP with -f option errors out if one of the path in the file list isn't present at SOURCE.

SideEffect - Databus (MergedStreamService, MirrorStreamService can stop consuming data thereby resulting in DATABUS getting stalled.

Current Solution - Have operational monitoring on /databus/system/mirrors, /databus/system/consumers directories for each instance of databus. If #files in these directories exceed a threshold(10) then stop databus. cat these files to produce a single file in the respective directory by checking whether the file exists on HDFS else skip it. Start DATABUS.

Make publish paths to strictly obey YYYY/MM/DD/HR/MN format

Today MM can be a single digit number. Same holding true for dd, hr, mn format.
Consumer's using OOZIE to trigger their workflows based on the existence of a directory have issues as OOZIE does an exact match.

DataConsumer tmpPath cleanup should happen on each run

Issue1 -

Today tmpPath for dataConsumer is created once and is not being cleaned once the iteration is over.
Due to some issue if some files are left in it(i.e. the process didn't scucceed) next runs gets more files than excepted and commit doesn't happen going forward as there is a check between #no of files in input to be moved == #no of files compressed.
Fix - to cleanup the tmpoutPath on each run of data consumer.

Issue2 -

Today when remote puller starts and has to pull multiple files from remote cluster, it creates one file in the remotePath which has the list of all paths to be pulled. If distcp fails after this then this file is not cleaned up and next distcp run will consider this file also as input thereby keep failing as there will be duplicate file names

Fix - this file should be created in a destination cluster where data is being pulled in a timestamp based name inside tmp

Avoid minimal data replay in corner case of catastrophic failure of DATABUS

LocalStreamService, MergedStreamService, MirrorStreamService use the following commit Protocol to guarantee 0% data loss

COMMIT Protocol

Producer writes the data getting published in a META file
Producer publishes the data.
Consumer reads the data to PULL from the published file in [1]
Consumer pulls the data, publishes it in an atomic fashion.
Consumer deletes/commits the META file in [1]

Catastrophic scenario's causing partial DATA at Consumer's (MergedStream, MirrorStream)

Consumer does [4] and dies before [5]
Databus is highly available through Zookeeper so another instance starts and follows the same COMMIT protocol.
Since some data was published in step1 above however not committed that data will be replayed.

Proposed Solution/Design

A transaction in DATABUS consists of moving multiple paths and DATABUS can fail in between thereby making atomicity tricky.

COMMIT protocol can be enhanced to use ZK for remembering data as it's being published for each Thread.
(i) if any path exists in ZK which is already copied to DEST skip it and delete it from ZK
(ii) Record source and destination path in ZK node for this thread
(iii) do the move
(iv) Delete the path from ZK node
Now when DATABUS starts another instance through ZK if the first one fails it does following
(i) Get data to be pulled from consumer's
(ii) Follow the COMMIT protocol describe above in [1]

Scenarios when DATABUS fails

Fails after 1.(i) - no issues
Fails after 1.(ii) - No issues, next run will see this path isn't copied and this run will get this PATH( No replay)
Fails after 1.(iii) - Next run will see this PATH is already copied, skip this PATH and delete from ZK( Avoid data replay)
Fails after 1.(iv) - No issues

Delete directories from intermediate hop

Currently when the data is moved from intermediate hop to final the empty directory structure from the intermediate hop for each hour isn't removed.

Distcp related fixes in Remote Copier of Databus

Remote Copier which pulls data from multiple sources uses distcp. It is designed to run remote copier on box1 and pull changes from box2 to box3. To support this distcp needs to be refactored to take hadoop conf from the caller and in this case we will pass box3 as destination and box2 as source.
Refactor distcp to avoid System.exit(1) in it's main

Add maven release plugin to the data-bus pom, so that release can be done using the plugin

This issue is to add maven release plugin to the databus pom

File Name consistency across data hops

Currently the data moves from /logs/ to /intermediateDir/<path in YY/MM/DD/HR/MN> from where it's moved to the final destination. The intermediate hop is required to make the movement atomic along with a DONE file generation in the intermediate directory. The directory structure across data hops should be same to ease debugability.

DataMover is resilient to failures while moving data to intermediate directories. It has a cleanupMode which it enters on every run to fix any issues with Last Run.

tmpPath for DataConsumer should be cleanedup before every run

If the previous run of DataConsumer isn't successful. There is a possibility that the tmpPath (/databus/system/tmp//jobIN might have some old data which wasn't committed. New run should cleanup this directory to avoid amount of data moved mismatch and any duplicate replays which may happen due to this.

inmobi / archived-data-bus Goto Github PK

archived-data-bus's People

Contributors

Stargazers

Watchers

Forkers

archived-data-bus's Issues

COMMIT Protocol

When can databus get stalled.

COMMIT Protocol

Catastrophic scenario's causing partial DATA at Consumer's (MergedStream, MirrorStream)

Proposed Solution/Design

Scenarios when DATABUS fails

Recommend Projects

Recommend Topics

Recommend Org