Giter Site home page Giter Site logo

archived-data-bus's People

Contributors

amareshwari avatar inders avatar raju-bairishetti avatar rajubairishetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archived-data-bus's Issues

Databus stalled usecases

LocalStreamService, MergedStreamService, MirrorStreamService use the following commit Protocol to guarantee 0% data loss

COMMIT Protocol

  1. Producer writes the data getting published in a META file
  2. producer publishes the data.
  3. Consumer reads the data to PULL from the published file in [1]
  4. Consumer pulls the data, publishes it in an atomic fashion.
  5. Consumer deletes/commits the META file in [1]

When can databus get stalled.

  1. producer dies after[1] and before [2]
  2. Producer's in DATABUS are highly available through ZooKeeper
  3. Another producer starts and follows the COMMIT protocol described above to publish more data.
  4. All Consumers in DATABUS(MergedStreamService, LocalStreamService) use DISTCP(augmented to support -preserveSourcePath - https://github.com/InMobi/DistCpV2-0.20.203) with -f option
  5. DISTCP with -f option errors out if one of the path in the file list isn't present at SOURCE.

SideEffect - Databus (MergedStreamService, MirrorStreamService can stop consuming data thereby resulting in DATABUS getting stalled.

Current Solution - Have operational monitoring on /databus/system/mirrors, /databus/system/consumers directories for each instance of databus. If #files in these directories exceed a threshold(10) then stop databus. cat these files to produce a single file in the respective directory by checking whether the file exists on HDFS else skip it. Start DATABUS.

DataConsumer tmpPath cleanup should happen on each run

Issue1 -

  1. Today tmpPath for dataConsumer is created once and is not being cleaned once the iteration is over.
  2. Due to some issue if some files are left in it(i.e. the process didn't scucceed) next runs gets more files than excepted and commit doesn't happen going forward as there is a check between #no of files in input to be moved == #no of files compressed.
  3. Fix - to cleanup the tmpoutPath on each run of data consumer.

Issue2 -

  1. Today when remote puller starts and has to pull multiple files from remote cluster, it creates one file in the remotePath which has the list of all paths to be pulled. If distcp fails after this then this file is not cleaned up and next distcp run will consider this file also as input thereby keep failing as there will be duplicate file names

Fix - this file should be created in a destination cluster where data is being pulled in a timestamp based name inside tmp

Avoid minimal data replay in corner case of catastrophic failure of DATABUS

LocalStreamService, MergedStreamService, MirrorStreamService use the following commit Protocol to guarantee 0% data loss

COMMIT Protocol

  1. Producer writes the data getting published in a META file
  2. Producer publishes the data.
  3. Consumer reads the data to PULL from the published file in [1]
  4. Consumer pulls the data, publishes it in an atomic fashion.
  5. Consumer deletes/commits the META file in [1]

Catastrophic scenario's causing partial DATA at Consumer's (MergedStream, MirrorStream)

  1. Consumer does [4] and dies before [5]
  2. Databus is highly available through Zookeeper so another instance starts and follows the same COMMIT protocol.
  3. Since some data was published in step1 above however not committed that data will be replayed.

Proposed Solution/Design

A transaction in DATABUS consists of moving multiple paths and DATABUS can fail in between thereby making atomicity tricky.

  1. COMMIT protocol can be enhanced to use ZK for remembering data as it's being published for each Thread.
    (i) if any path exists in ZK which is already copied to DEST skip it and delete it from ZK
    (ii) Record source and destination path in ZK node for this thread
    (iii) do the move
    (iv) Delete the path from ZK node
  2. Now when DATABUS starts another instance through ZK if the first one fails it does following
    (i) Get data to be pulled from consumer's
    (ii) Follow the COMMIT protocol describe above in [1]

Scenarios when DATABUS fails

Fails after 1.(i) - no issues
Fails after 1.(ii) - No issues, next run will see this path isn't copied and this run will get this PATH( No replay)
Fails after 1.(iii) - Next run will see this PATH is already copied, skip this PATH and delete from ZK( Avoid data replay)
Fails after 1.(iv) - No issues

Distcp related fixes in Remote Copier of Databus

  1. Remote Copier which pulls data from multiple sources uses distcp. It is designed to run remote copier on box1 and pull changes from box2 to box3. To support this distcp needs to be refactored to take hadoop conf from the caller and in this case we will pass box3 as destination and box2 as source.
  2. Refactor distcp to avoid System.exit(1) in it's main

File Name consistency across data hops

  1. Currently the data moves from /logs/ to /intermediateDir/<path in YY/MM/DD/HR/MN> from where it's moved to the final destination. The intermediate hop is required to make the movement atomic along with a DONE file generation in the intermediate directory. The directory structure across data hops should be same to ease debugability.

DataMover is resilient to failures while moving data to intermediate directories. It has a cleanupMode which it enters on every run to fix any issues with Last Run.

tmpPath for DataConsumer should be cleanedup before every run

If the previous run of DataConsumer isn't successful. There is a possibility that the tmpPath (/databus/system/tmp//jobIN might have some old data which wasn't committed. New run should cleanup this directory to avoid amount of data moved mismatch and any duplicate replays which may happen due to this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.