Giter Site home page Giter Site logo

vkuznet / transfer2go Goto Github PK

View Code? Open in Web Editor NEW
8.0 6.0 2.0 4.37 MB

Distributed, loosely couple agent-based transferring system

License: MIT License

Makefile 0.77% Go 83.05% CSS 3.79% Shell 5.88% HTML 2.63% JavaScript 3.88%
transfer-request agent-request tfc phedex priority cms

transfer2go's Introduction

transfer2go

Build Status Go Report Card GoDoc DOI

Go implementation of CMS PhEDEx distributed, loosly coupled agents for CMS transfering data.

Description

The CMS experiment at the LHC proton-proton collider developed PhEDEx (Physics Experiment Data Export) service as reliable and scalable data management system to meet experiment requirements in Run I/II. Designed in 2004, and written mainly in Perl, it is still used today for managing multi-PB transfer loads per week, across a complex topology of dozen of Grid-compliant computing centres.

Its architecture, instead of having a central brain making global decisions on all CMS replica allocation, has a data management layer composed of a set of loosely coupled and stateless software agents - each managing highly specific parts of replication operations at each site in the distribution network - which communicate asynchronously through a central blackboard architecture. The system is resilient and robust against a large variety of possible failure reasons, and it has been designed by assuming a transfer will fail (thus implementing fail-over tactics) and being completely agnostic on the lower-level file transfer mechanism (thus focusing on full dataset management and detailed transfer bookkeeping). Collision data and derived data collected at LHC that allowed to achieve the Higgs boson discovery by ATLAS and CMS experiments at LHC were transferred in the CMS worldwide domain using this toolkit.

The aim of this project is to extend basic PhEDEX functionality to address up-coming challenges in exa-byte HL-HLC era via implementation of modern Go programming language.

The motivation for the effort is many fold:

  • eliminate central blackboard system and necessity to rely on ORACLE backend via fully distributed nature of the agents, self-discovery and task delegation;
  • even though current system is working well it lacks of support and expertise in perl programming language. We would like to leverage modern language such as Go to enhance concurrency model via native support in a language, and dependency free deployment;
  • the data volume expected in HL-HLC era will grow significantly to exa-byte level and we need to explore elasticity approach to handle variety of opportunistic resources;
  • extend file access and transfer patterns to event streaming, or individual objects, etc.;
  • implement support for user produced data in addition to centrally produced and manager by the system
  • take advantage of built-in concurrency model of the Go language and explore the scalability boundaries.

transfer2go's People

Contributors

rishiloyola avatar vkuznet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

transfer2go's Issues

Add CLI interface to view, approve, delete requests in main agent

We need both CLI and web interfaces to approve or delete requests in main agent. The initial web UI exists, now we need CLI counterpart. It should do the following:

# view request waiting approval
transfer2go -agent=main-agent -requests
should provide list of request and their state, e.g.
Request: Id, Source: src, Destination: dst, Data: /a/b/c, Status: pending
Request: Id, Source: src, Destination: dst, Data: /a/b/c, Status: in-transfer

Then, we need to have ability to approve certain request

transfer2go -agent=main-agent -requestID=id -action=approve

Add APIs for pull model

The pull model needs approval API on an agent. Here is a list of APIs:

  • transfer request
  • delete request
  • approve request
  • list all requests

Add decorators to Transfer and TransferRequest CLI APIs

We need to add (few) decorators to Transfer and TransferRequest CLI APIs which will handle policies of the system, e.g. only certain users will be able to approve/delete requests (admins), while collaborators will be able to submit transfer requests.

Add Catalog interface

I think we should add a Catalog interface and convert existing implementation (core/catalog.go) to adopt such interface. This will allow easily switch to another back-end implementation. For instance, right now we use SQL database, but may be we may use document-oriented DB (like CouchDB or MongoDB) or key-value stores (e.g. dgraph badger DB, https://github.com/dgraph-io/badger)

Switch to pull model

Currently we implemented push approach:

The way how it's done now in my prototype is the following: a client sends request to an agent to transfer dataset /a/b/c from to site X. The agent first checks if it has this dataset, if so, it initiates the transfer by pushing data from itself to site X. If that agent does not have this dataset it broadcasts request to all known agents. The agent who has it replies and request is delegated to that agent. This agent then pushes the data from itself to site X.

It has some flaw, e.g. site can go down or experience maintenance or run out of disk space, therefore we need to explore, develop and eventually switch to pull model.

Sites today have complete control over the agent that puts data into their site. This is a design choice that was made in order to put the responsibility for transfers onto the site ops team. E.g. the site can turn off their agent when they have problems with storage. They can throttle it if there are issues. They can stop the agent if they loose disk and thus run out of space, or run out of space for some other reason. In pull model request will land to a site which request the data and fetch it from original site. From the above description we'll redirect request to agent sitting on site X and it will download dataset /a/b/c from whatever site holds its copy.

Switch to logrus logger

I would suggest that we'll use logrus logger and make colored output for stdout and json one for persistent storage.

Implement central catalog

PhEDEx uses central catalog, while transfer2go does not. We need feasibility studies if we can use distributed model at scale. E.g., what will happen if we need to update catalog, corrupted catalog at one site, etc.

Code Refactor

TODOs:

  • Make generic catalog interface
  • Notify clients if transfer fails
  • Add middleware tools in pull model
  • Improve html interface
  • Make html pages configurable for push and pull model
  • For GET /pending request fetch data from priority queue
  • Resolve bug in jobPool registration
  • Figure out a way to monitor destination's memory (It can run out of space while pulling data)
  • Make changes in client to register request to the main agent
  • Use io.pipe() to upload the data (client's register method)
  • Implement unique id generator
  • Allow access origin
  • Need a better way to distinguish main-agent and other agents

Imitate transfer failures

capture outcome of the transfers:

The api can be re-used when we'll implement the policy of "what to do with
failed transfers". The table should capture all details about failed request
that it can be re-initiate it at a later time.

Extend file access to event streaming and data slicing

So far the transfer are file based. We need an R&D for root IO event streaming and data slicing in a context of CWP:

In current paradigms, physicists consider all events at an equivalent level of detail and in the format offering the highest level of detail that needs to be considered in an analysis. However, not every event considered in analysis requires the same level of detail. One consideration to improve I/O throughput is to design event tiers with different sizes. All events can be considered at a lighter-weight tier while events of interest only can be accessed with a more information-rich tier. Machine learning and similar highly-iterative techniques also provide new challenges in data management and access. Data access that is column-based instead of row-based can be of enormous benefit to many analysis techniques. For instance, instead of reading events in sequence and searching for one particular type of object in the events, to have the data management system be able to return all objects of a certain type in a dataset. Data queries that consider histogram indexing is another feature that may or may not provide performance increases in analysis.

Go based root IO https://godoc.org/go-hep.org/x/hep/rootio

Add delegation of main-agent responsibility to another agent

In pull model, where we have central main-agent, we need ability to delegate its responsibility to another agent or process. I think we have everything in place and only need to perform test

  • start main-agent
  • start source/destination agents
  • perform transfer request
  • stop main-agent
  • start main-agent again (may be on a different port) and perform aforementioned steps again

Efficient hashing

For each file to transfer we need to obtain its hash, so far we read files from end-to-end to obtain file hash. It has impact on RAM utilization. Study if this can be avoided or find a better way to obtain reliable hash while minimize RAM utilization impact. For example, seek file in multiple places and obtain hash of some chunk of the data.

Need unit/integration tests

Here is list of test which is required:

  • TFC methods
    • database read/write
    • lfn look-up
  • server apis
    • reset + protocol registration
    • status
    • agents plus agent registration
  • request methods
    • upload method, upload local file to the agent
    • transfer method, transfer LFN from agent A to agent B
      • test can be done with local /bin/cp command
      • test with external tools srmcp, xrdcp, etc.
  • various decorators
  • imitate transfer failures

Adjust web UI

web UI needs some adjustments

  • make tabs and highlight the one with current view, so far we have colorful tabs, instead we need something like github page tabs with a tab highlighted with current view
  • need summary page for showing stats, e.g. total number of request, total transfers, etc.
  • need page showing agents info, one page per agent and then stats for individual agent.

Test to avoid duplicate replica transfers

We need to perform test to check if request will be acknowledged once to perform transfer of data. For example, a data set /a/b/c may reside on multiple sites. The router should pick only one and tell other agents to drop such request.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.