Giter Site home page Giter Site logo

mirror-clone's People

Contributors

alissa-tung avatar photonquantum avatar skyzh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mirror-clone's Issues

MIME in S3 backend

Currently, we do not set MIME in S3. This would lead to HTML become downloadable files.

support custom config

Now every task has its own configuration, and the mirror-clone program has its global configuration. I suggest the following config scheme.

The final config used by mirror-clone is composed of three parts. The configuration provided in toml format, default config, and the command line configs.

For example, we have config.toml

[global]
io-limit = 16 # only a total of 16 concurrent downloads are allowed
cpu-limit = 16 # only 16 concurrent CPU-bound tasks are allowed
io-thread-pool = 4
cpu-thread-pool = 4

[global.log]
log-format = "json"
log-level = "warning"

[opam]
use-cache = false

[conda]

[[conda.repo]]
name = "anaconda/pkgs/main/win-64"
url = "balahbalah"

And now, we call mirror-clone with the following arguments.

The basic usage of mirror-clone is mirror-clone <task> <base_dir> <config>

mirror-clone --config config.toml conda /data/conda --all-repos # clone all repos specified in config
mirror-clone --config config.toml conda /data/conda --repos=anaconda/pkgs/main/win-64,anaconda/pkgs/main/linux-64 # use pre-defined repo in config
mirror-clone --config config.toml conda /data/conda/pkgs/main/win-64 --url=mirrors.sjtug.sjtu.edu.cn/anaconda/pkgs/main/win-64

Command-line arguments take precedence. For example, we could override use-cache in opam.

mirror-clone --config config.toml opam --use-cache=true # clone all repos specified in config

If we do not specify cpu-thread-pool in both config.toml and command-line arguments, mirror-clone will use its default value specified in program.

mirror-clone roadmap

The ultimate goal of mirror-clone is provide an easy-to-use abstraction layer for developers who want to clone a software repo to their own local registry.

Developers will need to implement two interface, SourceFS and TargetFS, in order to clone a registry.

SourceFS

SourceFS generally refers to the source software registry. For example, crates.io, opam, conda, etc. It provides the following functionalities:

  • snapshot provides a file list of current software registry.
    • For OPAM, taking a snapshot involves download repo and index.tar.gz, and parse the information.
    • For conda, this involves download repodata.json and generate file list.
    • For crates.io, this involves scanning the crates.io-index repo and generate file list.
  • entry provides the way to download a file from source filesystem.
    • For most of the mirroring tasks, this is to find corresponding URL and checksum to a file.
    • Also, index file should be included. For example, index.tar.gz.

TargetFS

TargetFS generally refers to a local filesystem. It could also be an object storage, or a key-value database.

TargetFS should be able to:

  • list files
  • read file
  • write file
  • get metadata of a file

Mirror-Clone

mirror-clone provides utilities for mirroring a repo.

tmpfs

tmpfs stores file temporarily. When taking a snapshot, source filesystem may download some index file. They could be saved to tmpfs, and be served directly when entry is being called.

downloader

downloader helps download a file from a given URL.

transferrer

Transferrer transfers a file from source filesystem to target filesystem. It will automatically retry failed requests.

comparator

Given an entry on source filesystem and target filesystem, a comparator decides whether a file requires re-transferring.

buffer layer

Buffer layer stands between transferrer and target filesystem.

Transaction Buffer provides a transaction-commit interface. It's normal that a file could not be downloaded successfully because of network issues. Buffer layer commits a file to target filesystem only when a file is successfully downloaded (or wait until all files have been downloaded)

Fuse Buffer ensures that a file is never downloaded twice by fusing it. It will also record file metadata in a single cache file to speed up listing all files in target filesystem.

rsync support

  • snapshot includes modify time and size
  • fetch data by HTTP, decode headers to verify time and size
  • add metadata in s3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.