Giter Site home page Giter Site logo

cboudereau / gcs-rsync Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 8.0 174 KB

Lightweight Google Cloud Storage sync Rust Client with better performance than gsutil rsync

Home Page: https://docs.rs/gcs-rsync/

License: MIT License

Rust 99.64% Dockerfile 0.36%
rsync gcs gsutil gcs-rsync rust rust-lang

gcs-rsync's Introduction

gcs-rsync

build codecov License:MIT docs.rs crates.io crates.io (recent) docker

Lightweight and efficient Rust gcs rsync for Google Cloud Storage.

gcs-rsync is faster than gsutil rsync according to the following benchmarks.

no hard limit to 32K objects or specific conf to compute state.

This crate can be used as a library or CLI. The API managing objects (download, upload, delete, ...) can be used independently.

How to install as crate

Cargo.toml

[dependencies]
gcs-rsync = "0.4"

How to install as cli tool

cargo install --example gcs-rsync gcs-rsync

~/.cargo/bin/gcs-rsync

How to run with docker

Mirror local folder to gcs

docker run --rm -it -v ${GOOGLE_APPLICATION_CREDENTIALS}:/creds.json:ro -v <YourFolderToUpload>:/source:ro superbeeeeeee/gcs-rsync -r -m /source gs://<YourBucket>/<YourFolderToUpload>/

Mirror gcs to folder

docker run --rm -it -v ${GOOGLE_APPLICATION_CREDENTIALS}:/creds.json:ro -v <YourFolderToDownloadTo>:/dest superbeeeeeee/gcs-rsync -r -m gs://<YourBucket>/<YourFolderToUpload>/ /dest

Mirror partial gcs with prefix to folder

docker run --rm -it -v ${GOOGLE_APPLICATION_CREDENTIALS}:/creds.json:ro -v <YourFolderToDownloadTo>:/dest superbeeeeeee/gcs-rsync -r -m gs://<YourBucket>/<YourFolderToUpload>/<YourPrefix> /dest

Include or Exclude files using glob pattern

CLI gcs-rsync

-i (include glob pattern) and -x (exclude glob pattern) multiple times.

An example where any json or toml are included recursively except any test.json or test.toml recursively

docker run --rm -it -v ${GOOGLE_APPLICATION_CREDENTIALS}:/creds.json:ro -v <YourFolderToDownloadTo>:/dest superbeeeeeee/gcs-rsync -r -m -i **/*.json -i **/*.toml -x **/test.json -x **/test.toml
 gs://<YourBucket>/YourFolderToUpload>/ /dest

Library

with_includes and with_excludes client builders are used to fill includes and excludes glob patterns.

Benchmark

Important note about gsutil: The gsutil ls command does not list all object items by default but instead list all prefixes while adding the -r flag slowdown gsutil performance. The ls performance command is very different to the rsync implementation.

new files only (first time sync)

  • gcs-rsync: 2.2s/7MB
  • gsutil: 9.93s/47MB

winner: gcs-rsync

gcs-rsync sync bench

rm -rf ~/Documents/test4 && cargo build --release --examples && /usr/bin/time -lp -- ./target/release/examples/bucket_to_folder_sync
real         2.20
user         0.13
sys          0.21
             7606272  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                1915  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                 394  messages sent
                1255  messages received
                   0  signals received
                  54  voluntary context switches
                5814  involuntary context switches
           636241324  instructions retired
           989595729  cycles elapsed
             3895296  peak memory footprint

gsutil sync bench

rm -rf ~/Documents/gsutil_test4 && mkdir ~/Documents/gsutil_test4 && /usr/bin/time -lp --  gsutil -m -q rsync -r gs://dev-bucket/sync_test4/ ~/Documents/gsutil_test4/
Operation completed over 215 objects/50.3 KiB.
real         9.93
user         8.12
sys          2.35
            47108096  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              196391  page reclaims
                   1  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
               36089  messages sent
               87309  messages received
                   5  signals received
               38401  voluntary context switches
               51924  involuntary context switches
            12986389  instructions retired
            12032672  cycles elapsed
              593920  peak memory footprint

no change (second time sync)

  • gcs-rsync: 0.78s/8MB
  • gsutil: 2.18s/47MB

winner: gcs-rsync (due to size and mtime check before crc32c like gsutil does)

gcs-rsync sync bench

cargo build --release --examples && /usr/bin/time -lp -- ./target/release/examples/bucket_to_folder_sync
real         1.79
user         0.13
sys          0.12
             7864320  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                1980  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                 397  messages sent
                1247  messages received
                   0  signals received
                  42  voluntary context switches
                4948  involuntary context switches
           435013936  instructions retired
           704782682  cycles elapsed
             4141056  peak memory footprint

gsutil sync bench

/usr/bin/time -lp --  gsutil -m -q rsync -r gs://test-bucket/sync_test4/ ~/Documents/gsutil_test4/
real         2.18
user         1.37
sys          0.66
            46899200  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              100108  page reclaims
                1732  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                6311  messages sent
               12752  messages received
                   4  signals received
                6145  voluntary context switches
               14219  involuntary context switches
            13133297  instructions retired
            13313536  cycles elapsed
              602112  peak memory footprint

gsutil rsync config

gsutil -m -q rsync -r -d ./your-dir gs://your-bucket
/usr/bin/time -lp --  gsutil -m -q rsync -r gs://dev-bucket/sync_test4/ ~/Documents/gsutil_test4/

About authentication

All default functions related to authentication use GOOGLE_APPLICATION_CREDENTIALS env var as default conf like official Google libraries do on other languages (golang, dotnet)

Other functions (from and from_file) provide the custom integration mode.

For more info about OAuth2, see the related README in the oauth2 mod.

How to run tests

Unit tests

cargo test --lib

Integration tests + Unit tests

TEST_SERVICE_ACCOUNT=<PathToAServiceAccount> TEST_BUCKET=<BUCKET> TEST_PREFIX=<PREFIX> cargo test --no-fail-fast

Examples

Upload object

Library

use std::path::Path;

use gcs_rsync::storage::{credentials, Object, ObjectClient, StorageResult};
use tokio_util::codec::{BytesCodec, FramedRead};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let args = std::env::args().collect::<Vec<_>>();
    let bucket = args[1].as_str();
    let prefix = args[2].to_owned();
    let file_path = args[3].to_owned();

    let auc = Box::new(credentials::authorizeduser::default().await?);
    let object_client = ObjectClient::new(auc).await?;

    let file_path = Path::new(&file_path);
    let name = file_path.file_name().unwrap().to_string_lossy();

    let file = tokio::fs::File::open(file_path).await.unwrap();
    let stream = FramedRead::new(file, BytesCodec::new());

    let name = format!("{}/{}", prefix, name);
    let object = Object::new(bucket, name.as_str())?;
    object_client.upload(&object, stream).await.unwrap();
    println!("object {} uploaded", &object);
    Ok(())
}

CLI

cargo run --release --example upload_object "<YourBucket>" "<YourPrefix>" "<YourFilePath>"

Download object

Library

use std::path::Path;

use futures::TryStreamExt;
use gcs_rsync::storage::{credentials, Object, ObjectClient, StorageResult};
use tokio::{
    fs::File,
    io::{AsyncWriteExt, BufWriter},
};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let args = std::env::args().collect::<Vec<_>>();
    let bucket = args[1].as_str();
    let name = args[2].as_str();
    let output_path = args[3].to_owned();

    let auc = Box::new(credentials::authorizeduser::default().await?);
    let object_client = ObjectClient::new(auc).await?;

    let file_name = Path::new(&name).file_name().unwrap().to_string_lossy();
    let file_path = format!("{}/{}", output_path, file_name);

    let object = Object::new(bucket, name)?;
    let mut stream = object_client.download(&object).await.unwrap();

    let file = File::create(&file_path).await.unwrap();
    let mut buf_writer = BufWriter::new(file);

    while let Some(data) = stream.try_next().await.unwrap() {
        buf_writer.write_all(&data).await.unwrap();
    }

    buf_writer.flush().await.unwrap();
    println!("object {} downloaded to {:?}", &object, file_path);
    Ok(())
}

CLI

cargo run --release --example download_object "<YourBucket>" "<YourObjectName>" "<YourAbsoluteExistingDirectory>"

Download public object

Library

use std::path::Path;

use futures::TryStreamExt;
use gcs_rsync::storage::{Object, ObjectClient, StorageResult};
use tokio::{
    fs::File,
    io::{AsyncWriteExt, BufWriter},
};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let bucket = "gcs-rsync-dev-public";
    let name = "hello.txt";

    let object_client = ObjectClient::no_auth();

    let file_name = Path::new(&name).file_name().unwrap().to_string_lossy();
    let file_path = file_name.to_string();

    let object = Object::new(bucket, "hello.txt")?;
    let mut stream = object_client.download(&object).await.unwrap();

    let file = File::create(&file_path).await.unwrap();
    let mut buf_writer = BufWriter::new(file);

    while let Some(data) = stream.try_next().await.unwrap() {
        buf_writer.write_all(&data).await.unwrap();
    }

    buf_writer.flush().await.unwrap();
    println!("object {} downloaded to {:?}", &object, file_path);
    Ok(())
}

CLI

cargo run --release --example download_public_object "<YourBucket>" "<YourObjectName>" "<YourAbsoluteExistingDirectory>"

Delete object

Library

use gcs_rsync::storage::{credentials, Object, ObjectClient, StorageResult};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let args = std::env::args().collect::<Vec<_>>();
    let bucket = args[1].as_str();
    let name = args[2].as_str();
    let object = Object::new(bucket, name)?;

    let auc = Box::new(credentials::authorizeduser::default().await?);
    let object_client = ObjectClient::new(auc).await?;

    object_client.delete(&object).await?;
    println!("object {} uploaded", &object);
    Ok(())
}

CLI

cargo run --release --example delete_object "<YourBucket>" "<YourPrefix>/<YourFileName>"

List objects

Library

use futures::TryStreamExt;
use gcs_rsync::storage::{credentials, ObjectClient, ObjectsListRequest, StorageResult};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let args = std::env::args().collect::<Vec<_>>();
    let bucket = args[1].as_str();
    let prefix = args[2].to_owned();

    let auc = Box::new(credentials::authorizeduser::default().await?);
    let object_client = ObjectClient::new(auc).await?;

    let objects_list_request = ObjectsListRequest {
        prefix: Some(prefix),
        fields: Some("items(name),nextPageToken".to_owned()),
        ..Default::default()
    };

    object_client
        .list(bucket, &objects_list_request)
        .await
        .try_for_each(|x| {
            println!("{}", x.name.unwrap());
            futures::future::ok(())
        })
        .await?;

    Ok(())
}

CLI

cargo run --release --example list_objects "<YourBucket>" "<YourPrefix>"

List objects with default service account

Library

use futures::TryStreamExt;
use gcs_rsync::storage::{credentials, ObjectClient, ObjectsListRequest, StorageResult};

#[tokio::main]
async fn main() -> StorageResult<()> {
    let args = std::env::args().collect::<Vec<_>>();
    let bucket = args[1].as_str();
    let prefix = args[2].to_owned();

    let auc = Box::new(
        credentials::serviceaccount::default(
            "https://www.googleapis.com/auth/devstorage.full_control",
        )
        .await?,
    );
    let object_client = ObjectClient::new(auc).await?;

    let objects_list_request = ObjectsListRequest {
        prefix: Some(prefix),
        fields: Some("items(name),nextPageToken".to_owned()),
        ..Default::default()
    };

    object_client
        .list(bucket, &objects_list_request)
        .await
        .try_for_each(|x| {
            println!("{}", x.name.unwrap());
            futures::future::ok(())
        })
        .await?;

    Ok(())
}

CLI

GOOGLE_APPLICATION_CREDENTIALS=<PathToJson> cargo r --release --example list_objects_service_account "<YourBucket>" "<YourPrefix>"

List lots of (>32K) objects

list a bucket having more than 60K objects

time cargo run --release --example list_objects "<YourBucket>" "<YourPrefixHavingMoreThan60K>" | wc -l

Profiling

Humans are terrible at guessing-about-performance

export CARGO_PROFILE_RELEASE_DEBUG=true
sudo -- cargo flamegraph --example list_objects "<YourBucket>" "<YourPrefixHavingMoreThan60K>"
cargo build --release --examples && /usr/bin/time -lp -- ./target/release/examples/list_objects "<YourBucket>" "<YourPrefixHavingMoreThan60K>"

Native bin build (static shared lib)

docker rust rust:alpine3.14
apk add --no-cache musl-dev pkgconfig openssl-dev

LDFLAGS="-static -L/usr/local/musl/lib" LD_LIBRARY_PATH=/usr/local/musl/lib:$LD_LIBRARY_PATH CFLAGS="-I/usr/local/musl/include" PKG_CONFIG_PATH=/usr/local/musl/lib/pkgconfig cargo build --release --target=x86_64-unknown-linux-musl --example bucket_to_folder_sync

gcs-rsync's People

Contributors

cboudereau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gcs-rsync's Issues

update publish crates github action

update the crate publish github action to remove those warnings

The following actions uses node12 which is deprecated and will be forced to run on node16....

when using WLIF getting error : "missing field `client_id`"

I'm running gcs-rsync from an ec2 instance which auth's w/ gcp via workload identity.

How would I set the GOOGLE_CLIENT_ID or client_id?

gcs-rsync -u gs://bucket-name/file.test /tmp

Error: StorageError(GcsTokenError(HttpError(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("metadata.google.internal")), port: None, path: "/computeMetadata/v1/instance/service-accounts/default/token", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Uncategorized, error: "failed to lookup address information: Name or service not known" })) })))

gcs-rsync -u gs://bucket-name/file.test /tmp

Error: StorageError(GcsTokenError(EnvVarError { key: "GOOGLE_APPLICATION_CREDENTIALS", error: NotPresent }))
[ec2-user@ip-10-50-5-146 ~]$ export GOOGLE_APPLICATION_CREDENTIALS=/home/ec2-user/credentials.json
You have new mail in /var/spool/mail/ec2-user

gcs-rsync -u gs://bucket-name/file.test /tmp

Error: StorageError(GcsTokenError(DeserializationError { expected_type: "gcs_rsync::gcp::oauth2::token::AuthorizedUserCredentials", error: Error("missing field `client_id`", line: 14, column: 1) }))

cat credentials.json

{
  "type": "external_account",
  "audience": "//iam.googleapis.com/projects/<my-project-id-number>/locations/global/workloadIdentityPools/amzn/providers/aws-provider",
  "subject_token_type": "urn:ietf:params:aws:token-type:aws4_request",
  "token_url": "https://sts.googleapis.com/v1/token",
  "credential_source": {
    "environment_id": "aws1",
    "region_url": "http://169.254.169.254/latest/meta-data/placement/availability-zone",
    "url": "http://169.254.169.254/latest/meta-data/iam/security-credentials",
    "regional_cred_verification_url": "https://sts.{region}.amazonaws.com?Action=GetCallerIdentity&Version=2011-06-15",
    "imdsv2_session_token_url": "http://169.254.169.254/latest/api/token"
  },
  "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/[email protected]:generateAccessToken"
}

rsync from a list of files

Hi,

I have a huuge dataset and I only want to sync some of the files.
The traditional rsync allows for that (using --files-from), but the rsync on gsutil does not have this functionality.

Does your implementation allow such a thing? If not, can this be implemented?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.