Comments (3)
Thanks for investigating the Cloud Dataflow Shuffle, Nima!
Update: using Cloud Dataflow Shuffle actually solves the issue of handling massive number of records when merging. However, it can still be expensive since it's attempting to merge data from all chromosomes. We can optimize this further by partitioning the data by chromosome first. Saman is going to work on partitioning data for sharded BQ export, and this also fits nicely there (we can "kill two birds with one stone" :p ). Reassigning to Saman.
from gcp-variant-transforms.
Saman, now that Cloud Dataflow Shuffle pricing is updated, could you please try with --experiments=shuffle_mode=service
with the platinum1000 test to see what cost/performance you get? I think we can close this issue once that experiment is done.
from gcp-variant-transforms.
Closing this bug. Summary:
- Using the shuffle service is actually more expensive than our native partition-based merge (turned on with the
--optimize_for_large_inputs
flag). - Using a large number of small workers is much more efficient than using a small number of large workers.
- Using SSDs can further reduce cost if a large enough worker pool is not available.
More details are provided in the handling large inputs doc.
from gcp-variant-transforms.
Related Issues (20)
- Add a flag to specify the "pipeline mode" (SMALL, MEDIUM, or LARGE) HOT 7
- Add a flag to preserve the call.name field HOT 9
- Add a flag to omit homozygous reference calls on vcf_to_bq import
- Document how to use Dataflow Flexible Resource Scheduling to save on Cloud cost
- Update the VEP release used for annotations HOT 1
- Fix deploy_and_run_tests.sh to be locally runnable.
- Enabling --save_main_session pipeline option in our pipelines. HOT 1
- Add an option to disable sharding
- off by 1 error? HOT 3
- move_hom_ref_calls python error HOT 1
- Network parameter not being read HOT 27
- VCF to BQ Timeout Issue HOT 4
- The MISSING_GENOTYPE_VALUE value of -1 can conflict with genotype values that are in a VCF (scalar vs. list)
- pyYAML 6.0 incompatability HOT 1
- unable to pull from pypi.org when running under VPC SC
- flatten null repeated column
- AnnotateShards not producing **vep_output.vcf HOT 3
- VCF to BQ via docker keeps failing HOT 1
- support for newer version than python 3.7
- subnetwork flag to use network/subnet outside the project where dataflow runs in
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gcp-variant-transforms.