Comments (3)
one is cah
get http://3080.rom1504.fr/cah/cah_dataframe_unique/ then
img2dataset --url_list part_one.parquet --input_format "parquet" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder the_part_one_wds --thread_count 256 --image_size 256
running...
from img2dataset.
img2dataset --url_list part_one.parquet --input_format "parquet" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder the_part_one_wds --thread_count 128 --image_size 256
took 18h to get 18M samples stored at http://3080.rom1504.fr/cah/cah_wds_part_one/
Average 20MB/s
I had to reduce the thread count due to a slow memory leak.
Learned from this experiment:
- need to increase a lot more the concurrency to increase download speed
- need to decrease number of process to decrease ram usage
- removing mem leaks would be great
will try doing that using asks + trio / raw multi threading
from img2dataset.
done
from img2dataset.
Related Issues (20)
- Run img2dataset on goolge cloud HOT 1
- parallelize tests with gh action containers
- cannot download laion-high-resolution HOT 2
- LAION-Aesthetic Huggingface Error: Access to this resource is disabled. HOT 1
- custom dataset HOT 1
- cc3m dataset HOT 3
- pyarrow.lib.ArrowInvalid: Empty CSV file
- Implement mode to retry failed urls of all shards
- Low success rate on donwloading laion400m HOT 26
- laion-coco is not available
- Download hangs at End
- GCS url_path either not recognized as directory or mangled glob HOT 1
- Why I can't download laion400M dataset? HOT 3
- Is the field 'similarity' in Parquet file referring to the cosine similarity of the feature representations of image-text pairs? How is this metric computed?
- placekitten.com example in README fails to download images HOT 1
- The success rate when downloading the sbu data set is extremely low at 0 HOT 1
- Question about LAION-400M
- s3 paths in url_list are not supported HOT 1
- Decompressing the downloaded tar file is very slow HOT 1
- pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(URL) in sample_id: double HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from img2dataset.