Huggingface Dataset Builder file to create subsets about birdset HOT 11 CLOSED

lurauch commented on August 18, 2024

Huggingface Dataset Builder file to create subsets

from birdset.

Comments (11)

Moritz-Wirth commented on August 18, 2024 1

Created extra issue for the split of train/test data,
Found solution for providing the ebird_codes in a seperate file
Created a notebook to pack files in evenly sized .tar files (tar is recommanded by hf)
Streaming the dataset is now also possible

HF repo: https://huggingface.co/datasets/DBD-research-group/gadme_v1_1/tree/main

from birdset.

lurauch commented on August 18, 2024

from birdset.

lurauch commented on August 18, 2024

git cloning is not recommended: https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http
pushing with git lfs leads to errors
HfApi Client should be used programmatically

from birdset.

lurauch commented on August 18, 2024

Repo

https://huggingface.co/datasets/DBD-research-group/gadme_v1

from birdset.

lurauch commented on August 18, 2024

@Moritz-Wirth
@reheinrich

Lessons Learned

The loading_script file has to have the same name as the repo! Otherwise, it reverts back to some default loading which does not really work.
If you want to debug the loading script, I would strongly suggest cloning the huggingface repository. Then you can directly call the loading script within the load_dataset method and debug it accordingly:

ds = load_dataset("gadme_clone_repo/gadme_cloned_repo.py", "sapsucker_woods") or:
ds = load_dataset("gadme_clone_repo", "sapsucker_woods")

Cloning and pushing the repo (and the files) is not recommended. Use the HfAPI to push something to the repo: https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http. There is an example notebook in the dataset/data/huggingface_hub directory in the GADME repo
The download_manager https://huggingface.co/docs/datasets/v2.14.4/en/package_reference/builder_classes#datasets.DownloadManager loads the files that are stored locally in cache (metadata + audio files in this case) .
I zipped all the files (this has to be sharded in parquet files, I guess)

metadata.zip (consists of train.csv and test.csv)
files.zip (all .ogg files)

The files are downloaded and cached in the split generator and then called with the _generate_examples method
This method loads the cached metadata file and then iterates over every row with the names of the files from the metadata

Repo Structure

Example

├──data
│   ├── sapsucker woods
│   │   ├──meta.zip
│   │   ├──files.zip
│   ├── amazon basin
│   │   ├──meta.zip
│   │   ├──files.zip
├─gadme.py

TODOS

Add all features from the main metadata.csv file (right now, only a fraction of the features are added as an example
Add all subtasks and create train and test data (should be an issue by itself)
Right now, the ebird_codes (the label names for the ClassLabel Features) are added manually in a list. Unfortunately, the .txt file with the codes are loaded before the .info method is called. Maybe there is a more sophisticated workaround. Otherwise, we would have to provide a list for every sub-task. I looked at some example codes - They always provided a list (to be fair, they did not have the same number of labels we have).
Use a better approach to save the data. Maybe the .zip files work, and we do not need to put in more effort (we should ask the hf people). Otherwise, we have to change to the shard and parquet format.
I did not yet test the working flow I suggested in notion (.encode_column -> .map -> .set_transform).

from birdset.

lurauch commented on August 18, 2024

@Moritz-Wirth

I assigned you now. Please read the lessons learned, extract some issues from this, and then close this issue, when you are done. Thanks! :)

from birdset.

lurauch commented on August 18, 2024

@Moritz-Wirth What's the status here?

from birdset.

lurauch commented on August 18, 2024

Thanks! Looks good :)

from birdset.

lurauch commented on August 18, 2024

@Moritz-Wirth everything done, has to be pushed

from birdset.

lurauch commented on August 18, 2024

add "no_call" class to classes.py (@ stefan)
add only zenodo task to dataset builder
fix multiclass 5s
add fsl dataset

from birdset.

raphaelschwinger commented on August 18, 2024

@lurauch this is done, right?

from birdset.

Huggingface Dataset Builder file to create subsets about birdset HOT 11 CLOSED

Comments (11)

Repo

Lessons Learned

Repo Structure

TODOS

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent