plantnet / gbif-dl Goto Github PK
View Code? Open in Web Editor NEWGBIF classification dataloaders
Home Page: https://plantnet.github.io/gbif-dl/
License: MIT License
GBIF classification dataloaders
Home Page: https://plantnet.github.io/gbif-dl/
License: MIT License
Hi, when constructing own datasets it would be handy to have the gbifID for each occurrence returned should we want to find it manually on the GBIF website or API.
Currently, the dwca/doi url generator does not support the same sampling functionality as for the query. There might be a way to also add at least some balancing here.
When drawing a subset of occurances from the GBIF api, they might be sorted by some internal database query. Given that a subset should taken from random distribution of samples for best performance and generalization, it would be idea if samples could be drawn randomly from the API.
currently this is not supported and we might want to raise this issue on either of the following trackers
Currently most projects use github worflows, so that seems the best option.
In github free tier, this can only be enabled for public projects, so this will have to wait a bit
The following query produces an unhandled JSONDecodeError inside gbif-dl
data_generator = gbif_dl.api.generate_urls(
queries={'familyKey': [9456], 'offset': 83941, 'pageLimit': 1},
label="speciesKey",
)
for i in data_generator:
print(i)
I estimate that I get this 1/100 times I create a url generator.
To use offset
in the query requires a change to gbif-dl/generators/api.py
to add offset: int = 0
as a function parameter to the gbif_query_generator
function and replacing offset = 0
with offset = offset
near the top of the function body. I'm doing this so that I can get random samples from GBIF by setting the offset to a random number within the range of the number of occurances for the query which I get from a direct API call as I don't think gbif-dl has a method to tell you this info?
Thank you very much for sharing this utility.
We're heading into relatively new territory for the GBIF community where images will increasingly be used in ML to build models. It's currently unclear how publishers will feel about this when learning that their photos are used in this way. Even if it could be argued that copyright isn't being infringed upon since photos are not being redistributed (the equivalent of "not infringing copyright by creating an index allowing you to find Shakespeare works containing Romeo") it may result in images being withheld in the future.
I suggest it would be wise to firstly acknowledge this situation in the README and presentations. Secondly, it would be good to promote a citation practice that acknowledges the sources of training data - e.g. the built models are given a DOI that links (cites) the source dataset DOIs. Thirdly, it might be prudent to restrict this to only CC0 and CC-BY licensed images.
At GBIF we are interested to see open discussion around these aspects.
Thanks!
It may be possible by using facets of the GBIF API?
Example: https://api.gbif.org/v1/occurrence/search?[...]&facet=taxonKey&limit=0&facetLimit=100
Currently, there is a minimum required fields to be consumed by the downloader. These include the following:
class MediaData(TypedDict):
""" Media dict representation received from api or dwca generators"""
url: str
basename: str
label: str
content_type: str
suffice: str
We might want make sure that generators always make sure that these are the minimum fields to be passed to the downloader, otherwise an error should be raised.
As we already have support for torch.datasets we should also add support for tf.data
pipelines to cover a more broad list of deep learning frameworks
Before saving the file an integrity check should optionally be passed. That function should be passed and reads and verifys byte strings
we need a license, are we okay with MIT?
Currently the debugging experience of the io.download_single
function is not idea since its inside of an async task. If an exception is raised inside of that function tasks are not stopped, thus makes is hard to debug it.
I added a watchdog decorator, but it doesn't work in all cases yet:
Lines 5 to 17 in 3f2707a
When trying to download, i get this error.
Code:
import gbif_dl data_generator = gbif_dl.api.generate_urls( queries=query_per_species, label="speciesKey", nb_samples=100, weighted_streams=True ) stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")
Error:
0 Files [00:00, ? Files/s]Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py", line 44, in run
self.result = runners.run(self.func(*self.args, **self.kwargs))
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/runners.py", line 104, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py", line 242, in download_from_asyncgen
async for batch in chnk:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/transform.py", line 87, in chunks
async for first in streamer:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/create.py", line 34, in from_iterable
for item in it:
File "/usr/local/lib/python3.7/dist-packages/pescador/core.py", line 204, in iterate
for n, obj in enumerate(active_streamer.stream):
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/generators/api.py", line 43, in gbif_query_generator
resp = pygbif.occurrences.search(
AttributeError: module 'pygbif' has no attribute 'occurrences'
AttributeError Traceback (most recent call last)
in ()
----> 1 stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")
2 print(stats)
1 frames
/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py in download(items, root, tcp_connections, nb_workers, batch_size, retries, verbose, overwrite, is_valid_file, proxy, random_subsets)
316 is_valid_file=is_valid_file,
317 proxy=proxy,
--> 318 random_subsets=random_subsets,
319 )
/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py in run_async(func, *args, **kwargs)
58 thread.start()
59 thread.join()
---> 60 return thread.result
61 else:
62 return runners.run(func(*args, **kwargs))
AttributeError: 'RunThread' object has no attribute 'result'
It seems that it crashes for joblib (1.3.1).
Downgrading to 0.10.0 seemed to solve the issue (pescadores/pescador#26)
Sample code:
import gbif_dl
data_generator = gbif_dl.dwca.generate_urls(
"10.15468/dl.pcxfa5", dwca_root_path="dwcas", label="speciesKey"
)
stats = gbif_dl.io.download(data_generator, root="my_dataset", retries=1000000)
Error:
Traceback (most recent call last):
File "test.py", line 1, in <module>
import gbif_dl
File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/__init__.py", line 15, in <module>
from .generators import api
File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/generators/api.py", line 9, in <module>
import pescador
File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/__init__.py", line 8, in <module>
from .zmq_stream import *
File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/zmq_stream.py", line 33, in <module>
from joblib._parallel_backends import SafeFunction
ImportError: cannot import name 'SafeFunction' from 'joblib._parallel_backends' (/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/joblib/_parallel_backends.py)
By now it's something like
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
given a gbif download id, we want to get the original query, modify it and rerun it using the gbif_dl.api
.
Having a class based design for the generators make it difficult to enhance the package in the future to support a modular API.
E.g, the generators currently have a generate_urls
function that returns an interable or generator.
This can be enhanced creating a class interface. E.g. users create a DWCAGenerator()
that can directly be used within the downloader
unit tests for downloading DWCA archives are slow. ideally we should include a tiny dwca archive downloaded from gbif in the package manifest and use that for the unit tests
Most modules lack proper unit testing.
The download module should just work for a given list of urls.
Currently the url would need to be a dict of the following:
urls = [
{
'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
'label': '3189866'
},
{
'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
'label': '3189866'
},
{
'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
'basename': 'e04a36f124b875a16b5393a8fdef36846ada8e35',
'label': '3189866'
}
]
Thus, changes should be made so that basename
and label
can be surpressed:
urls = [
{
'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
},
{
'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
},
{
'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
}
]
but also
urls = [
'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321'
]
should work
names of downloaded image filess are not same as in identifier coloumn of multimedia.txt file in dwca file, how can I map downloaded image with the metadata
Currently, downloaded DWCA zip can be temporarily stored. However, there is no automatic way to delete these files.
To reduce stress to servers, we should - by default - introduce a minimal wait/sleep of maybe 0.5s per request.
data_generator = gbif_dl.api.generate_urls(
queries=queries,
label="speciesKey",
nb_samples=-1,
split_streams_by=["datasetKey", "speciesKey"],
)
The line on the split streams is giving the following error:
File "test.py", line 19, in
split_streams_by=["datasetKey", "speciesKey"]
File "/srv/gbif-dl/gbif_dl/generators/api.py", line 171, in generate_urls
for x, y in subset_streams.items():
AttributeError: 'NoneType' object has no attribute 'items'
Add a functionality to select the species list having GBIF occurrences in given area and then download all images of that species wherever they are.
Both, the dataset and the downloader are designed for multi-class classfication tasks. That means that a single label
is used to store the data in hierachical folder structure as used in torchvision.datasets.ImageFolder.
Give that we also want to support #9, it can makes sense to switch to the webdataset format where each file is accompanied by a single label (e.g. json) file: resulting in a flat folder without any hierarchy.
eg.
e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json
That would allow us to save all gbif metadata in that json which, in turn, enables more diverse tasks such as unsupervised learning.
Of course we can also off support for both?
pytorch is currently not included in the standard requirements, therefore a check should be implemented when importing gbif_dl.dataloaders.torch
. If torch is not installed, a warning shoudld be raised that hints at installing pytorch separately or use pip install gbif-dl['torch']
Using Webdataset is a great way to speed up the training pipeline and also makes it convenient to share and download achieves of datasets (e.g. by uploading to Zenodo).
Addressing this issue should involve:
tar
files using the gbif_dl.io
method.proposal, first check the file header for the type:
this means we do not have to provide the mimetype in dict anymore since this operation won't take much time.
This also means we would have to guess the extension by mimetype within the downloader.
Users might have a very long is of urls and the download might fail so they would have to start again.
Either
we implement a persistent queue that saves the queue status on disk
we check for existing files and in case they exist, do not download again, to reduce the possibility of duplicates
is tricky to implement for all use cases since the download function takes not just lists but also Generators and AsyncGenerators. In the case of lists, I would propose to use a library such as persist-queue.
Therefore I would propose to just implement 2
During the process of downloading files, there are not statistics of any kind obtained. For the user it would be useful to know
Currently the torch dataset is parsing the full dataset. Ideally we should provide some helper functions to offer stratified splits.
Until #11 is addressed, we can just add function using data.random_split
but since this is just one line of code, I don't think that brings much additional value
Using code taken from the readme, using "9456" as a speciesKey returns no results even though occurrences with images exist - https://www.gbif.org/occurrence/gallery?media_type=StillImage&taxon_key=9456.
Even with a more specific key "2437489" there are still no examples returned.
At the moment this is a single value so can only collect 1. I would be interested in collecting examples of both.
With async in place errors are hard to trace and a single threaded option could help for debugging
The doi based dwca downloader doesn't require a lot of parameters which is why it would be ideally suitable for a cli interface.
Since, we already have type hinting enabled, we can use typer automatically create the cli API.
gbifl-dl -dwca 10.15468/dl.vnm42s
gbif-dl file_list.txt
cat ... | gbif-dl
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.