The gbif-dl's discuss from plantnet

Add docs

Add a simple doc pages, using pdoc3. This should also be integrated in #5, and can probably only enabled after the project is public.

A readme how to create the docs should be included in the pdocs folder.

Return gbifID in url generator

Hi, when constructing own datasets it would be handy to have the gbifID for each occurrence returned should we want to find it manually on the GBIF website or API.

Add extensive sampling to dwca generator

Currently, the dwca/doi url generator does not support the same sampling functionality as for the query. There might be a way to also add at least some balancing here.

Shuffle responses on the server

When drawing a subset of occurances from the GBIF api, they might be sorted by some internal database query. Given that a subset should taken from random distribution of samples for best performance and generalization, it would be idea if samples could be drawn randomly from the API.

currently this is not supported and we might want to raise this issue on either of the following trackers

Enable continuous integration

Currently most projects use github worflows, so that seems the best option.
In github free tier, this can only be enabled for public projects, so this will have to wait a bit

frequent JSONDecodeError's

The following query produces an unhandled JSONDecodeError inside gbif-dl

data_generator = gbif_dl.api.generate_urls(
    queries={'familyKey': [9456], 'offset': 83941, 'pageLimit': 1},
    label="speciesKey",
)

for i in data_generator:
    print(i)

I estimate that I get this 1/100 times I create a url generator.

To use offset in the query requires a change to gbif-dl/generators/api.py to add offset: int = 0 as a function parameter to the gbif_query_generator function and replacing offset = 0 with offset = offset near the top of the function body. I'm doing this so that I can get random samples from GBIF by setting the offset to a random number within the range of the number of occurances for the query which I get from a direct API call as I don't think gbif-dl has a method to tell you this info?

Discussion: Citation and licensing considerations

Thank you very much for sharing this utility.

We're heading into relatively new territory for the GBIF community where images will increasingly be used in ML to build models. It's currently unclear how publishers will feel about this when learning that their photos are used in this way. Even if it could be argued that copyright isn't being infringed upon since photos are not being redistributed (the equivalent of "not infringing copyright by creating an index allowing you to find Shakespeare works containing Romeo") it may result in images being withheld in the future.

I suggest it would be wise to firstly acknowledge this situation in the README and presentations. Secondly, it would be good to promote a citation practice that acknowledges the sources of training data - e.g. the built models are given a DOI that links (cites) the source dataset DOIs. Thirdly, it might be prudent to restrict this to only CC0 and CC-BY licensed images.

At GBIF we are interested to see open discussion around these aspects.

Thanks!

Is it possible to limit the number of species?

It may be possible by using facets of the GBIF API?

Example: https://api.gbif.org/v1/occurrence/search?[...]&facet=taxonKey&limit=0&facetLimit=100

Validate MediaData

Currently, there is a minimum required fields to be consumed by the downloader. These include the following:

class MediaData(TypedDict):
    """ Media dict representation received from api or dwca generators"""
    url: str
    basename: str
    label: str
    content_type: str
    suffice: str

We might want make sure that generators always make sure that these are the minimum fields to be passed to the downloader, otherwise an error should be raised.

Add support for tensorflow datasets

As we already have support for torch.datasets we should also add support for tf.data pipelines to cover a more broad list of deep learning frameworks

Allow integrity check function in downloader

Before saving the file an integrity check should optionally be passed. That function should be passed and reads and verifys byte strings

add a license

we need a license, are we okay with MIT?

Improve debugging experience

Currently the debugging experience of the io.download_single function is not idea since its inside of an async task. If an exception is raised inside of that function tasks are not stopped, thus makes is hard to debug it.

I added a watchdog decorator, but it doesn't work in all cases yet:

gbif-dl/gbif_dl/utils.py

Lines 5 to 17 in 3f2707a

    
           def watchdog(afunc): 
        
               """Stops all tasks if there is an error""" 
        
               @functools.wraps(afunc) 
        
               async def run(*args, **kwargs): 
        
                   try: 
        
                       await afunc(*args, **kwargs) 
        
                   except asyncio.CancelledError: 
        
                       return 
        
                   except Exception as err: 
        
                       print(f'exception {err}') 
        
                   asyncio.get_event_loop().stop() 
        
               return run

AttributeError: module 'pygbif' has no attribute 'occurrences'

When trying to download, i get this error.
Code:
import gbif_dl data_generator = gbif_dl.api.generate_urls( queries=query_per_species, label="speciesKey", nb_samples=100, weighted_streams=True ) stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")

Error:
0 Files [00:00, ? Files/s]Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py", line 44, in run
self.result = runners.run(self.func(*self.args, **self.kwargs))
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/runners.py", line 104, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py", line 242, in download_from_asyncgen
async for batch in chnk:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/transform.py", line 87, in chunks
async for first in streamer:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/create.py", line 34, in from_iterable
for item in it:
File "/usr/local/lib/python3.7/dist-packages/pescador/core.py", line 204, in iterate
for n, obj in enumerate(active_streamer.stream):
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/generators/api.py", line 43, in gbif_query_generator
resp = pygbif.occurrences.search(
AttributeError: module 'pygbif' has no attribute 'occurrences'

AttributeError Traceback (most recent call last)

in ()
----> 1 stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")
2 print(stats)

1 frames

/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py in download(items, root, tcp_connections, nb_workers, batch_size, retries, verbose, overwrite, is_valid_file, proxy, random_subsets)
316 is_valid_file=is_valid_file,
317 proxy=proxy,
--> 318 random_subsets=random_subsets,
319 )

/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py in run_async(func, *args, **kwargs)
58 thread.start()
59 thread.join()
---> 60 return thread.result
61 else:
62 return runners.run(func(*args, **kwargs))

AttributeError: 'RunThread' object has no attribute 'result'

ImportError: cannot import name 'SafeFunction' from 'joblib.parallel'

It seems that it crashes for joblib (1.3.1).
Downgrading to 0.10.0 seemed to solve the issue (pescadores/pescador#26)

Sample code:

import gbif_dl

data_generator = gbif_dl.dwca.generate_urls(
    "10.15468/dl.pcxfa5", dwca_root_path="dwcas", label="speciesKey"
)
stats = gbif_dl.io.download(data_generator, root="my_dataset", retries=1000000)

Error:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import gbif_dl
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/__init__.py", line 15, in <module>
    from .generators import api
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/generators/api.py", line 9, in <module>
    import pescador
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/__init__.py", line 8, in <module>
    from .zmq_stream import *
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/zmq_stream.py", line 33, in <module>
    from joblib._parallel_backends import SafeFunction
ImportError: cannot import name 'SafeFunction' from 'joblib._parallel_backends' (/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/joblib/_parallel_backends.py)

Provide meaningful error message when the doi is not available for download

By now it's something like

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

label and url are mandatory while building data_generator

add proxy for downloader

Convert gbif download into api query

given a gbif download id, we want to get the original query, modify it and rerun it using the gbif_dl.api.

Move generators and dataloaders from functional to class based design

Having a class based design for the generators make it difficult to enhance the package in the future to support a modular API.

E.g, the generators currently have a generate_urls function that returns an interable or generator.
This can be enhanced creating a class interface. E.g. users create a DWCAGenerator() that can directly be used within the downloader

Add DWCA test archive

unit tests for downloading DWCA archives are slow. ideally we should include a tiny dwca archive downloaded from gbif in the package manifest and use that for the unit tests

Keep track of media licensing information if requested

Add unit tests

Most modules lack proper unit testing.

add dependabot

media download stalls after a while

make downloader usable for simple downloading tasks

The download module should just work for a given list of urls.
Currently the url would need to be a dict of the following:

urls = [
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
        'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
        'label': '3189866'
    },
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
        'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
        'label': '3189866'
    },
    {
        'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
        'basename': 'e04a36f124b875a16b5393a8fdef36846ada8e35',
        'label': '3189866'
    }
]

Thus, changes should be made so that basename and label can be surpressed:

urls = [
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    },
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    },
    {
        'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
    }
]

but also

urls = [
    'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321'
]

should work

Downloaded image file name

names of downloaded image filess are not same as in identifier coloumn of multimedia.txt file in dwca file, how can I map downloaded image with the metadata

Improve tempfile management for DWCA files

Currently, downloaded DWCA zip can be temporarily stored. However, there is no automatic way to delete these files.

Implement default sleeping for downloader

To reduce stress to servers, we should - by default - introduce a minimal wait/sleep of maybe 0.5s per request.

Issue when using split streams by

data_generator = gbif_dl.api.generate_urls(
    queries=queries,
    label="speciesKey",
    nb_samples=-1,
    split_streams_by=["datasetKey", "speciesKey"],
)

The line on the split streams is giving the following error:

File "test.py", line 19, in
split_streams_by=["datasetKey", "speciesKey"]
File "/srv/gbif-dl/gbif_dl/generators/api.py", line 171, in generate_urls
for x, y in subset_streams.items():
AttributeError: 'NoneType' object has no attribute 'items'

Select species list from a geographic area

Add a functionality to select the species list having GBIF occurrences in given area and then download all images of that species wherever they are.

Support multi-label classfication

Both, the dataset and the downloader are designed for multi-class classfication tasks. That means that a single label is used to store the data in hierachical folder structure as used in torchvision.datasets.ImageFolder.

Give that we also want to support #9, it can makes sense to switch to the webdataset format where each file is accompanied by a single label (e.g. json) file: resulting in a flat folder without any hierarchy.

eg.

e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json

That would allow us to save all gbif metadata in that json which, in turn, enables more diverse tasks such as unsupervised learning.

Of course we can also off support for both?

Add check for torch requirement

pytorch is currently not included in the standard requirements, therefore a check should be implemented when importing gbif_dl.dataloaders.torch. If torch is not installed, a warning shoudld be raised that hints at installing pytorch separately or use pip install gbif-dl['torch']

Add support for webdataset

Using Webdataset is a great way to speed up the training pipeline and also makes it convenient to share and download achieves of datasets (e.g. by uploading to Zenodo).

Addressing this issue should involve:

a method to write webdataset tar files using the gbif_dl.io method.
a torch dataset class/pipeline to parse the dataset.

Add file check before downloading to determine type

proposal, first check the file header for the type:

this means we do not have to provide the mimetype in dict anymore since this operation won't take much time.
This also means we would have to guess the extension by mimetype within the downloader.

Add resume functionality for the download

Users might have a very long is of urls and the download might fail so they would have to start again.
Either

we implement a persistent queue that saves the queue status on disk
we check for existing files and in case they exist, do not download again, to reduce the possibility of duplicates
is tricky to implement for all use cases since the download function takes not just lists but also Generators and AsyncGenerators. In the case of lists, I would propose to use a library such as persist-queue.

Therefore I would propose to just implement 2

Implement download statistics

During the process of downloading files, there are not statistics of any kind obtained. For the user it would be useful to know

how many files were successfully being downloaded
how many files and from which host failed to be downloaded
set a threshold for a percentage of how many files should be successfully being downloaded to return a success state

add dwca interface
add file base io download

usage examples

gbifl-dl -dwca 10.15468/dl.vnm42s
gbif-dl file_list.txt
cat ... | gbif-dl

	def watchdog(afunc):
	"""Stops all tasks if there is an error"""
	@functools.wraps(afunc)
	async def run(args, *kwargs):
	try:
	await afunc(args, *kwargs)
	except asyncio.CancelledError:
	return
	except Exception as err:
	print(f'exception {err}')
	asyncio.get_event_loop().stop()
	return run

plantnet / gbif-dl Goto Github PK

gbif-dl's Issues

usage examples

Recommend Projects

Recommend Topics

Recommend Org