Giter Site home page Giter Site logo

gbif-dl's Introduction

Travis build status Codecov test coverage AppVeyor build status

PlantNet

Developed by Tom August

last built 2019-07-23

The R-package interfaces with the PlantNet image classification API. This API is designed to receive images and return species classifications. You can find out more about the PlantNet project here: https://plantnet.org. To use the API you need to have registered an account and generated an API key. All this can be done here: https://my.plantnet.org/.

Install the package

To install the development version of this package from Github use this code.

# Install using the devtools package
devtools::install_github(repo = 'BiologicalRecordsCentre/plantnet')
library(plantnet)

Using the package to classify images

The images that you want to classify need to have URLs. If the images you have are not online you will need to put them online and then copy all of the URLs. The other information you need is your API key. You can create this after registering here: https://my.plantnet.org/.

Single image

With these we can do a single image classification like this. Here we use this photo of a lavender


# Get your key from https://my.plantnet.org/
key <- "YOUR_SUPER_SECRET_KEY"
# Get the URL for your image
imageURL <- 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Single_lavendar_flower02.jpg/800px-Single_lavendar_flower02.jpg'
classifications <- identify(key, imageURL)
classifications
##       score     latin_name                 common_name         
##  [1,] 35.92154  "Lavandula dentata"        "French lavender"   
##  [2,] 31.78688  "Lavandula angustifolia"   "Lavender"          
##  [3,] 18.06659  "Lavandula stoechas"       "Topped lavender"   
##  [4,] 6.717049  "Lavandula latifolia"      "Broadleaf lavender"
##  [5,] 1.665378  "Perovskia atriplicifolia" "Russian-sage"      
##  [6,] 1.022022  "Lavandula pinnata"        NA                  
##  [7,] 0.5533172 "Vitex agnus-castus"       "Chasteberry"       
##  [8,] 0.5320209 "Salvia farinacea"         "Mealy sage"        
##  [9,] 0.485067  "Lavandula multifida"      "Fern-leaf lavender"
## [10,] 0.3812005 "Lavandula canariensis"    NA                  
## [11,] 0.2864826 "Nepeta tuberosa"          NA                  
## [12,] 0.1719182 "Lavandula minutolii"      NA                  
## [13,] 0.1427725 "Perovskia abrotanoides"   "Russian Sage"

You can see in the the table returned there are three columns returned. The first gives you a score, this is the likelihood that the species in this row is the correct classification for the image you provided. As you can see in this instance there is not much between the top two species so we could not be confident which species it is. The second column gives the Latin names as binomials, and the final column gives the most commonly used common name for the species.

If you want more information, such as a list of all of the common names for each classification, or the full Latin name including authors, you can access all of this information by using simplify = FALSE.

classifications <- identify(key, imageURL, simplify = FALSE)
str(classifications,1)
## List of 4
##  $ query              :List of 3
##  $ language           : chr "en"
##  $ preferedReferential: chr "florefrance"
##  $ results            :List of 13

The top elements give some information about the call we made to the API, but the results contains what we are usually after. Results has one entry for each species classification, or each row in the table we saw above. Let's look at one.

classifications$results[[1]]
## $score
## [1] 35.92154
## 
## $species
## $species$scientificNameWithoutAuthor
## [1] "Lavandula dentata"
## 
## $species$scientificNameAuthorship
## [1] "L."
## 
## $species$genus
## $species$genus$scientificNameWithoutAuthor
## [1] "Lavandula"
## 
## $species$genus$scientificNameAuthorship
## [1] "L."
## 
## 
## $species$family
## $species$family$scientificNameWithoutAuthor
## [1] "Lamiaceae"
## 
## $species$family$scientificNameAuthorship
## [1] ""
## 
## 
## $species$commonNames
## $species$commonNames[[1]]
## [1] "French lavender"
## 
## $species$commonNames[[2]]
## [1] "Spanish Lavender"

Here we have information on the Latin name, authors, and at the end, a list of the common names.

Multiple images

You can get a better identification if you provide more than one image of the plant, and of multiple organs of the plant. The organs that PlantNet considers are: leaf, flower, fruit, bark. You can also take images classed as habit (the overall form of the plant), or other, but you can only have an image labelled as one of these if you also have an image labelled as one of the primary organs (i.e. leaf, flower, fruit, bark).

In this example we are going to use three images of Quercus robur from the Encyclopedia of Life.




# We can search using up to five images
# Here are three picture of Quercus robur
imageURL1 <- 'https://content.eol.org/data/media/55/2c/a8/509.1003460.jpg'
imageURL2 <- 'https://content.eol.org/data/media/89/88/4c/549.BI-image-16054.jpg'
imageURL3 <- 'https://content.eol.org/data/media/8a/77/9b/549.BI-image-76488.jpg'

identify(key, imageURL = c(imageURL1, imageURL2, imageURL3))
##      score    latin_name           common_name      
## [1,] 68.28053 "Quercus robur"      "Pedunculate oak"
## [2,] 28.3852  "Terminalia catappa" "Indian-almond"  
## [3,] 19.02534 "Quercus petraea"    "Sessile oak"

In this case all three images have been used to arrive at one classification. Here we get a few species suggested but the top species is correct, and has a significantly higher score than the second species. In this example we have not told the API what the organs are. The API can use this information to help give a better classification.

# This time I specify the organs in each image
identify(key,
         imageURL = c(imageURL1, imageURL2, imageURL3),
         organs = c('habit','bark','fruit'))
##      score    latin_name           common_name      
## [1,] 68.28053 "Quercus robur"      "Pedunculate oak"
## [2,] 19.02534 "Quercus petraea"    "Sessile oak"    
## [3,] 18.92347 "Terminalia catappa" "Indian-almond"

Notice that now the API knows which organs each image is of, we get slightly different results, the top species is the same but we can have higher confidence that this is the correct answer because the distance between the score of the first and second image is larger than before.

gbif-dl's People

Contributors

alexisjoly avatar faroit avatar jclombar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gbif-dl's Issues

Add support for tensorflow datasets

As we already have support for torch.datasets we should also add support for tf.data pipelines to cover a more broad list of deep learning frameworks

Implement download statistics

During the process of downloading files, there are not statistics of any kind obtained. For the user it would be useful to know

  • how many files were successfully being downloaded
  • how many files and from which host failed to be downloaded
  • set a threshold for a percentage of how many files should be successfully being downloaded to return a success state

Support multi-label classfication

Both, the dataset and the downloader are designed for multi-class classfication tasks. That means that a single label is used to store the data in hierachical folder structure as used in torchvision.datasets.ImageFolder.

Give that we also want to support #9, it can makes sense to switch to the webdataset format where each file is accompanied by a single label (e.g. json) file: resulting in a flat folder without any hierarchy.

eg.

e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json

That would allow us to save all gbif metadata in that json which, in turn, enables more diverse tasks such as unsupervised learning.

Of course we can also off support for both?

Add support for webdataset

Using Webdataset is a great way to speed up the training pipeline and also makes it convenient to share and download achieves of datasets (e.g. by uploading to Zenodo).

Addressing this issue should involve:

  • a method to write webdataset tar files using the gbif_dl.io method.
  • a torch dataset class/pipeline to parse the dataset.

Issue when using split streams by

data_generator = gbif_dl.api.generate_urls(
    queries=queries,
    label="speciesKey",
    nb_samples=-1,
    split_streams_by=["datasetKey", "speciesKey"],
)

The line on the split streams is giving the following error:

File "test.py", line 19, in
split_streams_by=["datasetKey", "speciesKey"]
File "/srv/gbif-dl/gbif_dl/generators/api.py", line 171, in generate_urls
for x, y in subset_streams.items():
AttributeError: 'NoneType' object has no attribute 'items'

Add resume functionality for the download

Users might have a very long is of urls and the download might fail so they would have to start again.
Either

  1. we implement a persistent queue that saves the queue status on disk

  2. we check for existing files and in case they exist, do not download again, to reduce the possibility of duplicates

  3. is tricky to implement for all use cases since the download function takes not just lists but also Generators and AsyncGenerators. In the case of lists, I would propose to use a library such as persist-queue.

Therefore I would propose to just implement 2

Create CLI interface

The doi based dwca downloader doesn't require a lot of parameters which is why it would be ideally suitable for a cli interface.

Since, we already have type hinting enabled, we can use typer automatically create the cli API.

  • add dwca interface
  • add file base io download

usage examples

gbifl-dl -dwca 10.15468/dl.vnm42s
gbif-dl file_list.txt
cat ... | gbif-dl

Shuffle responses on the server

When drawing a subset of occurances from the GBIF api, they might be sorted by some internal database query. Given that a subset should taken from random distribution of samples for best performance and generalization, it would be idea if samples could be drawn randomly from the API.

currently this is not supported and we might want to raise this issue on either of the following trackers

Discussion: Citation and licensing considerations

Thank you very much for sharing this utility.

We're heading into relatively new territory for the GBIF community where images will increasingly be used in ML to build models. It's currently unclear how publishers will feel about this when learning that their photos are used in this way. Even if it could be argued that copyright isn't being infringed upon since photos are not being redistributed (the equivalent of "not infringing copyright by creating an index allowing you to find Shakespeare works containing Romeo") it may result in images being withheld in the future.

I suggest it would be wise to firstly acknowledge this situation in the README and presentations. Secondly, it would be good to promote a citation practice that acknowledges the sources of training data - e.g. the built models are given a DOI that links (cites) the source dataset DOIs. Thirdly, it might be prudent to restrict this to only CC0 and CC-BY licensed images.

At GBIF we are interested to see open discussion around these aspects.

Thanks!

AttributeError: module 'pygbif' has no attribute 'occurrences'

When trying to download, i get this error.
Code:
import gbif_dl data_generator = gbif_dl.api.generate_urls( queries=query_per_species, label="speciesKey", nb_samples=100, weighted_streams=True ) stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")

Error:
0 Files [00:00, ? Files/s]Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py", line 44, in run
self.result = runners.run(self.func(*self.args, **self.kwargs))
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/runners.py", line 104, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py", line 242, in download_from_asyncgen
async for batch in chnk:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/transform.py", line 87, in chunks
async for first in streamer:
File "/usr/local/lib/python3.7/dist-packages/aiostream/stream/create.py", line 34, in from_iterable
for item in it:
File "/usr/local/lib/python3.7/dist-packages/pescador/core.py", line 204, in iterate
for n, obj in enumerate(active_streamer.stream
):
File "/usr/local/lib/python3.7/dist-packages/gbif_dl/generators/api.py", line 43, in gbif_query_generator
resp = pygbif.occurrences.search(
AttributeError: module 'pygbif' has no attribute 'occurrences'


AttributeError Traceback (most recent call last)

in ()
----> 1 stats = gbif_dl.io.download(data_generator, root="/content/gdrive/MyDrive/efc-images/plantnet")
2 print(stats)

1 frames

/usr/local/lib/python3.7/dist-packages/gbif_dl/io.py in download(items, root, tcp_connections, nb_workers, batch_size, retries, verbose, overwrite, is_valid_file, proxy, random_subsets)
316 is_valid_file=is_valid_file,
317 proxy=proxy,
--> 318 random_subsets=random_subsets,
319 )

/usr/local/lib/python3.7/dist-packages/gbif_dl/utils.py in run_async(func, *args, **kwargs)
58 thread.start()
59 thread.join()
---> 60 return thread.result
61 else:
62 return runners.run(func(*args, **kwargs))

AttributeError: 'RunThread' object has no attribute 'result'

Add DWCA test archive

unit tests for downloading DWCA archives are slow. ideally we should include a tiny dwca archive downloaded from gbif in the package manifest and use that for the unit tests

Move generators and dataloaders from functional to class based design

Having a class based design for the generators make it difficult to enhance the package in the future to support a modular API.

E.g, the generators currently have a generate_urls function that returns an interable or generator.
This can be enhanced creating a class interface. E.g. users create a DWCAGenerator() that can directly be used within the downloader

Add docs

Add a simple doc pages, using pdoc3. This should also be integrated in #5, and can probably only enabled after the project is public.

A readme how to create the docs should be included in the pdocs folder.

frequent JSONDecodeError's

The following query produces an unhandled JSONDecodeError inside gbif-dl

data_generator = gbif_dl.api.generate_urls(
    queries={'familyKey': [9456], 'offset': 83941, 'pageLimit': 1},
    label="speciesKey",
)

for i in data_generator:
    print(i)

I estimate that I get this 1/100 times I create a url generator.

To use offset in the query requires a change to gbif-dl/generators/api.py to add offset: int = 0 as a function parameter to the gbif_query_generator function and replacing offset = 0 with offset = offset near the top of the function body. I'm doing this so that I can get random samples from GBIF by setting the offset to a random number within the range of the number of occurances for the query which I get from a direct API call as I don't think gbif-dl has a method to tell you this info?

Enable continuous integration

Currently most projects use github worflows, so that seems the best option.
In github free tier, this can only be enabled for public projects, so this will have to wait a bit

Return gbifID in url generator

Hi, when constructing own datasets it would be handy to have the gbifID for each occurrence returned should we want to find it manually on the GBIF website or API.

Validate MediaData

Currently, there is a minimum required fields to be consumed by the downloader. These include the following:

class MediaData(TypedDict):
    """ Media dict representation received from api or dwca generators"""
    url: str
    basename: str
    label: str
    content_type: str
    suffice: str

We might want make sure that generators always make sure that these are the minimum fields to be passed to the downloader, otherwise an error should be raised.

Add check for torch requirement

pytorch is currently not included in the standard requirements, therefore a check should be implemented when importing gbif_dl.dataloaders.torch. If torch is not installed, a warning shoudld be raised that hints at installing pytorch separately or use pip install gbif-dl['torch']

Improve debugging experience

Currently the debugging experience of the io.download_single function is not idea since its inside of an async task. If an exception is raised inside of that function tasks are not stopped, thus makes is hard to debug it.

I added a watchdog decorator, but it doesn't work in all cases yet:

def watchdog(afunc):
"""Stops all tasks if there is an error"""
@functools.wraps(afunc)
async def run(*args, **kwargs):
try:
await afunc(*args, **kwargs)
except asyncio.CancelledError:
return
except Exception as err:
print(f'exception {err}')
asyncio.get_event_loop().stop()
return run

Add train/validation/split

Currently the torch dataset is parsing the full dataset. Ideally we should provide some helper functions to offer stratified splits.

Until #11 is addressed, we can just add function using data.random_split but since this is just one line of code, I don't think that brings much additional value

make downloader usable for simple downloading tasks

The download module should just work for a given list of urls.
Currently the url would need to be a dict of the following:

urls = [
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
        'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
        'label': '3189866'
    },
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
        'basename': 'e75239cd029162c81f16a6d6afb1057d2437bcc8',
        'label': '3189866'
    },
    {
        'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
        'basename': 'e04a36f124b875a16b5393a8fdef36846ada8e35',
        'label': '3189866'
    }
]

Thus, changes should be made so that basename and label can be surpressed:

urls = [
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    },
    {
        'url': 'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    },
    {
        'url': 'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321',
    }
]

but also

urls = [
    'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    'https://bs.plantnet.org/image/o/6d5ed1f1769b4818ed5a234670dba742bf5b28a5',
    'https://bs.plantnet.org/image/o/f32365ec997bdf06b57adcfca6a49c6d9602b321'
]

should work

Add extensive sampling to dwca generator

Currently, the dwca/doi url generator does not support the same sampling functionality as for the query. There might be a way to also add at least some balancing here.

ImportError: cannot import name 'SafeFunction' from 'joblib.parallel'

It seems that it crashes for joblib (1.3.1).
Downgrading to 0.10.0 seemed to solve the issue (pescadores/pescador#26)

Sample code:

import gbif_dl

data_generator = gbif_dl.dwca.generate_urls(
    "10.15468/dl.pcxfa5", dwca_root_path="dwcas", label="speciesKey"
)
stats = gbif_dl.io.download(data_generator, root="my_dataset", retries=1000000)

Error:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import gbif_dl
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/__init__.py", line 15, in <module>
    from .generators import api
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/gbif_dl/generators/api.py", line 9, in <module>
    import pescador
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/__init__.py", line 8, in <module>
    from .zmq_stream import *
  File "/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/pescador/zmq_stream.py", line 33, in <module>
    from joblib._parallel_backends import SafeFunction
ImportError: cannot import name 'SafeFunction' from 'joblib._parallel_backends' (/home/rtcalumby/miniconda3/envs/py38/lib/python3.8/site-packages/joblib/_parallel_backends.py)

Make project public

For testing #4, #5 and #15 we would need to make the repo public since the plantnet org does only has a github free account.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.