Giter Site home page Giter Site logo

ami-system / on_device_classifier Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 7.2 MB

This repository contains the code to create on-device machine learning models for species classification.

License: MIT License

Python 3.83% Shell 0.60% Jupyter Notebook 95.57%
biodiversity on-device-deep-learning ami-system

on_device_classifier's Introduction

Species Classifier Models

This repo creates PyTorch species classification models based on GBIF images (see the gbif_download_standalone repo for information and code to downloading images).

This model is built using pytorch. The user needs to run the following scripts in a sequence to train the model:

Training the Models for a Given Region

The easiest way to run this pipeline is to use the {region}_model.sh files.

To run this for a given species list with Slurm (e.g., on Baskerville). For example sbatch costarica_model.sh, which will output to cr_train.out.

Scripts

The pipeline is comprised of 4 scripts:

01-create_dataset_split.py

This creates training, validation and testing splits of the data downloaded from GBIF.

python 01_create_dataset_split.py \
    --data_dir /bask/homes/f/fspo1218/amber/data/gbif_download_standalone/gbif_images/ \
    --write_dir /bask/homes/f/fspo1218/amber/data/gbif_costarica/ \
    --species_list /bask/homes/f/fspo1218/amber/projects/gbif_download_standalone/species_checklists/costarica-moths-keys-nodup.csv \
    --train_ratio 0.75 \
    --val_ratio 0.10 \
    --test_ratio 0.15 \
    --filename 01_costarica_data

The description of the arguments to the script:

  • --data_dir: Path to the root directory containing the GBIF data. Required.
  • --write_dir: Path to the directory for saving the split files. Required.
  • --train_ratio: Proportion of data for training. Required.
  • --val_ratio: Proportion of data for validation. Required.
  • --test_ratio: Proportion of data for testing. Required.
  • --filename: Initial name for the split files. Required.
  • --species_list: Path to the species list. Required.

02_calculate_taxa_statistics.py

This calculates information and statistics regarding the taxonomy to be used for model training.

python 02_calculate_taxa_statistics.py \
    --species_list /bask/homes/f/fspo1218/amber/projects/gbif_download_standalone/species_checklists/costarica-moths-keys-nodup.csv \
    --write_dir /bask/homes/f/fspo1218/amber/data/gbif_costarica/ \
    --numeric_labels_filename 01_costarica_data_numeric_labels \
    --taxon_hierarchy_filename 01_costarica_data_taxon_hierarchy \
    --training_points_filename 01_costarica_data_count_training_points \
    --train_split_file /bask/homes/f/fspo1218/amber/data/gbif_costarica/01_costarica_data-train-split.csv

The description of the arguments to the script:

  • --species_list: Path to the species list. Required.
  • --write_dir: Path to the directory for saving the information. Required.
  • --numeric_labels_filename: Filename for numeric labels file. Required.
  • --taxon_hierarchy_filename: Filename for taxon hierarchy file. Required.
  • --training_points_filename: Filename for storing the count of training points. Required.
  • --train_split_file: Path to the training split file. Required.

THEN after this is done you need to add the number fo families, genus, and species to the ./configs/01_uk_macro_data_config.json file. This is done manually.

03_create_webdataset.py

Creates webdataset from raw image data. It needs to be run individually for each of the train, validation and test sets.

So we will loop through each set:

for VARIABLE in 'train' 'val' 'test'
do
    echo '--' $VARIABLE
    mkdir -p /bask/homes/f/fspo1218/amber/data/gbif_costarica/$VARIABLE
    python 03_create_webdataset.py \
        --dataset_dir /bask/homes/f/fspo1218/amber/data/gbif_download_standalone/gbif_images/ \
        --dataset_filepath /bask/homes/f/fspo1218/amber/data/gbif_costarica/01_costarica_data-$VARIABLE-split.csv \
        --label_filepath /bask/homes/f/fspo1218/amber/data/gbif_costarica/01_costarica_data_numeric_labels.json \
        --image_resize 500 \
        --max_shard_size 100000000 \
        --webdataset_pattern "/bask/homes/f/fspo1218/amber/data/gbif_costarica/$VARIABLE/$VARIABLE-500-%06d.tar"
done

04_train_model.py

Training the Pytorch model. This step required the use of wandb. The user needs to create an account and login to the platform. The user will then need to set up a project and pass the entity (username) and project into the config file. This can be run with nohup:

nohup sh -c 'python 04_train_model.py  \
    --train_webdataset_url "$train_url" \
    --val_webdataset_url "$val_url" \
    --test_webdataset_url "$test_url" \
    --config_file ./configs/01_costarica_data_config.json \
    --dataloader_num_workers 6 \
    --random_seed 42' &

The description of the arguments to the script:

  • --train_webdataset_url: path to webdataset tar files for training
  • --val_webdataset_url: path to webdataset tar files for validation
  • --test_webdataset_url: path to webdataset tar files for testing
  • --config_file: path to configuration file containing training information
  • --dataloader_num_workers: number of cpus available
  • --random_seed: random seed for reproducible experiments

For setting up the config file: The total families, genuses, and species are spit out at the end of 02_calculate_taxa_statistics.py so you can use this info to fill in the config lines 5-7.

on_device_classifier's People

Contributors

katrionagoldmann avatar levanbokeria avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

on_device_classifier's Issues

Some ideas from IAAD discussion

Just recording these somewhere (please edit/add to the list):

  • Consider close crops/close ups of important features or partial moth images as another augmentation (c.f. whales)
  • 'Open-world' classification - common for individual-level problems. Methods that produce an embedding of the image in a latent space can be convenient; visualisation of this
  • https://github.com/facebookresearch/detr
  • Megadetector: For detecting animals from camera traps. Used by WildLabs
  • Networks used for facial recognition
  • Grad-CAM: For explainable AI - to see which parts of the image the model focuses on

Run models on baskerville with GPU/CUDA

Running the models on Baskerville crashes when requesting GPUs, despite torch.is_cuda_available() returning True and the set-up selecting multiple GPUs.

Using:

#SBATCH --gpus-per-task 3
#SBATCH --tasks-per-node 1
#SBATCH --nodes 1 

Returns error RuntimeError: CUDA error: no kernel image is available for execution on the device

Incorporate Crop details into modelling

I have observed a number of potential misclassifications which could be avoided if we incorporated features such as moth size, shape, etc.

For example Cryptocala acadiensis and Noctua pronuba are both in the Noctuidae family, have orange and black hindwings but are very different in size. Pixelflow could potentially be used in this case.

GBIF returning genus results for some rank='species'

GBIF results are called with data = species_api.name_backbone(name=name, strict=True, rank="SPECIES")

However, some of the calls return 'rank': 'GENUS' results.

Examples:

data = species_api.name_backbone(name='Macaria notata', strict=True, rank="SPECIES")

gives:

{'usageKey': 3256294, 'scientificName': 'Macaria Curtis, 1826', 
    'canonicalName': 'Macaria', 'rank': 'GENUS', 'status': 'ACCEPTED', 
    'confidence': 98, 'matchType': 'HIGHERRANK', 'kingdom': 'Animalia', 
    'phylum': 'Arthropoda', 'order': 'Lepidoptera', 'family': 'Geometridae', 
    'genus': 'Macaria', 'kingdomKey': 1, 'phylumKey': 54, 'classKey': 216, 
    'orderKey': 797, 'familyKey': 6950, 'genusKey': 3256294, 
    'synonym': False, 'class': 'Insecta'}

Not sure why since https://api.gbif.org/v1/species?name=Macaria%20notata returns species as the first result

Fix file naming error

genus name and family name are mixed up in the 02a_fetch_gbif_metamorphic_data.py file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.