supernnova / supernnova Goto Github PK

Open Source Photometric classification https://supernnova.readthedocs.io

License: MIT License

Python 99.84% Shell 0.16%

cosmology supernova deep-learning arxiv reproducible-science recurrent-neural-networks pandas python pytorch bayesian-neural-networks

supernnova's Introduction

Read the documentation

For the main branch: https://supernnova.readthedocs.io

The paper branch differs slightly from the master. Take a look to "changelog_paper_to_new_branch" or Build the docs for this branch.

Installation

Clone this repository (preferred)

git clone https://github.com/supernnova/supernnova.git

or install pip module (check versioning)

pip install supernnova

Read the paper

Links to the publication: MNRAS,ArXiv. All results quoted in these publications were produced using the branch "paper" which is frozen for reproducibility.

Please include the full citation if you use this material in your research: A Möller and T de Boissière, MNRAS, Volume 491, Issue 3, January 2020, Pages 4277–4293.

Repository overview
Getting Started 0. [Use Poetry in new releases]
1. With Conda
2. With Docker
Usage
Reproduce paper
Pipeline Description
Running tests
Build the docs

Repository overview

├── supernnova              --> main module
    ├──data                 --> scripts to create the processed database
    ├──visualization        --> data plotting scripts
    ├──training             --> training scripts
    ├──validation           --> validation scripts
    ├──utils                --> utilities used throughout the module
├── tests                   --> unit tests to check data processing
├── sandbox                 --> WIP scripts

Getting started

With Conda

cd env

# Create conda environment
conda create --name <env> --file <conda_file_of_your_choice>

# Activate conda environment
source activate <env>

With Docker

cd env

# Build docker images
make cpu  # cpu image
make gpu  # gpu image (requires NVIDIA Drivers + nvidia-docker)

# Launch docker container
python launch_docker.py (--use_gpu to run GPU based container)

For more detailed instructions, check the full setup instructions

Usage

When cloning this repository:

# Create data
python run.py --data  --dump_dir tests/dump --raw_dir tests/raw --fits_dir tests/fits

# Train a baseline RNN
python run.py --train_rnn --dump_dir tests/dump

# Train a variational dropout RNN
python run.py --train_rnn --model variational --dump_dir tests/dump

# Train a Bayes By Backprop RNN
python run.py --train_rnn --model bayesian --dump_dir tests/dump

# Train a RandomForest
python run.py --train_rf --dump_dir tests/dump

When using pip, a full example is https://supernnova.readthedocs.io

# Python
import supernnova.conf as conf
from supernnova.data import make_dataset

# get config args
args =  conf.get_args()

# create database
args.data = True            # conf: making new dataset
args.dump_dir = "tests/dump"        # conf: where the dataset will be saved
args.raw_dir = "tests/raw"      # conf: where raw photometry files are saved 
args.fits_dir = "tests/fits"        # conf: where salt2fits are saved 
settings = conf.get_settings(args)  # conf: set settings
make_dataset.make_dataset(settings) # make dataset

Reproduce paper results

Please change to branch paper:

python run_paper.py

General pipeline description

Parse raw data in FITS format
Create processed database in HDF5 format
Train Recurrent Neural Networks (RNN) or Random Forests (RF) to classify photometric lightcurves
Validate on test set

Running tests with py.test

PYTHONPATH=$PWD:$PYTHONPATH pytest -W ignore --cov supernnova tests

Build docs

cd docs && make clean && make html && cd ..
firefox docs/_build/html/index.html

supernnova's People

Contributors

Stargazers

Watchers

Forkers

nmiranda anaismoller troyraen ditmarhalla rayadastidar julienpeloton adacs-australia

supernnova's Issues

redshift_label and --zspe/zpho

Eliminate --zspe/zpho options and just change redshift_label for simplicity.

SNANA FLT changed for BAND

SNANA format:
Rick Kessler "changed the name of the FLT column in the data to BAND … SNANA codes accept either FLT or BAND for back-compatibility, and non-SNANA codes will need to do the same. The other ‘either’ option is REDSHIFT_FINAL or REDSHIFT_CMB."

'features' doesn't exist

Hi Anais and team,

Things are no longer working smoothly on midway. Got an exception about 'features' does not exist. If you have any ideas on what might be behind it I'm all ears.

Log is located at /scratch/midway2/rkessler/PIPPIN_OUTPUT/DJB_SPEC/3_CLAS_old/SNNVANILLATRAIN_TRAIN_SPEC_FIT/output.log on midway2.

Full log is below:

[Data processing] 15s
Traceback (most recent call last):
  File "run.py", line 204, in <module>
    raise e
  File "run.py", line 43, in <module>
    make_dataset.make_dataset(settings)
  File "/project2/rkessler/PRODUCTS/miniconda/envs/snn_gpu/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/data/make_dataset.py", line 749, in make_dataset
    data_utils.save_to_HDF5(settings, df)
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/utils/data_utils.py", line 576, in save_to_HDF5
    list_training_features + ["FLT"]
AssertionError
/project2/rkessler/PRODUCTS/miniconda/envs/snn_gpu/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "run.py", line 28, in <module>
    settings = conf.get_settings()
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/conf.py", line 364, in get_settings
    settings = experiment_settings.ExperimentSettings(args)
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/utils/experiment_settings.py", line 57, in __init__
    self.set_feature_lists()
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/utils/experiment_settings.py", line 190, in set_feature_lists
    self.all_features = hf["features"][:].astype(str)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/project2/rkessler/PRODUCTS/miniconda/envs/snn_gpu/lib/python3.6/site-packages/h5py/_hl/group.py", line 177, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'features' doesn't exist)"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 200, in <module>
    settings = conf.get_settings()
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/conf.py", line 364, in get_settings
    settings = experiment_settings.ExperimentSettings(args)
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/utils/experiment_settings.py", line 57, in __init__
    self.set_feature_lists()
  File "/project2/rkessler/PRODUCTS/classifiers/supernnova/supernnova/utils/experiment_settings.py", line 190, in set_feature_lists
    self.all_features = hf["features"][:].astype(str)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/project2/rkessler/PRODUCTS/miniconda/envs/snn_gpu/lib/python3.6/site-packages/h5py/_hl/group.py", line 177, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'features' doesn't exist)"

Not that in addition, this exception is uncaught, in that it does not produce a done file marked with FAILURE as supposed to, and instead the application terminates.

redshift_label support csv + additional_train_var for validation

CSV does not support redshift label (fixed)

model trained w addionatinal_train_var does not use it properly

Photo window not supported with csv

Add support for photo window with csv inputs

Documentation: on the fly

Document on the fly

Pip update

Pip version is not in sync with recent SNN changes.

Settings are not the same and thus crashes

[on the fly] fill automatically columns not used

if model uses db with information such as redshift but this info is not used in the classification. automatically fill with zeros before formatting for classification.

Cyclic speeding

train_cyclic

data is loaded twice, once in get_lr and once afterwards. Optimise to load only once.

Warning when empty files are given

preprocessing crashes but not HEAD assembly.
See if df.MJD.values[-1] == -777.0:

TypeError in method process_single_csv

Line 473 of make_dataset.py reads:

for c_ in [2, list(set(len(settings.sntypes.keys())))]:

This is currently throwing the following error:

TypeError: 'int' object is not iterable

This will always happen, as method len always returns an int type.

problem reading sntypes

It is the missing types problem and error we discussed when reading from different SNANA sims. pls help!!

Let me know if you need more files...

clean up cli_args

Some features are repeated and some are misleading...

Allow the user to define filter labels in pivot_dataframe_single

In file make_dataset.py, line 558 reads:

list_filters = data_utils.FILTERS

Maybe it should instead say:

list_filters = settings.list_filters

This would allow the user to define what filters to use and the label they have in the input table.

Unusual results for PS1 Data

As discussed on Slack - When fitting PS1 data (HEAD and PHOT files in here $PS1_ROOT/lcmerge/PS1_PS1MD_cen_SIGCLIP_FITS/ ) we get some weird results - only 450 SNe in the sample are classified as Ia, and this seems to be entirely random. A couple spectroscopically confirmed SNIa are being listed as 100% chance CC, so something's up. I generated some SNN example plots using the following command :

/scratch/midway2/rkessler/PIPPIN_OUTPUT/PANOPTICON-DATA/3_CLAS/old-PS1/job_plot.slurm

and in general the relevant directory is here:
/scratch/midway2/rkessler/PIPPIN_OUTPUT/PANOPTICON-DATA/3_CLAS/SNNTEST_PS_DATAPS1_SNNTEST_PS_PS1+MVCC

AttributeError: 'ExperimentSettings' object has no attribute 'sntype_var'

Using the branch elasticc, the code used to process ZTF fails with error:

File "fink_science/snn/processor.py", line 167, in snn_ia
    ids, pred_probs = classify_lcs(pdf, model, 'cpu')
  File "/home/libs/miniconda/lib/python3.7/site-packages/supernnova/validation/validate_onthefly.py", line 99, in classify_lcs
    df = format_data(df, settings)
  File "/home/libs/miniconda/lib/python3.7/site-packages/supernnova/validation/validate_onthefly.py", line 59, in format_data
    df = pivot_dataframe_single_from_df(df, settings)
  File "/home/libs/miniconda/lib/python3.7/site-packages/supernnova/data/make_dataset.py", line 722, in pivot_dataframe_single_from_df
    + class_columns
AttributeError: 'ExperimentSettings' object has no attribute 'sntype_var'

Any ideas?

Input from spark dataframe

Explore the potential of using spark to input data as alerts -> convert to object -> build database

recursive folders for raw directory

generalize to find photometry files in recursive folders

db error when only 1 sntype given

make dataset list([2,len(settings.sntypes.keys()]) will repeat columns. add a set into the splits

Confusion matrix labels

support confusion matrix labels which are not the first 5 itemised in the dictionary

refactor branch dump configs

dump feature list in config dump

implement random length in arXiv script

Database

TypeError: No conversion path for dtype: dtype('<U120')

data_types_training in make_dataset

access above path exists

To avoid issues of access in cluster deployment we should add

import os
os.access(path, os.R_OK)

recent issue involved
supernnova/utils/data_utils.py
Path(f"{settings.fits_dir}/FITOPT000.FITRES").exists()

but could happen with other checks

PS1 classification issues

Morning!

After you fixed the plotting issue from yesterday, I went and reclassified the PS1 data. We were seeing only 450 SNe pass as SNN-classified type Ia, and most of the spectroscopically confirmed Ia were marked as 100% CC by SNN. This hasn't changed since the update pushed yesterday, and we are a bit lost. (to be clear - I did not re-train a model, just re-used the same model on the data again after the update).

I've sent some pictures on slack that I hope demonstrate the problem, and the relevant directories are here:

This is where the PS1 training set was generated and trained:
/scratch/midway2/rkessler/PIPPIN_OUTPUT/BP-PS1-CLASS/
This is where it was used to fit the PS1 data:
/scratch/midway2/rkessler/PIPPIN_OUTPUT/PANOPTICON-DATA/
And tested on simulated PS1 data here:
/scratch/midway2/rkessler/PIPPIN_OUTPUT/BP-PS1-CLASSTEST/

I remade the light-curve plots, they live here: /scratch/midway2/rkessler/PIPPIN_OUTPUT/PANOPTICON-DATA/3_CLAS/SNNTEST_PS_DATAPS1_SNNTEST_PS_PS1+MVCC/dump/lightcurves/SNNTEST_PS_PS1+MVCC/early_prediction

They look good now!

Pickle format change

Change pickle format to something more compatible

Function pivot_dataframe_batch is not handling concurrent exceptions (fails silently)

In make_dataset.py, lines 696-701 read:

    for chunk_idx in tqdm(list_chunks, desc="Pivoting dataframes", ncols=100):
        parallel_fn = partial(pivot_dataframe_single, settings=settings)
        # Process each file in the chunk in parallel
        with ProcessPoolExecutor(max_workers=max_workers) as executor:
            start, end = chunk_idx[0], chunk_idx[-1] + 1
            executor.map(parallel_fn, list_files[start:end])

The iterator with results of the execution of the map function is not being accessed at any time afterwards. Unfortunately, this has as a side effect that, if any of the executors fail, the corresponding exception is never going to be raised, and, therefore, it will fail silently. (See https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map )