Giter Site home page Giter Site logo

nemi's Introduction

NEMI

About NEMI

The Native Emergent Manifold Interrogation (NEMI; submitted JAMES) is a method to determine regions of interest in large or highly complex and nonlinear data.

Citation: Sonnewald, M., in review. A hierarchical ensemble manifold methodology for new knowledge on spatial data: An application to ocean physics. Journal of Advances in Modeling Earth Systems. Available: ESSOAr.

Github GitHub Tests pypi DOIzenodo

Short description/abstract:

Algorithms to determine regions of interest in large or highly complex and nonlinear data is becoming increasingly important. Novel methodologies from computer science and dynamical systems are well placed as analysis tools, but are underdeveloped for applications within the Earth sciences, and many produce misleading results. I present a novel and general workflow, the Native Emergent Manifold Interrogation (NEMI) method, which is easy to use and widely applicable. NEMI is able to quantify and leverage the highly complex latent space presented by noisy, nonlinear and unbalanced data common in the Earth sciences. NEMI uses dynamical systems and probability theory to strengthen associations, simplifying covariance structures, within the data with a manifold, or a Riemannian, methodology that uses domain specific charting of the underlying space. On the manifold, an agglomerative clustering methodology is applied to isolate the now observable areas of interest. The construction of the manifold introduces a stochastic component which is beneficial to the analysis as it enables latent space regularization. NEMI uses an ensemble methodology to quantify the sensitivity of the results noise. The areas of interest, or clusters, are sorted within individual ensemble members and co-located across the set. A metric such as a majority vote, entropy, or similar the quantifies if a data point within the original data belongs to a certain cluster. NEMI is clustering method agnostic, but the use of an agglomerative methodology and sorting in the described case study allows a filtering, or nesting, of clusters to tailor to a desired application.

NEMI workflow

Figure: Sketch of NEMI workflow. Part 1 (top row) illustrates moving from the data in its rew form, through initial symbolic renditioning, manifold transformation and clustering. Part 2 (bottom row) shows the ensembling, agglomerative utility ranking and native (field specific) utility ranking within each ensemble member. Finally, the cluster for each location is determined looking across the ensemble. (Top left image of model adapted from encyclopedie-environnement.org).

Plain language summary:

Within the Earth sciences data is increasingly becoming unmanageably large, noisy and nonlinear. Most methods that are commonly in use employ highly restrictive assumptions regarding the underlying statistics of the data and may even offer misleading results. To enable and accelerate scientific discovery, I drew on tools from computer science, statistics and dynamical systems theory to develop the Native Emergent Manifold Interrogation (NEMI) method. Nemi is intended for wide use within the Earth sciences and applied to an oceanographic example here. Using domain specific theory, manifold representation of the data, clustering and sophisticated ensembling, NEMI is able to highlight particularly interesting areas within the data. In the paper, I stresses the underlying philosophy and appreciation of methods to facilitate understanding of data mining; a tool to gain new knowledge.

What is new with NEMI:

NEMI is a generalisation of the methodology in Sonnewald et al. (2020) that targeted plankton ecosystems, in that is is designed to scale to larger datasets and is agnostic to the source of the data. Scaling is one of the true bottlenecks in data mining for scientific applications. NEMI is generalised to work with any data, where the particular example application used here is geospatial data. I have used an explicitly hierarchical approach, making NEMI less parametric (fewer parameters to tune and less danger of noise interference) and intuitively useful both for global (for example the whole Earth in the present example) or more local applications (for example a basin or more regional assessment). Another novelty in NEMI is the lack of a fixed field-specific benchmark criteria (used in Sonnewald et al. (2020)), where I have generalised so a field agnostic option is available. Lastly, NEMI invites the use of a range of uncertainty quantification options in the final cluster evaluation.

Requirements

Python 3.7 or greater

We also recommend installing in a virtual environment. For more information see documentation for e.g., Mamba.

Quick start guide

Install with pip install nemi-learn. Given an array X with dimensions (n_samples, n_features), these Python commands will run the NEMI workflow and bring up a plot:

from nemi import NEMI
nemi = NEMI()
nemi.run(X)
nemi.plot('clusters')

Installation from source

If you wish to install from the source code follow the steps below. This will allow you to e.g., personalize the embedding or clustering steps in the pipeline.

  1. Clone the repository

  2. (optional) Create and activate your virtual environment

  3. Navigate to the root of the repository and install:

    pip install -e .
    

    Alternatively, you can opt for a full installation to run tests and examples:

    pip install -e .[full]
    

nemi's People

Contributors

krosenfeld avatar maikejulie avatar

Stargazers

Dan Hooke avatar  avatar  avatar Amardeep Singh avatar  avatar  avatar  avatar Rich Chang (Yu-Chi Chang) avatar  avatar Katherine Rosenfeld avatar

Watchers

 avatar  avatar

nemi's Issues

Possibility of a parallel code?

Hello! Thanks for the nice package!

I was thinking if we can run the NEMI ensembles in parallel. I am thinking in the following way. Somehow assess_overlap is not working as expected which I am checking. Would like to hear your thoughts.

def nemi(feature):

	embedding..
	clustering..
	sorting..
	
	return embedding,clusters


ncpus = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=ncpus) 
results = pool.map(nemi, [feature] * n_ens)


embeddings = []
ensembles = []

for i, array in enumerate(results):
    embeddings.append( array[0])
    ensembles.append( array[1])

ensembles = np.vstack(ensembles)    
embeddings = np.stack(embeddings)



def assess_overlap(ensembles,max_clusters=None):

    base_id = 0
    base_labels = ensembles[base_id]
    compare_ids = [i for i in range(n_ens)]
    compare_ids.pop(base_id)

    num_clusters = int(np.max(base_labels) + 1)
    
    ...



clusters = assess_overlap(ensembles,max_clusters=None)

Add examples

Create examples folder and provide at least one as a notebook or script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.