kemerelab / jagular Goto Github PK

View Code? Open in Web Editor NEW

3.0 5.0 1.0 40 MB

Out-of-core pre-processing of big-ish electrophysiology data

License: Other

Python 9.76% Jupyter Notebook 90.24%

python electrophysiology neuroscience big-data out-of-core

jagular's Introduction

Jagular

Out-of-core pre-processing of big-ish electrophysiology data, including spike detection and alignment.

Installation

The easiest way to install jagular is to use pip. From the terminal, run:

pip install jagular

Alternatively, you can install the latest version of jagular by running the following commands:

git clone https://github.com/kemerelab/jagular.git
cd jagular
python setup.py [install, develop]

where the develop argument should be used if you want to modify the code.

What is Jagular used for?

We perform long (multiple days to multiple week-long) chronic in-vivo electrophysiology recordings, resulting in many terabytes of data per experiment. These long recording periods pose significant challenges to almost every part of a typical analysis pipeline, including filtering, spike detection, and alignment.

For spike detection, for example, we need to filter the wideband data prior to doing some form of threshold crossing detection. But if we have terabytes worth of data, even a simple step such as filtering can become tricky, since we have to do the filtering out-of-core (since the data does not fit into memory). In addition, there can be substantial drift on the electrodes over such a long period of time, so an adaptive threshold crossing aproach would be more appropriate.

Jagular makes doing these out-of-core tasks easier, by providing a simple interface to read in well-defined chunks from multiple files, in a seamless manner. These chunks can then be used to process the data in a more manageable way. Jagular also has complete support built in for the full (filtering)-(spike-detection)-(waveform-allignment) part of the analysis process, which works out-of-core, and deals elegantly with electrode drift.

Where

download	https://pypi.python.org/pypi/jagular
inspiration	https://www.youtube.com/watch?v=WLCDAPNTpaM
docs	coming soon!
code	https://github.com/kemerelab/jagular

License

Jagular is distributed under the MIT license. See the LICENSE file for details.

jagular's People

Contributors

Stargazers

Watchers

Forkers

stjordanis

jagular's Issues

implement: channel write generalization

Currently the channel write functionality in utils is a little restrictive. In particular, we need to support different formats:

# re-estimate number of packets
num_packets = len(ts)
my_ch_struct = Struct('<%dh' % num_packets)
my_ts_struct = Struct('<%dI' % num_packets) # ts_dtype should affect this!!!!
ts_packed = my_ts_struct.pack(*ts)

but other generalizations are necessary, too.

refactor: within-epoch operations

Many operations need to be restricted to within-epochs. These include filtering, spike detection and snippet extraction, resampling, __repr__ and so on.

Currently, I re-implement this functionality each time I need it. It would be great to come up with some abstract interface or service that I can invoke to make the code more readable, compact, and maintainable.

I don't have a good / clean way of doing this yet.

implement: Matclust interface

No Matlab required / involved!

implement: switch to logging throughout jagular

implement: downsampling

With an appropriate anti-aliasing filter applied first.

Should probably re-use the same underlying functions as jag.resample

Incompatibility between native Python and numpy types

In filtering.py, certain indices are computed by taking the difference between a native Python type and np.uint64. The result is a float because Python doesn't recognize the numpy int type. Result should be cast into an appropriate int (whether native or numpy) before indexing by this value.

change assumption of integral timestamps

This is prevalent in check_timestamps, has_duplicates and so on, and is unnecessary for the most part. Contiguous segment determination can get messy and tricky when we don't have integral timestamps, but it has for the most part been converted to the more general form already.

implement: waveform alignment

We plan to use the analytic solution to the peak-fit parabola as the peak location, and we plan to use jag.resample to resample the entire waveform using Fourier methods.

Write merged SD and rec file data directly to individual channels

In our current workflow, we take a rec file along with SD card file to create a merged rec file. This step can often be necessary for the following reasons:

The merged rec file contains all neural data channels. This is inefficient especially if we don't record from all channels. The same can be said for the DIOs.
To get a single channel data out, we go from packed data (sd file) to packed data (merged rec file) to individual channel data. It is arguably more time-consuming to do it this way when we can eliminate the intermediate merged file and go directly from packed data in the sd file to individual channel data.

Finally, the timestamps should be in absolute time to allow for pain-free comparison across days (or even within a single day if there were multiple recordings taken in a day). See #22

implement: SpikeGadgets DIO reader

Add notch filter capability to filtering API

This should probably mirror the implementation in nelpy.

implement: SpikeGadgets IMU reader

Absolute time in merge rec

This issue is more relevant for the data format of input files to jagular but is documented here for bookkeeping purposes. A jagular file map will extract the first and last timestamps of the passed-in merged rec files, but this is a problem when the files span multiple days, as the timestamps recorded in .rec files restart every time a new rec file is opened. A solution is to store timestamps as 64-bit absolute time in merged rec files so that the timestamps are unambiguous.

implement: waveform extraction

We need to extract a larger snippet than what we really want, so that we can align the snippets without edge effects.

extend filter configuration for fast out-of-core processing

Currently filter epochs are computed internally, and the API is not properly exposed to allow a user to specify in-core or out-of-core, etc.

In addition, it may be desirable to pass in desired epochs (with index=True) directly to the filter object, so that they won't have to be recomputed every time we want to filter. This approach can also make it simpler to customize the epoch calculation to suit the application needs.

implement: resampling

Preferably using Fourier methods or similar, so not just linear interpolation.

Checking if container is sorted

The numpy diff() function returns an array of the same type as its input. This means that we have to be careful about checking whether a generic container is sorted if that container is of an unsigned type. Currently various jagular functions rely on diff(), including get_contiguous_segments() and may have to be modified if necessary.

documentation: filtering

benchmark: filtering

wideband
gstop

implement: add metadata to extracted data as a separate file

We need several pieces of information, most critically the sampling rate of recordings / processed data, but also data types.

One option could be to generate and use info.json files, similar to those found in https://github.com/tridesclous/tridesclous_datasets

An example of an info.json file might look like this:

{
    "dtype": "float32",
    "sample_rate": 15000.0,
    "shape": [-1, 4]
}

Different number of timestamps and channel data samples during raw data extraction

In utils.py, sometimes the raw data extraction fails. The error indicates that the number of timestamps to write to disk is different from the number of samples in the channel data.

Ex: For the file install_07-09-2017_0200_0400_sd09_merge.rec, the error message was
pack expected 2097334 items for packing (got 2097340)

The failed disk write happened on both the dev branch commit c2424a5 and master branch commit
5d88362

implement: spike detection

MAD-based spike detection.

We should think carefully about lock-out periods, peeling, and adaptive detection.

For adaptive detection, we could use the block size, and do detection on a per-block level. However, we will again need to have some padding to ensure that we don't cut some spikes in half, and that we don't detect the same spikes multiple times.

Filtering failure while trying to index

See error and stack trace below:

IndexError Traceback (most recent call last)
in ()
4 fs=30000,
5 fl=600,
----> 6 fh=6000)
7 filt_data_ch0 = np.fromfile('/home/jchu/data/install/long-recording/experiments/07-09-2017/filt-ch.00.raw',
8 np.int16)

~/code/jagular/jagular/filtering.py in filtfilt_mmap(timestamps, finname, foutname, fs, fl, fh, gpass, gstop, dtype, ftype, buffer_len, overlap_len, max_len, **kwargs)
109 overlap_len=overlap_len,
110 max_len=max_len,
--> 111 **kwargs)
112 return y
113

~/code/jagular/jagular/filtering.py in filtfilt_within_epochs_mmap(timestamps, finname, foutname, dtype, sos, buffer_len, overlap_len, max_len, filter_epochs, **kwargs)
178 assume_sorted=assume_sorted,
179 step=step,
--> 180 index=True)
181
182 for (start, stop) in filter_epochs:

~/code/jagular/jagular/utils.py in _get_contiguous_segments_fast(data, step, assume_sorted, index, inclusive)
277 starts = np.insert(breaks+1, 0, 0)
278 stops = np.append(breaks, len(data)-1)
--> 279 bdries = np.vstack((data[starts], data[stops] + step)).T
280 if index:
281 if inclusive:

IndexError: arrays used as indices must be of integer (or boolean) type

benchmark: channel writes

How does it scale with # channels written out? Does the unpacking take most of the time, or the writing back to disk?

How about the effect of the block size?

implement: scriptable interface

Either change the API to make it scriptable, or at least write an example script to do the following:

extract all or subset of channels
spike detection
- filter to spike band
- extract epoch-aware padded snippets
- align snippets
- save aligned snippets, as well as aligned spike times
  ------ peeling !?
downsample to 3 kHz (.eeg)
downsample to 1 kHz
- LFP filter (.lfp)
- ripple filter (.ripple)
downsample to 120 Hz
- theta filter (.theta)

Number of timestamps and neural samples not the same in particular file

@jchutrue please post the code snippet below!

Context: one of the 7/10 .dat files (2200--2400) seems problematic. Using the default block size, I think it "worked", but created lots of small epochs (while using max_gap_size=300), which is unexpected for this file. When we used block_size=2**24, we got an error, saying that the number of expected channel data packets to pack was 0 (which implies an empty new_ts), but that approx 2 million samples were passed for packing.

What's going on?
Is jagular missing some special case? Is the data corrupt? Both?

...

data: locust example

Ask permission to use data from https://github.com/tridesclous/tridesclous_datasets, or simply consider readers for these datasets and use urlrequest to obtain data directly in example analyses.

Take a look at Cristophe Pouzat's "easy" example: http://xtof.perso.math.cnrs.fr/d533101.html
Take a careful look at Cristophe Pouzat's "difficult" example: http://xtof.perso.math.cnrs.fr/d11221.html

testing: comprehensive test cases

Include out-of-core and in-core tests.

At least have tests for filtering, filemap, reader classes, empty subsets, etc.

documentation: timestamp sanitization and channel extraction

Use `logging` and only warn on incomplete packet read, not incomplete block read

I don't actually think that

print("Read {} complete packets but requested {} packets".format(num_samples_read, block_size))

is an interesting occurrence. This is almost always not seen by the user, since JFM gets the remaining packets from the next file. So this is only mildly interesting in the last file, but even then, not really.

What IS interesting to log and/or catch, is if we ever read an incomplete packet, not an incomplete block.

At any rate, we should avoid littering our code with print statements, and should use warnings at worst, or have an optional verbose mode, but logging is highly preferred.

data: multi-file gap data

The timestamps could not be fully sanitized; not clear why yet. This needs to be investigated before this example dataset can be finalized.

implement: spike feature extraction

Add several features, including t-SNE for spike sorting.