isoverse / isoreader Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 6.0 17.51 MB

Read IRMS (Isotope Ratio Mass Spectrometry) data files into R

Home Page: http://isoreader.isoverse.org

License: GNU General Public License v2.0

R 99.43% Makefile 0.13% TeX 0.44%

r data isotopes geochemistry ecology

isoreader's Introduction

isoverse

isoreader's People

Contributors

Stargazers

Watchers

Forkers

brianeads sjbalint jimhester davisvaughan romainfrancois

isoreader's Issues

add `isoreader_example()` function

for simplicity in vignettes

implement number with units class

to deal with units in raw data and vendor data tables more effectively, use the new data frame compatible vector class system introduced by the vctrs package to implement an isotope data type wit units (iso_double_with_units) that can carry units

problem with combined iarc/dxf dataset in plot_raw_data

using original columns (i.e. nA and mV), panels do not plot properly

saving file paths in isoread

a few improvements to the way iso_files store and use file paths to make reread more portable between server and local file systems (and of course between different operating systems)

include option to save only the relative file paths (relative to working directory) if users pass in file paths that include the working directory - make this the default behavior so file system information is not stored in the $file_info$file_path and it is easier to re-read files even after moving around directories
perhaps also store information about whether the file path is absolute or relative
note: this should probably go hand in hand with functions to replace parts of the file path for entire iso_file collections

same prefix for all exported function for easy auto-completion and to avoid naming conflicts

all exported functions should be prefixed with iso_...

allow easy recoding of data columns

in order to co-process data from different file formats, it would be very useful to have a "rename" function that recodes multiple columns into one (making sure that they are all NA in only one set of files).

Pseudo-syntax suggestion:

iso_recode(name = c(Id1, info), preparation = c(Comment, Preparation))

include file_info in the data frame used for plotting

this will make it possible to apply facets and other parameters on top of the base chromatogram

Iso_reader has an invalid format error

When running isoreader on a set of cached data it will some times give the bellow error. Restarting R is normally enough to fix it. This error was created on a windows 10 computer.

Info: preparing to read 2313 data files (all will be cached), setting up 12 parallel processes...
Progress: [--------------------------------------------------------------------------------------------------------------------------------------------------------] 0/2313 ( 0%) 0s213 parsing failures.
row col expected actual file
143 X1 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 -- 3 columns 5 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
201 -- 3 columns 1 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
... ... .................. ......... ............................................................................
See problems(...) for more details.
Error in sprintf("Info (process %d): ", X2) :
invalid format '%d'; use format %s for character objects"

provide an easy way to copy vignettes

vignettes don't usually ship with devtools::install_github() unless explicitly compiled (which takes a long time) so this may not be an option until the package is on CRAN but it would be quite useful to have a way to copy all vignettes into the current working directory for users to play with, something along these lines (pseudocode):

iso_copy_reader_vignettes <- function(target_folder = ".", overwrite_existing = FALSE, remove_vignette_header = TRUE) {
  ...
}

widget for easy interactive movement around a chromatogram (like in isoviewer)

maybe this should just be functionality available from isoviewer instead? but possible to use inside an rmd file

support for reading scan files

What's the progress on reading in .scn files? On the old isoread repository, this was the one thing we could get into R using your nice collection of scripts! ;-).

I'm willing to help develop, but I first have to get how isoreader operates under the hood, and make the helper functions available to play around with (I guess I can copy your repo, set all the functions to @export and then I should be able to play around with them on my machine?).

consider implementing units in raw_data

raw data could be cast as double_with_units, the same way the vendor data table is.

pros:

convenient for interacting with the data, less cluttered data access, can manage units and scaling independently of column names
current vs. voltage measurements would probably be fine thanks to the v44 and i44 prefixes

cons:

unclear how to reconcile raw data between different vendor data types if the units don't match (e.g. if have v44 in mV for one and v44 in V for another) - forced scaling first before aggregation?
several other processing pipelines already use the implicit unit columns from the raw data and would require fixing by calling an iso_make_units_implicit(prefix = ".", suffix = "") to continue working as used

please chime in if you feel strongly about this possibility

problem with raw data for Argon trace

file attached, peak data table loads correctly

13375__30min He purge 180713 rep2_40ArN2 no inj no dilute DBN 180713 700s 2 refs_met.dxf.zip

allow re-read of .rda collections

error/warnings if not all original source files available anymore (i.e reload impossible)

main parameter for reload should be the .rda filepath and then it is resaved to this position

allow global overwrite of read_options, default is to use same read options used in individual files

cannot read non-dual(?) did file

I guess our new mass spec set up for LIDI-measurements (60 cycles, first the reference gas, then the sample gas) causes the did files to have a different structure from the typical dual-inlet system. Or at least, there is something going on with the "info" capture.

Attached is an example measurement: link to ETH-1 did file

Reading a few did files (without the cache, so you get the real errors):

did  <- iso_read_dual_inlet("raw files/mat253plus/standards/", read_cache = FALSE)
Info: preparing to read 8 data file(s)...
Info: reading and caching file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){15}): 41 00 63 00 69 00 64 00 3a 00 20 00 37 00 30 00 2e 00 31 00 20 00 5b 00 b0 00 43 00 5d 00
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 948896)
Info: reading and caching file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){16}): 20 00 41 00 63 00 69 00 64 00 3a 00 20 00 36 00 39 00 2e 00 39 00 20 00 5b 00 b0 00 43 00 ...
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 949042)

Reloading the cache also throws some errors:

> did  <- iso_read_dual_inlet("raw files/mat253plus/standards/")
Info: preparing to read 8 data file(s)...
Info: reading file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' from cache...
Info: reading file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' from cache...
Info: reading file 3/8 'raw files/mat253plus/standards//180515_28_studname_3_ETH-3.did' from cache...
Info: reading file 4/8 'raw files/mat253plus/standards//180515_28_studname_4_ETH-3.did' from cache...
Info: reading file 5/8 'raw files/mat253plus/standards//180515_28_studname_5_ETH-2.did' from cache...
Info: reading file 6/8 'raw files/mat253plus/standards//180515_28_studname_6_ETH-2.did' from cache...
Info: reading file 7/8 'raw files/mat253plus/standards//180515_28_studname_7_ETH-3.did' from cache...
Info: reading file 8/8 'raw files/mat253plus/standards//180515_28_studname_8_ETH-3.did' from cache...
Info: encountered 17 problems in total.
# A tibble: 17 x 4
   file_id                      type  func                            details  
   <chr>                        <chr> <chr>                           <chr>    
 1 180515_28_studname_1_ETH-1.did error extract_isodat_measurement_info "cannot …
 2 180515_28_studname_1_ETH-1.did error extract_did_vendor_data_table   cannot p…
 3 180515_28_studname_2_ETH-1.did error extract_isodat_measurement_info "cannot …
 4 180515_28_studname_2_ETH-1.did error extract_did_vendor_data_table   cannot p…
 5 180515_28_studname_3_ETH-3.did error extract_isodat_measurement_info "cannot …
 6 180515_28_studname_3_ETH-3.did error extract_did_vendor_data_table   cannot p…
 7 180515_28_studname_4_ETH-3.did error extract_isodat_measurement_info "cannot …
 8 180515_28_studname_4_ETH-3.did error extract_did_raw_voltage_data    cannot l…
 9 180515_28_studname_4_ETH-3.did error extract_did_vendor_data_table   cannot p…
10 180515_28_studname_5_ETH-2.did error extract_isodat_measurement_info "cannot …
11 180515_28_studname_5_ETH-2.did error extract_did_vendor_data_table   cannot p…
12 180515_28_studname_6_ETH-2.did error extract_isodat_measurement_info "cannot …
13 180515_28_studname_6_ETH-2.did error extract_did_vendor_data_table   cannot p…
14 180515_28_studname_7_ETH-3.did error extract_isodat_measurement_info "cannot …
15 180515_28_studname_7_ETH-3.did error extract_did_vendor_data_table   cannot p…
16 180515_28_studname_8_ETH-3.did error extract_isodat_measurement_info "cannot …
17 180515_28_studname_8_ETH-3.did error extract_did_vendor_data_table   cannot p…

error message for problematic files should give additional information

it would be useful to know not just how many errors but in how many files errors were encountered

improve read speed

binary parsing takes a long time but should be possible to improve in speed by the following means:

consider moving key processing steps into C++
consider parallelizing processing tasks that can run in parallel

allow dplyr style select statements for data aggregation

updates on .caf reader

Hi Sebastian,

Finally coming back to taking a look at this (see sebkopf/isoread#21).
Thanks for your elaborate reply in the other thread. I'm sorry I haven't taken the time to look at it before.

The caf reader works really well! It's slow, but I've just read in 564 data files from some MSc-student projects as a test, with only 80 problems in total (mostly relating to failed measurements due to drop tests or similar). I don't know how to throw the errors again, and the printed tibble isn't big enough to show the full extent. Here's a snippet: (... indicates I omitted some successful imports).

Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109531)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93743)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93989)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109535)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
Info: reading and caching file 563/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_17.caf' with '.caf' reader...
Info: reading and caching file 564/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_18.caf' with '.caf' reader...
Info: encountered 80 problems in total.
# A tibble: 80 x 4
   file_id                             type  func       details                
   <chr>                               <chr> <chr>      <chr>                  
 1 180306_studname_1_1_ETH-1.caf  error extract_c… cannot identify measur…
 2 180306_studname_1_1_ETH-1.caf  error extract_c… cannot process vendor …
 3 180306_studname_20_2_ETH-3.caf error extract_c… cannot identify measur…
 4 180306_studname_20_2_ETH-3.caf error extract_c… cannot process vendor …
...

Regarding your question 2, I think the caf file that I uploaded earlier should be fine!

support remote file system

potentially with google drive and/or dropbox
for google drive use : http://googledrive.tidyverse.org/

complete operations vignette

read nu dual inlet files

read files
include example files
run unit tests on example files

consider implementing iso_file_lists as vctr classes

the new vctrs package provides efficient implementations of complex type vectors and would solve several issues with S3 method dispatch for different types of iso files objects (e.g. continuous flow vs. dual inlet).

the caveat is that individual iso_file objects are not as easily clearly defined to stay compatible with different file format imports.

gotta think this through carefully whether it makes sense

consider a data_table or peak_table df that carries with the isofiles

this is mostly about continuous flow file calculations and could benefit from first having a clearer separation of classes between dual_inlet and continuous_flow (maybe simply with dual_inlet and continuous_flow and dual_inlet_list and continuous_flow_list) since they do behave quite differently for more complex operations (iso_calculate_deltas can then also be for dual inlet only and throw an error "doesn't make sense" for cf).

anyways, to simplify isoprocessor calculations that may benefit from staying in isofile space for plotting and other operations, it could be useful to introduce a data_table field that could be modified (e.g. if a calibrated peak table data set should be added back into the isofiles at a later point) via iso_set_data/peak_table and could adopt vendor data table to begin with using iso_set_peak_table_from_vendor_data_table (or something along those lines) and rename fields to names that are compatible with isoprocessor defaults (rt.s, rt_start.s, rt_end.s, etc., the likewise the areas and backgrounds).

this whole set of setting a peak_table could simply be a part of isoprocessor instead of isoreader to keep the separation of collecting data and using data crystal clear

the big question to solve is whether it makes more sense to keep the raw_data and peak_table together permanently once iso_set_data_table is called to simplify background and area calculations and plotting in continuous flow files or keep them separate and only combine as needed.

I would advocate for keeping them separate and only combining as needed even if that makes those operations a bit slower. I.e. an iso_calculate_peak_area combines the peak table and chromatogram and calculates areas, then spits out the peak table again into its own field and leaves the raw data unchanged. this would probably help from getting things too messy in where what data is stored.

the rough overview of functions would be:

iso_detect_peaks: peak detection that operates on the raw data and generate the peak_table from scratch (not implemented for now)
iso_set_peak_table_from_vendor_data_table: start with vendor data table
iso_set_peak_table: set peak table from external information (or after calibration)
iso_mutate_peak_table: add new columns with useful information
iso_map_peaks: modify the peak mapping function to work within the isofiles (to keep plotting easy)
iso_integrate_peaks: combine raw data with peak data to calculate peak areas, max peak height and store the result in the peak_data field - would need to simultaneously calculate backgrounds as well to pave the way for more complex background calculation procedures than just a fixed background (i.e. background area makes sense, background height would be at the apex rather than some sort of average)
iso_calculate_peak_ratios: calculate ratios in the peaks table, shouldn't need the raw data in any way
iso_calculate_peak_deltas: calculate peak deltas based on an expression that identifies the reference peaks to use, a delta value for the reference peaks (again an expression so it can be a fixed value or a column introduced from other information, by default 0 - somehow figure out how to do this taking the standard values into consideration too), and a method on how to extrapolate raw ratios from the ref peaks ("linear", "bracket", "average", etc.)

iso_get_raw_data doesn't return the `Is Ref_` column. Where did the reference gas go?

I want to play around with the raw intensities of the sample and reference gas of the dual inlet measurements, but iso_get_raw_data doesn't seem to return a separate ref gas and sample gas. Am I missing something?

keep file order in parallel processing

parallel processing can jumble the order of the files that were read. this is not usually a problem but can lead to confusion for users when file info is aggregated seemingly out of order.

iso_read_scn

implement reader for isodat's scn file format (see #25 for similar request)

enable auto width in excel export

include an add_file_info function

function to easily merge in metadata but figure out how this fits with the iso_add_metadata function from isoprocessor!

move iso_format to isoreader

it is frequently used in iso reader already so should be part of this package, rather than isoprocessor

implement iso_parse_file_info

see isoverse/clumpedr#13 for details

document extract_data, extract_word, extract_substring properly in the vignettes

iso_read_scan_iarc

it should be possible to pull out scan information from elementar scan iarc files

error reading CNS dxf files

the following error occurs when trying to read the newer CNS dxf files
invalid multibyte string, element 1

only trigger re-cache/warnings with major and minor version change

quasi-quotation issue with R 3.6 (!!! vs !!)

compatibility issues with tidyselect 1.0.0

implement mutate and filter S3 generics for iso_files

both should operate on file info, this will be easier than any derived iso_... functions because of users' familiarity with dplyr (usually)

filter (should then only spit out the filtered subset of the iso_files object) --> replaces iso_filter_files
mutate (should modifyList the resulting changes...)

Windows 10 reading problem

Running windows 10 with everything updated (R, RStudio, and Packages) as of 2020/02/13 It looks to put 5 errors for every file. This is true for all files including ones that worked it the past. I attached a few files but any files should work.

37956__all$ETH2_12_195.zip

question about installation

Hello, thanks for this fantastic package! It works great (when I use it on a machine on a network, where I can do devtools::github_install() to install). When I try to install onto an older (Win XP) computer that cannot be on a network (long story) using the .zip file I downloaded from this repo, the command does not fail or error, but the package does not get installed. It seems like the issue might be that installing the software requires concomitant installation of other software dependencies, and these are thwarted by the fact that the computer isn't on the network? Any idea if this is true?

did file does not read in.

did file with a large number of cyc gives a "extract_did_vendor_data_table" func error.
"cannot process vendor computed data table - unequal number of column headers (7) and data entries (0) found (pos 544533)"

16085__all$zero_0.zip

iarc reader C stack error

iso_get_reader_example("continuous_flow_example.iarc") %>%
iso_read_continuous_flow()

has recently started throwing the following type of error:

HDF5-DIAG: Error detected in HDF5 (1.8.19) thread 0:
  #000: H5L.c line 1181 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 833 in H5G_iterate(): unable to register group
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 880 in H5I_register(): invalid type
    major: Object atom
    minor: Unable to find ID group information
Warning: caught error - C stack usage  7969760 is too close to the limit

fully implement bgrd_data field

background support in isodat files (should be part of read_raw_data)
background scaling included in iso_convert_signals!
include background in exports

iso_get_raw_data hangs when reading in many files on latest release!

On dev when I try to get the raw data from ~8k files, the function never completes! It gives the info message, but then never finishes without any errors. Have to interrupt with Ctrlc. When I first iso_filter_files() to reduce the total number a bit, it doesn't hang.

error reading dxf files with custom gas configuration name

reported by Brett Hill

move calculation functions to isoprocessor

clearer separation between data reader and processor

peak detect params

This is not an issue with isoreader, but rather an idea for a new functionality, which you may or may not consider worth the time. I was just thinking that it would be awesome if isoreader allowed the user to re-evaluate a dxf trace with new peak detection parameters. In my experience, changing the peak detection parameters in Isodat can have important effects on delta values, but it is cumbersome to re-evaluate whole datasets in isodat varying peak detect params. If isoreader could evaluate the peaks, it would be easy for the user to test a range of peak detect params and see what results in best data quality. I think a lot of people arbitrarily choose peak detect params just because it is hard to test them. This could help. Just a thought. I know you don't have time to re-write all of Isodat though.

allow user control over number of cores in parallel processing

users may not want to have all their cores used so other software can continue to be useful, provide control over this by parameter

iso_read_feather

consider whether it should be possible to read exported feather files back in although some information may not be recoverable from the purely 2D structure of feather files

Detect duplicate file names with different `file_datetime`s?

In our database there are several files with the same name (in different subdirectories), that occur because for instance a run has stopped due to some error (e.g. no acid drop recorded). In this case the .did file is still saved, but contains no data except for the diagnostics (line, row, error message, etc.). Because the material is still there, we can restart the run from that particular aliquot, resulting in a new folder with some of the same names.
In our case, over half a year or so this has resulted in 267 files with duplicate names. Would it be possible for isoreader to not stop when importing different files with the same name, if they have different timestamps/file_paths?