Giter Site home page Giter Site logo

isoverse / isoreader Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 6.0 17.51 MB

Read IRMS (Isotope Ratio Mass Spectrometry) data files into R

Home Page: http://isoreader.isoverse.org

License: GNU General Public License v2.0

R 99.43% Makefile 0.13% TeX 0.44%
r data isotopes geochemistry ecology

isoreader's Introduction

isoreader's People

Contributors

japhir avatar romainfrancois avatar sebkopf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

isoreader's Issues

implement number with units class

to deal with units in raw data and vendor data tables more effectively, use the new data frame compatible vector class system introduced by the vctrs package to implement an isotope data type wit units (iso_double_with_units) that can carry units

saving file paths in isoread

a few improvements to the way iso_files store and use file paths to make reread more portable between server and local file systems (and of course between different operating systems)

  • include option to save only the relative file paths (relative to working directory) if users pass in file paths that include the working directory - make this the default behavior so file system information is not stored in the $file_info$file_path and it is easier to re-read files even after moving around directories

  • perhaps also store information about whether the file path is absolute or relative

  • note: this should probably go hand in hand with functions to replace parts of the file path for entire iso_file collections

allow easy recoding of data columns

in order to co-process data from different file formats, it would be very useful to have a "rename" function that recodes multiple columns into one (making sure that they are all NA in only one set of files).

Pseudo-syntax suggestion:

  • iso_recode(name = c(Id1, info), preparation = c(Comment, Preparation))

Iso_reader has an invalid format error

When running isoreader on a set of cached data it will some times give the bellow error. Restarting R is normally enough to fix it. This error was created on a windows 10 computer.

Info: preparing to read 2313 data files (all will be cached), setting up 12 parallel processes...
Progress: [--------------------------------------------------------------------------------------------------------------------------------------------------------] 0/2313 ( 0%) 0s213 parsing failures.
row col expected actual file
143 X1 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 -- 3 columns 5 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
201 -- 3 columns 1 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
... ... .................. ......... ............................................................................
See problems(...) for more details.
Error in sprintf("Info (process %d): ", X2) :
invalid format '%d'; use format %s for character objects"

provide an easy way to copy vignettes

vignettes don't usually ship with devtools::install_github() unless explicitly compiled (which takes a long time) so this may not be an option until the package is on CRAN but it would be quite useful to have a way to copy all vignettes into the current working directory for users to play with, something along these lines (pseudocode):

iso_copy_reader_vignettes <- function(target_folder = ".", overwrite_existing = FALSE, remove_vignette_header = TRUE) {
  ...
}

support for reading scan files

What's the progress on reading in .scn files? On the old isoread repository, this was the one thing we could get into R using your nice collection of scripts! ;-).

I'm willing to help develop, but I first have to get how isoreader operates under the hood, and make the helper functions available to play around with (I guess I can copy your repo, set all the functions to @export and then I should be able to play around with them on my machine?).

consider implementing units in raw_data

raw data could be cast as double_with_units, the same way the vendor data table is.

pros:

  • convenient for interacting with the data, less cluttered data access, can manage units and scaling independently of column names
  • current vs. voltage measurements would probably be fine thanks to the v44 and i44 prefixes

cons:

  • unclear how to reconcile raw data between different vendor data types if the units don't match (e.g. if have v44 in mV for one and v44 in V for another) - forced scaling first before aggregation?
  • several other processing pipelines already use the implicit unit columns from the raw data and would require fixing by calling an iso_make_units_implicit(prefix = ".", suffix = "") to continue working as used

please chime in if you feel strongly about this possibility

allow re-read of .rda collections

error/warnings if not all original source files available anymore (i.e reload impossible)

main parameter for reload should be the .rda filepath and then it is resaved to this position

allow global overwrite of read_options, default is to use same read options used in individual files

cannot read non-dual(?) did file

I guess our new mass spec set up for LIDI-measurements (60 cycles, first the reference gas, then the sample gas) causes the did files to have a different structure from the typical dual-inlet system. Or at least, there is something going on with the "info" capture.

Attached is an example measurement: link to ETH-1 did file

Reading a few did files (without the cache, so you get the real errors):

did  <- iso_read_dual_inlet("raw files/mat253plus/standards/", read_cache = FALSE)
Info: preparing to read 8 data file(s)...
Info: reading and caching file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){15}): 41 00 63 00 69 00 64 00 3a 00 20 00 37 00 30 00 2e 00 31 00 20 00 5b 00 b0 00 43 00 5d 00
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 948896)
Info: reading and caching file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){16}): 20 00 41 00 63 00 69 00 64 00 3a 00 20 00 36 00 39 00 2e 00 39 00 20 00 5b 00 b0 00 43 00 ...
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 949042)

Reloading the cache also throws some errors:

> did  <- iso_read_dual_inlet("raw files/mat253plus/standards/")
Info: preparing to read 8 data file(s)...
Info: reading file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' from cache...
Info: reading file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' from cache...
Info: reading file 3/8 'raw files/mat253plus/standards//180515_28_studname_3_ETH-3.did' from cache...
Info: reading file 4/8 'raw files/mat253plus/standards//180515_28_studname_4_ETH-3.did' from cache...
Info: reading file 5/8 'raw files/mat253plus/standards//180515_28_studname_5_ETH-2.did' from cache...
Info: reading file 6/8 'raw files/mat253plus/standards//180515_28_studname_6_ETH-2.did' from cache...
Info: reading file 7/8 'raw files/mat253plus/standards//180515_28_studname_7_ETH-3.did' from cache...
Info: reading file 8/8 'raw files/mat253plus/standards//180515_28_studname_8_ETH-3.did' from cache...
Info: encountered 17 problems in total.
# A tibble: 17 x 4
   file_id                      type  func                            details  
   <chr>                        <chr> <chr>                           <chr>    
 1 180515_28_studname_1_ETH-1.did error extract_isodat_measurement_info "cannot …
 2 180515_28_studname_1_ETH-1.did error extract_did_vendor_data_table   cannot p…
 3 180515_28_studname_2_ETH-1.did error extract_isodat_measurement_info "cannot …
 4 180515_28_studname_2_ETH-1.did error extract_did_vendor_data_table   cannot p…
 5 180515_28_studname_3_ETH-3.did error extract_isodat_measurement_info "cannot …
 6 180515_28_studname_3_ETH-3.did error extract_did_vendor_data_table   cannot p…
 7 180515_28_studname_4_ETH-3.did error extract_isodat_measurement_info "cannot …
 8 180515_28_studname_4_ETH-3.did error extract_did_raw_voltage_data    cannot l…
 9 180515_28_studname_4_ETH-3.did error extract_did_vendor_data_table   cannot p…
10 180515_28_studname_5_ETH-2.did error extract_isodat_measurement_info "cannot …
11 180515_28_studname_5_ETH-2.did error extract_did_vendor_data_table   cannot p…
12 180515_28_studname_6_ETH-2.did error extract_isodat_measurement_info "cannot …
13 180515_28_studname_6_ETH-2.did error extract_did_vendor_data_table   cannot p…
14 180515_28_studname_7_ETH-3.did error extract_isodat_measurement_info "cannot …
15 180515_28_studname_7_ETH-3.did error extract_did_vendor_data_table   cannot p…
16 180515_28_studname_8_ETH-3.did error extract_isodat_measurement_info "cannot …
17 180515_28_studname_8_ETH-3.did error extract_did_vendor_data_table   cannot p…

improve read speed

binary parsing takes a long time but should be possible to improve in speed by the following means:

  • consider moving key processing steps into C++
  • consider parallelizing processing tasks that can run in parallel

updates on .caf reader

Hi Sebastian,

Finally coming back to taking a look at this (see sebkopf/isoread#21).
Thanks for your elaborate reply in the other thread. I'm sorry I haven't taken the time to look at it before.

The caf reader works really well! It's slow, but I've just read in 564 data files from some MSc-student projects as a test, with only 80 problems in total (mostly relating to failed measurements due to drop tests or similar). I don't know how to throw the errors again, and the printed tibble isn't big enough to show the full extent. Here's a snippet: (... indicates I omitted some successful imports).

Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109531)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93743)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93989)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109535)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
Info: reading and caching file 563/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_17.caf' with '.caf' reader...
Info: reading and caching file 564/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_18.caf' with '.caf' reader...
Info: encountered 80 problems in total.
# A tibble: 80 x 4
   file_id                             type  func       details                
   <chr>                               <chr> <chr>      <chr>                  
 1 180306_studname_1_1_ETH-1.caf  error extract_c… cannot identify measur…
 2 180306_studname_1_1_ETH-1.caf  error extract_c… cannot process vendor …
 3 180306_studname_20_2_ETH-3.caf error extract_c… cannot identify measur…
 4 180306_studname_20_2_ETH-3.caf error extract_c… cannot process vendor …
...

Regarding your question 2, I think the caf file that I uploaded earlier should be fine!

consider implementing iso_file_lists as vctr classes

the new vctrs package provides efficient implementations of complex type vectors and would solve several issues with S3 method dispatch for different types of iso files objects (e.g. continuous flow vs. dual inlet).

the caveat is that individual iso_file objects are not as easily clearly defined to stay compatible with different file format imports.

gotta think this through carefully whether it makes sense

consider a data_table or peak_table df that carries with the isofiles

this is mostly about continuous flow file calculations and could benefit from first having a clearer separation of classes between dual_inlet and continuous_flow (maybe simply with dual_inlet and continuous_flow and dual_inlet_list and continuous_flow_list) since they do behave quite differently for more complex operations (iso_calculate_deltas can then also be for dual inlet only and throw an error "doesn't make sense" for cf).

anyways, to simplify isoprocessor calculations that may benefit from staying in isofile space for plotting and other operations, it could be useful to introduce a data_table field that could be modified (e.g. if a calibrated peak table data set should be added back into the isofiles at a later point) via iso_set_data/peak_table and could adopt vendor data table to begin with using iso_set_peak_table_from_vendor_data_table (or something along those lines) and rename fields to names that are compatible with isoprocessor defaults (rt.s, rt_start.s, rt_end.s, etc., the likewise the areas and backgrounds).

this whole set of setting a peak_table could simply be a part of isoprocessor instead of isoreader to keep the separation of collecting data and using data crystal clear

the big question to solve is whether it makes more sense to keep the raw_data and peak_table together permanently once iso_set_data_table is called to simplify background and area calculations and plotting in continuous flow files or keep them separate and only combine as needed.

I would advocate for keeping them separate and only combining as needed even if that makes those operations a bit slower. I.e. an iso_calculate_peak_area combines the peak table and chromatogram and calculates areas, then spits out the peak table again into its own field and leaves the raw data unchanged. this would probably help from getting things too messy in where what data is stored.

the rough overview of functions would be:

  • iso_detect_peaks: peak detection that operates on the raw data and generate the peak_table from scratch (not implemented for now)
  • iso_set_peak_table_from_vendor_data_table: start with vendor data table
  • iso_set_peak_table: set peak table from external information (or after calibration)
  • iso_mutate_peak_table: add new columns with useful information
  • iso_map_peaks: modify the peak mapping function to work within the isofiles (to keep plotting easy)
  • iso_integrate_peaks: combine raw data with peak data to calculate peak areas, max peak height and store the result in the peak_data field - would need to simultaneously calculate backgrounds as well to pave the way for more complex background calculation procedures than just a fixed background (i.e. background area makes sense, background height would be at the apex rather than some sort of average)
  • iso_calculate_peak_ratios: calculate ratios in the peaks table, shouldn't need the raw data in any way
  • iso_calculate_peak_deltas: calculate peak deltas based on an expression that identifies the reference peaks to use, a delta value for the reference peaks (again an expression so it can be a fixed value or a column introduced from other information, by default 0 - somehow figure out how to do this taking the standard values into consideration too), and a method on how to extrapolate raw ratios from the ref peaks ("linear", "bracket", "average", etc.)

keep file order in parallel processing

parallel processing can jumble the order of the files that were read. this is not usually a problem but can lead to confusion for users when file info is aggregated seemingly out of order.

iso_read_scn

implement reader for isodat's scn file format (see #25 for similar request)

iso_read_scan_iarc

it should be possible to pull out scan information from elementar scan iarc files

error reading CNS dxf files

the following error occurs when trying to read the newer CNS dxf files
invalid multibyte string, element 1

implement mutate and filter S3 generics for iso_files

both should operate on file info, this will be easier than any derived iso_... functions because of users' familiarity with dplyr (usually)

  • filter (should then only spit out the filtered subset of the iso_files object) --> replaces iso_filter_files
  • mutate (should modifyList the resulting changes...)

Windows 10 reading problem

Running windows 10 with everything updated (R, RStudio, and Packages) as of 2020/02/13 It looks to put 5 errors for every file. This is true for all files including ones that worked it the past. I attached a few files but any files should work.

error | extract_isodat_sequence_line_info
error | extract_isodat_measurement_info
error | extract_did_raw_voltage_data
error | extract_isodat_reference_values
error | extract_did_vendor_data_table

37956__all$ETH2_12_195.zip

question about installation

Hello, thanks for this fantastic package! It works great (when I use it on a machine on a network, where I can do devtools::github_install() to install). When I try to install onto an older (Win XP) computer that cannot be on a network (long story) using the .zip file I downloaded from this repo, the command does not fail or error, but the package does not get installed. It seems like the issue might be that installing the software requires concomitant installation of other software dependencies, and these are thwarted by the fact that the computer isn't on the network? Any idea if this is true?

did file does not read in.

did file with a large number of cyc gives a "extract_did_vendor_data_table" func error.
"cannot process vendor computed data table - unequal number of column headers (7) and data entries (0) found (pos 544533)"

16085__all$zero_0.zip

iarc reader C stack error

iso_get_reader_example("continuous_flow_example.iarc") %>%
iso_read_continuous_flow()

has recently started throwing the following type of error:

HDF5-DIAG: Error detected in HDF5 (1.8.19) thread 0:
  #000: H5L.c line 1181 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 833 in H5G_iterate(): unable to register group
    major: Object atom
    minor: Unable to register new atom
  #002: H5I.c line 880 in H5I_register(): invalid type
    major: Object atom
    minor: Unable to find ID group information
Warning: caught error - C stack usage  7969760 is too close to the limit    

fully implement bgrd_data field

  • background support in isodat files (should be part of read_raw_data)
  • background scaling included in iso_convert_signals!
  • include background in exports

peak detect params

This is not an issue with isoreader, but rather an idea for a new functionality, which you may or may not consider worth the time. I was just thinking that it would be awesome if isoreader allowed the user to re-evaluate a dxf trace with new peak detection parameters. In my experience, changing the peak detection parameters in Isodat can have important effects on delta values, but it is cumbersome to re-evaluate whole datasets in isodat varying peak detect params. If isoreader could evaluate the peaks, it would be easy for the user to test a range of peak detect params and see what results in best data quality. I think a lot of people arbitrarily choose peak detect params just because it is hard to test them. This could help. Just a thought. I know you don't have time to re-write all of Isodat though.

iso_read_feather

consider whether it should be possible to read exported feather files back in although some information may not be recoverable from the purely 2D structure of feather files

Detect duplicate file names with different `file_datetime`s?

In our database there are several files with the same name (in different subdirectories), that occur because for instance a run has stopped due to some error (e.g. no acid drop recorded). In this case the .did file is still saved, but contains no data except for the diagnostics (line, row, error message, etc.). Because the material is still there, we can restart the run from that particular aliquot, resulting in a new folder with some of the same names.
In our case, over half a year or so this has resulted in 267 files with duplicate names. Would it be possible for isoreader to not stop when importing different files with the same name, if they have different timestamps/file_paths?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.