isoverse / isoreader Goto Github PK
View Code? Open in Web Editor NEWRead IRMS (Isotope Ratio Mass Spectrometry) data files into R
Home Page: http://isoreader.isoverse.org
License: GNU General Public License v2.0
Read IRMS (Isotope Ratio Mass Spectrometry) data files into R
Home Page: http://isoreader.isoverse.org
License: GNU General Public License v2.0
for simplicity in vignettes
to deal with units in raw data and vendor data tables more effectively, use the new data frame compatible vector class system introduced by the vctrs
package to implement an isotope data type wit units (iso_double_with_units
) that can carry units
using original columns (i.e. nA and mV), panels do not plot properly
a few improvements to the way iso_files store and use file paths to make reread
more portable between server and local file systems (and of course between different operating systems)
include option to save only the relative file paths (relative to working directory) if users pass in file paths that include the working directory - make this the default behavior so file system information is not stored in the $file_info$file_path and it is easier to re-read files even after moving around directories
perhaps also store information about whether the file path is absolute or relative
note: this should probably go hand in hand with functions to replace parts of the file path for entire iso_file collections
all exported functions should be prefixed with iso_...
in order to co-process data from different file formats, it would be very useful to have a "rename" function that recodes multiple columns into one (making sure that they are all NA in only one set of files).
Pseudo-syntax suggestion:
iso_recode(name = c(Id1, info), preparation = c(Comment, Preparation))
this will make it possible to apply facets and other parameters on top of the base chromatogram
When running isoreader on a set of cached data it will some times give the bellow error. Restarting R is normally enough to fix it. This error was created on a windows 10 computer.
Info: preparing to read 2313 data files (all will be cached), setting up 12 parallel processes...
Progress: [--------------------------------------------------------------------------------------------------------------------------------------------------------] 0/2313 ( 0%) 0s213 parsing failures.
row col expected actual file
143 X1 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 X3 delimiter or quote i 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
184 -- 3 columns 5 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
201 -- 3 columns 1 columns 'C:\Users\CUBESSIL\AppData\Local\Temp\RtmpAJgEKL\file32dc60eccf6.log'
... ... .................. ......... ............................................................................
See problems(...) for more details.
Error in sprintf("Info (process %d): ", X2) :
invalid format '%d'; use format %s for character objects"
vignettes don't usually ship with devtools::install_github()
unless explicitly compiled (which takes a long time) so this may not be an option until the package is on CRAN but it would be quite useful to have a way to copy all vignettes into the current working directory for users to play with, something along these lines (pseudocode):
iso_copy_reader_vignettes <- function(target_folder = ".", overwrite_existing = FALSE, remove_vignette_header = TRUE) {
...
}
maybe this should just be functionality available from isoviewer instead? but possible to use inside an rmd file
What's the progress on reading in .scn files? On the old isoread repository, this was the one thing we could get into R using your nice collection of scripts! ;-).
I'm willing to help develop, but I first have to get how isoreader operates under the hood, and make the helper functions available to play around with (I guess I can copy your repo, set all the functions to @export
and then I should be able to play around with them on my machine?).
raw data could be cast as double_with_units, the same way the vendor data table is.
pros:
v44
and i44
prefixescons:
v44
in mV
for one and v44
in V
for another) - forced scaling first before aggregation?iso_make_units_implicit(prefix = ".", suffix = "")
to continue working as usedplease chime in if you feel strongly about this possibility
file attached, peak data table loads correctly
13375__30min He purge 180713 rep2_40ArN2 no inj no dilute DBN 180713 700s 2 refs_met.dxf.zip
error/warnings if not all original source files available anymore (i.e reload impossible)
main parameter for reload should be the .rda filepath and then it is resaved to this position
allow global overwrite of read_options, default is to use same read options used in individual files
I guess our new mass spec set up for LIDI-measurements (60 cycles, first the reference gas, then the sample gas) causes the did files to have a different structure from the typical dual-inlet system. Or at least, there is something going on with the "info" capture.
Attached is an example measurement: link to ETH-1 did file
Reading a few did files (without the cache, so you get the real errors):
did <- iso_read_dual_inlet("raw files/mat253plus/standards/", read_cache = FALSE)
Info: preparing to read 8 data file(s)...
Info: reading and caching file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){15}): 41 00 63 00 69 00 64 00 3a 00 20 00 37 00 30 00 2e 00 31 00 20 00 5b 00 b0 00 43 00 5d 00
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 948896)
Info: reading and caching file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' with '.did' reader...
Warning: caught error - cannot process measurement info - 'info' capture failed: raw data does not match requested data pattern (([ -~]\x00){16}): 20 00 41 00 63 00 69 00 64 00 3a 00 20 00 36 00 39 00 2e 00 39 00 20 00 5b 00 b0 00 43 00 ...
Warning: caught error - cannot process vendor computed data table - unequal number of column headers (10) and data entries (0) found (pos 949042)
Reloading the cache also throws some errors:
> did <- iso_read_dual_inlet("raw files/mat253plus/standards/")
Info: preparing to read 8 data file(s)...
Info: reading file 1/8 'raw files/mat253plus/standards//180515_28_studname_1_ETH-1.did' from cache...
Info: reading file 2/8 'raw files/mat253plus/standards//180515_28_studname_2_ETH-1.did' from cache...
Info: reading file 3/8 'raw files/mat253plus/standards//180515_28_studname_3_ETH-3.did' from cache...
Info: reading file 4/8 'raw files/mat253plus/standards//180515_28_studname_4_ETH-3.did' from cache...
Info: reading file 5/8 'raw files/mat253plus/standards//180515_28_studname_5_ETH-2.did' from cache...
Info: reading file 6/8 'raw files/mat253plus/standards//180515_28_studname_6_ETH-2.did' from cache...
Info: reading file 7/8 'raw files/mat253plus/standards//180515_28_studname_7_ETH-3.did' from cache...
Info: reading file 8/8 'raw files/mat253plus/standards//180515_28_studname_8_ETH-3.did' from cache...
Info: encountered 17 problems in total.
# A tibble: 17 x 4
file_id type func details
<chr> <chr> <chr> <chr>
1 180515_28_studname_1_ETH-1.did error extract_isodat_measurement_info "cannot …
2 180515_28_studname_1_ETH-1.did error extract_did_vendor_data_table cannot p…
3 180515_28_studname_2_ETH-1.did error extract_isodat_measurement_info "cannot …
4 180515_28_studname_2_ETH-1.did error extract_did_vendor_data_table cannot p…
5 180515_28_studname_3_ETH-3.did error extract_isodat_measurement_info "cannot …
6 180515_28_studname_3_ETH-3.did error extract_did_vendor_data_table cannot p…
7 180515_28_studname_4_ETH-3.did error extract_isodat_measurement_info "cannot …
8 180515_28_studname_4_ETH-3.did error extract_did_raw_voltage_data cannot l…
9 180515_28_studname_4_ETH-3.did error extract_did_vendor_data_table cannot p…
10 180515_28_studname_5_ETH-2.did error extract_isodat_measurement_info "cannot …
11 180515_28_studname_5_ETH-2.did error extract_did_vendor_data_table cannot p…
12 180515_28_studname_6_ETH-2.did error extract_isodat_measurement_info "cannot …
13 180515_28_studname_6_ETH-2.did error extract_did_vendor_data_table cannot p…
14 180515_28_studname_7_ETH-3.did error extract_isodat_measurement_info "cannot …
15 180515_28_studname_7_ETH-3.did error extract_did_vendor_data_table cannot p…
16 180515_28_studname_8_ETH-3.did error extract_isodat_measurement_info "cannot …
17 180515_28_studname_8_ETH-3.did error extract_did_vendor_data_table cannot p…
it would be useful to know not just how many errors but in how many files errors were encountered
binary parsing takes a long time but should be possible to improve in speed by the following means:
Hi Sebastian,
Finally coming back to taking a look at this (see sebkopf/isoread#21).
Thanks for your elaborate reply in the other thread. I'm sorry I haven't taken the time to look at it before.
The caf reader works really well! It's slow, but I've just read in 564 data files from some MSc-student projects as a test, with only 80 problems in total (mostly relating to failed measurements due to drop tests or similar). I don't know how to throw the errors again, and the printed tibble isn't big enough to show the full extent. Here's a snippet: (... indicates I omitted some successful imports).
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109531)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93743)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 93989)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
...
Warning: caught error - cannot identify measured masses - block 'CResultData' not found after position 1 (pos 109535)
Warning: caught error - cannot process vendor data table - block 'CResultData' not found after position 1 (pos 1)
Info: reading and caching file 563/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_17.caf' with '.caf' reader...
Info: reading and caching file 564/564 '~/Downloads/megaraw/180522_Stds/180523_Std_ETH-3_18.caf' with '.caf' reader...
Info: encountered 80 problems in total.
# A tibble: 80 x 4
file_id type func details
<chr> <chr> <chr> <chr>
1 180306_studname_1_1_ETH-1.caf error extract_c… cannot identify measur…
2 180306_studname_1_1_ETH-1.caf error extract_c… cannot process vendor …
3 180306_studname_20_2_ETH-3.caf error extract_c… cannot identify measur…
4 180306_studname_20_2_ETH-3.caf error extract_c… cannot process vendor …
...
Regarding your question 2, I think the caf file that I uploaded earlier should be fine!
potentially with google drive and/or dropbox
for google drive use : http://googledrive.tidyverse.org/
the new vctrs package provides efficient implementations of complex type vectors and would solve several issues with S3 method dispatch for different types of iso files objects (e.g. continuous flow vs. dual inlet).
the caveat is that individual iso_file objects are not as easily clearly defined to stay compatible with different file format imports.
gotta think this through carefully whether it makes sense
this is mostly about continuous flow file calculations and could benefit from first having a clearer separation of classes between dual_inlet
and continuous_flow
(maybe simply with dual_inlet
and continuous_flow
and dual_inlet_list
and continuous_flow_list
) since they do behave quite differently for more complex operations (iso_calculate_deltas
can then also be for dual inlet only and throw an error "doesn't make sense" for cf).
anyways, to simplify isoprocessor calculations that may benefit from staying in isofile space for plotting and other operations, it could be useful to introduce a data_table
field that could be modified (e.g. if a calibrated peak table data set should be added back into the isofiles at a later point) via iso_set_data/peak_table
and could adopt vendor data table to begin with using iso_set_peak_table_from_vendor_data_table
(or something along those lines) and rename fields to names that are compatible with isoprocessor defaults (rt.s
, rt_start.s
, rt_end.s
, etc., the likewise the areas and backgrounds).
this whole set of setting a peak_table could simply be a part of isoprocessor instead of isoreader to keep the separation of collecting data and using data crystal clear
the big question to solve is whether it makes more sense to keep the raw_data
and peak_table
together permanently once iso_set_data_table
is called to simplify background and area calculations and plotting in continuous flow files or keep them separate and only combine as needed.
I would advocate for keeping them separate and only combining as needed even if that makes those operations a bit slower. I.e. an iso_calculate_peak_area
combines the peak table and chromatogram and calculates areas, then spits out the peak table again into its own field and leaves the raw data unchanged. this would probably help from getting things too messy in where what data is stored.
the rough overview of functions would be:
iso_detect_peaks
: peak detection that operates on the raw data and generate the peak_table
from scratch (not implemented for now)iso_set_peak_table_from_vendor_data_table
: start with vendor data tableiso_set_peak_table
: set peak table from external information (or after calibration)iso_mutate_peak_table
: add new columns with useful informationiso_map_peaks
: modify the peak mapping function to work within the isofiles (to keep plotting easy)iso_integrate_peaks
: combine raw data with peak data to calculate peak areas, max peak height and store the result in the peak_data
field - would need to simultaneously calculate backgrounds as well to pave the way for more complex background calculation procedures than just a fixed background (i.e. background area makes sense, background height would be at the apex rather than some sort of average)iso_calculate_peak_ratios
: calculate ratios in the peaks table, shouldn't need the raw data in any wayiso_calculate_peak_deltas
: calculate peak deltas based on an expression that identifies the reference peaks to use, a delta value for the reference peaks (again an expression so it can be a fixed value or a column introduced from other information, by default 0 - somehow figure out how to do this taking the standard values into consideration too), and a method on how to extrapolate raw ratios from the ref peaks ("linear", "bracket", "average", etc.)I want to play around with the raw intensities of the sample and reference gas of the dual inlet measurements, but iso_get_raw_data
doesn't seem to return a separate ref gas and sample gas. Am I missing something?
parallel processing can jumble the order of the files that were read. this is not usually a problem but can lead to confusion for users when file info is aggregated seemingly out of order.
implement reader for isodat's scn file format (see #25 for similar request)
function to easily merge in metadata but figure out how this fits with the iso_add_metadata function from isoprocessor!
it is frequently used in iso reader already so should be part of this package, rather than isoprocessor
see isoverse/clumpedr#13 for details
it should be possible to pull out scan information from elementar scan iarc files
the following error occurs when trying to read the newer CNS dxf files
invalid multibyte string, element 1
both should operate on file info, this will be easier than any derived iso_...
functions because of users' familiarity with dplyr (usually)
iso_filter_files
modifyList
the resulting changes...)Running windows 10 with everything updated (R, RStudio, and Packages) as of 2020/02/13 It looks to put 5 errors for every file. This is true for all files including ones that worked it the past. I attached a few files but any files should work.
error | extract_isodat_sequence_line_info
error | extract_isodat_measurement_info
error | extract_did_raw_voltage_data
error | extract_isodat_reference_values
error | extract_did_vendor_data_table
Hello, thanks for this fantastic package! It works great (when I use it on a machine on a network, where I can do devtools::github_install() to install). When I try to install onto an older (Win XP) computer that cannot be on a network (long story) using the .zip file I downloaded from this repo, the command does not fail or error, but the package does not get installed. It seems like the issue might be that installing the software requires concomitant installation of other software dependencies, and these are thwarted by the fact that the computer isn't on the network? Any idea if this is true?
did file with a large number of cyc gives a "extract_did_vendor_data_table" func error.
"cannot process vendor computed data table - unequal number of column headers (7) and data entries (0) found (pos 544533)"
iso_get_reader_example("continuous_flow_example.iarc") %>%
iso_read_continuous_flow()
has recently started throwing the following type of error:
HDF5-DIAG: Error detected in HDF5 (1.8.19) thread 0:
#000: H5L.c line 1181 in H5Literate(): link iteration failed
major: Symbol table
minor: Iteration failed
#001: H5Gint.c line 833 in H5G_iterate(): unable to register group
major: Object atom
minor: Unable to register new atom
#002: H5I.c line 880 in H5I_register(): invalid type
major: Object atom
minor: Unable to find ID group information
Warning: caught error - C stack usage 7969760 is too close to the limit
read_raw_data
)iso_convert_signals
!On dev when I try to get the raw data from ~8k files, the function never completes! It gives the info message, but then never finishes without any errors. Have to interrupt with Ctrlc. When I first iso_filter_files()
to reduce the total number a bit, it doesn't hang.
reported by Brett Hill
clearer separation between data reader and processor
This is not an issue with isoreader, but rather an idea for a new functionality, which you may or may not consider worth the time. I was just thinking that it would be awesome if isoreader allowed the user to re-evaluate a dxf trace with new peak detection parameters. In my experience, changing the peak detection parameters in Isodat can have important effects on delta values, but it is cumbersome to re-evaluate whole datasets in isodat varying peak detect params. If isoreader could evaluate the peaks, it would be easy for the user to test a range of peak detect params and see what results in best data quality. I think a lot of people arbitrarily choose peak detect params just because it is hard to test them. This could help. Just a thought. I know you don't have time to re-write all of Isodat though.
users may not want to have all their cores used so other software can continue to be useful, provide control over this by parameter
consider whether it should be possible to read exported feather files back in although some information may not be recoverable from the purely 2D structure of feather files
In our database there are several files with the same name (in different subdirectories), that occur because for instance a run has stopped due to some error (e.g. no acid drop recorded). In this case the .did file is still saved, but contains no data except for the diagnostics (line, row, error message, etc.). Because the material is still there, we can restart the run from that particular aliquot, resulting in a new folder with some of the same names.
In our case, over half a year or so this has resulted in 267 files with duplicate names. Would it be possible for isoreader to not stop when importing different files with the same name, if they have different timestamps/file_paths?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.