palaeoverse / palaeoverse Goto Github PK

View Code? Open in Web Editor NEW

19.0 7.0 5.0 177.83 MB

palaeoverse: an R package developed by palaeobiologists, for palaeobiologists

Home Page: https://palaeoverse.palaeoverse.org/

License: GNU General Public License v3.0

R 100.00%

biodiversity fossil palaeobiology r-package paleobiology

palaeoverse's Issues

REVIEW: time_bins()

Great Job Lewis! I did some tests, it all seems to work except for the binning at large bin sizes (see below). These are the changes I propose:

remove equal argument: size is sufficient. Set default of size to FALSE, if user specifies a number, then use equal length time bins
set plot default to FALSE: I don't think the user usually doesn't want to plot these intervals when generating time bins.
The first bin can be substantially shorter than subsequent bins if long bin sizes are used. Change to generate more uniform bin sizes. Example: time_bins(interval = c("Albian", "Meghalayan"), equal = TRUE, size = 100)
add a column called something like "level" to indicate that the bins were grouped on stage, epoch, ... level
allow negative ages as arguments to assign
Input allows epochs, periods ect., but GTS2020$interval_name returns only stages. Should return table with stage, epoch, ... column
return intervals of time bins as a list of vectors composed of individual interval names, rather than strings with multiple intervals for easier handling in data analysis
description unclear: In what way is the mean minimised? I thought the mean durations should be as equal as possible

time_bins fails with defaults for GTS2012

While playing around with other things, I realized that time_bins(scale = "GTS2012") (i.e., using the old timescale with all other defaults) currently fails. This is because the default value for interval is c("Fortunian", "Meghalayan"), but the "Meghalayan" stage didn't exist in 2012. To fix/enhance this, I propose that we make the interval argument optional (with a default value of NULL) and if not specified, all intervals of the desired rank are returned from the specified timescale.

User-defined timescale in time_bins()

It would be great to give more flexibility to the user to which timescale is being used for time_bins() (like axis_geo()). This could include incorporating other timescales through deeptime::getTimeScale() or by letting the user supply their own dataframe. I imagine this would be most useful for the equal-length binning aspect of time_bins().

Non-ASCII characters in built-in datasets?

CRAN checks note that we have some non-ASCII characters in our package. I'm guessing they are within our built-in datasets (maybe characters with accents?)? It's not super urgent, but we should convert them to ASCII at some point.

tax_range_time example

The tax_range_time example uses orders, but does not exclude "NO_ORDER_SPECIFIED", resulting in weird groupings. This should be updated to use genera or exclude non-specified occurrences.

look_up can't assign stages to pre-Phanerozoic intervals and fails if those are present

Describe the bug
The look_up function tries to look up intervals from the GTS tables and assign stages. This fails for intervals older than the Phanerozoic, as older stages are not defined in the GTS tables, resulting in an error.

To reproduce
look_up(reefs,
int_key = FALSE,
early_interval = "interval",
late_interval = "interval",
assign_with_GTS = "GTS2020",
return_unassigned = FALSE)

Expected behavior
Not assign those stages.

Resolution for the next release
Add a line of code specifying that pre-Phanerozoic intervals will not be looked up in the GTS tables.

User-defined probability function for bin_time()

Right now bin_time() uses a uniform distribution to assign point estimates for ages. However, I can see situations where users would prefer a normal distribution, logistic distribution, etc. It would be quite cool (although possibly complicated) to allow for this type of flexibility.

Exhaustive equal-length bin algorithm

It would be nice to have a way to ensure that we have the MOST equal-length bins possible. This could include some sort of algorithm that checks a bunch of different sets of bins and then compares their sds. I'm not sure on the return on investment here, both in terms of developer time and computation time within the function, but I think it might be useful for more statistically-inclined users.

Homonym retention in tax_unique()

Is your feature request related to a problem? Please describe.
tax_unique() currently verifies repetition based on one taxonomic level at a time. This means that, for example, genera with the same name but in different orders, would currently be collapsed into a single genus.

Describe the solution you'd like
The 'if' statements which cross-check taxon names need to be made more nuanced to allow checking across multiple taxon levels.

Unnecessary columns generated in palaeorotate

rot_age, rot_lng and rot_lat columns are generated by default in the palaeorotate function. However, these are not used by the "point" method and should therefore only be generated if the "grid" method is specified. Small bug, but annoying.

tax_unique() output which lists occurrences

Is your feature request related to a problem? Please describe.
tax_unique() currently provides a list of 'unique' taxon names, and does not utilise occurrence information.

Describe the solution you'd like
An alternative output format which provides occurrence information as nested lists within each 'unique' taxon might be useful.

lat_bins() occurrence binning and variable latitudinal range

Currently, if occurrences fall on boundaries in lat_bins() they are assigned to the higher bin number. It could be useful to add functionality to allow occurrences to be binned into both bins.
Currently, lat_bins() only covers the complete latitudinal range (i.e. -90 to 90 latitude). It could be useful to add functionality to allow user input ranges (e.g. -60 to 60).

tax_check()

A function for checking the taxonomy of occurrence data. This might not be necessary with all of the taxonomic packages out there (e.g., taxize) and the fossilbrush package which seems to be devoted specifically to this problem.

Allow custom time bins for tax_expand_time

Right now tax_expand_time() only supports GTS 2012 and GTS 2020, but we should allow it to support a custom user time bin dataframe like bin_time() does.

AUDIT: lat_bins()

The lat_bins() function is now ready for auditing. Sofía and Lucas could you please now audit the code/documentation and check it is behaving as expected?

Please also document any tests you throw at it and bugs you find. This will help with developing the automated testing for later.

For now, I would suggest creating your own branch and proofing the code there. However, if you prefer we can also meet online to discuss any issues that need resolving.

Thank you!!!

Sofía's audit complete
Lucas' audit complete

Multimodel binding issue in palaeorotate

When calling multiple GPMs at once, there seems to be a binding issue with the palaeocoordinates. Note, this is not an issue if palaeorotations are generated iteratively and must be related to chunk handling. This should be resolved quickly.

Rotation (MULLER2019)

Documentation for the MULLER2019 model has been updated on GPlates Web Service stating that the MULLER2019 model covers 0--250 Ma (as the paper also states). However, the API service allows points to be reconstructed up to 540 Ma for this model. A little digging required...

This should be addressed for v1.1.1.

As a side note, the MULLER2022 model is also now available and perhaps should be incorporated down the line.

Range plot for stratigraphic sections

Is your feature request related to a problem? Please describe.
This problem was raised by Meghan Jenkinson (via X): an R-based way to plot stratigraphic range and occurrence data for an individual section.

Describe the solution you'd like
An extension of tax_range_time which plots ranges across beds within a section, including points indicating specific sampled levels, and ideally with open points for uncertain identifications.

Describe alternatives you've considered
These figures are common but typically made by hand for manuscripts, but it should be possible to generate them automatically from the input data.

Additional context
I already have some code from Alex Dunhill; I will be working on a full draft over the next couple of weeks.

axis_geo() is backwards on base R phylogenies

library(paleotree)
data(RaiaCopesRule)
plot(ceratopsianTreeRaia)
axis_geo()

AUDIT: palaeorotate()

The palaeorotate() function is now at a place I am happy with. Bethany, Emma, and Chris could you please now audit the code/documentation and check it is behaving as expected?

Please also document any tests you throw at it and bugs you find. This will help with developing the automated testing for later.

Three might seem overkill for checking this function, but I think this could perhaps be one of the most used functions. As such, I would like to ensure I haven't made a mistake somewhere.

For now, I would suggest creating your own branch and proofing the code there. However, if you prefer we can also meet online to discuss any issues that need resolving.

Thank you!!!

Bethany's audit complete
Emma's audit complete
Chris' audit complete

Reconstruction files

Currently, the reconstruction files are based on a 1º x 1º spatial grid. This should perhaps be updated to use a discrete equal-area grid. As implemented, points at high latitudes will be linked to reconstruction files at a higher geographic resolution than those at low latitudes.

bin_spatial() antimeridian wrapping

Currently edge polygons are warped when they wrap around the antimeridian (-180/180). It would be nice to have better functionality for handling this. This is a known issue when representing a spheroid in 2D and is generally only a problem for visualisation purposes. However, it should be resolved at some point.

axis_geo() development

Functionality of the function

bin_spatial()

A function to bin occurrence data into spatial bins. I know @LewisAJones and @bethany-j-allen have their own ways of doing this. Perhaps it would also be useful to incorporate other methods, despite potential reservations (e.g., rectangular bins, the Close et al. MST method, etc).

Check function and argument names for consistency

We should go through the package and check all of the function and argument names for consistency before the CRAN release

Multi-model call bug in palaeorotate

Multi-model call in palaeorotate for the "point" method does not return all requested model coordinates, only the last called model.

DEVELOPMENT: time_binning()

Develop time_binning() function

Functionality:

Automated label scaling for axis_geo()

In deeptime::coord_geo(), there is the option to use ggfittext to scale the labels such that they don't overlap with one another and fit within their boxes. I'm not aware of something similar for base R, but we could try to emulate it for axis_geo(). @KEichenseer looked into it during development but couldn't figure out a solution, so it might not even be possible in base R?

Revdep check failure test-axis_geo.R:24

With the devel version of sf and terra, GDAL 3.6.0, this:
00check.log
testthat.Rout.zip
The test makes no sense to me, and setup-data.R also fails:

> library(palaeoverse)
> require(divDyn, quietly = TRUE)
>   stages <- deeptime::stages
>   periods <- deeptime::periods
>   data(corals)
Warning message:
In data(corals) : data set 'corals' not found
>   corals_stages_clean <- subset(corals, stage != "")
Error in subset(corals, stage != "") : object 'corals' not found
>   coral_div <- aggregate(cbind(n = genus) ~ stage,
+                          data = corals_stages_clean,
+                          FUN = function(x) length(x))
Error in eval(m$data, parent.frame()) : 
  object 'corals_stages_clean' not found

probably explaining why object 'coral_div' not found. In my case, there is no package called 'divDyn', but you do not condition on success. Either rewrite the test framework to respect settings where _R_CHECK_FORCE_SUGGESTS_=FALSE, or elevate packages needed for testing.

tax_unique() should allow arbitrary higher levels of taxonomy

Right now, tax_unique() only allows for family, class, and order, but I imagine lots of datasets out there have other higher taxonomic levels (e.g., subfamily). I think we should leave genus/species/binomial/name the way it is, but collapse the higher level arguments to a single argument that takes a vector, then just loop over those similar to how we do now with order/class/family. Since you would use indet. for any of those, we should theoretically be able to take any arbitrary set of ordered higher levels of taxonomy.

AUDIT: time_bins()

The time_bins() function is now ready for auditing. Kilian and Alessandro could you please now audit the code/documentation and check it is behaving as expected?

Please also document any tests you throw at it and bugs you find. This will help with developing the automated testing for later.

For now, I would suggest creating your own branch and proofing the code there. However, if you prefer we can also meet online to discuss any issues that need resolving.

Thank you!!!

Kilian's audit complete
Alessandro's audit complete

Make all links more accessible

Copied from this guide:

Use concise and meaningful text for links
Do not capitalize all letters in links
Avoid using URLs for link text
Do not use the word "link" as part of the link text
Do not use tooltips/screentips to add additional information

sp retirement

Loading the package now results in the following message:

The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0

My best guess is that this is due to our dependency on geosphere which uses sp? It doesn't look like geosphere has any intent on moving to sf, so maybe we need to find another package to use?

More details: https://r-spatial.org/r/2023/05/15/evolution4.html

Edit: Looks like this functionality is built into sf: https://cran.r-project.org/web/packages/sf/vignettes/sf7.html

Quantify ghost ranges

Is your feature request related to a problem? Please describe.
Quantifying ghost ranges is useful for thinking about the sampling completeness of a dataset during exploration.

Describe the solution you'd like
This could include two facets:

Using occurrence data, quantifying gaps between the oldest and youngest samples of each taxon, and summarising this across the whole dataset - a table would be useful but some sort of graphical output would also be nice
Using occurrence data and a phylogeny, quantifying the proportion of branch lengths represented by samples - would have to come with a bunch of warnings that the completeness only applies to the sampled, not the "true" phylogeny, but would be really useful as e.g. an indicator of the prior to use in a Bayesian analysis which needs a sampling estimate

Describe alternatives you've considered
I'm not sure if anything like this already exists - perhaps there might be something in a package more geared towards biostratigraphy?

Additional context
None.

DEVELOPMENT: lat_plot()

Develop function for generating quick (palaeo-)latitudinal plots. The basis of the function is almost there but it still requires:

Further documentation
Error handling
Streamlining of code

If you have any other suggestions/additions Sofía, please feel free to go ahead and implement them.

Palaeorotate function

Tasks to be complete for palaeorotation function

Function documentation
Submit for auditing

tax_unique()

A function for removing duplicate taxa from a dataset. I believe this is already performed by the fossilbrush package, but I'll leave it up to the assignees to determine if there are any gaps that still need to be filled.

look_up()

The original idea of this function was to provide a conversion table for several regional and international time scales (similar to GeoWhen). This seems like a pretty lofty goal, and maybe focusing on a smaller number of time scales would be more tractable (for now). Alternatively, using interval ages to correlate across time scales instead of (bio)stratigraphic correlations might be more manageable and easily updatable?

Greater plot customisability in tax_range_time

Is your feature request related to a problem? Please describe.
Kateryn Pino on our Google Group requested additional customisability of plots in tax_range_time. This might be desirable for some users to refine plots for publications.

Describe the solution you'd like
Basically, we should allow greater customisability. I guess we could pass standard plot arguments using ... - it seems it would be the cleanest way.

Describe alternatives you've considered
An alternative would be to set up a number of expected customisable arguments (e.g. colour, title, etc), but this seems unnecessary and a lot more work.

Additional context
Nothing to add.

@KEichenseer @ChristopherDavidDean happy to review if I put a PR together?

Atdabanian/Botomian in interval_key incorrect

Describe the bug
For the interval name Atdabanian/Botomian in interval_key the early stage and late stage are incorrect. Currently, the early stage is "Stage 3" and the late stage "Stage 2". This should be the opposite.

To Reproduce

interval_key[which(interval_key$interval_name == "Atdabanian/Botomian"), ]

DEVELOPMENT: palaeorotate() function

I think the palaeorotate() function is more or less there now!

Lucas, could you take a look through the code and documentation? Given your past 6 months of work, you're probably in the best position to give this an initial check before we pass it onto the team for formal review! We also need to think about the comparison (using all PBDB data) between rotations generated via the function, and actually using GPlates (essentially, point rotations vs. grid rotations). Although, I think this is something more for the eventual manuscript.

11-development-palaeorotate

Merci!

Update CONTRIBUTING.md

According to this app found by @willgearty, our CONTRIBUTING.md is not up to scratch. This should be updated going forward.

geo_check/clean

It would be great to have a function that takes interval names and checks for spelling mistakes (similar to tax_check) and cleans up interval names. For example, some times interval names are provided as time1/time2 or time1–time2 in a single column or contain information that might want discarding (e.g. early Maastrichtian might want reducing to Maastrichtian).

tax_range()

A function for calculating the temporal, latitudinal, and/or spatial range of taxa (based on occurrence data).

tax_spat_expand

I can envision a function that is like tax_time_expand but for latitudinal bins, although I'm not sure range-through is great assumption for spatial studies. (also, both functions should probably have their names changed to tax_expand_{spat/time} to mirror our verb construction of other function names)

Vignette

We should add a vignette to showcase an example workflow using all/most of the functions in the package.

Build package website

At some point, we should establish a website for the package (after we get the first version up and running). This can be done very easily using pkgdown (https://pkgdown.r-lib.org/articles/pkgdown.html). An example of how this can look: https://tidyverse.tidyverse.org/.

dateline issues with bin_space plot

The plot that is output by bin_space appears to have some dateline problems.

Here is the output of the current documented example:

# Get internal data
data("reefs")

# Reduce data for plotting
occdf <- reefs[1:250, ]

# Bin data using a hexagonal equal-area grid
ex1 <- bin_space(occdf = occdf, spacing = 500, plot = TRUE)

# Bin data using a hexagonal equal-area grid and sub-grid
ex2 <- bin_space(occdf = occdf, spacing = 1000, sub_grid = 250, plot = TRUE)

phylo_check()

Is your feature request related to a problem? Please describe.
For checking the tip names in a phylogeny against a list of taxa and trimming the phylogeny if desired.

Describe the solution you'd like
A table cross-matching a list of taxon names with the list of tip names, with an additional option to trim the phylogeny.

Describe alternatives you've considered
I don't know of other functions that do this - it might be available in e.g. paleotree - but I think would make a nice complement to the rest of the functions here anyway.

Additional context
None

Summarise abundance in occurrence datasets

Is your feature request related to a problem? Please describe.
Understanding the abundance distribution within a set of occurrences is a fundamental way of exploring a dataset, but we currently don't have any functions that do this.

Describe the solution you'd like
It would be great to have a function that summarises the abundance distribution in an occurrence dataset and outputs:

a table (of taxa and their (relative) abundances)
a graph (of the shape of the ranked abundance distribution)
a string (vector of abundances suitable for input into other analysis functions, e.g. iNEXT, vegan)

Describe alternatives you've considered
iNEXT has functions to create strings but it would be nice to have a more flexible function specifically designed to do this from a wider range of input formats.

Additional context
It would be great to expand this to model fitting on the distribution, but this is straying into analysis rather than data exploration.

palaeoverse / palaeoverse Goto Github PK

palaeoverse's Issues

Recommend Projects

Recommend Topics

Recommend Org