moodymudskipper / inops Goto Github PK
View Code? Open in Web Editor NEWInfix Operators for Detection, Subsetting and Replacement
License: GNU General Public License v3.0
Infix Operators for Detection, Subsetting and Replacement
License: GNU General Public License v3.0
it should be quick enough to implement, should be declined for all set/range/regex variants.
It should be as simple as :
`%#in{}%` <- function(x, set) sum(x %in{}% set, na.rm = TRUE)
where I leave the na.rm
to your appreciation as I'm not sure if I'll use those much.
@KKPMW
NA %in{}% NA
is NA
, consistently with ==
, but might benefit from an example as its another difference with %in%
.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# pure characters: everything works as it should
"a" == "a" # TRUE
"a" == list("a") # TRUE
list("a") == "a" # TRUE
list("a") == list("a") # fails
list("a") %in% list("a") # TRUE
"a" %in{}% "a" # TRUE
"a" %in{}% list("a") # TRUE
list("a") %in{}% "a" # TRUE
list("a") %in{}% list("a") # TRUE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on the right: we observe inconsistencies
"a" == factor("a") # TRUE
"a" == list(factor("a")) # FALSE
list("a") == factor("a") # TRUE
list("a") == list(factor("a")) # fails
list("a") %in% list(factor("a")) # FALSE
"a" %in{}% factor("a") # TRUE
"a" %in{}% list(factor("a")) # TRUE, should be FALSE for consistency
list("a") %in{}% factor("a") # NA + warning, should be TRUE
list("a") %in{}% list(factor("a")) # NA + warning, should be FALSE ?
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on the left: everything works as it should
factor("a") == "a" # TRUE
factor("a") == list("a") # TRUE
list(factor("a")) == "a" # FALSE
list(factor("a")) == list("a") # fails
list(factor("a")) %in% list("a") # FALSE
factor("a") %in{}% "a" # TRUE
factor("a") %in{}% list("a") # TRUE
list(factor("a")) %in{}% "a" # FALSE
list(factor("a")) %in{}% list("a") # FALSE
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on both sides
factor("a") == factor("a") # TRUE
factor("a") == list(factor("a")) # FALSE
list(factor("a")) == factor("a") # FALSE
list(factor("a")) == list(factor("a")) # fails
list(factor("a")) %in% list(factor("a")) # TRUE
factor("a") %in{}% factor("a") # TRUE
factor("a") %in{}% list(factor("a")) # TRUE, should be FALSE for consistency
list(factor("a")) %in{}% factor("a") # FALSE
list(factor("a")) %in{}% list(factor("a")) # FALSE, should be TRUE
to do
Hello @moodymudskipper , long time no see.
I've been using the package and recently stumbled upon a scenario where a particular operator might be handy. What I wanted to do is select, within a loop, first - top10% of values, then top10%-top20%, then top20%-top30% etc.
Do you think we should add a separate operator for this, or do you think this is better covered by other functions like cut?
Just posting for opinions.
README file is still missing. Should include all the operators and the rationale for each of them, as well as examples.
When attaching our package, the following text is displayed :
library(inops)
Attaching package: ‘inops’
The following object is masked from ‘package:base’:
<<-
This happens because our package indeed redefines this operator, however no need to worry, it doesn't affect in any way any base or packaged code (these won't ever "see" inops::<<-
).
Moreover, if you choose to use <<-
in its binary form x <<- y
you will not find any difference, this operator has been redefined so the syntax x < y <- value
can be supported. Indeed just like class(x) <- value
calls the function class<-
and x == y <- value
calls the function ==<-
, calling x < y <- value
calls <<-
. You can actually try it in a session where inops is not attached :
x <- 1
y <- 2
x < y <- 3
#> Error in x < y <- 3: incorrect number of arguments to "<<-"
Created on 2019-11-22 by the reprex package (v0.3.0)
Our package simply implements the 3 parameter usage and leaves the original binary usage unaffected.
Now for those worried about the performance cost, here is a benchmark, where I ran each instruction a million times :
bench::mark(iterations= 10^6,
base::`<<-`(x,1),
inops::`<<-`(x,1)
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base::`<<-`(x, 1) 3.7us 4.3us 192856. 0B 22.0
#> 2 inops::`<<-`(x, 1) 7.4us 8.3us 110235. 34.8KB 24.6
Created on 2019-11-22 by the reprex package (v0.3.0)
inops::`<<-`
has an overhead of 4us. This means that if you call it one million time it'll take 8 seconds rather than 4.
Given that this overhead will disappear as soon as you package your function, and that most experienced users will recommend you to stay as far away of this function as possible, we believe worries aren't necessary.
Note that we won't ever try to dissimulate this warning, but might find a way to make it more understandable by all.
I was working with examples in the help files and stumbled upon this by accident:
x <- c(1, NA, 2)
x %out[]% 2 <- x
Error in x[list] <- values :
NAs are not allowed in subscripted assignments
This is of course a non-intended usage. But it's a bit unintuitive to receive this message. More so because if x
didn't have missing values - it would work:
x <- c(1, 1.5, 2)
x %out[]% 2 <- x
Warning message:
In x[list] <- values :
number of items to replace is not a multiple of replacement length
x
[1] 1.0 1.5 2.0
all talk about naming conventions can happen here.
I chose %subset***%
to extract matching subset, maybe there's better.
I like %!in%
better than %out%
because :
%in%
is explicit!=
%!subset[]%
I picked rangeops
as a package name because comparisons separate ranges of values as well so it seemed to make sense, open to any alternative. My previous package mmassign was more about having all kinds of assignment operators, but it's probably better to be more focused.
It detects invalid syntax such as >=(e1, e2) <- value
.
checking Rd \usage sections ... WARNING
Bad \usage lines found in documentation object 'comparison_ops':
>=(e1, e2) <- value
>(e1, e2) <- value
<=(e1, e2) <- value
<(e1, e2) <- value
==(e1, e2) <- value
!=(e1, e2) <- value
Bad \usage lines found in documentation object 'element_wise_in':
%in{}%(x, table) <- value
%!in{}%(x, table) <- value
Bad \usage lines found in documentation object 'in_variants':
%in%(x, table) <- value
%!in%(x, table) <- value
Bad \usage lines found in documentation object 'like_variants':
%like%(x, pattern) <- value
%!like%(x, pattern) <- value
Bad \usage lines found in documentation object 'range_ops':
%in()%(x, range) <- value
%!in()%(x, range) <- value
%in(]%(x, range) <- value
%!in(]%(x, range) <- value
%in[)%(x, range) <- value
%!in[)%(x, range) <- value
%in[]%(x, range) <- value
%!in[]%(x, range) <- value
Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
The code of in_regex
should be reviewed and look more like %in{}%
, the rhs should be just as flexible.
"a" %in~% c("^b","^a")
should definitely return TRUE
"ab" %in~% quote(a)
seem absurd but should be supported for consistency.replace ops are built around x <- replace(x, cond, value)
with specific cases for
I'm questioning both special cases.
First case was to counter this :
x <- factor(letters[1:3])
x[x == "b"] <- "Z"
#> Warning in `[<-.factor`(`*tmp*`, x == "b", value = "Z"): niveau de facteur
#> incorrect, NAs générés
x
#> [1] a <NA> c
#> Levels: a b c
so x == "b" <- "Z"
would replace the levels smoothly.
I think it might have been a bad idea as we could simply do : levels(x) == "b" <- "Z"
in that case with more consistency and less ambiguity in some corner cases.
Special case for when the condition was never met was designed so the following wouldn't alter x :
x <- 1:3
x < 0 <- "a"
but now I think it's good that x[x <0] <- "a"
coerces to character even if the condition is never met, because of type stability. and our assignment functions would be more intuitive if they were all really simple shortcuts for cond(x) <- value
equivalent to x[cond(x)] <- value
.
@KKPMW do you see a problem with simplifying those ?
do we agree that they should be consistent ?
if so we still have some work to do on them :
> test_chr <- c(names(iris), NA)
> test_num <- c(1:5,NA)
> test_list1 <- c(as.list(c(1:5,NA)),
+ as.list(c(letters[1:5], NA)),
+ as.list(factor(c(letters[1:5], NA))))
> test_list2 <- list(1:3, c(4:5,NA) ,
+ letters[1:3], c(letters[4:5], NA),
+ factor(c(letters[1:5], NA)))
> test_df <- data.frame(
+ a = 1:3, b = c(4:5,NA), c = letters[1:3], d = c(letters[4:5], NA),
+ e = factor(letters[1:3]), f = factor(c(letters[4:5], NA)),stringsAsFactors = FALSE)
>
> test_chr == 3
[1] FALSE FALSE FALSE FALSE FALSE NA
> test_list1 == 3
[1] FALSE FALSE TRUE FALSE FALSE NA NA NA NA NA NA NA FALSE FALSE TRUE FALSE FALSE NA
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
3: NAs introduced by coercion
4: NAs introduced by coercion
5: NAs introduced by coercion
> test_list2 == 3
Error: (list) object cannot be coerced to type 'double'
> test_df == 3
a b c d e f
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE NA FALSE NA FALSE NA
>
>
> test_chr %in{}% 3
[1] FALSE FALSE FALSE FALSE FALSE NA
> test_list1 %in{}% 3
[1] FALSE FALSE TRUE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE NA
> test_list2 %in{}% 3
[[1]]
[1] FALSE FALSE TRUE
[[2]]
[1] FALSE FALSE FALSE
[[3]]
[1] FALSE FALSE FALSE
[[4]]
[1] FALSE FALSE FALSE
[[5]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> test_df %in{}% 3
a b c d e f
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE NA FALSE NA FALSE NA
>
>
> test_chr %in% 3
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> test_list1 %in% 3
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
> test_list2 %in% 3
[1] FALSE FALSE FALSE FALSE FALSE
> test_df %in% 3
[1] FALSE FALSE FALSE FALSE FALSE FALSE
this deserves its own issue.
I don't have a lot of ideas so far, apart from the fact I don't like the current one :).
The package checks, replaces, subsets, etc, and it uses infix operators for all operations.
infixer wasn't a bad name.
operators is taken, ops is catchy but not very specific.
could be a less technical but more evocative name, a tool, an animal, a sound...
could be an acronym made from what it does, wouldn't be very evocative but would be short enough to remember easily.
is not done yet
A lot of copy and paste so far.
Better use a few general functions that other functions wrap
This issue again :) But I think I have a proposal that might actually work, unless I am missing something.
In short, replacing multiple values at once would be a nice step, especially when:
numerics are replaced with characters
x <- 1:9
x %in[)% c(1,4) <- "low"
x %in[)% c(4,7) <-" middle" # no longer a numeric vector...
regex replacements overlap with new checks
x <- c("house", "home", "bus", "boat", "car")
x %in~% c("^h") <- "building"
x %in~% c("^b", "^c") <- "transportation" # now building will be replaced too
counts will change and overlap
x <- c("a", "b", "b", "c", "c", "c")
x %in#% 1:2 <- "rare"
x %in#% 3:4 <- "common" # now rare will be replaced with common
To overcome these - a syntax allowing to replace multiple values at once would help. My suggestion, that I wish to get some feedback on is to do this if rhs is a list
:
x <- 1:9
x %in[)% c(1,4,7,10) <- list("low", "medium", "high")
Of course it's a bit tricky how this would work with %out%
- at least might be non-intuitive, but probably would be included for consistency.
@moodymudskipper what do you think?
I've bookmarked this thread a few months ago : https://twitter.com/dgkeyes/status/1164625536406212610
From it I made a rough checklist of what makes a package better or makes it look its best.
It can apply to any package and I'll probably use it for my other stuff, shoot if you have more ideas.
Maybe there are some things here that apply to inops :
Package design :
Documentation + online presence:
Advertising package (similar to "online" presence above, but with a timestamp):
github Badges :
Activity indicators :
Things that are outside of our direct control :
@KKPMW
Thanks,
If there are references describing the methods in your package, please
add these in the description field of your DESCRIPTION file in the form
authors (year) doi:...
authors (year) arXiv:...
authors (year, ISBN:...)
or if those are not available: https:...
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for
auto-linking.Please fix and resubmit, and document what was changed in the submission
comments.Best,
Jelena Saf
@KKPMW FYI : r-lib/conflicted#48
Maybe we should get a checklist for what functionality to finish before first version release to CRAN? Might also create a milestone for this.
library(rangeops)
#>
#> Attaching package: 'rangeops'
#> The following object is masked from 'package:base':
#>
#> <<-
x <- quote(a)
# works
x == x
#> [1] TRUE
# doesn't
x %in% x
#> Error in match(x, table, nomatch = 0L): 'match' requires vector arguments
# doesn't either because it's based on `%in%`
x %in{}% x
#> Error in match(x, table, nomatch = 0L): 'match' requires vector arguments
# But shouldn't it work regarding our consistency principles with `==` ?
# the following seems to fix it :
`%in{}%` <- function(x, table) {
if (is.atomic(x)) {
res <- x %in% table
} else {
res <- lapply(x, function(a,b)
if(is.language(a)) any(a == b) else a %in% b, table)
}
attributes(res) <- attributes(x)
if (!is.language(x)) res[is.na(x)] <- NA
simplify2array(res)
}
y <- quote(b)
x %in{}% x
#> [1] TRUE
x %in{}% y
#> [1] FALSE
x %in{}% c(x,y)
#> [1] TRUE
c(x,y) %in{}% x
#> [1] TRUE FALSE
c(x,y) %in{}% y
#> [1] FALSE TRUE
c(x,y) %in{}% c(x,y)
#> [1] TRUE TRUE
Created on 2019-09-27 by the reprex package (v0.3.0)
While adding examples for replacement operators, I badly wanted to do this:
cars <- rownames(mtcars)
cars %in~% "^Mazda" <- toupper(cars)
Instead of a more elaborate
cars %in~% "^Mazda" <- toupper(cars %[in~% "^Mazda"% )
But on the other hand - this might be a bit confusing. @moodymudskipper what do you think?
@KKPMW said :
What I would like to see in this package:
in[] in{} in(), etc
out[] (or !in[]) variants
variants that work with tables (so replace values that occur some n of times)
variants that help with "cut" so maybe x %#cut% 5 <- letters[5] would cut x into 5 intervals and name them A-E.
@KKPMW Our package functions produce NAs, which might not always be desirable for all situations:
c(1:4) %in{}% c(4,NA,3)
#> [1] NA NA TRUE TRUE
c(1:4) %[in{}% c(4,NA,3)
#> [1] NA NA 3 4
do you think it would stay in the scope of the package to add helpers to remove NAs or replace them by FALSE (or by symetry though less useful, by TRUE)
c(1:4) %in{}% na_rm(c(4,NA,3))
#> [1] FALSE FALSE TRUE TRUE
c(1:4) %[in{}% na_rm(c(4,NA,3))
#> [1] 3 4
na_false(c(1:4) %in{}% c(4,NA,3))
#> [1] FALSE FALSE TRUE TRUE
@KKPMW said:
The only thing I am a bit unsure about is the formula notation. Like the stop thing can probably be written as stopifnot(any(vec >= 1000)). And <- seems like an assignment operator, so a bit weird to assign a stop. On the other hand multiplying all the found matching entries by some constant (or applying any other function to it) is very convenient. I am just wondering if we could do something like x %f==% 2 <- list('*', 0.7) instead. Thou the long form is x[x==2] <- x[x==2]*0.7 Which almost looks nicer...
%out%
is not consistent with ==
so need to be on another help page.
There could be %[in%
and %[out%
too
replace and subset ops can be on same help page as they're all short forms for the replace(x, x %op% y)
and x[x %op% y]
(no check no tweak) so it's intuitive and consistent.
y<- iris[1:2]; iris %in% y <- NULL
will remove matched columns for example. our current ops can't do that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.