Giter Site home page Giter Site logo

`%#in%` family about inops HOT 18 CLOSED

moodymudskipper avatar moodymudskipper commented on June 26, 2024
`%#in%` family

from inops.

Comments (18)

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024 1

Hmm my thought as always - it should be consistent as much as possible with all the other operations... For now I think we do not allow assigning new levels to factors. I am not sure yet if this is a good or bad idea.

An argument can be made that we are preventing silly users from making mistakes for themselves, while at the same time forcing more sophisticated users to do several additional steps (like adding new levels)?

Maybe for now let's leave the current behaviour we have for factors and not worry about it. Adding a level once should not be a big deal. If one is using factors anyway - he/she will probably want to control the levels themselves.

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024 1

Also I think we should add %out#% variants. To allow for dropping items that occur only a few times, without requiring x %in#% 6:999999

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024 1

The problem with using %in% is that matrices and data frames don't behave right (at least as expected compared to other functions of the package). I think we're better off using %in{}% and having special cases for a length 0 rhs (should return a vector or matrix of FALSE, now always a vector, before failed), and also a special case for length 0 lhs (failed before and fails now but should return logical(0) like integer(0) == 1.

Which reminds me that we didn't test n>2 dimensional arrays, but I think our code so far should handle them properly.

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

I see I am bad at explaining and so the purpose of %#in{}% is still not clear enough... It is not for getting the count of elements within the set, but rather - working on counts of unique elements within x. Think about it like this: how to assign groups that occurred less than 5 number of times to the group "Other" ?

flights$tailnum %#in[]% c(0,5) <- "Other"

So few issues here:

  1. The function body should be something like:
    function(x, counts) ave(seq_len(length(x)), x, FUN=length) %in{}% counts
  2. What do we do with == and < and similar? Do we add %#<% and the like?
    2b. same question applies to %[==% - do we need those?
  3. Do we add subsetting operators? %[#in{}%?

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

@moodymudskipper pinging for your opinion.

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

I was thinking about 2b this morning, I think we do, I thought I had written those actually. %[==% and %[!=% wouldn't be so important as %[in{}% and %[out{}% would work the same (only less efficiently), but %[>% and %[<% are quite useful.

I remember now our talks about the # variants, I see I came back to my initial misunderstanding, sorry.

Would this work as follows ?

library(inops)
#> 
#> Attaching package: 'inops'
#> The following object is masked from 'package:base':
#> 
#>     <<-

`%#in[]%` <- function(x, range){
  if(is.data.frame(x))
    set <- names(table(as.matrix(x)) %[in[]% range)
  else
    set <- names(table(x) %[in[]% range)
    
  x %in{}% set
}

`%#[in[]%` <- function(x, range){
  if(is.data.frame(x))
    set <- names(table(as.matrix(x)) %[in[]% range)
  else
    set <- names(table(x) %[in[]% range)
  x %[in{}% set
}

`%#in[]%<-` <- function(x, range, value){
  if(is.data.frame(x))
  set <- names(table(as.matrix(x)) %[in[]% range)
  else
    set <- names(table(x) %[in[]% range)
  x %in{}% set <- value
  x
}

x <- c(1,1:5,5,5)
x %#in[]% c(2,3)
#> [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
x %#[in[]% c(2,3)
#> [1] 1 1 5 5 5
x %#in[]% c(2,3) <- NA
x
#> [1] NA NA  2  3  4 NA NA NA

y <- data.frame(a = 1:4,b = c(1, 5, 5, 5))
y %#in[]% c(2,3)
#>          a    b
#> [1,]  TRUE TRUE
#> [2,] FALSE TRUE
#> [3,] FALSE TRUE
#> [4,] FALSE TRUE
y %#[in[]% c(2,3)
#> [1] 1 1 5 5 5
y %#in[]% c(2,3) <- NA
y
#>    a  b
#> 1 NA NA
#> 2  2 NA
#> 3  3 NA
#> 4  4 NA

z <- factor(c("a",letters[1:5],"e", "e"))
z %#in[]% c(2,3)
#> [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
z %#[in[]% c(2,3)
#> [1] a a e e e
#> Levels: a b c d e
z %#in[]% c(2,3) <- NA
z
#> [1] <NA> <NA> b    c    d    <NA> <NA> <NA>
#> Levels: a b c d e

Created on 2019-11-04 by the reprex package (v0.3.0)

I changed the df to a matrix as this is what == does, and is a way to get a one dimensional table().

In the case of z we might need to change the levels as I suppose your main use case is to group outliers or low frequency/high frequency values, and factors might be common in those cases.
We need to decide if we add a NA level if we attribute NA (I suppose we woudn't ?).

In that case, variants %#<% etc would be desirable indeed.

A few remarks :

  • it almost doubles the amount of our function, which is not necessarily a big issue
  • it adds one more dimension to our operators, we now have suffix, prefix, and #, I think we made a good job to describe these dimensions quite clearly, this should not muddy the water so we need a good vocabulary and good integration in the readme.
  • it is a bit more complex than our other operators, which might mean corner cases I haven't identified yet (one would be that levels will be converted to numeric by as.matrix()).
  • there is no %#in~%, which is not a problem in itself, but doc should not be confusing in that regard when describing this "third dimension".

It seems to be quite a useful functionality to you and I'm ok to incorporate it if you ponder these points and think it is worth it.

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

Hmmm few ideas:

  1. I am sure things like %#in~% will never be useful. So we will not get a full set of operators.

So maybe we can get away with using %in#% instead? as rhs we would simply specify the wanted numbers of occurrences like 1:5 - up to 5?

  1. I am really unsure how to behave with data.frames... I would never call this on a data.frame. But the behaviour cannot be too smart and be consistent with other operators. So I think turning it to matrix is all right.

  2. isn't names(table()) implementation slower compared to ave ? Didn't check yet

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

@KKPMW I updated my post above while you were replying

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

I believe the %in#% you're suggesting is what would be %#<=% in my more general case above. This might work and be less confusing.

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

You mean `%in#% might work and be less confusing? If so - I agree :)

Also in my particular cases - I am often working with biological data that has "technical replicates". And then have to only select the samples that all have exact number (like 3) tech replicates, and analyse them separately. So sample_id %in#% 3 would be exactly it.

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

So something like this ?

library(inops)
#> 
#> Attaching package: 'inops'
#> The following object is masked from 'package:base':
#> 
#>     <<-

`%in#%` <- function(x, threshold){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb <= threshold])
  x %in{}% set
}

`%[in#%` <- function(x, threshold){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb <= threshold])
  x %[in{}% set
}

`%in#%<-` <- function(x, threshold, value){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb <= threshold])
  x %in{}% set <- value
  x
}

x <- c(1,1:5,5,5)
x %in#% 2
#> [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
x %[in#% 2
#> [1] 1 1 2 3 4
x %in#% 2 <- NA
x
#> [1] NA NA NA NA NA  5  5  5

y <- data.frame(a = 1:4,b = c(1, 5, 5, 5))
y %in#% 2
#>         a     b
#> [1,] TRUE  TRUE
#> [2,] TRUE FALSE
#> [3,] TRUE FALSE
#> [4,] TRUE FALSE
y %[in#% 2
#> [1] 1 2 3 4 1
y %in#% 2 <- NA
y
#>    a  b
#> 1 NA NA
#> 2 NA  5
#> 3 NA  5
#> 4 NA  5

z <- factor(c("a",letters[1:5],"e", "e"))
z %in#% 2
#> [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
z %[in#% 2
#> [1] a a b c d
#> Levels: a b c d e
z %in#% 2 <- NA
z
#> [1] <NA> <NA> <NA> <NA> <NA> e    e    e   
#> Levels: a b c d e

Created on 2019-11-04 by the reprex package (v0.3.0)

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

Oh, maybe you rather meant :

library(inops)
#> 
#> Attaching package: 'inops'
#> The following object is masked from 'package:base':
#> 
#>     <<-

`%in#%` <- function(x, counts){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb %in% counts])
  x %in{}% set
}

`%[in#%` <- function(x, counts){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb %in% counts])
  x %[in{}% set
}

`%in#%<-` <- function(x, counts, value){
  if(is.data.frame(x)){
    tb <- table(as.matrix(x))
  } else{
    tb <- table(x)
  }
  set <- names(tb[tb %in% counts])
  x %in{}% set <- value
  x
}

x <- c(1,1:5,5,5)
x %in#% 2
#> [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x %[in#% 2
#> [1] 1 1
x %in#% 2 <- NA
x
#> [1] NA NA  2  3  4  5  5  5

x2 <- c(1,1:5,5,5)
x2 %in#% 2:3
#> [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
x2 %[in#% 2:3
#> [1] 1 1 5 5 5
x2 %in#% 2:3 <- NA
x
#> [1] NA NA  2  3  4  5  5  5

y <- data.frame(a = 1:4,b = c(1, 5, 5, 5))
y %in#% 2
#>          a     b
#> [1,]  TRUE  TRUE
#> [2,] FALSE FALSE
#> [3,] FALSE FALSE
#> [4,] FALSE FALSE
y %[in#% 2
#> [1] 1 1
y %in#% 2 <- NA
y
#>    a  b
#> 1 NA NA
#> 2  2  5
#> 3  3  5
#> 4  4  5

z <- factor(c("a",letters[1:5],"e", "e"))
z %in#% 2
#> [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
z %[in#% 2
#> [1] a a
#> Levels: a b c d e
z %in#% 2 <- NA
z
#> [1] <NA> <NA> b    c    d    e    e    e   
#> Levels: a b c d e

Created on 2019-11-04 by the reprex package (v0.3.0)

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

Yup! I had in mind this last one, as it covers both points:

  1. Flexible enough to be used for various different situations
  2. Introduces only one additional operator type.

But I do wonder if we could ever have some kind of rounding issues doing %in% on numeric.

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

Good, I'm totally fine with this one, doesn't add much complexity to the package, and I think it's easy enough to understand. As for edge cases and optimization (table() / ave() / tabulate(), how to deal with factors...), we need experiments and unit tests. But agreeing on the concept and naming is the main part

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

Agree, both with the message and with the discussed naming convention.

Should I add your proposed functions to the codebase as a first iteration?

from inops.

moodymudskipper avatar moodymudskipper commented on June 26, 2024

Yes, but please think about the desired behavior with factors and share your thoughts

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

First iteration in #30

from inops.

karoliskoncevicius avatar karoliskoncevicius commented on June 26, 2024

You are correct. I missed that.

Tried to fix it in #31

from inops.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.