Giter Site home page Giter Site logo

Deal with significant digits about skimr HOT 27 CLOSED

ropensci avatar ropensci commented on August 17, 2024 1
Deal with significant digits

from skimr.

Comments (27)

hadley avatar hadley commented on August 17, 2024 3

FWIW I think you should do this while printing, not earlier: you're changing how you display the values, not the values themselves.

I think you should continue to display non-signficant digits as printing them as 0 will confuse a lot of people. I think showing 3 digits should be adequate for most cases.

from skimr.

hadley avatar hadley commented on August 17, 2024 1

See also https://github.com/hadley/colformat. colformat should handle this for you

from skimr.

elinw avatar elinw commented on August 17, 2024 1

@benmarwick Actually Lucid really solves the whole entire thing. The first time I read that page I missed that it works in the console.

from skimr.

hadley avatar hadley commented on August 17, 2024 1

colformat does everything that lucid does and adds colours.

from skimr.

elinw avatar elinw commented on August 17, 2024 1

I've pushed up a branch that uses Cformat. I think that's the actual only generic workable solution--a user would write a custom function to to use any of the other options that make sense in their context. In thinking about this I read the links here and I think they are as good as any. I think it's important to remember that the idea of skimr is to allow users to "skim" a data set, as opposed to diving in deeply. This also does not change the stored value, it is just about display.
#125

from skimr.

elinw avatar elinw commented on August 17, 2024 1

screen shot 2017-07-16 at 4 21 17 pm
screen shot 2017-07-16 at 4 21 01 pm
screen shot 2017-07-16 at 4 20 28 pm

from skimr.

hadley avatar hadley commented on August 17, 2024 1

In the min column in your previous screenshots there is both 1 and 0.1, and the decimal points aren't aligned.

from skimr.

elinw avatar elinw commented on August 17, 2024

Yes that looks like a good approach and consistent with what we are doing elsewhere. Marking as a question because I think we need to decide a few things such as where to handle and also if we should look at integers again rather than just treating them as numeric.

from skimr.

elinw avatar elinw commented on August 17, 2024

Here are a few issues related to this.

  • Do we handle in the skim object or do we handle in the print. To be consistent with our approaches to other issues we should handle in the skim object.
  • Do we want to use a greyed out approach to indicate false precision or do we want to not display false precision at all.
  • How is it possible to determine the appropriate number of significant digits to be reported for some arbitrary numeric variable. All of the methods seem extremely clunky. Also lots of data sets have integers class as numeric.
  • Why is it that in that screen shot there are different numbers of digits for different percentiles?
  • Is there any way to stop .000s from being appended?

from skimr.

benmarwick avatar benmarwick commented on August 17, 2024

There's quite a nice discussion of this in the vignette for the lucid pkg, here's an excerpt:

One recommendation for improving the display of tables of numbers is to round numbers to 2 (Wainer 1997) or 3 (Feinberg and Wainer 2011) digits for the following reasons:

We cannot comprehend more than three digits very easily.
We seldom care about accuracy of more than three digits.
We can rarely justify more than three digits of accuracy statistically.

An alternative to significant digits is the concept of effective digits (Ehrenberg 1977, Kozak et al. (2011)), which considers the amount of variation in the data.

Tufte is not very prescriptive with his statement that "The number of significant digits depends on data underlying the calcuations" [1]

Gelman makes it clear that four significant digits is too many.

So the proposal for three significant digits in regular type and the rest in a lighter colour (in the rare event that those are meaningful to the user) seems consistent with other advice.

from skimr.

elinw avatar elinw commented on August 17, 2024

I would never print them as 0 that makes no sense at all. Trailing 0s after the decimal place are significant digits.

Actually I think he is very prescriptive. In iris$Sepal,Length you should show 1 decimal, 2 significant digits because that is the number in the data you are calculating from. In mtcars$wt you should show 3 decimals, 4 significant digits because that is what the original data have. In mtcars$gear you should have no decimals since those are integers. For freeny$market.potential, 4 decimals, 6 significant digits is right.

You could perhaps say this is just a skim of the data set so you won’t show freeny$market.potential to its full significant digits.

The worst offenders are the 5 number summary values where tons of 0s are slapped on, actually changing the meaning of the minimum and maximum values. The original functions don't do that, it is when they are put into the same column with other numbers that I guess they are all coerced to show the same number of digits.

from skimr.

hadley avatar hadley commented on August 17, 2024

I don't think we should be showing different numbers of digits in different columns (except where necessary to pad to the same length). But I suspect the discussion is getting confused because we're using the same words to mean different things 😟

from skimr.

elinw avatar elinw commented on August 17, 2024

Yes, I was thinking that too. I think we should focus on the display and not the storage for the moment. @michaelquinn32 I also wonder what you think about the issue of changing the data representation for print. You were arguing against that in another thread.

Since the columns range from "n" to "mean" to "hist" I don't know how it would make sense to insist on the same number of digits in all columns. Clearly n is going to be an integer and I really think that min and max, as two obvious examples, should be the actual values not padded with extra 0s. I also think it's a problem that the values returned from the functions are being changed by the processing.

I think the actual problem is that to really do it correctly you need different numbers of digits in the same column, and I don't know if that is possible. a<-c(3, 3.5, 3.54) automatically gives all of them two digits to the right of the decimal.

from skimr.

hadley avatar hadley commented on August 17, 2024

I'm not sure what you mean by extra zeroes. colformat doesn't add any?

from skimr.

elinw avatar elinw commented on August 17, 2024

There are at least two issues going on, which is one reason this is confusing.

  1. Inheriting from .Internal(print.default) all numeric values are printed using the digits option. Unless overridden in the specific call this will pad any numeric with 0s on the right until they have the specified number of digits (but not always all the way to digits). This is really the most offensive piece of the problem because it is implying by default that values have up to 7 significant digits. From what I can tell R figures out how many decimal places the longest value in a column has and makes each displayed value match up, meaning that the decimals align. This is regardless of how the values are actually stored in the object. So as I was pointing out, values that are obviously integers or that are stored with 1 digit after the decimal have wrong levels of precision displayed when printing in the form of numerous extra 0s. You can store 2 and it will display it as 2.0000000 in the right situation.

  2. For calculated values such as the mean, standard deviation and quantiles in skim() default numeric handling (but also others) R will always display using the digits option regardless of whether that is correct or not in terms of the idea of significant digits (that it is false precision to display more than the number of decimal places that your data were measured to). These long numbers then cause all the other numbers displayed in the same column to get the maximum padding of right hand 0s.

I haven't spent enough time trying to learn colformat (it took me a while just to understand that print was not printing what is in the object), but my sense is that like R base it is giving a predefined number of decimal digits rather than displaying what is in the object (problem 1) or trying to figure out the correct number to display for calculated values (much harder problem 2).

From looking around on StackOverflow etc it seems like the solution to problem 1 is to apply as.character() to the value column of the skim object as part of the numeric handling in skim_print. This does seem to work perfectly except that the tibble print says . It fixes all the integers and removes the misleading trailing 0s from calculated values. However it has other strange effects

There doesn't seem to be a common way to solve problem 2. It's do-able but slow, at least from what I have seen so far.

All of this also relates to the other issue about print.

from skimr.

elinw avatar elinw commented on August 17, 2024

One way to cut down the scope of the problem would be to put integers into a separate tibble in skim_print(). That would eliminate some really egregious 0 padding.

from skimr.

elinw avatar elinw commented on August 17, 2024

Colors are nice. I agree the column outputs are the same otherwise. But actually I think one difference as I have been experimenting is that colformats applies to individual columns and gives and error for list or a whole tibble/data frame, so this works colformat::colformat(xp[[1]]$mean) where xp is a printed skim_df. While lucid::lucid(xp) deals with the whole thing, at least so far.

from skimr.

elinw avatar elinw commented on August 17, 2024

One idea I have is that we might want to add another column "formatted" to the skim object that would contain the correctly formatted values. Right now the inline histogram for example is being stuck into level, but it is not a level. If we did that, we could also do things like correctly format the Date and POSIXct values, which are also frustratingly displayed right now. In that case we could use colformats in a more refined way and also potentially someone could write a custom formatter for their data type.

from skimr.

elinw avatar elinw commented on August 17, 2024

#122
@hadley @haozhu233 @benmarwick
I think I came up with a workable approach that will let you decide on handling using skim_with(). So you could use colformat, lucid or whatever else you choose. For my stats 101 students and other similar users I still want to make a default for mean and sd which are the main offenders in terms of false precision but that is a much more focused problem.

from skimr.

hadley avatar hadley commented on August 17, 2024

I think this loses the important property that the length of the number should be proportional to the order of magnitude.

from skimr.

elinw avatar elinw commented on August 17, 2024

Can you give me an example from a sample data set where you think this would be a problem? I'm not sure what you mean by the length.

screen shot 2017-07-18 at 10 42 48 pm

from skimr.

elinw avatar elinw commented on August 17, 2024

Correct ! That's the point. They are different because they are different variables and min itself is the actual minimum value in the data. Some variables are recorded with decimals and some are not, i.e. some have more digits than others (and some are integer), that is a large part of the problem with transposing the data_ because print() always adds as many 0s as is needed to make the values equal width and therefore have straight columns, R (and I guess S) could have chosen to pad with blank spaces whem print was written but it didn't. Personally since the idea is to understand what is in a data set I find that besides being the correct practice for display the fact that you're actually getting the correct information about how the data actually is, is helpful. I personally would not be using the console display in any publication though.

from skimr.

hadley avatar hadley commented on August 17, 2024

I think this is a bad idea because it means you can no longer easily skim the column - you can't easily compare numbers because they're not aligned.

from skimr.

elinw avatar elinw commented on August 17, 2024

You shouldn't be comparing the number of cylinders and the miles per gallon anyway. Or in a survey data set the mean income, mean score on a 1 to 10 rating of satisfaction with your city, and the mean number of children. On the other hand, all of the numeric Iris data will line up, though since I'm not a botanist I'm not sure if you normally want to compare the median length and median width. If you have a set of variables you specifically want to compare like that you should just skim those columns or use a function that is not designed to handle multiple data types.

For better or worse print.skim() is giving you something similar to print.data.frame() but displayed horizontally, and divided up by data type. It defaults to an opinionated set of statistics, but that's all it really is. A more visually organized and compact version of summary.data.frame() with more statistics.

from skimr.

hadley avatar hadley commented on August 17, 2024

I don't think your reasoning is correct, but it's your package.

from skimr.

elinw avatar elinw commented on August 17, 2024

#129

from skimr.

elinw avatar elinw commented on August 17, 2024

Merged with a solution; certainly open to further improvements in new issue reports but let's start a new discussion if desired.

from skimr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.