Comments (3)
@OlivierBlanvillain we should close the issue as it is solved by #65, right ?
from frameless.
I agree with this issue. I have two related observations here. Let me show them with an example:
val e = TypedDataset.create[(Int, String, Long)]( (1,"a",2L) :: (2, "b", 4L) :: (2, "b", 1L) :: Nil )
// Summing an Int column fails:
e.select(sum(e('_1)))
<console>:25: error: could not find implicit value for parameter summable: frameless.functions.Summable[Int]
e.select(sum(e('_1)))
// The behavior in spark when adding int is to widen them to BigInt
scala> e.dataset.select(org.apache.spark.sql.functions.sum($"_1"))
res21: org.apache.spark.sql.DataFrame = [sum(_1): bigint]
Fist, I think we need to add proper implicit for the 3 types mentioned in the title (Int, short, byte). Second, I think we should stay faithful to the widening that spark does. If you add a lot of Int, is better to have the result as something bigger. The idea is that spark is used for "Big Data". Many times when you want to save disk space you will persist you short numeric values as Short/Int. However, when it's time to sum your billions-row of data, you want to collapse these numeric values into a number that you can be sure it will not overflow. This is why spark uses BigInt every time you sum up any numeric type. It might feel strange for a strictly typed system, but it makes perfect sense for the application.
from frameless.
I agree with you comment and I think we should simply fix this such that the return type of sum matches Spark's behaviour.
from frameless.
Related Issues (20)
- Map[X, BigDecimal] not properly encoded
- How to use it with Java POJOs HOT 8
- provide TypedEncoder for java.time.{Instant, Duration, Period} in Spark 3.2 HOT 5
- Frameless 0.11 release HOT 5
- spark-sql 3.1.2 can't work with frameless-dataset 0.11.1 HOT 3
- Snapshot publish failed
- Compatibility with Spark 3.2.1 HOT 11
- Cats-effect 3 roadmap HOT 1
- CI release failure HOT 7
- How should parse and convert data from an external medium in a generic way? HOT 2
- Frameless 0.13 release HOT 2
- spark 3.4 support - replacing dataTypeFor logic HOT 8
- 3.4 AgnosticEncoder support - Spark Connect HOT 1
- [feature] DatasetT HOT 1
- AVG and KMeans tests fix HOT 1
- Add scalafmt HOT 1
- Add support for TypedDeltaTable
- use HOT 1
- Iterate over TypedColumns with evidence
- Spark 3.5 update HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from frameless.