Comments (4)
It works after #99 was rebased on latest master.
from frameless.
Is it relevant? I guess everything works out of the box right now.
from frameless.
@kanterov I think we can close this. Let me run some thoughts by you and let me know if I am describing what we are doing correctly.
- In Spark single entries (Rows) are represented (encoded) as InternalRow. So
Dataset[Vector[Int]]
is represented withInternalRow(Array[Int])
. That is, there is an encoder that takes you fromVector[Int] => Array[Int]
. - We have a udf with this type signature
Vector[Int] => String
- When we need to apply the udf to the Dataset, we cannot really apply it as is. We need to somehow adapt it to work with
Array[Int]
as input an potentially withUTF8String
as output. Therefore the adapted udf needs to beArray[Int] => UTF8String
. - Essentially we need wrap our udf in the following:
(a: Array[Int]) => UTF8Codec.encode(udf(VectorCodec.decode(a)))
. Now we have a function with the correct signatureArray[Int] => UTF8String
.
from frameless.
Overall, if you want to be fast, if you want your UDFs to run fast, then it's better to have your case classes use types that are closer to what Spark uses internally:
that is,
case class Person(ids: Array[Int], name: UTF8String)
would in principle perform faster than
case class Person(ids: Vector[Int], name: String)
at least it terms of UDFs/UDAFs ?
(in this way, you save yourself with a lot of encoding/decoding conversions)
from frameless.
Related Issues (20)
- MIMA plugin usage
- Map[X, BigDecimal] not properly encoded
- How to use it with Java POJOs HOT 8
- provide TypedEncoder for java.time.{Instant, Duration, Period} in Spark 3.2 HOT 5
- Frameless 0.11 release HOT 5
- spark-sql 3.1.2 can't work with frameless-dataset 0.11.1 HOT 3
- Snapshot publish failed
- Compatibility with Spark 3.2.1 HOT 11
- Cats-effect 3 roadmap HOT 1
- CI release failure HOT 7
- How should parse and convert data from an external medium in a generic way? HOT 2
- Frameless 0.13 release HOT 2
- spark 3.4 support - replacing dataTypeFor logic HOT 8
- 3.4 AgnosticEncoder support - Spark Connect HOT 1
- [feature] DatasetT HOT 1
- AVG and KMeans tests fix HOT 1
- Add scalafmt HOT 1
- Add support for TypedDeltaTable
- use HOT 1
- Iterate over TypedColumns with evidence
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from frameless.