Comments (7)
Hi @RajatSablok,
When you convert Spark DataFrame to H2OFrame, the whole dataset is kept in H2O memory storage. Data is compressed column by column and distributed across H2O nodes. (Spark executors when using SW internal backend).
from sparkling-water.
Thanks for the reply @mn-mikke, is it possible to store the data in secondary memory (hard disk) instead of it being stored in RAM?
from sparkling-water.
Nope.
from sparkling-water.
Like in spark we have the option to change the storage location, do we have that sort of support in sparkling water? @mn-mikke
The reason I am still confused is, suppose we have a 1 TB file that needs to be processed, will we have to create a 4TB instance for that? This is my understad based on the information given here.
from sparkling-water.
yes, if you need all the columns of the dataset for training a model.,you will have to create a cluster of 4TB of memory (80x50GB for example). Usually, the dataset contains extra columns that are unimportant for model training. You can use Spark to reduce number of columns and thus memory requirements. or you can train algorithm on a random sample of data.
Note: H2O memory engine is designed to be super fast for machine algorithms. Everything is kept in memory since ML algorithms iterate over the dataset many times. Storing data to disk would harm training speed significantly. Spark engine comes from a different world of big data processing where dataset is walked through only once or very few times.
from sparkling-water.
Got it, thanks for the help. Appreciate it
from sparkling-water.
Hi @mn-mikke, I had a follow-up question based on this:
you can train algorithm on a random sample of data
Can we train the model in batches of data?
Example: If we have a 1 TB dataset, can we load and train the model 10 times, 100 GB at a time?
from sparkling-water.
Related Issues (20)
- Benchmarks: Use persist(StorageLevel.DISK_ONLY) for Materialization of Parsed DataFrames HOT 2
- Update examples/README file HOT 4
- Remove extra argument on H2OAUtoML pyspark wrapper HOT 2
- Handle sortMetric param in H2OAutoML the same way as other enums HOT 2
- Benchmarks: Clean DKV After a Test is Finished HOT 2
- Deprecate algos and features in org.apache.spark package HOT 2
- Fix pipeline tests HOT 3
- Deprecate H2OMOJOModel, H2OMOJOPipelineModel and H2OMOJOSettings in the org.apache.spark package HOT 2
- Cleanup of PySparkling package -> moving to new package ai.h2o HOT 3
- Fix download links for latest stable HOT 2
- Don't need to start H2O to initialize algo on PySparkling side HOT 2
- Fix bug with setting init on KMeans HOT 2
- Avoid duplication between mojo params and algo params HOT 2
- Expose predict_contributions for H2OMOJOModel HOT 6
- Upgrade Docker version in CI HOT 2
- Upgrade MojoPipeline to 0.10.0 HOT 2
- Explore slow-down for fat-dataset with many categorical columns HOT 2
- Don't use strings to define algo name HOT 2
- Support for Spark 3.4 HOT 1
- Support for Python 3.10 in H2O Packages
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkling-water.