Comments (11)
@alansaid Within the next two days no problem.
I also noticed that there are now slight modifications to the DataModel structure in the master branch.
Edit:
I'm now able to run the Netflix dataset ML-10M on a computer with 16GB with no problem. I also discarded the timestamp but that is optional.
from rival.
Hi André,
that situation can happen. So far, we have used the framework with datasets of different nature, but most of them were not too large.
We are aware of this fact and that the current datamodels are not very generic (check issues #83 and #103). Probably the best solution at the moment would be to implement your own DataModel or rely on other frameworks already optimised for large datasets (such as RankSys [ http://ranksys.org/ ]).
Thank you for the interest, and let us know whatever works for you, so we can work on that in the future.
Alex
from rival.
Hi all,
Thank you @abellogin for mentioning our framework. Indeed, by generalising your DataModel it would be easy for you to add more efficient implementations (or, even, plugging RankSys' ones).
Just to give you some perspective, I have measured the memory footprint of various implementations of the RankSys' PreferenceData inferface (equivalent to RiVal's DataModel) using the ML10M dataset. First, I have created an implementation equivalent to the current DataModel (RiValPreferenceData). Then, I used the two publicly available implementations in RankSys 0.3 (SimplePreferenceData and SimpleFastPreferenceData). Finally, I am including the results of our last RecSys'15 poster that applies state-of-the-art compression techniques and whose implementations will be published (hopefully soon) in RankSys 0.4.
The results are the following:
Data model | Memory |
---|---|
RiValPreferenceData | 1,969.2 MB |
SimplePreferenceData | 1,261.3 MB |
SimpleFastPreferenceData | 810.7 MB |
compression - none | 165.5 MB |
compression - FOR | 42.3 MB |
As you can see, there is ample room for improvement with respect to having two nested maps. I am planning to publish these and other observations in a blog post once RankSys 0.4 has been released.
Just one more thing. RankSys 0.4 will be released under a much more relaxed license. That should allow its usage in other projects without requiring them to be licensed under the GPL (as it happens currently).
from rival.
Thank you both @abellogin and @saulvargas !
I'll have a look over the RankSys, I think it fits my needs!
I will also have a look at your work "Analyzing Compression Techniques for In-Memory Collaborative Filtering" and hopefully will use it for mine.
from rival.
Hi @abellogin,
If you want i can send you an adaptation to the CrossValidationSplitter<U, I> that I developed. I named it CrossValidationSplitterIterative<U, I>, instead of caching 5 folds to memory I compute one a write it promptly to the file (test or train respectively)
Let me know if you want
from rival.
Sure @afcarvalho1991, you can do a pull request or upload it somewhere and paste here the URL. I think it can be useful to have an intermediate class that does not handle everything in memory for the n-fold case.
Thank you!
Alex
from rival.
Related to #60 since strategies take a lot of memory when loading from file.
One solution would be to skip the recommendation step as it is right now, where all the recommendations are generated and dumped into a file. Instead of this, only the recommendations needed for a strategy would be generated.
Also related to #83.
from rival.
@afcarvalho1991 can you please do a pull request with your code and we'll see if we can merge it?
from rival.
Those are great news! I assume that is because of the last changes, which make use of the RankSys data representation /cc @saulvargas
from rival.
Hello, I did a pull from the latest version from the master branch. I have now a local branch with my modifications how can I perform a pull request? Can you help me?
My contribution is the implementation of a CrossValidationSpilitter (Iterative) and a functioning Test that was modified from the CrossValidatedMahoutKNNRecommenderEvaluator to create the CrossValidatedIterativeMahoutKNNRecommenderEvaluator.
Also, I would like to inform you that unable to execute the CrossValidatedMahoutKNNRecommenderEvaluator perhaps you need to review this class, some sort of problem with the timestamps table.
Thank you,
André
from rival.
Hi André,
I guess it depends on whether you forked the repository and started doing changes there (preferred case) or if you simply cloned the repository and your changes are on the base branch.
Find here more info about this.
Alex
PS: I will check CrossValidatedMahoutKNNRecommenderEvaluator ASAP, thanks for noticing.
from rival.
Related Issues (20)
- Check evaluation metrics are suitable for unary/binary data
- Check splitters are suitable for unary/binary data
- Check parsers are suitable for unary/binary data
- Not generalized DataModel for RandomSplitter class
- Not generalized DataModel for Temporal Splitter class
- Not generalized DataModel for SplitterRunner class
- Update license snippet in gh-pages HOT 1
- Issue with Precision and Recall HOT 3
- Create jar file of 0.3-SNAPSHOT to be used outside eclipse HOT 2
- DataModel does not support duplicate ratings in dataset
- Implement RecSys Challenge 2016 metric
- [Question]: Split dataset in training, validation and test HOT 12
- Missing documentation for unsupported combination in RandomSplitter HOT 3
- Add custom behaviour to DataModelUtils.saveModel HOT 3
- [Question] Maven install command HOT 5
- Bug in CrossValidatedMahoutKNNRecommenderEvaluator
- Type mismatch: cannot convert from CSVParser to Iterable<CSVRecord> HOT 11
- Test coverage HOT 3
- Precision values are not consistent with those from other ranking metrics
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rival.