Giter Site home page Giter Site logo

Comments (6)

Vika-F avatar Vika-F commented on June 1, 2024

@fleapapa
A1: Yes, you need to split the dataset into non-overlapping blocks prior use of the distributed version of implicit ALS.

A2: With Intel DAAL you can verify the trained model by computing, for example, an RMSE for the same training data set. In order to do this, compute RMSE between the training data set and the predictions.
I attach the example that shows the flow of the computations. See testModelQuality() function in the attached example.
To align the computations with MLLib you may also provide test data set for RMSE computation. To do this, please replace transposedDataTable[0], …, transposedDataTable[nBlocks - 1] with the numeric tables that contain test ratings in CSR format.

impl_als_csr_distr_verify.zip

Best regards,
Victoriya

from onedal.

fleapapa avatar fleapapa commented on June 1, 2024

Thanks for the example code! It's very helpful.

Regarding

To align the computations with MLLib you may also provide test data set for RMSE computation. To do this, please replace transposedDataTable[0], …, transposedDataTable[nBlocks - 1] with the numeric tables that contain test ratings in CSR format.

why are test ratings placed in transposedData instead of both data and transposedData?

from onedal.

fleapapa avatar fleapapa commented on June 1, 2024

Hi Victoriya,

In your example code, i found two undefined member functions:

    size_t *colIndices = sparseBlock.getBlockColumnIndicesSharedPtr().get();
    size_t *rowOffsets = sparseBlock.getBlockRowIndicesSharedPtr().get();

The two functions, getBlockColumnIndicesSharedPtr and getBlockRowIndicesSharedPtr, seem only available in 2018 Beta. I am using 2017 release. May i just copy the latest header file @ https://github.com/01org/daal/blob/92f4dde5a1e2d7f132111588f4513cc7c4578052/include/data_management/data/csr_numeric_table.h without any negative impact to my application?

from onedal.

Vika-F avatar Vika-F commented on June 1, 2024

@fleapapa

why are test ratings placed in transposedData instead of both data and transposedData?

Both data and transposedData arrays define the same distributed numeric table. In the data array the table is split by rows (users), and in the transposedData array the table is split by columns (items). The code I have provided uses transposedData as the ground truth in the testModelQuality() function. That is why to test the quality of the trained model you need only the transposedData.

May i just copy the latest header file

It would be better not to copy a header file, but to modify the example to make it work with DAAL 2017. Please replace those two lines of code with the following code:

    size_t *colIndices = sparseBlock.getBlockColumnIndicesPtr();
    size_t *rowOffsets = sparseBlock.getBlockRowIndicesPtr();

Best regards,
Victoriya

from onedal.

fleapapa avatar fleapapa commented on June 1, 2024

Victoriya,

Thanks for the replacement code. It works:)

However, afterward my app crashed with error on a call to free(). [i didn't call free():] If i comment out testModelQuality (thus mergePredictions too), then no crash.

I'm investigating the crash, and found it most likely with incorrect shape of the matrix 'predictions'. I put some logging messages which show as follow:

predictedRatings[0][0]: 1360, 2500
predictedRatings[0][1]: 1360, 2500
predictedRatings[0][2]: 1360, 2500
predictedRatings[0][3]: 1358, 2500
predictedRatings[1][0]: 1360, 2500
predictedRatings[1][1]: 1360, 2500
predictedRatings[1][2]: 1360, 2500
predictedRatings[1][3]: 1358, 2500
predictedRatings[2][0]: 1360, 2500
predictedRatings[2][1]: 1360, 2500
predictedRatings[2][2]: 1360, 2500
predictedRatings[2][3]: 1358, 2500
predictedRatings[3][0]: 1360, 2499
predictedRatings[3][1]: 1360, 2499
predictedRatings[3][2]: 1360, 2499
predictedRatings[3][3]: 1358, 2499

while predictions' is allocated to be in a shape of (5438, 9998). I don't know why it is not 9999, because my input matrix is in a shape of (5438, 9999).

However, even i manually change the statement

  HomogenNumericTable<float> predictions(nItems, nUsers, NumericTable::doAllocate);

to

HomogenNumericTable predictions(9999, nUsers, NumericTable::doAllocate);

The code still crashes.

By the way, final RMSE is 0.79 which is unreasonable high (with SPARK ML, it is 0.11 only:).

Most likely incorrect shape of the matrixes is the culprit of these issues. I'm hunting it...

from onedal.

fleapapa avatar fleapapa commented on June 1, 2024

Victoriya,

After making the following two changes to your example code, finally i got it working:

  1. Use dataTables[] instead of transposedDataTables[] in testModelQuality()
  2. Modify mergePredictions() according to the shapes of predictedRatingsMaster[][]

RMSE is only reduced to 0.32 (from previous 0.79) and still higher than that obtained using SPARK ML, but i am very happy because my app works and doesn't crash now. And, it's much faster than pyspark!

I will port my 'model selection' Python code used with SPARK to my DAAL ALS C++ app and see if by tuning some hyper-parameters i can get a RMSE as close as 0.11 :)

I am attached my changed code, FYI.
my-daal-als-changes.txt

I'm closing this issue.

Many thanks again!

from onedal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.