The output of the many compute_filter_similarity cele

This was added as part of v1.8.1 in PR <a class="issue-link js-issue-link" data-error-

Persist in progress match data about anonlink-entity-service HOT 3 CLOSED

hardbyte commented on May 30, 2024

Persist in progress match data

from anonlink-entity-service.

Comments (3)

hardbyte commented on May 30, 2024

Writing a result file for every task doesn't seem the right approach either - many of them will be tiny or empty and it couples the number/size of the matching tasks to the size/number of the result files that will need to be accessed in the next steps.

Some type of buffered queue might be a nice solution so when it reaches a certain size (e.g. 500MiB) we write a binary file to disk.

from anonlink-entity-service.

hardbyte commented on May 30, 2024

Until resolved this issue limits us to mappings where the sparse similarity score matrix fits in memory.

Regarding the buffered queue idea it might be worth reading up on Kombu which celery uses under the hood.

With further reflection I think writing a binary result file is okay for each task - as long as it is followed up by a task that aggregates these files together (up to a minimum size).

The tasks (lets say there are 10k) each compare their chunks from A and B. They upload a file into the object store (say between 0 and 10KiB each) and all tasks output a simple filename and match count. The reduce task first processes these results into larger files (each say 500MiB), then saves an array of these final filenames to the database.

['partial-result-part-1', 'partial-result-part-2', ...]

When the similarity score results are requested as json we can create one output stream from these multiple partial result files. Any following tasks can also pull the data from this one stream (e.g. greedy matching for the mapping view type).

from anonlink-entity-service.

hardbyte commented on May 30, 2024

This was added as part of v1.8.1 in PR #213

from anonlink-entity-service.

Recommend Projects

Persist in progress match data about anonlink-entity-service HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent