Comments (3)
Writing a result file for every task doesn't seem the right approach either - many of them will be tiny or empty and it couples the number/size of the matching tasks to the size/number of the result files that will need to be accessed in the next steps.
Some type of buffered queue might be a nice solution so when it reaches a certain size (e.g. 500MiB) we write a binary file to disk.
from anonlink-entity-service.
Until resolved this issue limits us to mappings where the sparse similarity score matrix fits in memory.
Regarding the buffered queue idea it might be worth reading up on Kombu which celery uses under the hood.
With further reflection I think writing a binary result file is okay for each task - as long as it is followed up by a task that aggregates these files together (up to a minimum size).
The tasks (lets say there are 10k) each compare their chunks from A and B. They upload a file into the object store (say between 0 and 10KiB each) and all tasks output a simple filename and match count. The reduce task first processes these results into larger files (each say 500MiB), then saves an array of these final filenames to the database.
['partial-result-part-1', 'partial-result-part-2', ...]
When the similarity score results are requested as json we can create one output stream from these multiple partial result files. Any following tasks can also pull the data from this one stream (e.g. greedy matching for the mapping
view type).
from anonlink-entity-service.
This was added as part of v1.8.1 in PR #213
from anonlink-entity-service.
Related Issues (20)
- Update production deployment docs HOT 1
- Bug: Workers that fail to init db connection pool accept work anyway
- Switch CI to new cluster HOT 1
- Consistant environment variables to set server address HOT 1
- Improve opentracing HOT 5
- Extend upload endpoint to accept all formats specified in openAPI
- Encoding fetching timing HOT 3
- Progress metrics not taking blocks into account
- Migrate off deprecated K8s dependencies
- API tutorial out of date
- Limit maximum number of edges
- Unify models
- Assume Role Provider calls need to handle a secure minio server
- Store and expose error message for a run
- Handle blocking input that doesn't cover all encodings HOT 1
- Implement P-Sig key filtering
- Release version 1.15.0 HOT 1
- Issues seen on new install for first time in Windows environment for v.1.15.1
- Issues seen when trying to set up PyCharm debugging on Windows HOT 1
- Invalid signal error seen when running large data set
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anonlink-entity-service.