I've been benchmarking the performance of match-identities</

Yes we do, but not sure why yet, <a class="commit-link" data-hovercard-type="commit" d

Performance dropped critically about identity-matching HOT 9 CLOSED

warenlg commented on September 25, 2024

Performance dropped critically

from identity-matching.

Comments (9)

irinakhismatullina commented on September 25, 2024

I've looked through the commit, on the first sight the only affecting thing may be this.

Here instead of selecting DISTINCT commit signatures, they select all of them to count the stats. If after that no deduplication was added, and all unique commit signatures are now considered as identities, this may lead to such effect...

from identity-matching.

irinakhismatullina commented on September 25, 2024

It is easy to check - if after adding this single DISTINCT here performance improves, this is it and we know what to fix:)

@warenlg Is it easy for you to try it in your experiment set up?

UPD: Oh now in the current version it will be useless as we started extracting dates. Is it possible for you to check it at this first breaking revision?

from identity-matching.

warenlg commented on September 25, 2024

Yes this brings back the good old performance but indeed, DISTINCT would be useless today since we collect the commit_date and above all commit_hash.
We have to look if commit_hash is really useful, store the commit_date more efficiently (maybe for example, all dates in the same row, instead of duplicating name, email each time) and also check if there is not a hidden quadratic complexity somewhere.

from identity-matching.

irinakhismatullina commented on September 25, 2024

Awesome!

Do we really collect commit_hash? Haven't seen it at all.

As to commit_date, nothing complex is required, the dates can be discarded right after stats calculation (which is always finished before reducing), so deduplication can be added easily and the performance will be shiny.

from identity-matching.

warenlg commented on September 25, 2024

Yes we do, but not sure why yet, f1c6749

from identity-matching.

vmarkovtsev commented on September 25, 2024

Will a group by work?

from identity-matching.

vmarkovtsev commented on September 25, 2024

Yeah we collect hashes because we need them for GitHub API.

from identity-matching.

vmarkovtsev commented on September 25, 2024

I fixed this with the GROUP BY as I suggested.

from identity-matching.

warenlg commented on September 25, 2024

Just tested the performance again, and indeed it runs way faster now

from identity-matching.

Recommend Projects

Performance dropped critically about identity-matching HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent