Giter Site home page Giter Site logo

Comments (9)

irinakhismatullina avatar irinakhismatullina commented on September 22, 2024

I've looked through the commit, on the first sight the only affecting thing may be this.

Here instead of selecting DISTINCT commit signatures, they select all of them to count the stats. If after that no deduplication was added, and all unique commit signatures are now considered as identities, this may lead to such effect...

from identity-matching.

irinakhismatullina avatar irinakhismatullina commented on September 22, 2024

It is easy to check - if after adding this single DISTINCT here performance improves, this is it and we know what to fix:)

@warenlg Is it easy for you to try it in your experiment set up?

UPD: Oh now in the current version it will be useless as we started extracting dates. Is it possible for you to check it at this first breaking revision?

from identity-matching.

warenlg avatar warenlg commented on September 22, 2024

Yes this brings back the good old performance but indeed, DISTINCT would be useless today since we collect the commit_date and above all commit_hash.
We have to look if commit_hash is really useful, store the commit_date more efficiently (maybe for example, all dates in the same row, instead of duplicating name, email each time) and also check if there is not a hidden quadratic complexity somewhere.

from identity-matching.

irinakhismatullina avatar irinakhismatullina commented on September 22, 2024

Awesome!

Do we really collect commit_hash? Haven't seen it at all.

As to commit_date, nothing complex is required, the dates can be discarded right after stats calculation (which is always finished before reducing), so deduplication can be added easily and the performance will be shiny.

from identity-matching.

warenlg avatar warenlg commented on September 22, 2024

Yes we do, but not sure why yet, f1c6749

from identity-matching.

vmarkovtsev avatar vmarkovtsev commented on September 22, 2024

Will a group by work?

from identity-matching.

vmarkovtsev avatar vmarkovtsev commented on September 22, 2024

Yeah we collect hashes because we need them for GitHub API.

from identity-matching.

vmarkovtsev avatar vmarkovtsev commented on September 22, 2024

I fixed this with the GROUP BY as I suggested.

from identity-matching.

warenlg avatar warenlg commented on September 22, 2024

Just tested the performance again, and indeed it runs way faster now

from identity-matching.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.