Comments (9)
I've looked through the commit, on the first sight the only affecting thing may be this.
Here instead of selecting DISTINCT
commit signatures, they select all of them to count the stats. If after that no deduplication was added, and all unique commit signatures are now considered as identities, this may lead to such effect...
from identity-matching.
It is easy to check - if after adding this single DISTINCT
here performance improves, this is it and we know what to fix:)
@warenlg Is it easy for you to try it in your experiment set up?
UPD: Oh now in the current version it will be useless as we started extracting dates. Is it possible for you to check it at this first breaking revision?
from identity-matching.
Yes this brings back the good old performance but indeed, DISTINCT
would be useless today since we collect the commit_date
and above all commit_hash
.
We have to look if commit_hash
is really useful, store the commit_date
more efficiently (maybe for example, all dates in the same row, instead of duplicating name, email each time) and also check if there is not a hidden quadratic complexity somewhere.
from identity-matching.
Awesome!
Do we really collect commit_hash
? Haven't seen it at all.
As to commit_date
, nothing complex is required, the dates can be discarded right after stats calculation (which is always finished before reducing), so deduplication can be added easily and the performance will be shiny.
from identity-matching.
Yes we do, but not sure why yet, f1c6749
from identity-matching.
Will a group by work?
from identity-matching.
Yeah we collect hashes because we need them for GitHub API.
from identity-matching.
I fixed this with the GROUP BY as I suggested.
from identity-matching.
Just tested the performance again, and indeed it runs way faster now
from identity-matching.
Related Issues (20)
- Include the external identifiers into the result HOT 2
- Add another output format: Postgres HOT 2
- request external API only once
- Make the project open-source HOT 2
- Assume the output format is parquet when the output path points to a parquet file HOT 1
- Detect the primary name of an identified person HOT 1
- The list of popular names is too large HOT 1
- Bad precision and recall (~60%) on IBM and intel open source stacks HOT 4
- Use more efficient API for GitHub HOT 3
- Consider committer data, not just commit author HOT 3
- Debug the bot detection pipeline HOT 1
- Extract commit date for stats filtering
- panic: json: unsupported value: NaN HOT 13
- Save and load the bot detection model from modelforge HOT 6
- v3.1.0 doesn't seem to finish after running on writeas org HOT 6
- Alter Docker image so it dumps output to a defined folder HOT 4
- Print version at startup HOT 1
- Incremental operation support
- After uprgrade to 0.129.0: panic: json: unsupported value: NaN
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from identity-matching.