One of the ways how to erase bots which were not excluded by the blacklist is to set a

I achieved this task following these steps: collect <code clas

Study how the quality depends on the hard identity size limit about identity-matching HOT 1 CLOSED

src-d commented on June 20, 2024 1

Study how the quality depends on the hard identity size limit

from identity-matching.

warenlg commented on June 20, 2024

I achieved this task following these steps:

collect repository_id, commit_hash, commit_author_name, commit_author_email using the gitbase and a python client on 22 open source stacks.
Iterate through 2019-05-01 GHTorrent dump, and map every commit hash with the GHTorrent id of its author.
For each org, create a CSV file with repository_id, author_email, author_name, author_id.
For each org, create 10 different identity matching table in Parquet format running match-identities with the --cache option pointing to the previous CSV file on which we dropped the author_id column, with 10 different values for the MaxIdentities parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500].
For each org, and each identity table generated (so 22x10), build 2 identity graph (1) from GHTorrent identity mapping (2) from our own identity matching.
Compute precision and recall using the following definitions for false positive and false negative.
- FP = set(pred_graph.edges) - set.intersection(set(pred_graph.edges), set(true_graph.edges))
- FN = set(ght_graph.edges) - set.intersection(set(pred_graph.edges), set(ght_graph.edges))
Plot the precision and recall curves depending on MaxIdentities for each org.
Conclude that MaxIdentities=20 stands for a good trade-off.