Now we have just a number that joins all person's identities together. But we should a

The primary user names are included in GHTorrent. <a class="user-mention notransla

Research the detection of the primary name for an identified person about identity-matching HOT 8 CLOSED

src-d commented on September 22, 2024

Research the detection of the primary name for an identified person

from identity-matching.

Comments (8)

vmarkovtsev commented on September 22, 2024 1

The primary user names are included in GHTorrent.
@EgorBu Can you please attach here the best identity matching results on GitHub?

from identity-matching.

vmarkovtsev commented on September 22, 2024 1

I collected the dataset of full names: /user/internal-datasets/full_names

Random strategy

46% accuracy if we consider identities without a correct name. Otherwise, ~82%.

Biggest number of commits strategy

56% and 99.9%, respectively.

So 99.9% is a very good number and there is no need for digging deeper. We should count commits for each name and sort them, problem solved.

from identity-matching.

EgorBu commented on September 22, 2024

I will launch the extraction of identities and dump it here

Update:
Old result found here: /user/egorbu/idmatching/cache/res_name_threshold_5_email_threshold_28_data_aggregated_deduplicated.pkl - it was launched on initial dataset (not full)
I will launch extraction on full dataset.

from identity-matching.