src-d / identity-matching Goto Github PK

View Code? Open in Web Editor NEW

17.0 7.0 13.0 899 KB

source{d} extension to match Git signatures to real people.

License: GNU General Public License v3.0

Go 65.97% Makefile 2.78% Python 28.37% Dockerfile 0.93% Shell 1.18% TSQL 0.76%

identity-matching personal-info

identity-matching's Introduction

Identity Matching source{d} Extension

Match different identities of the same person using 🤖. Extension for source{d}.

Overview • How To Use • Science • Contributions • License

Overview

People are using different e-mails and names (aka identities) when they commit their work to git. E-mails can be corporate, personal, special like users.noreply.github.com, etc. Names can be with Surname or without, with typos, no name, etc. Thus to get precise information about developer it is required to gather their identities and separate them from another person identities. That's what we call Identity Matching.

How To Use

Right now no pre-built binaries are available. Please refer to How to build from source code section to build an executable.

Run match-identities --help to see all the parameters that you can configure.

There are two use cases supported for match-identities.

With gitbase
Without gitbase

In both cases, the output identity table is saved as a Parquet file. Read more in the Output format section.

Use with gitbase

match-identities is supposed to be used with gitbase. First of all, make sure you have a gitbase instance running with all the repositories you are going to analyze. Please refer to the gitbase documentation to get more information.

Usage example:

match-identities --output matched_identities.parquet

The credentials can be configured with the --host, --port, --user and --password flags.

For example, the following SQL gitbase query will return the identities of each commit author:

SELECT DISTINCT repository_id, commit_author_name, commit_author_email
FROM commits;

If you want to cache the gitbase output you can use the --cache flag. After the identities are fetched from gitbase, the matching process is run. Read Science section to learn more.

Use without gitbase

If you run match-identities with the --cache option enabled you get a csv file with the cached gitbase output. Besides, if you already have a list of identities it is possible to run match-identities without gitbase involved. Create a CSV file with the columns repo, email and name, then feed it to the --cache parameter.

Usage Example:

match-identities \
    --cache path/to/csv/file.csv \
    --output matched_identities.parquet

Output format

Once the algorithm finishes to merge identities, you get a table with 4 columns:

id (int64) -- unique identifier of the person with the corresponding identity.
email (utf8) -- e-mail of the identity.
name (utf8) -- name of the identity.
repo (utf8) -- repository of the commit.

The columns email, name and repo may contain empty values which means no constraints. For example, let's consider this output identity table:

id,email,name,repo
1,[email protected],"",""
1,"",alice,""
2,[email protected],"",""
2,"",bob,""
2,[email protected],"",""
2,"",no-name,bob/bobs-project

There are two developers. Let's name them Alice (with id 1) and Bob (with id 2). When we analyze a commit with [email protected] as author email, then the author is Alice. The repository and author name are ignored since the author email is the most reliable way to define an identity. On the other hand, when we analyze a commit with alice as an author name, then the author is Alice for whatever combination of email and repository. Same for Bob, although he uses two different email addresses [email protected] and [email protected]. If we come across a commit with the no-name author name in bob/bobs-project repository then it is Bob's.

Convert parquet to CSV

It is possible to convert the output parquet file to CSV using the python script in the research directory:

python3 ./research/parquet2csv.py matched_identities.parquet

The result will be saved as matched_identities.csv. Please note that pyspark must be installed.

External matching option

If the organization is using GitHub, Gitlab or Bitbucket, it is possible to use their API to match identities by emails. In that case, 2 columns are added and filled for every email in the table: the External id provider and the External id itself.

How to build

git clone https://github.com/src-d/identity-matching
cd identity-matching
make build

You'll see two directories with Linux and Macos binaries inside the build directory.

Science

There are two stages to match identities. The first is the precomputation which is run once on the whole dataset and remains unchanged during the subsequent steps. The second is the matching itself.

Precomputation:
1. Gather 2 lists of the most popular names and emails (by frequencies) on the whole dataset.
2. Gather 2 lists of emails and names that will be ignored (aka blacklists) on the whole dataset. They are non-human identities and usually related to CI, bots, etc.
Analysis:
1. Gather the list of triplets {email, name, repository} from all the commits using gitbase.
2. Remove any triplet whose name or email belongs to the blacklists.
3. Merge identities with the same e-mail if it doesn't belong to the list of popular emails created in 1.1.
4. Merge identities with the same name if it doesn't belong to the list of popular names created in 1.1. When the name belongs to this list we replace it with the following tuple (name, repository).
5. Save the resulting identity table in the desired output format.

There is a Design Document (or a Blueprint, or whatever else you are used to call project documentation) which goes into more detail: link.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

GPL 3.0, see LICENSE. Y u no Apache/MIT? Read here.

identity-matching's People

Contributors

Stargazers

Watchers

Forkers

vmarkovtsev warenlg smola se7entyse7en irinakhismatullina jeroenherczeg ronhab allenfernzz gagliardetto neomatrix369 isabella232 rfesi syllogy

identity-matching's Issues

Release v1

The code was battle-tested in the demo and worked flawlessly.

I suggest using the Chromium-like versioning schema here because we are building an application, not a library. So our versions would be v1, v2, ..., v100500.

@smola to approve this versioning proposal
@zurk to release v1

gitbase spark connector support

Now we can work with pure gitbase only. We should add support for https://github.com/src-d/gitbase-spark-connector-enterprise

Find a way to distinguish regular users from bots

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like [email protected] that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

Assume the output format is parquet when the output path points to a parquet file

Right now the default output format is postgres, so a command like (extracted from the readme https://github.com/src-d/eee-identity-matching#use-without-gitbase):

match-identities \
    --cache path/to/csv/file.csv \
    --output matched_identities.parquet

will write the output to postgres even if the output path points to a parquet file.
I think we should set the output format to parquet when such an output path is provided otherwise the arguments are redundant.

Determine the primary work email for an identified person

Related issue: Determine a primary name for an identified person #28

The motivation is the same as in #28: we want to know the primary email and not just an id.
It looks like different approaches will work for name and email, so there are two issues for both of them

Plan:
Plan:

Get the data for proper validation. We may use Github API but many users do not keep email publicly available
Test simple heuristics like
- take the most frequent email without blacklisted host like gmail, inbox, etc (it is more personal emails).
- ???
If simple heuristics do not give good performance train something like GBDT.

Include the external identifiers into the result

@warenlg said that it would really help if we saved the GitHub (GitLab, BitBucket, etc.) ids into the parquet file. We should add two more columns:

External id provider, one of github, gitlab, bitbucket.
External id

These two columns should be filled for every email in the table.

Add README.md

Debug the bot detection pipeline

A full pipeline has been proposed by @EgorBu to identify bots from user identities but some bugs seem to have sneaked into the code since the precision and recall have dropped.

So first, let's recover a good accuracy, then try different sampling strategies and finally try to simplify a bit the pipeline by removing non necessary steps/features.

Save and load the bot detection model from modelforge

Once the bot detection model has been trained and has reached good performance, we have to save it to asdf format and upload the model to modelforge. The corresponding script has been PR here #73. The tree includes:

The booster XGBoost model
The dict of parameters used to train the model
The BPE model

Make the project open-source

As the mgmt decided, eee-identity-matching becomes open-source.

Measure quality on several organizations

To have a better understanding of the quality of the approach let's measure performance in near-real life scenario:

collect a list of organization with different scale
apply eee-identity-matching to collect identities based on the current approach
evaluate it using ground truth data from ghtorrent

Print version at startup

Because @EgorBu had hard time detecting the cached Docker image build.

Extract commit date for stats filtering

We need to add dates extraction to allow for calculating emails/names stats on the certain time period. This will help to avoid noise.

Alter Docker image so it dumps output to a defined folder

Later preprocessing steps in sourced-demo rely in the Parquet files generated by this. Right now if you use the Dockerfile, they are dumped to /, which is not convenient.

The simplest solution would be to dump them to a fixed directory (e.g. /data), probably adding WORKDIR /data to the Dockerfile before running identities2sql.sh would work. If you want to get fancy, you can get the path from an environment variable.

This would allow us to bind-mount this folder in both identity-matching's container and preprocessing.

Depends on https://travis-ci.com/src-d/sourced-demo/builds/132103418 passing, though.

request external API only once

In case the email is missing and it is not possible to find users for a given email we query API each time we see this name.

you can see it in logs with the same email in wanring messages:

WARN[0843] unable to find users for email: [email protected]
WARN[0843] no matches for person :matthew kocher||[email protected].
...
WARN[0944] unable to find users for email: [email protected]
WARN[0944] no matches for person :matthew kocher||[email protected].

Detect the primary name of an identified person

As #28 suggests, we need to count commits per name for each identity with more than one name and perform the detection.

We have to generate the second Parquet file:

Identity index
Primary name

Please note: the primary name should not be lower-cased. The easiest thing to do is to capitalize the first letter in each word.

Replace black and popular lists with an autogenerated values from a non go files.

There are a few lists in blacklist.go the file that should be replaced with proper values.
Such as blacklistedEmails, ignoredDomains etc.

Proper values should be taken from some kind of data files (like csv) and corresponding go files should be generated from them.

Use Github/Gitlab/Bitbucket API if applicable.

There is a todo in code that should be addressed:
https://github.com/src-d/eee-identity-matching/blob/c4128f7bee0d8f0da124b307cdbba83568acfb3b/matching.go#L27
In a few words, we should use an external API to match a person's identities if applicable.

Approach:

Gather ids (like Github user id) for all emails we collected and merge corresponding identities.
If there are emails without id left we run regular matching process, but only for such entries.

Add csv output format

Now we have only parquet files as output format, but sometimes it is more convenient to use csv as output format. So, we should add --csv to CLI for that.

Bad precision and recall (~60%) on IBM and intel open source stacks

Following up #17 and #30 where the performance of the identity merging algorithm has been evaluated on 22 different open source stacks. We noticed particular bad performance on 2 organization IBM and intel with ~60% precision and recall.

This needs to be investigated because nearly all other organization are above 90% precision and recall, and we should be able to promise an acceptable score (at least 90 %) on all organizations.

Research the detection of the primary name for an identified person

Now we have just a number that joins all person's identities together. But we should also set the name of the person.

one thing I've learned from the current demo and identity matching is that we need a simple way for the algorithm to determine a primary name

Plan:

Get the data for proper validation. We may use Github API. We should also check data quality because what people set in their GitHub profile and use in commits can be different. We cannot predict something that does not appear in commits.
Test simple heuristics like
- take the most frequent name
- take the longest name
- ???
If simple heuristics do not give good performance train something like GBDT.

panic: json: unsupported value: NaN

Using current master (bbaf008) from observability demo repository, and the organization carlosms-test-org, it fails with this log:

identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="Using caching for external matching" cachePath=cache-external.csv
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="Dumping CachedMatcher cache."
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="looking for people in commits"
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="not cached in cache-raw.csv, loading from the database"
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="caching the result to cache-raw.csv"
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="found people" elapsed=9.72824ms people=1
identity-matching_1      | time="2019-10-03T08:54:47Z" level=info msg="reducing people"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=warning msg="unable to find users for email: [email protected]"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=warning msg="no matches for person :carlos martin||[email protected]"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=info msg="reduced people" elapsed=889.145294ms people=1
identity-matching_1      | time="2019-10-03T08:54:48Z" level=info msg="primary names are set" elapsed="2.417µs"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=info msg="primary emails are set" elapsed="1.465µs"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=info msg="storing people"
identity-matching_1      | time="2019-10-03T08:54:48Z" level=info msg="stored people" elapsed=5.782553ms path=identities
identity-matching_1      | panic: json: unsupported value: NaN
identity-matching_1      | 
identity-matching_1      | goroutine 1 [running]:
identity-matching_1      | github.com/src-d/identity-matching/reporter.Write()
identity-matching_1      | 	/go/src/identity-matching/reporter/reporter.go:38 +0x10b
identity-matching_1      | main.main()
identity-matching_1      | 	/go/src/identity-matching/cmd/match-identities/main.go:111 +0x120a

Use more efficient API for GitHub

GitHub support currently uses the search API, but it could use other APIs with higher quotas. For example, given a commit, it could use the get commit API: Get a single commit.

If I got it right, search is limited to 30 requests per minute (1800 per hour) and getting a commit should be 5000 per hour.

v3.1.0 doesn't seem to finish after running on writeas org

+ match-identities --output identities --host gitbase --port 3306 --user root --password  --external github --api-url  --token fb49750df6df11bdeb8a8483cdd2475a15025e55 --max-identities 20 --months 12 --min-count 5
INFO[0000] Using cache for external matching             cachePath=cache-external.csv
INFO[0000] Dumping CachedMatcher cache                  
INFO[0000] looking for people in commits                
INFO[0000] signatures are not cached in cache-raw.csv, loading them from the database 
INFO[0000] writing the signatures cache to cache-raw.csv 
INFO[0000] found people                                  elapsed=259.214864ms people=1962
INFO[0000] reducing people                              
WARN[0002] unable to find users by commit for email: [email protected] 
WARN[0002] no matches for person :michael demetriou||[email protected] 
WARN[0003] no matches for person :michael demetriou||[email protected] 
WARN[0003] no matches for person :michael demetriou||[email protected] 
WARN[0003] no matches for person :michael demetriou||[email protected] 
WARN[0004] unable to find users by commit for email: [email protected] 
WARN[0004] no matches for person :brad koehn||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0005] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] no matches for person :michael demetriou||[email protected] 
WARN[0006] unable to find users by commit for email: [email protected] 
WARN[0006] no matches for person :ben overmyer||[email protected] 
INFO[0006] Dumping CachedMatcher cache

(dlv) grs -t
  Goroutine 1 - User: /usr/local/go/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4400f7)
	 0  0x00000000004309e0 in runtime.gopark
	     at /usr/local/go/src/runtime/proc.go:305
	 1  0x0000000000440390 in runtime.goparkunlock
	     at /usr/local/go/src/runtime/proc.go:310
	 2  0x0000000000440390 in runtime.semacquire1
	     at /usr/local/go/src/runtime/sema.go:144
	 3  0x00000000004400f7 in sync.runtime_SemacquireMutex
	     at /usr/local/go/src/runtime/sema.go:71
	 4  0x000000000046804c in sync.(*Mutex).lockSlow
	     at /usr/local/go/src/sync/mutex.go:138
	 5  0x0000000000469667 in sync.(*Mutex).Lock
	     at /usr/local/go/src/sync/mutex.go:81
	 6  0x0000000000469667 in sync.(*RWMutex).Lock
	     at /usr/local/go/src/sync/rwmutex.go:98
	 7  0x0000000000702199 in github.com/src-d/identity-matching/external.(*safeUserCache).LoadFromDisk
	     at /go/src/external/cache.go:162
	 8  0x00000000007028f4 in github.com/src-d/identity-matching/external.safeUserCache.DumpOnDisk
	     at /go/src/external/cache.go:201
	 9  0x0000000000701b54 in github.com/src-d/identity-matching/external.CachedMatcher.DumpCache
	     at /go/src/external/cache.go:77
	10  0x0000000000701b54 in github.com/src-d/identity-matching/external.(*CachedMatcher).MatchByCommit
	     at /go/src/external/cache.go:124
	(truncated)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x0000000000430897 in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x0000000000430897 in runtime.forcegchelper
	    at /usr/local/go/src/runtime/proc.go:253
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x0000000000419893 in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x0000000000419893 in runtime.runfinq
	    at /usr/local/go/src/runtime/mfinal.go:175
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 4 - User: /usr/local/go/src/runtime/sigqueue.go:147 os/signal.signal_recv (0x44422c)
	0  0x000000000045d863 in runtime.futex
	    at /usr/local/go/src/runtime/sys_linux_amd64.s:536
	1  0x000000000042c146 in runtime.futexsleep
	    at /usr/local/go/src/runtime/os_linux.go:44
	2  0x000000000040c086 in runtime.notetsleep_internal
	    at /usr/local/go/src/runtime/lock_futex.go:174
	3  0x000000000040c28c in runtime.notetsleepg
	    at /usr/local/go/src/runtime/lock_futex.go:228
	4  0x000000000044422c in os/signal.signal_recv
	    at /usr/local/go/src/runtime/sigqueue.go:147
	5  0x00000000004c5e82 in os/signal.loop
	    at /usr/local/go/src/os/signal/signal_unix.go:23
	6  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 5 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000043f82b in runtime.selectgo
	    at /usr/local/go/src/runtime/select.go:313
	2  0x0000000000458fb8 in runtime.ensureSigM.func1
	    at /usr/local/go/src/runtime/signal_unix.go:549
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 6 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000044c70b in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x000000000044c70b in runtime.timerproc
	    at /usr/local/go/src/runtime/time.go:303
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x00000000004237b1 in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x00000000004237b1 in runtime.bgsweep
	    at /usr/local/go/src/runtime/mgcsweep.go:89
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 19 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x0000000000423079 in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x0000000000423079 in runtime.bgscavenge
	    at /usr/local/go/src/runtime/mgcscavenge.go:332
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 20 - User: /go/src/cmd/match-identities/main.go:45 main.main.func1 (0x78e9d4)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x0000000000407498 in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x0000000000407498 in runtime.chanrecv
	    at /usr/local/go/src/runtime/chan.go:524
	3  0x000000000040715b in runtime.chanrecv1
	    at /usr/local/go/src/runtime/chan.go:406
	4  0x000000000078e9d4 in main.main.func1
	    at /go/src/cmd/match-identities/main.go:45
	5  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 21 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000041d0ef in runtime.gcBgMarkWorker
	    at /usr/local/go/src/runtime/mgc.go:1837
	2  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 22 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000041d0ef in runtime.gcBgMarkWorker
	    at /usr/local/go/src/runtime/mgc.go:1837
	2  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 23 - User: /usr/local/go/src/runtime/lock_futex.go:228 runtime.notetsleepg (0x40c274)
	0  0x000000000045d863 in runtime.futex
	    at /usr/local/go/src/runtime/sys_linux_amd64.s:536
	1  0x000000000042c1c4 in runtime.futexsleep
	    at /usr/local/go/src/runtime/os_linux.go:50
	2  0x000000000040c12e in runtime.notetsleep_internal
	    at /usr/local/go/src/runtime/lock_futex.go:193
	3  0x000000000040c28c in runtime.notetsleepg
	    at /usr/local/go/src/runtime/lock_futex.go:228
	4  0x000000000044c781 in runtime.timerproc
	    at /usr/local/go/src/runtime/time.go:311
	5  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 34 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000041d0ef in runtime.gcBgMarkWorker
	    at /usr/local/go/src/runtime/mgc.go:1837
	2  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 35 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000041d0ef in runtime.gcBgMarkWorker
	    at /usr/local/go/src/runtime/mgc.go:1837
	2  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 36 - User: /usr/local/go/src/runtime/proc.go:305 runtime.gopark (0x4309e0)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000044c70b in runtime.goparkunlock
	    at /usr/local/go/src/runtime/proc.go:310
	2  0x000000000044c70b in runtime.timerproc
	    at /usr/local/go/src/runtime/time.go:303
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 50 - User: /usr/local/go/src/runtime/lock_futex.go:228 runtime.notetsleepg (0x40c274)
	0  0x000000000045d863 in runtime.futex
	    at /usr/local/go/src/runtime/sys_linux_amd64.s:536
	1  0x000000000042c1c4 in runtime.futexsleep
	    at /usr/local/go/src/runtime/os_linux.go:50
	2  0x000000000040c12e in runtime.notetsleep_internal
	    at /usr/local/go/src/runtime/lock_futex.go:193
	3  0x000000000040c28c in runtime.notetsleepg
	    at /usr/local/go/src/runtime/lock_futex.go:228
	4  0x000000000044c781 in runtime.timerproc
	    at /usr/local/go/src/runtime/time.go:311
	5  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 51 - User: /usr/local/go/src/database/sql/sql.go:1052 database/sql.(*DB).connectionOpener (0x599bf8)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000043f82b in runtime.selectgo
	    at /usr/local/go/src/runtime/select.go:313
	2  0x0000000000599bf8 in database/sql.(*DB).connectionOpener
	    at /usr/local/go/src/database/sql/sql.go:1052
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
  Goroutine 52 - User: /usr/local/go/src/database/sql/sql.go:1065 database/sql.(*DB).connectionResetter (0x599d2b)
	0  0x00000000004309e0 in runtime.gopark
	    at /usr/local/go/src/runtime/proc.go:305
	1  0x000000000043f82b in runtime.selectgo
	    at /usr/local/go/src/runtime/select.go:313
	2  0x0000000000599d2b in database/sql.(*DB).connectionResetter
	    at /usr/local/go/src/database/sql/sql.go:1065
	3  0x000000000045b931 in runtime.goexit
	    at /usr/local/go/src/runtime/asm_amd64.s:1357
[18 goroutines]

README misses the Travis badge

https://docs.travis-ci.com/user/status-images/

Build status images for private repositories include a security token.

Performance dropped critically

I've been benchmarking the performance of match-identities and figured out I was not able to process some orgs I used to process pretty quickly 2 months ago. For example, digitalocean used to take me 3sec, now it is hanging forever. The step that is hanging is reducing identities.

The drop has been introduced between v2.0.0 and v.3.0.0. After going through the commit history, I figured out the drop started at this commit when adding the primary name feature.

The list of git signatures to process includes duplicates (to get the frequence), then commit_date and then commit hash. So, we pass from:

1907 to 45287 signatures for src-d and from 1sec to 60sec (CPU time) to process
57K to 1.77M signatures for digitalocean and from 3sec to forever (CPU time) to process

If we want to keep this feature, we might want to optimize this a bit.

Add progress bar on fetching people by SQL

We've run the app in the eng-demo environment. It worked, but it was hard for me and @warenlg to understand whether it hanged or not. We need to add a progress bar while we are fetching the results of the query.

Specifically, at https://github.com/src-d/eee-identity-matching/blob/master/people.go#L323 we should print a counter with fmt.Fprintf(os.Stderr, "%d\r", index + 1).

Gather quality metrics

Much like we already do in style-analyzer, we need to gather and report the quality metrics:

How many blacklisted names found. Emails, too.
How many raw persons found. Averages.
Graph metrics.
Result metrics: count, averages, precision, recall, f1, etc.
Pretty much everything else we can measure.

We should output them to a JSON (single line) in stdout.
We need to ensure that the regular logs are written to stderr.

Incremental operation support

EPIC: src-d/backlog#1435

identity-matching already caches the raw input data. In order to be able to run continuously, we still miss the following:

Running loop.
Generating diff to the previous results.
Running src-d/parquetry as a service.
Supporting diffs in src-d/parquetry: this means supporting removing records.

Study how the quality depends on the hard identity size limit

One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.

This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.

Update Design Document

there were some changes in the algorithm and input-output formats that should be reflected in Design Document.

Add another output format: Postgres

Santi said that if we were able to write the results directly to Postgres then it would greatly simplify the deployment. So we should ask for

output format - "parquet" or "postgres"
postgres host
postgres port
postgres user
postgres password
postgres db
postgres table

in the command line and write our results there.

The list of popular names is too large

Currently when running match-identities, we use a pre-compiled list of popular names.

This list is very large: 55659 names and it includes names that are obviously not popular: emanuele caprioli, ludovic menthiller, thomas flahault, bryce cuthriell, ... etc

It looks like hyperopt has been running on a huge dataset that is not representative to the real use case, whereas the design document says that the use case of identity-matching should be one organization with less than 10k devs and repos. Thus, it looks like we have to lower the threshold and recompile the list.

Consider committer data, not just commit author

A commit has a commit author and committer. They may be the same or not. It seems we are currently using only author data and not committer.