Comments (4)
It turns out the identity graph of intel and IBM were pretty big: 80k and 11k edges respectively. And reducing the proportion of popular names decreased the number of false positive and false negative as popular identities tend to be the ones with problems. That's why increasing the popularity threshold from 5 to 100, we improved our precision and recall from ~62 to 94% for both organizations.
from identity-matching.
We can not increase the popular threshold too much though otherwise we start loosing precision at some point.
from identity-matching.
Great data analysis Waren 👍 Let's use your recommended threshold 100 and update the CSVs/embedded Go code.
from identity-matching.
Thanks just opened the PR this morning #59
Now closed.
from identity-matching.
Related Issues (20)
- Study how the quality depends on the hard identity size limit HOT 1
- Include the external identifiers into the result HOT 2
- Add another output format: Postgres HOT 2
- request external API only once
- Make the project open-source HOT 2
- Assume the output format is parquet when the output path points to a parquet file HOT 1
- Detect the primary name of an identified person HOT 1
- The list of popular names is too large HOT 1
- Use more efficient API for GitHub HOT 3
- Consider committer data, not just commit author HOT 3
- Debug the bot detection pipeline HOT 1
- Extract commit date for stats filtering
- panic: json: unsupported value: NaN HOT 13
- Save and load the bot detection model from modelforge HOT 6
- v3.1.0 doesn't seem to finish after running on writeas org HOT 6
- Alter Docker image so it dumps output to a defined folder HOT 4
- Print version at startup HOT 1
- Incremental operation support
- Performance dropped critically HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from identity-matching.