ghuser-io / db Goto Github PK
View Code? Open in Web Editor NEWghuser.io's database
License: MIT License
ghuser.io's database
License: MIT License
We use S3 to expose our DB over HTTP. Our DB is made of many json files, and for 3k users we will have about 260k files, which we refresh daily.
1000 PUTs to S3 cost $0.005. Thus this upload will cost us $40/month. On top of that come the price of storing and the price incurred by users fetching the data, but these are low.
We could for example replace these $40 by only $5 by running a t2.nano exposing the files over HTTP.
Now consistently hitting
✔ Fetched NixOS/nixpkgs's contributors
⠙ Fetching NixOS/nixpkgs's pull requests...Error while fetching https://api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=$
00
API rate limit state:
{ core: { limit: 5000, remaining: 4996, reset: 1537441414 },
search: { limit: 30, remaining: 30, reset: 1537437875 },
graphql: { limit: 5000, remaining: 5000, reset: 1537441415 } }
Promise {
<rejected> Body {
url:
'https://********:********@api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=10$&client_id=********&client_secret=********',
status: 500,
statusText: 'Internal Server Error',
headers: Headers { _headers: [Object] },
ok: false,
body:
PassThrough {
_readableState: [ReadableState],
readable: true,
_events: [Object],
_eventsCount: 1,
_maxListeners: undefined,
_writableState: [WritableState],
writable: false,
allowHalfOpen: true,
_transformState: [Object] },
bodyUsed: false,
size: 0,
timeout: 0,
_raw: [],
_abort: false } }
although our code retries a few times on 500.
So that the front end can show a timeline.
e.g. at https://github.com/ghuser-io/db/blob/a9c116e/data/repos/reframejs/reframe.json#L49
"contributors": {
"AurelienLourot": 11,
"tdfranklin": 22,
"brillout": 2038,
"yuxal": 4
},
should become something like
"contributors_2": {
"AurelienLourot": {
"2018-01": 1,
"2018-03": 5,
"2018-04": 5,
}
"tdfranklin": {
"2018-02": 1,
"2018-04": 10,
"2018-05": 11,
},
...
},
Note: it makes sense to work on this and #6 at the same time.
@brillout would you rather have the data per week instead? Weeks and months aren't aligned and I think you might have a harder time on the frontend if the data is per week.
Although we have a mechanism in place to avoid hitting the rate limit for our API key, it seems that we're hitting now another rate limit (this time probably specific to my user):
✔ FabioBaroni/awesome-exploit-development hasn't changed
✖ Fetching Fachschaft07/skriptinat0r7's contributors...
{ message: 'API rate limit exceeded for user ID ********.',
documentation_url: 'https://developer.github.com/v3/#rate-limiting' }
Promise { <rejected> 403 }
/home/ubuntu/db/impl/scriptUtils.js:9
throw e;
^
403
In this repo we now have various js, bash and python scripts, all calling each other. It has become hard to maintain. A more uniform CLI would be nice.
Better to address this after the scaling issues which produce many CLI changes at the moment.
e.g.
✔ tortkis/mew-unread hasn't changed
⠹ Fetching torvalds/linux's contributors... [commit page 7818]
This file containing all orgs we know about will have a size of about 5.6 MB when we'll serve 3k users. In the current design, for each profile, the front end has to download this large file.
GitHub's REST and GraphQL APIs have separate rate limits, so that a good move might be to start using both.
Here an example on how to query all commits of a giant repo using GraphQL in our code base: https://github.com/ghuser-io/db/blob/aure-graphql/fetchRepos.js#L50
It seems that now that the DB contains 28k repos, fetchRepos.js often runs out of memory at random places on machines with 0.5G RAM.
✔ volrath/treepy.el hasn't changed
⠦ Fetching volumio/Volumio2's contributors...Killed
e.g. https://github.com/StefanescuCristian/hammerhead :
Maybe we should have some default values for time and earned stars for such repos?
In order to be able to crawl more data per day despite GitHub's API rate limit, we could look into making use of these sources, if not too expensive:
git clone
ing reposdata/
is growing.
VFS for Git (VFS4G) is a tool that could eventually allow us to get the commit history of repoS that are huge.
Both GitLab and GitHub are currently implementing support for VFS4G.
(GitHub: https://stackoverflow.com/questions/37684028/how-to-clone-fetch-a-repo-getting-only-the-history and GitLab: https://gitlab.com/gitlab-org/gitlab-ce/issues/27895)
Also, the GitHub API gives us the disk usage of a repo. This means we can git clone and get the commit history of small repoS.
✖ Fetching StefanescuCristian/ubuntu-bfsq's contributors...
{ message:
'The history or contributor list is too large to list contributors for this repository via the API.',
documentation_url: 'https://developer.github.com/v3/repos/#list-contributors' }
Promise { <rejected> 403 }
/home/ubuntu/db/impl/scriptUtils.js:9
throw e;
^
403
With the current solution (json files) we can't parallelize much, because if two crawlers happen to edit the same file at the same time, the file might end up corrupted.
Maybe we can find a database (relational or not) which is not too expensive?
After ghuser-io/ghuser.io#158 the following fields can be removed from the data as they are not used by the front end anymore (double-check again when picking this up) :
popularity
maturity
activity
*total_score*
e.g. https://rawgit.com/lynxaegon/ghuser.io.settings/master/ghuser.io.json
403 Forbidden
RawGit will soon shut down and is no longer serving new repos. Please visit https://rawgit.com for more details.
Fetched lynxaegon's popular forks
Fetching lynxaegon's settings...
Promise { <rejected> 403 }
Error: Promise rejected with value: 403
In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).
Among others a PushEvent contains:
Key | Type | Description |
---|---|---|
commits | array | An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.) |
commits[][sha] | string | The SHA of the commit. |
commits[][message] | string | The commit message. |
commits[][author] | object | The git author of the commit. |
commits[][author][name] | string | The git author's name. |
commits[][author][email] | string | The git author's email address. |
commits[][url] | url | URL that points to the commit API resource. |
commits[][distinct] | boolean | Whether this commit is distinct from any that have been pushed before. |
There are several limitations:
We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:
We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.