ghuser-io / db Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 5.0 203 KB

ghuser.io's database

License: MIT License

JavaScript 94.10% Shell 5.90%

db's People

Contributors

Stargazers

Watchers

Forkers

wowawiwa mddanishyusuf brydzu github-web-apps doc22940

db's Issues

S3 turns out to be too expensive

We use S3 to expose our DB over HTTP. Our DB is made of many json files, and for 3k users we will have about 260k files, which we refresh daily.

1000 PUTs to S3 cost $0.005. Thus this upload will cost us $40/month. On top of that come the price of storing and the price incurred by users fetching the data, but these are low.

We could for example replace these $40 by only $5 by running a t2.nano exposing the files over HTTP.

HTTP 500 when fetching pull requests

Now consistently hitting

✔ Fetched NixOS/nixpkgs's contributors
⠙ Fetching NixOS/nixpkgs's pull requests...Error while fetching https://api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=$
00
API rate limit state:
{ core: { limit: 5000, remaining: 4996, reset: 1537441414 },
  search: { limit: 30, remaining: 30, reset: 1537437875 },
  graphql: { limit: 5000, remaining: 5000, reset: 1537441415 } }
Promise {
  <rejected> Body {
    url:
     'https://********:********@api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=10$&client_id=********&client_secret=********',
    status: 500,
    statusText: 'Internal Server Error',
    headers: Headers { _headers: [Object] },
    ok: false,
    body:
     PassThrough {
       _readableState: [ReadableState],
       readable: true,
       _events: [Object],
       _eventsCount: 1,
       _maxListeners: undefined,
       _writableState: [WritableState],
       writable: false,
       allowHalfOpen: true,
       _transformState: [Object] },
    bodyUsed: false,
    size: 0,
    timeout: 0,
    _raw: [],
    _abort: false } }

although our code retries a few times on 500.

DB should contain the number of commits per user per month for each repo

So that the front end can show a timeline.

e.g. at https://github.com/ghuser-io/db/blob/a9c116e/data/repos/reframejs/reframe.json#L49

  "contributors": {
    "AurelienLourot": 11,
    "tdfranklin": 22,
    "brillout": 2038,
    "yuxal": 4
  },

should become something like

  "contributors_2": {
    "AurelienLourot": {
      "2018-01": 1,
      "2018-03": 5,
      "2018-04": 5,
    }
    "tdfranklin": {
      "2018-02": 1,
      "2018-04": 10,
      "2018-05": 11,
    },
    ...
  },

Note: it makes sense to work on this and #6 at the same time.

@brillout would you rather have the data per week instead? Weeks and months aren't aligned and I think you might have a harder time on the frontend if the data is per week.

API rate limit exceeded for user ID

Although we have a mechanism in place to avoid hitting the rate limit for our API key, it seems that we're hitting now another rate limit (this time probably specific to my user):

✔ FabioBaroni/awesome-exploit-development hasn't changed
✖ Fetching Fachschaft07/skriptinat0r7's contributors...
{ message: 'API rate limit exceeded for user ID ********.',
  documentation_url: 'https://developer.github.com/v3/#rate-limiting' }
Promise { <rejected> 403 }

/home/ubuntu/db/impl/scriptUtils.js:9
      throw e;
      ^
403

Improve CLI

In this repo we now have various js, bash and python scripts, all calling each other. It has become hard to maintain. A more uniform CLI would be nice.

Better to address this after the scaling issues which produce many CLI changes at the moment.

Some repos take hours to refresh

e.g.

✔ tortkis/mew-unread hasn't changed
⠹ Fetching torvalds/linux's contributors... [commit page 7818]

orgs.json is getting big

This file containing all orgs we know about will have a size of about 5.6 MB when we'll serve 3k users. In the current design, for each profile, the front end has to download this large file.

Leverage GraphQL

GitHub's REST and GraphQL APIs have separate rate limits, so that a good move might be to start using both.

Here an example on how to query all commits of a giant repo using GraphQL in our code base: https://github.com/ghuser-io/db/blob/aure-graphql/fetchRepos.js#L50

fetchRepos.js runs out of memory

It seems that now that the DB contains 28k repos, fetchRepos.js often runs out of memory at random places on machines with 0.5G RAM.

✔ volrath/treepy.el hasn't changed
⠦ Fetching volumio/Volumio2's contributors...Killed

The fetchBot wastes a lot of resources on crawling giant old not-so-popular repos

e.g. https://github.com/StefanescuCristian/hammerhead :

5 stars
last pushed 3 years ago
300k commits
3k contributors

Maybe we should have some default values for time and earned stars for such repos?

Leverage alternate sources of data

In order to be able to crawl more data per day despite GitHub's API rate limit, we could look into making use of these sources, if not too expensive:

https://cloud.google.com/bigquery/public-data/
https://www.gharchive.org/
git cloneing repos

Move `data/` away from this repo

data/ is growing.

Leverage git clone

VFS for Git (VFS4G) is a tool that could eventually allow us to get the commit history of repoS that are huge.

Both GitLab and GitHub are currently implementing support for VFS4G.
(GitHub: https://stackoverflow.com/questions/37684028/how-to-clone-fetch-a-repo-getting-only-the-history and GitLab: https://gitlab.com/gitlab-org/gitlab-ce/issues/27895)

Also, the GitHub API gives us the disk usage of a repo. This means we can git clone and get the commit history of small repoS.

The history or contributor list is too large to list contributors for this repository via the API

✖ Fetching StefanescuCristian/ubuntu-bfsq's contributors...
{ message:
   'The history or contributor list is too large to list contributors for this repository via the API.',
  documentation_url: 'https://developer.github.com/v3/repos/#list-contributors' }
Promise { <rejected> 403 }

/home/ubuntu/db/impl/scriptUtils.js:9
      throw e;
      ^
403

Use a real database

With the current solution (json files) we can't parallelize much, because if two crawlers happen to edit the same file at the same time, the file might end up corrupted.

Maybe we can find a database (relational or not) which is not too expensive?

Remove obsolete data

After ghuser-io/ghuser.io#158 the following fields can be removed from the data as they are not used by the front end anymore (double-check again when picking this up) :

popularity
maturity
activity
*total_score*

fetchBot is failing - getting 403 from rawgit

e.g. https://rawgit.com/lynxaegon/ghuser.io.settings/master/ghuser.io.json

403 Forbidden
RawGit will soon shut down and is no longer serving new repos. Please visit https://rawgit.com for more details.

Fetched lynxaegon's popular forks
Fetching lynxaegon's settings...
Promise { <rejected> 403 }
Error: Promise rejected with value: 403

Leverage gharchive.org dataset

In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).

Among others a PushEvent contains:

Key	Type	Description
commits	array	An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.)
commits[][sha]	string	The SHA of the commit.
commits[][message]	string	The commit message.
commits[][author]	object	The git author of the commit.
commits[][author][name]	string	The git author's name.
commits[][author][email]	string	The git author's email address.
commits[][url]	url	URL that points to the commit API resource.
commits[][distinct]	boolean	Whether this commit is distinct from any that have been pushed before.

There are several limitations:

A PushEvent contains a maximum of 20 commits. This means that any commit that is above this limit is simply missing in the dataset. Most PushEvent don't hit that limit and contain all the commits (something like 99%). But the problem are initial pushes that could have several thousands of commits. (E.g. A private repo moving to github would have a first PushEvent with a high number of commits.) Missing out on these commits is not okay. We could use the GitHub API for such initial PushEvent that have 20 commits (and potentially thousands of truncated commits). Missing out on subsequent PushEvent commits is probably ok.
Commit dates are missing. But we do have the push date. So we could take the push date as coarse approximation of the commit date (assuming that most of the time the date of a git push is within the same approximate time frame as the dates of the commits). But we shouldn't do this approximation for a initial PushEvent that has 20 commits (and potentially thousands of truncated commits).

We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:

I expect most repoS to start public (We can easily get stats for the ratio how many start private and how many start public. (By checking if the first PushEvent has more than 20 commits.)
If you contributed to a private repo, chances are not that low that you contribute to it after it goes open source.
Small contributions (only couple of commits) are very unlikely to be missing. (Small contribs most likely only happen in public repoS. Very unlikely to miss out of a small contributions because of truncated subsequent PushEvent commit array.)

We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.