Giter Site home page Giter Site logo

db's People

Contributors

aurelienlourot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

db's Issues

S3 turns out to be too expensive

We use S3 to expose our DB over HTTP. Our DB is made of many json files, and for 3k users we will have about 260k files, which we refresh daily.

1000 PUTs to S3 cost $0.005. Thus this upload will cost us $40/month. On top of that come the price of storing and the price incurred by users fetching the data, but these are low.

We could for example replace these $40 by only $5 by running a t2.nano exposing the files over HTTP.

HTTP 500 when fetching pull requests

Now consistently hitting

✔ Fetched NixOS/nixpkgs's contributors
⠙ Fetching NixOS/nixpkgs's pull requests...Error while fetching https://api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=$
00
API rate limit state:
{ core: { limit: 5000, remaining: 4996, reset: 1537441414 },
  search: { limit: 30, remaining: 30, reset: 1537437875 },
  graphql: { limit: 5000, remaining: 5000, reset: 1537441415 } }
Promise {
  <rejected> Body {
    url:
     'https://********:********@api.github.com/repos/NixOS/nixpkgs/pulls?state=all&page=1&per_page=10$&client_id=********&client_secret=********',
    status: 500,
    statusText: 'Internal Server Error',
    headers: Headers { _headers: [Object] },
    ok: false,
    body:
     PassThrough {
       _readableState: [ReadableState],
       readable: true,
       _events: [Object],
       _eventsCount: 1,
       _maxListeners: undefined,
       _writableState: [WritableState],
       writable: false,
       allowHalfOpen: true,
       _transformState: [Object] },
    bodyUsed: false,
    size: 0,
    timeout: 0,
    _raw: [],
    _abort: false } }

although our code retries a few times on 500.

DB should contain the number of commits per user per month for each repo

So that the front end can show a timeline.

e.g. at https://github.com/ghuser-io/db/blob/a9c116e/data/repos/reframejs/reframe.json#L49

  "contributors": {
    "AurelienLourot": 11,
    "tdfranklin": 22,
    "brillout": 2038,
    "yuxal": 4
  },

should become something like

  "contributors_2": {
    "AurelienLourot": {
      "2018-01": 1,
      "2018-03": 5,
      "2018-04": 5,
    }
    "tdfranklin": {
      "2018-02": 1,
      "2018-04": 10,
      "2018-05": 11,
    },
    ...
  },

Note: it makes sense to work on this and #6 at the same time.

@brillout would you rather have the data per week instead? Weeks and months aren't aligned and I think you might have a harder time on the frontend if the data is per week.

API rate limit exceeded for user ID

Although we have a mechanism in place to avoid hitting the rate limit for our API key, it seems that we're hitting now another rate limit (this time probably specific to my user):

✔ FabioBaroni/awesome-exploit-development hasn't changed
✖ Fetching Fachschaft07/skriptinat0r7's contributors...
{ message: 'API rate limit exceeded for user ID ********.',
  documentation_url: 'https://developer.github.com/v3/#rate-limiting' }
Promise { <rejected> 403 }

/home/ubuntu/db/impl/scriptUtils.js:9
      throw e;
      ^
403

Improve CLI

In this repo we now have various js, bash and python scripts, all calling each other. It has become hard to maintain. A more uniform CLI would be nice.

Better to address this after the scaling issues which produce many CLI changes at the moment.

orgs.json is getting big

This file containing all orgs we know about will have a size of about 5.6 MB when we'll serve 3k users. In the current design, for each profile, the front end has to download this large file.

fetchRepos.js runs out of memory

It seems that now that the DB contains 28k repos, fetchRepos.js often runs out of memory at random places on machines with 0.5G RAM.

✔ volrath/treepy.el hasn't changed
⠦ Fetching volumio/Volumio2's contributors...Killed

Use a real database

With the current solution (json files) we can't parallelize much, because if two crawlers happen to edit the same file at the same time, the file might end up corrupted.

Maybe we can find a database (relational or not) which is not too expensive?

Remove obsolete data

After ghuser-io/ghuser.io#158 the following fields can be removed from the data as they are not used by the front end anymore (double-check again when picking this up) :

  • popularity
  • maturity
  • activity
  • *total_score*

Leverage gharchive.org dataset

In the dataset commits are included in the events callled PushEvent (https://developer.github.com/v3/activity/events/types/#pushevent).

Among others a PushEvent contains:

Key Type Description
commits array An array of commit objects describing the pushed commits. (The array includes a maximum of 20 commits. If necessary, you can use the Commits API to fetch additional commits. This limit is applied to timeline events only and isn't applied to webhook deliveries.)
commits[][sha] string The SHA of the commit.
commits[][message] string The commit message.
commits[][author] object The git author of the commit.
commits[][author][name] string The git author's name.
commits[][author][email] string The git author's email address.
commits[][url] url URL that points to the commit API resource.
commits[][distinct] boolean Whether this commit is distinct from any that have been pushed before.

There are several limitations:

  • A PushEvent contains a maximum of 20 commits. This means that any commit that is above this limit is simply missing in the dataset. Most PushEvent don't hit that limit and contain all the commits (something like 99%). But the problem are initial pushes that could have several thousands of commits. (E.g. A private repo moving to github would have a first PushEvent with a high number of commits.) Missing out on these commits is not okay. We could use the GitHub API for such initial PushEvent that have 20 commits (and potentially thousands of truncated commits). Missing out on subsequent PushEvent commits is probably ok.
  • Commit dates are missing. But we do have the push date. So we could take the push date as coarse approximation of the commit date (assuming that most of the time the date of a git push is within the same approximate time frame as the dates of the commits). But we shouldn't do this approximation for a initial PushEvent that has 20 commits (and potentially thousands of truncated commits).

We could still use the dataset to get a list of repos per user. I expect this list of repos to be mostly exhaustive as:

  • I expect most repoS to start public (We can easily get stats for the ratio how many start private and how many start public. (By checking if the first PushEvent has more than 20 commits.)
  • If you contributed to a private repo, chances are not that low that you contribute to it after it goes open source.
  • Small contributions (only couple of commits) are very unlikely to be missing. (Small contribs most likely only happen in public repoS. Very unlikely to miss out of a small contributions because of truncated subsequent PushEvent commit array.)

We can also use the dataset for repoS that have a first PushEvent with less than 20 commits. If the first PushEvent has less than 20 commits then we can be confident that the repo started public. Then missing out on couple of commits is probably ok: The approximate commit stats would likely be good enough to categorize users as "maintainer"/"gold contrib"/"silver contrib"/"bronze contrib" and show a contribution timeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.