meltanolabs / tap-github Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 25.0 604 KB

A Singer tap for extracting data from Github. Powered by the Meltano SDK for Singer Taps: https://sdk.meltano.com

License: Apache License 2.0

Shell 0.22% Python 99.78%

tap-github's People

Contributors

Stargazers

Watchers

tap-github's Issues

Allow use cases where list of `projects` or `organizations` is already known

Currently, this was built around the searches config option, which allows Github repo searches to drive dynamic repo selection.

I suggest we make three vectors for selection, each which would be mutually exclusive during setting validation:

searches - (existing) finds repos based on the search API
repositories - (hi-pri) takes a list of one or more repositories, such as ['MeltanoLabs/tap-github']
organizations - (low-pri) takes a list of one or more organizations, such as ['MeltanoLabs']

Allow GithubStream to continue on specific errors

As I was implementing a GithubStream for community_profile, I stumbled upon the following issue:

Some repos actually return a 404 when we request their community_profile. It is quite lucky that this got caught early thanks to the main test repo being one of these edge-cases. Though I still do not understand why.

https://api.github.com/repos/facebook/react/community/profile works...
https://api.github.com/repos/MeltanoLabs/target-athena/community/profile doesn't work 🤯 -> 404

As a workaround, we could allow tolerated_http_errors and handle them differently. This is what I have implemented here. But I am open to other ideas 😄

Open question:

Is it worth implementing at the SDK level? This seems a bit specific to GitHub but 🤷

Handle issue events not linked to any issue

I am unsure of what situation it reflects, but I have been seeing the following error on issue_events where there doesn't seem to be any linked issue and we get an AttributeError: 'NoneType' object has no attribute 'pop'.

To start, we could simply add a check for the issue object and set null issue_number and issue_url if the object does not exist. Thoughts?

time=2021-09-27 15:42:20 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 1.158103, 'tags': {'endpoint': '/repos/{org}/{repo}/issues/events', 'http_status_code': 200, 'status': 'succeeded', 'context': {'org': 'facebook', 'repo': 'react'}}}
Traceback (most recent call last):
.../...
tract/.venv/lib/python3.9/site-packages/tap_github/streams.py", line 529, in post_process
    row["issue_number"] = int(row["issue"].pop("number"))
AttributeError: 'NoneType' object has no attribute 'pop'

State bookmarks should use ids instead of repo names

Following the same logic as what is described in #110 and the linked PR, the state bookmarks should use stable IDs for their context, as opposed to the current org / repo:
If a repo is renamed between 2 runs of the tap, the later run would ignore the bookmark, as the repo name would not match anymore. While not a disaster, it would essentially force a full resync of the history of the stream, which is a waste of time/energy/api quotas.

Example of a state bookmark right now:

    {
      "context": { "org": "scipy", "repo": "scipy" },  // this should use a stable id
      "replication_key": "updated_at",
      "replication_key_value": "2022-05-06T14:29:18Z",
      "replication_key_signpost": "2022-06-22T09:07:14.047100+00:00",
      "starting_replication_value": "2022-05-06T14:29:18Z",
      "progress_markers": {
        "Note": "Progress is not resumable if interrupted.",
        "replication_key": "updated_at",
        "replication_key_value": "2022-06-22T07:55:32Z"
      }
    },

Better error handling to add context

When an issue arises in a stream, it would be quite useful to catch it and add context information (stream, repo, ...).

This would make it a lot easier to replicate errors locally and fix them.

cc @laurentS

Add precommit hook for mypy and black

Add a simple pre-commit hook for mypy and black to avoid pushing code with errors that are quick and easy to fix.

Update rate limit handling to use new sdk feature

In the sdk v0.4.9, a new function was added through this MR which improves the sdk-level handling of rate limits.

tap-github's handling still leads to failures in some cases: with multiple tokens passed in config and enough remaining quota overall, the tap can still stop running with an error like:

name=backoff level=ERROR message=Giving up _request(...) after 5 tries (singer_sdk.exceptions.RetriableAPIError: 403 Client Error: b'{"message":"API rate limit exceeded for user ID xxxxxxx.","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}' (Reason: Forbidden) for path: /repos/camicroscope/caMicroscope/languages)

The key part of this message (in our production logs) is that the user ID remains the same across the 5 retries, which is not expected.

It looks like the tap's authenticator correctly rotates the token, but the backoff decorated request continues retrying with the same token for a given request, because the authenticator passes the token to the request in prepare_request but the request headers are not updated with the new token until the next record.

While this work most of the time, if the buffer is less than the number of retries or some other process is depleting quotas in parallel, it is possible to perform the exact same query 5 times in a row, without actually changing the token.

The new backoff_handler should be able to fix this.

Add missing fields to PR stream

Currently the PR stream is missing a bunch of key fields

additions
deletions
review_comments (int)
changed_files (int)
..etc

The current api route we use https://docs.github.com/en/rest/reference/pulls#list-pull-requests isn't sufficient to get all these missing fields. We would need to use the singular pr route https://docs.github.com/en/rest/reference/pulls#get-a-pull-request to fetch everything. This may be difficult since this would greatly increase the amount of requests made to the API. We could maybe cut this down by making a custom query with absolutely everything with #48.

Allow the tap to continue on server side timeout (error 502)

In their docs, github mention that long running requests (>10 sec) are interrupted and return nothing. We hit such a case and it seems that Github returns a 502 Bad gateway HTTP status code.

In our case, the issue comments stream gave such errors.

To see the error, run curl -v -o /dev/null "https://api.github.com/repos/bitcoin/bitcoin/issues/comments?per_page=100&sort=updated&direction=asc&since=2019-11-25T15:20:23" (this discards the output which is irrelevant, but shows the headers and the error 502). Curl helpfully shows a timer which stops at 10 seconds, confirming this is a server-side timeout.

When the tap is running with multiple repos listed in its config, it chokes on such errors and returns prematurely, and never finishes the list of tasks it's supposed to do. In my testing, retrying the same query led to the same result consistently. Lowering the per_page param seemed to get the data (in the case above, I had to go down to 10).

The tap should be able to get around such errors in a cleaner way:

ideally, it would retry the same endpoint but with a smaller per_page value for a while (say until it completed the current stream/repo)
the current MAX_PER_PAGE is set at 1000, and should be 100 max, according to docs
as a temporary workaround, the tap could simply skip the current repo and move on to the next one. This means some gaps in the data however 🕳️

Add `dependabot`

Similar to https://github.com/MeltanoLabs/meltano-map-transform/blob/main/.github/dependabot.yml, in order to keep dependencies up to date.

Consider using conditional requests to check for changes

The GitHub API allows for conditional requests that do not count towards rate-limiting.

It could be interesting to leverage these and avoid making requests if they are not necessary.

https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests

Explore fixes for error in issue_comments - ValueError: invalid literal for int() with base 16: b''

Investigate a bit more why the issues/comment is raising the following error ValueError: invalid literal for int() with base 16: b''

eg.

Possible solutions to explore, patching urllib3: psf/requests#4248

Stream schema does not seem to be respected in produced records

I might be misunderstanding how the schema definition works for a stream, but this bothers me.
With the following schema (from issue_comments in 71b07b7):

    schema = th.PropertiesList(
        th.Property("id", th.IntegerType),
        th.Property("node_id", th.StringType),
        th.Property("repo", th.StringType),
        th.Property("org", th.StringType),
        th.Property("issue_url", th.IntegerType),
        th.Property("updated_at", th.DateTimeType),
        th.Property("created_at", th.DateTimeType),
        th.Property("author_association", th.StringType),
        th.Property("body", th.StringType),
        th.Property(
            "user",
            th.ObjectType(
                th.Property("login", th.StringType),
                th.Property("id", th.IntegerType),
                th.Property("node_id", th.StringType),
                th.Property("avatar_url", th.StringType),
                th.Property("gravatar_id", th.StringType),
                th.Property("html_url", th.StringType),
                th.Property("type", th.StringType),
                th.Property("site_admin", th.BooleanType),
            ),
        ),
    ).to_dict()

I am seeing the following records:

{
  "type": "RECORD",
  "stream": "issue_comments",
  "record": {
    "issue_url": "https://api.github.com/repos/singer-io/tap-facebook/issues/157",
    "id": 895733451,
    "node_id": "IC_kwDOBRvyIM41Y87L",
    "user": {
      "login": "lscottallen004",
      "id": 83940128,
      "node_id": "MDQ6VXNlcjgzOTQwMTI4",
      "avatar_url": "https://avatars.githubusercontent.com/u/83940128?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/lscottallen004",
      "html_url": "https://github.com/lscottallen004",
      "followers_url": "https://api.github.com/users/lscottallen004/followers",
      "following_url": "https://api.github.com/users/lscottallen004/following{/other_user}",
      "gists_url": "https://api.github.com/users/lscottallen004/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/lscottallen004/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/lscottallen004/subscriptions",
      "organizations_url": "https://api.github.com/users/lscottallen004/orgs",
      "repos_url": "https://api.github.com/users/lscottallen004/repos",
      "events_url": "https://api.github.com/users/lscottallen004/events{/privacy}",
      "received_events_url": "https://api.github.com/users/lscottallen004/received_events",
      "type": "User",
      "site_admin": false
    },
    "created_at": "2021-08-10T05:09:01Z",
    "updated_at": "2021-08-10T05:09:01Z",
    "author_association": "NONE",
    "body": "> \n\n- > > > > > > ~~> >~~~~~~~~> @iterati _@iterati _@iterati _[]()[]()_@hz-lschick ____>~~ []",
    "org": "singer-io",
    "repo": "tap-facebook"
  },
  "time_extracted": "2021-09-10T23:07:19.736119Z"
}

A number of the nested *_url fields are present in the record, although they are excluded from the schema definition. It looks like a call to https://gitlab.com/meltano/sdk/-/blob/main/singer_sdk/helpers/_singer.py#L23 from pop_deselected_record_properties causes the field to be included because user is included, and somehow the details of the nested object do not appear in the selection mask.

I suspect this is a bug in the sdk, but I might have misunderstood how the code is supposed to work 🤔

Drop support for python 3.6

Python 3.6 is reaching its EOL on December 23rd, so this is probably a good time to drop support for it.
Any objections to a short PR dropping 3.6 from CI and pyproject.toml?

Source: https://endoflife.date/python

Use graphql endpoint for stargazers

The GitHub REST API makes it impossible to traverse the stargazers endpoint for large repos and limits how many pages we can query.

For example, querying https://api.github.com/repositories/10270250/stargazers?page=500&per_page=100 will result in the following error:

{
    "message": "In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse.",
    "documentation_url": "https://docs.github.com/v3/#pagination"
}

And contrary to what the message suggests, querying the "last" link still gives an error...

To make things worse, the API does NOT allow any of the following params for this endpoint:

"since"
"sort"

A solution is to paginate through a Graphql version instead:

query { 
  repository(name: "react" owner: "facebook") { 
    stargazers(first: 100 orderBy: {field: STARRED_AT direction: DESC}) {
      edges {
        node {
          id
          login
        }
        starredAt
      }
    }
  }
}

Error Loading GitHub Data

2022-06-16T12:16:57.756152Z [info     ] INFO Load t_issues_f4627bf93f834fa8b9b9682a78c94961 by FULL_TABLE (truncate) cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:16:57.756693Z [info     ] INFO loading t_issues_f4627bf93f834fa8b9b9682a78c94961 to BigQuery cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:16:58.179806Z [info     ] ERROR failed to load table t_issues_f4627bf93f834fa8b9b9682a78c94961 from file: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/pulumi-devrel/jobs?uploadType=resumable: Field assignee is type RECORD but has no schema cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:17:01.383314Z [info     ] CRITICAL 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/pulumi-devrel/jobs?uploadType=resumable: Field assignee is type RECORD but has no schema cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery

My config looks like:

environments:
  - name: production
    config:
      plugins:
        extractors:
          - name: tap-github
            config:
              repository: rawkode/rawkode
              start_date: "2017-01-01T00:00:00Z"
              access_token: <REDACTED>
            select:
              - commits.id
              - commits.commit.author.date
              - commits.commit.author.name
              - commits.commit.message.*
              - "*.*"
        loaders:
          - name: target-bigquery
            config:
              project_id: pulumi-devrel
              dataset_id: github
              credentials_path: client.json

Target fails with chr(0) in issue body

I've run into an annoying bug with target-postgres (datamill variant) where it crashes because a unicode char \x000 is sent to postgresql, causing it to reject the data.

The reason I'm reporting this here is that the data comes from this tap, and I'm not entirely sure what to do about it.

To reproduce:

run tap-github to fetch this issue: cockroachdb/cockroach#68703 (you can get to the bug faster by hardcoding the issue number in the IssuesStream url, so as to only get this one issue)
The body of this issue contains a weird character (see screenshot from github below) which curl returns as \u0000 and python requests as \x00 (this might be the first time I write an issue report which might cause the bug to happen again 🤯 )
Through the various encoding/decoding, this char ends up being sent to postgresql which chokes on it, and thows psycopg2.errors.CharacterNotInRepertoire: invalid byte sequence for encoding "UTF8": 0x00

On the one hand, I feel like the target should clean the data before sending it to postgres, but the same reasoning can apply to the tap. I'm not sure what the best practice is here. The interface is JSON, and I found no evidence that \x00 is illegal in JSON, as long as it's escaped.

Any thoughts?

Add all endpoints available in singer-io/tap-github

Effectively, this tap is already a lot more advanced than https://github.com/singer-io/tap-github, which is the tap currently proposed as a default GitHub tap on meltano.com.

Let's add missing endpoints from https://github.com/singer-io/tap-github and reach out to the relevant persons at Meltano to upgrade the default tap to this one.

Missing endpoints:

Assignees
Collaborators
Reviews
Review Comments
...

Add dependents field on repositories stream

We use the dependents count for a repository, which is currently fetched by grabbing the html page for the project (eg. https://github.com/facebook/react/network/dependents) and parsing the HTML. As I write this ticket, the link above returns 7,878,702 Repositories (and likewise for packages) and we grab these numbers.
Unfortunately, this info does not seem to exist anywhere in either the REST or graphQL APIs.

@aaronsteers Would you have any objection to me adding a request for that page to the repositories stream resulting in an extra field? Possibly behind some config option as it is fairly download heavy (the page above weighs 187kB).
Maybe in the post_process method? Ideally, the data will eventually be available in one of the APIs, and this can then be dropped.

Fix tolerated http errors to work with sdk >= 0.3.14

The upgrade to singer-sdk >= 0.3.14 breaks the way this tap works with tolerated_http_errors. The logic for this should be modified to allow using the latest sdk.

Add `requests-cache` implementation in CI to eliminate rate limit failures

@edgarrmondragon had an initial working implementation of requests caching using the cached-requests library: https://requests-cache.readthedocs.io/en/stable/

Used here: https://github.com/MeltanoLabs/tap-stackexchange/blob/main/tap_stackexchange/client.py#L7

And we have an SDK discussion on this here: https://gitlab.com/meltano/sdk/-/issues/237

@laurentS, @edgarrmondragon, and @ericboucher - If ever there were a good use case for this, I think this repo is a great one.

What do you all think of this approach?

When running tests in CI, invoke the requests-cache library to store http responses.
Sequence the build so that one python version is the main one (3.9, perhaps), and the main one runs prior to the others and stores its artifacts in a GitHub CI cache.
The other python versions run after the main version.
Even the main python version recalls the cache and thereby only has to send new requests not covered by a prior build's requests. (For instance, when a new stream is added.)

At most, this means one "main" job needs to send http requests - but in practice, even the main job should not need to make the full set of requests.

Optionally, a scheduled job could kill the cache once a week and refresh from scratch.

Question - Add _sdc_repository as an identifier for relevant fields?

Singer's GitHub tap is adding an _sdc_repository field for streams that don't get it from the API as an easy-to-use identifier to streams to their parent repo.

See https://github.com/singer-io/tap-github/blob/master/tap_github/__init__.py#L927

Should we do something similar with the Meltano taps?

At the moment, repo and org keys get added, so we could decide to leverage that instead.

One caveat, some streams actually have fields called repo and org that could be incompatible (see my upcoming Events stream PR).

Add HTTP Status 204 to `tolerated_http_errors` for contributors streams

/repos/{org}/{repo}/contributors and /repos/{org}/{repo}/stats/contributors both return HTTP Status 204 when called for an empty repository. Example Postman request:

As a result the Contributors, Anonymous Contributors, and Stats Contributors streams each return JSONDecodeError when attempting to parse an empty response:

time=2022-05-28 23:34:30 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.435061, 'tags': {'endpoint': '/search/repositories', 'http_status_code': 200, 'status': 'succeeded', 'context': {'search_name': 'empty-repo', 'search_query': 'user:epapineau example-empty-repo'}}}
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Beginning full_table sync of 'contributors' with context: {'org': 'epapineau', 'repo': 'example-empty-repo', 'repo_id': 497421330}...
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Tap has custom mapper. Using 1 provided map(s).
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Found 1 'GITHUB_TOKEN' environment variables for authentication.
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Tap will run with 1 auth tokens
time=2022-05-28 23:34:31 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.313254, 'tags': {'endpoint': '/repos/{org}/{repo}/contributors', 'http_status_code': 204, 'status': 'succeeded', 'context': {'org': 'epapineau', 'repo': 'example-empty-repo', 'repo_id': 497421330}}}
Traceback (most recent call last):
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "venv/bin/tap-github", line 8, in <module>
    sys.exit(cli())
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 499, in cli
    tap.sync_all()
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 379, in sync_all
    stream.sync()
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1020, in sync
    self._sync_records(context)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 962, in _sync_records
    self._sync_children(child_context)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1025, in _sync_children
    child_stream.sync(context=child_context)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1020, in sync
    self._sync_records(context)
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 946, in _sync_records
    for record_result in self.get_records(current_context):
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 424, in get_records
    for record in self.request_records(context):
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 323, in request_records
    yield from self.parse_response(resp)
  File "/Users/elizepapineau/dev/tap-github/tap_github/client.py", line 226, in parse_response
    resp_json = response.json()
  File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/requests/models.py", line 917, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: [Errno Expecting value] : 0

ReadmeStream fails on repos without a readme

As I write this issue, https://github.com/mikelove/DESeq2 does not have a readme, which results in https://api.github.com/repos/mikelove/DESeq2/readme returning a 404.

I will implement a small fix along the lines of what is described in #16 for the community profile stream.

Consider running tap output into a target in CI

In #93 I decided to run the tap thru a simple target target-sqlite to see if the new streams I added worked.
By doing this, I instantly found mistyped fields in the Events stream.
Example error:

CRITICAL '20730626430' is not of type 'integer', 'null'
CRITICAL 
CRITICAL Failed validating 'type' in schema['properties']['id']:
CRITICAL     {'type': ['integer', 'null']}
CRITICAL
CRITICAL On instance['id']:
CRITICAL     '20730626430'

I think this would be a great testing method to ensure our schemas are correctly typed.

Possibly a new step that runs:
poetry run tap-github --config c.json --catalog ca.json | poetry run target-sqlite -c sq.json and crashes if the exit code is non-zero.

Thoughts on implementing this?

Make test logs less verbose

The test logs are very verbose, making it very cumbersome to analyze on the GitHub website which keeps freezing.

We should find a way to reduce the log level to make it more digestible.

Should the tap tests mocks github's API responses?

While developing I've run a few times into github's api rate limits (with my auth_token set!), which gets pretty annoying, as the tap does not work at all then.
This might become an issue for CI as well when the number of streams increases (the limit without an auth token is 60 requests/hour per IP).
So I am wondering if we should setup some sort of mocking of API calls, for instance using pytest's monkeypatching fixture, or some other mechanism.

Pros:

faster and more reliable test suite (so probably more tests written)
better development experience (running tests would be almost instant)

Cons:

it takes a little time to build the mock responses
if the actual API changes its format, the mocks need to be updated. Tools like VCR.py could be helpful with this.
breaking changes in the actual API would not break the tap's tests, but something downstream would probably fail. Unless we run a periodic CI job to compare mocks to actual API output.

Tap crashes with decoding errors due to server problems

In our logs, we regularly get the following error:

requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

which is due to:

urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

itself coming from

.../.venv/lib/python3.9/site-packages/urllib3/response.py", line 704, in _update_chunk_length
       raise InvalidChunkLength(self, line)
     urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

The underlying error is on the server-side according to psf/requests#5583, however I can't find any info elsewhere on the web. It may be a transient error on github's side, so opening this ticket to keep a record of the problem.

Create shared schema for reuse

A few schemas, including owner schema, project schema, and others are duplicated across the streams.

Let's extract some of them and refactor to make schemas to:

make schemas easier to read
avoid duplication
avoid errors

Add pagination for the Graphql client

The basic implementation done in #78 does not allow for pagination. If we are to use the graphql endpoint to get more data, we will need to implement pagination in the client.

Fix collaborators / org_level tests

Currently, we have to bypass these tests because they were failing frequently. Instead, we should create separate tests.

For example, we could mocky API responses completely.

request-cache is not working properly

Tests keep hitting rate limiting issues, which suggests that our request-cache implementation is not working as expected during CI runs. It is working fine locally, which makes it hard to debug...

We should resolve this problem to help with our CI, making it faster and less frustrating

Add GitHub App authentication method

GitHub Apps have higher rate limiting. We can take some inspiration from singer-io/tap-github#163

Need process to manage pinnable release versions

@ericboucher @laurentS @edgarrmondragon

This is a fast-developing tap! We've expanded now to 9 streams and we've got several more on the way.

What can our plan be for efficient management of releases? Kicking off this issue for discussion.

I would love to introduce auto-changelog entries and auto-publish to PyPi if those were feasible. If not auto-generated, then we would need a process to efficiently manage the changelog and release proceses.

Get databaseId for organizations

Following up on #112, we should:

create a utility function to avoid duplication between repos and users streams
use this util to implement a similar behavior for organizations and their children's streams

Some initial thoughts on refactoring

Technically, user{i} could be item{i} and we could extract the TempStream as a ContextStream to be shared between users, repos, and orgs streams

Some pseudo code of what I am thinking about:

ContextStream(request_records, request_body) -> [request_body_reponses]

repo:
request_records = [{repo, org}]
request_body = "{nameWithOwner, databaseId, id: nodeId }"

users
request_records = [{login}]
request_body = "{login, databaseId, id: nodeId }"

tap-github --about is broken

tap-github --about requires a valid config file, which kinda defeats the purpose:

ericboucher@Erics-MBP website % tap-github --about
time=2022-04-12 15:36:20 name=tap-github level=INFO message=Skipping parse of env var settings...
time=2022-04-12 15:36:20 name=tap-github level=INFO message=Config validation passed with 0 errors and 0 warnings.
time=2022-04-12 15:36:20 name=root level=INFO message=Operator '__else__=None' was not found. Unmapped streams will be included in output.
Traceback (most recent call last):
  File "/opt/homebrew/bin/tap-github", line 8, in <module>
    sys.exit(cli())
  File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 467, in cli
    tap = cls(  # type: ignore  # Ignore 'type not callable'
  File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 92, in __init__
    self.input_catalog or self._singer_catalog
  File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 235, in _singer_catalog
    for stream in self.streams.values()
  File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 118, in streams
    for stream in self.load_streams():
  File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 267, in load_streams
    for stream in self.discover_streams():
  File "/opt/homebrew/lib/python3.9/site-packages/tap_github/tap.py", line 65, in discover_streams
    raise ValueError(
ValueError: This tap requires one and only one of the following path options: {'organizations', 'user_ids', 'user_usernames', 'repositories', 'searches'}.

Let's make sure anyone can run this command after installing tap-github.

Use the GITHUB_TOKEN built-in secret for better rate limits?

I believe as-of now, we are still limited by the no-auth rate limits, which are lower than authenticated rates.

We may get a better rate limits from using the built-in GITHUB_TOKEN secret. I haven't read deep enough to know if there are any specific limitations on this token which would block us, but it seems worth exploration.

https://docs.github.com/en/actions/security-guides/automatic-token-authentication#about-the-github_token-secret

Use GitHub OpenApi spec to validate requests

We could use https://github.com/github/rest-api-description/blob/main/descriptions-next/api.github.com/api.github.com.json to validate our requests and make sure that we are following the api specs.

Reactions +1 and -1 field names cause downstream issues

I'm trying to load this tap-github into target-bigquery and get the following error

CRITICAL Duplicate field(s) in stream issues:          & reactions.-1 are read as REACTIONS._1 by BigQuery�

The pull_requests stream has the same problem. What's the best way to deal with it? Having the special chars +/- in the schema name causes trouble, should this be fixed at the extractor level or should I look for a solution with this particular loader?

Create a function to get useful data about projects, not readily available through the API

Let's create a stream to get useful data not readily available from the API.

open issues (the api returns open_issues as open issues + open prs...)
open prs
number of dependents
number of dependencies
number of contributors
...

Any other ideas @laurentS?

We can piggy back on the the work done with #127

Tap-github allows inputting buggy data into its output

It is possible to "corrupt" data coming out of the tap by inputting incorrectly cased values in the config file. While it does not directly affect the tap itself, it can lead to problems downstream (we've just hit this problem).
I am not sure what the correct solution is, so I'm opening this ticket for discussion and awareness.

In config, set "repositories": ["meltanolabs/tap-github"] (lowercase M and L on purpose) and run the tap deselecting the repositories stream, and keeping at least one child stream, issues for instance.

The tap will run fine, since the github api will handle the case inconsistencies in the request and return data containing MeltanoLabs instead of meltanolabs.

But because of these lines https://github.com/MeltanoLabs/tap-github/blob/0851e80578bbb17f69faa4c6b2220d30ed1802c5/tap_github/repository_streams.py#L103-107 the output from the tap will contain org=meltanolabs but full_name=MeltanoLabs/tap-github (note the difference in upper/lowercase chars). If you use the org and repo columns to join records from different streams together, you can end up with inconsistent values which are hard to reconcile.

The obvious solution to this problem is to avoid incorrectly cased input in the config, but this is fairly fragile. Another option would be to change the way the child streams get their org/repo values by parsing the url field for instance (which is received from github and matches the "official" repo name casing), but this seems fairly costly performance-wise.

Are there any best practices around this in other taps?

Test tokens during initialization

Add a test when loading tokens to make sure that they are valid and have not expired.

Expired tokens return an "Unauthorized" error which is a bit cryptic to interpret mid-stream.

Instead, I think we should validate tokens during the initialization phase. Other ideas?

Use repo_id instead of [repo, org] as partition key and index when possible

Thanks to the changes introduced by #112, we now have a more reliable column to use for children streams.

Note, this is potentially a breaking changes and might change the DB index structure.

The "issues" stream is only fetching OPEN issues (and PRs)

I am interested in puling CLOSED issues and MERGED PRs as well as OPEN ones. The current behavior for the issues stream is to pull all OPEN issues (and PRs).

I think that by default, the issues stream could fetch ALL issues that have been updated since our start date; and we could have an option to only select OPEN issues.
Somewhat linked: it could be an opportunity to clearly separate issues and PRs. The issues stream would pull only issues, and a pull_request stream would handle pull_requests.

What do you think?

Teams endpoint parent changed

Issue

Error loading teams data through Stitch (schema error). After closer evaluation, we realised this was because the parents returned value had changed from a string to an object (see below + list_teams schema https://docs.github.com/en/rest/teams/teams#list-teams)

To recreate

Sync github tap through Stitch using meltano (Tap--github -> Meltano -> Stitch -> Warehouse). Obtains error:

CRITICAL Error persisting data to Stitch: 400: {'error': 'Record 0 for table teams did not conform to schema:\n#/parent: #: no subschema matched out of the total 2 subschemas\n#/parent: expected type: String, found: JSONObject\n#/parent: expected: null, found: JSONObject\n'}

Solution:

In organization_stream.py -> TeamsStream, change

    schema = th.PropertiesList(
        # Parent Keys
        th.Property("org", th.StringType),
        # Rest
        th.Property("id", th.IntegerType),
        th.Property("node_id", th.StringType),
        th.Property("url", th.StringType),
        th.Property("html_url", th.StringType),
        th.Property("name", th.StringType),
        th.Property("slug", th.StringType),
        th.Property("description", th.StringType),
        th.Property("privacy", th.StringType),
        th.Property("permission", th.StringType),
        th.Property("members_url", th.StringType),
        th.Property("repositories_url", th.StringType),
        th.Property("parent", th.StringType),
    ).to_dict()

    schema = th.PropertiesList(
        # Parent Keys
        th.Property("org", th.StringType),
        # Rest
        th.Property("id", th.IntegerType),
        th.Property("node_id", th.StringType),
        th.Property("url", th.StringType),
        th.Property("html_url", th.StringType),
        th.Property("name", th.StringType),
        th.Property("slug", th.StringType),
        th.Property("description", th.StringType),
        th.Property("privacy", th.StringType),
        th.Property("permission", th.StringType),
        th.Property("members_url", th.StringType),
        th.Property("repositories_url", th.StringType),
        th.Property("parent", th.ObjectType(
            th.Property("id", th.IntegerType),
            th.Property("node_id", th.StringType),
            th.Property("url", th.StringType),
            th.Property("html_url", th.StringType),
            th.Property("name", th.StringType),
            th.Property("slug", th.StringType),
            th.Property("description", th.StringType),
            th.Property("privacy", th.StringType),
            th.Property("permission", th.StringType),
            th.Property("members_url", th.StringType),
            th.Property("repositories_url", th.StringType),
            )
        ),
    ).to_dict()

Hope this helps!

Implement a base GraphQL stream to reduce unnecessary API calls

Currently we have a REST base stream implemented for our tap:
https://github.com/MeltanoLabs/tap-github/blob/main/tap_github/client.py#L10

However, it would be greatly beneficial if we also considered making a base stream for the Github GraphQL API. This is because the Graphql API uses a 'credits' system which makes it a lot more adaptive at collecting granular data compared to REST where each resource, no matter how small, takes up an entire request.

Streams that can greatly benefit from the GraphQL API:

PR Files stream (WIP in #45)
PRs in general #49

Add RepoLanguagesStream

We would like the ability to fetch repo languages and LOC. The repo object has one main language, but a list is available under the languages_url with respective LOC - https://api.github.com/repos/facebook/react/languages

Config - api_url_base

I see api_url_base is retrieved from the config in the code here but its not listed in the config options in the tap. I'm not positive but I think it would be ignored right now if it was set in the config.json, is that right?

Is this purposeful? If not I can create a PR to get it added to the config!

CI jobs all run twice on PRs

I just noticed after pushing a couple of commits to #71 that each CI job is run twice, once for the PR, once for the push.
There must be some setting on github actions to avoid this, even if we're caching results, it's seems super wasteful.

Using different API endpoints for a "same" stream

Currently issue_comments are fetched from /repos/{owner}/{repo}/issues/{issue_number}/comments as a child stream of issues. This allows fetching only comments for a single issue.

Our use case is more suited to fetching the same data at the repo level, using the /repos/{owner}/{repo}/issues/comments API endpoint. It returns the same data for each comment, according to the docs.

The main value for us in using the second solution is that we get more data for the same API usage quota: the typical issue will have far fewer than 100 comments on it, so we get comments for multiple issues in one call. We do not run this as a child stream, instead we feed it a list of repos for which we want data (as in #5), and sync only this stream. This allows us better control over API usage limits.

Is there a way to have both streams coexist in the tap so that either can be used if needed, but the default setup would not double fetch data? (what I mean is that issue_comments_by_repo could be a child stream of repositories, and issue_comments_by_issue a child of issues, itself a child of repositories, so if fetching the whole catalog, we would fetch some data twice).

I think there are other streams which have a similar situation, so this is probably a slightly more general question than just this stream (in a similar fashion, repositories can be obtained through the search endpoint, or the repos endpoint, although the data returned is slightly different). Another example is issue_events which can be fetch by issue or by repo.

meltanolabs / tap-github Goto Github PK

tap-github's People

Contributors

Stargazers

Watchers

Forkers

tap-github's Issues

Issue

To recreate

Solution:

Recommend Projects

Recommend Topics

Recommend Org