meltanolabs / tap-github Goto Github PK
View Code? Open in Web Editor NEWA Singer tap for extracting data from Github. Powered by the Meltano SDK for Singer Taps: https://sdk.meltano.com
License: Apache License 2.0
A Singer tap for extracting data from Github. Powered by the Meltano SDK for Singer Taps: https://sdk.meltano.com
License: Apache License 2.0
Singer's GitHub tap is adding an _sdc_repository
field for streams that don't get it from the API as an easy-to-use identifier to streams to their parent repo.
See https://github.com/singer-io/tap-github/blob/master/tap_github/__init__.py#L927
Should we do something similar with the Meltano taps?
At the moment, repo
and org
keys get added, so we could decide to leverage that instead.
One caveat, some streams actually have fields called repo
and org
that could be incompatible (see my upcoming Events stream PR).
I might be misunderstanding how the schema definition works for a stream, but this bothers me.
With the following schema (from issue_comments
in 71b07b7):
schema = th.PropertiesList(
th.Property("id", th.IntegerType),
th.Property("node_id", th.StringType),
th.Property("repo", th.StringType),
th.Property("org", th.StringType),
th.Property("issue_url", th.IntegerType),
th.Property("updated_at", th.DateTimeType),
th.Property("created_at", th.DateTimeType),
th.Property("author_association", th.StringType),
th.Property("body", th.StringType),
th.Property(
"user",
th.ObjectType(
th.Property("login", th.StringType),
th.Property("id", th.IntegerType),
th.Property("node_id", th.StringType),
th.Property("avatar_url", th.StringType),
th.Property("gravatar_id", th.StringType),
th.Property("html_url", th.StringType),
th.Property("type", th.StringType),
th.Property("site_admin", th.BooleanType),
),
),
).to_dict()
I am seeing the following records:
{
"type": "RECORD",
"stream": "issue_comments",
"record": {
"issue_url": "https://api.github.com/repos/singer-io/tap-facebook/issues/157",
"id": 895733451,
"node_id": "IC_kwDOBRvyIM41Y87L",
"user": {
"login": "lscottallen004",
"id": 83940128,
"node_id": "MDQ6VXNlcjgzOTQwMTI4",
"avatar_url": "https://avatars.githubusercontent.com/u/83940128?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/lscottallen004",
"html_url": "https://github.com/lscottallen004",
"followers_url": "https://api.github.com/users/lscottallen004/followers",
"following_url": "https://api.github.com/users/lscottallen004/following{/other_user}",
"gists_url": "https://api.github.com/users/lscottallen004/gists{/gist_id}",
"starred_url": "https://api.github.com/users/lscottallen004/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/lscottallen004/subscriptions",
"organizations_url": "https://api.github.com/users/lscottallen004/orgs",
"repos_url": "https://api.github.com/users/lscottallen004/repos",
"events_url": "https://api.github.com/users/lscottallen004/events{/privacy}",
"received_events_url": "https://api.github.com/users/lscottallen004/received_events",
"type": "User",
"site_admin": false
},
"created_at": "2021-08-10T05:09:01Z",
"updated_at": "2021-08-10T05:09:01Z",
"author_association": "NONE",
"body": "> \n\n- > > > > > > ~~> >~~~~~~~~> @iterati _@iterati _@iterati _[]()[]()_@hz-lschick ____>~~ []",
"org": "singer-io",
"repo": "tap-facebook"
},
"time_extracted": "2021-09-10T23:07:19.736119Z"
}
A number of the nested *_url
fields are present in the record, although they are excluded from the schema definition. It looks like a call to https://gitlab.com/meltano/sdk/-/blob/main/singer_sdk/helpers/_singer.py#L23 from pop_deselected_record_properties
causes the field to be included because user
is included, and somehow the details of the nested object do not appear in the selection mask.
I suspect this is a bug in the sdk, but I might have misunderstood how the code is supposed to work 🤔
Python 3.6 is reaching its EOL on December 23rd, so this is probably a good time to drop support for it.
Any objections to a short PR dropping 3.6 from CI and pyproject.toml
?
Source: https://endoflife.date/python
Currently, this was built around the searches
config option, which allows Github repo searches to drive dynamic repo selection.
I suggest we make three vectors for selection, each which would be mutually exclusive during setting validation:
searches
- (existing) finds repos based on the search APIrepositories
- (hi-pri) takes a list of one or more repositories, such as ['MeltanoLabs/tap-github']
organizations
- (low-pri) takes a list of one or more organizations, such as ['MeltanoLabs']
In the sdk v0.4.9, a new function was added through this MR which improves the sdk-level handling of rate limits.
tap-github
's handling still leads to failures in some cases: with multiple tokens passed in config and enough remaining quota overall, the tap can still stop running with an error like:
name=backoff level=ERROR message=Giving up _request(...) after 5 tries (singer_sdk.exceptions.RetriableAPIError: 403 Client Error: b'{"message":"API rate limit exceeded for user ID xxxxxxx.","documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"}' (Reason: Forbidden) for path: /repos/camicroscope/caMicroscope/languages)
The key part of this message (in our production logs) is that the user ID remains the same across the 5 retries, which is not expected.
It looks like the tap's authenticator correctly rotates the token, but the backoff decorated request continues retrying with the same token for a given request, because the authenticator passes the token to the request in prepare_request but the request headers are not updated with the new token until the next record.
While this work most of the time, if the buffer is less than the number of retries or some other process is depleting quotas in parallel, it is possible to perform the exact same query 5 times in a row, without actually changing the token.
The new backoff_handler
should be able to fix this.
Currently we have a REST base stream implemented for our tap:
https://github.com/MeltanoLabs/tap-github/blob/main/tap_github/client.py#L10
However, it would be greatly beneficial if we also considered making a base stream for the Github GraphQL API. This is because the Graphql API uses a 'credits' system which makes it a lot more adaptive at collecting granular data compared to REST where each resource, no matter how small, takes up an entire request.
Streams that can greatly benefit from the GraphQL API:
Currently the PR stream is missing a bunch of key fields
additions
deletions
review_comments (int)
changed_files (int)
..etc
The current api route we use https://docs.github.com/en/rest/reference/pulls#list-pull-requests isn't sufficient to get all these missing fields. We would need to use the singular pr route https://docs.github.com/en/rest/reference/pulls#get-a-pull-request to fetch everything. This may be difficult since this would greatly increase the amount of requests made to the API. We could maybe cut this down by making a custom query with absolutely everything with #48.
As I write this issue, https://github.com/mikelove/DESeq2 does not have a readme, which results in https://api.github.com/repos/mikelove/DESeq2/readme returning a 404.
I will implement a small fix along the lines of what is described in #16 for the community profile stream.
I am unsure of what situation it reflects, but I have been seeing the following error on issue_events
where there doesn't seem to be any linked issue
and we get an AttributeError: 'NoneType' object has no attribute 'pop'
.
To start, we could simply add a check for the issue object and set null issue_number and issue_url if the object does not exist. Thoughts?
time=2021-09-27 15:42:20 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 1.158103, 'tags': {'endpoint': '/repos/{org}/{repo}/issues/events', 'http_status_code': 200, 'status': 'succeeded', 'context': {'org': 'facebook', 'repo': 'react'}}}
Traceback (most recent call last):
.../...
tract/.venv/lib/python3.9/site-packages/tap_github/streams.py", line 529, in post_process
row["issue_number"] = int(row["issue"].pop("number"))
AttributeError: 'NoneType' object has no attribute 'pop'
I see api_url_base is retrieved from the config in the code here but its not listed in the config options in the tap. I'm not positive but I think it would be ignored right now if it was set in the config.json, is that right?
Is this purposeful? If not I can create a PR to get it added to the config!
The test logs are very verbose, making it very cumbersome to analyze on the GitHub website which keeps freezing.
We should find a way to reduce the log level to make it more digestible.
When an issue arises in a stream, it would be quite useful to catch it and add context information (stream, repo, ...).
This would make it a lot easier to replicate errors locally and fix them.
cc @laurentS
We use the dependents
count for a repository, which is currently fetched by grabbing the html page for the project (eg. https://github.com/facebook/react/network/dependents) and parsing the HTML. As I write this ticket, the link above returns 7,878,702 Repositories
(and likewise for packages) and we grab these numbers.
Unfortunately, this info does not seem to exist anywhere in either the REST or graphQL APIs.
@aaronsteers Would you have any objection to me adding a request for that page to the repositories
stream resulting in an extra field? Possibly behind some config option as it is fairly download heavy (the page above weighs 187kB).
Maybe in the post_process
method? Ideally, the data will eventually be available in one of the APIs, and this can then be dropped.
A few schemas, including owner schema
, project schema
, and others are duplicated across the streams.
Let's extract some of them and refactor to make schemas to:
@ericboucher @laurentS @edgarrmondragon
This is a fast-developing tap! We've expanded now to 9 streams and we've got several more on the way.
What can our plan be for efficient management of releases? Kicking off this issue for discussion.
I would love to introduce auto-changelog entries and auto-publish to PyPi if those were feasible. If not auto-generated, then we would need a process to efficiently manage the changelog and release proceses.
I am interested in puling CLOSED issues and MERGED PRs as well as OPEN ones. The current behavior for the issues
stream is to pull all OPEN issues (and PRs).
I think that by default, the issues
stream could fetch ALL issues that have been updated since our start date; and we could have an option to only select OPEN issues.
Somewhat linked: it could be an opportunity to clearly separate issues and PRs. The issues stream would pull only issues, and a pull_request stream would handle pull_requests.
What do you think?
Investigate a bit more why the issues/comment is raising the following error ValueError: invalid literal for int() with base 16: b''
eg.
Possible solutions to explore, patching urllib3: psf/requests#4248
2022-06-16T12:16:57.756152Z [info ] INFO Load t_issues_f4627bf93f834fa8b9b9682a78c94961 by FULL_TABLE (truncate) cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:16:57.756693Z [info ] INFO loading t_issues_f4627bf93f834fa8b9b9682a78c94961 to BigQuery cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:16:58.179806Z [info ] ERROR failed to load table t_issues_f4627bf93f834fa8b9b9682a78c94961 from file: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/pulumi-devrel/jobs?uploadType=resumable: Field assignee is type RECORD but has no schema cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
2022-06-16T12:17:01.383314Z [info ] CRITICAL 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/pulumi-devrel/jobs?uploadType=resumable: Field assignee is type RECORD but has no schema cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
My config looks like:
environments:
- name: production
config:
plugins:
extractors:
- name: tap-github
config:
repository: rawkode/rawkode
start_date: "2017-01-01T00:00:00Z"
access_token: <REDACTED>
select:
- commits.id
- commits.commit.author.date
- commits.commit.author.name
- commits.commit.message.*
- "*.*"
loaders:
- name: target-bigquery
config:
project_id: pulumi-devrel
dataset_id: github
credentials_path: client.json
It is possible to "corrupt" data coming out of the tap by inputting incorrectly cased values in the config file. While it does not directly affect the tap itself, it can lead to problems downstream (we've just hit this problem).
I am not sure what the correct solution is, so I'm opening this ticket for discussion and awareness.
In config, set "repositories": ["meltanolabs/tap-github"]
(lowercase M
and L
on purpose) and run the tap deselecting the repositories
stream, and keeping at least one child stream, issues
for instance.
The tap will run fine, since the github api will handle the case inconsistencies in the request and return data containing MeltanoLabs
instead of meltanolabs
.
But because of these lines https://github.com/MeltanoLabs/tap-github/blob/0851e80578bbb17f69faa4c6b2220d30ed1802c5/tap_github/repository_streams.py#L103-107 the output from the tap will contain org=meltanolabs
but full_name=MeltanoLabs/tap-github
(note the difference in upper/lowercase chars). If you use the org
and repo
columns to join records from different streams together, you can end up with inconsistent values which are hard to reconcile.
The obvious solution to this problem is to avoid incorrectly cased input in the config, but this is fairly fragile. Another option would be to change the way the child streams get their org/repo
values by parsing the url
field for instance (which is received from github and matches the "official" repo name casing), but this seems fairly costly performance-wise.
Are there any best practices around this in other taps?
We could use https://github.com/github/rest-api-description/blob/main/descriptions-next/api.github.com/api.github.com.json to validate our requests and make sure that we are following the api specs.
I believe as-of now, we are still limited by the no-auth rate limits, which are lower than authenticated rates.
We may get a better rate limits from using the built-in GITHUB_TOKEN
secret. I haven't read deep enough to know if there are any specific limitations on this token which would block us, but it seems worth exploration.
tap-github --about
requires a valid config file, which kinda defeats the purpose:
ericboucher@Erics-MBP website % tap-github --about
time=2022-04-12 15:36:20 name=tap-github level=INFO message=Skipping parse of env var settings...
time=2022-04-12 15:36:20 name=tap-github level=INFO message=Config validation passed with 0 errors and 0 warnings.
time=2022-04-12 15:36:20 name=root level=INFO message=Operator '__else__=None' was not found. Unmapped streams will be included in output.
Traceback (most recent call last):
File "/opt/homebrew/bin/tap-github", line 8, in <module>
sys.exit(cli())
File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/homebrew/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 467, in cli
tap = cls( # type: ignore # Ignore 'type not callable'
File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 92, in __init__
self.input_catalog or self._singer_catalog
File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 235, in _singer_catalog
for stream in self.streams.values()
File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 118, in streams
for stream in self.load_streams():
File "/opt/homebrew/lib/python3.9/site-packages/singer_sdk/tap_base.py", line 267, in load_streams
for stream in self.discover_streams():
File "/opt/homebrew/lib/python3.9/site-packages/tap_github/tap.py", line 65, in discover_streams
raise ValueError(
ValueError: This tap requires one and only one of the following path options: {'organizations', 'user_ids', 'user_usernames', 'repositories', 'searches'}.
Let's make sure anyone can run this command after installing tap-github.
As I was implementing a GithubStream for community_profile
, I stumbled upon the following issue:
Some repos actually return a 404 when we request their community_profile
. It is quite lucky that this got caught early thanks to the main test repo being one of these edge-cases. Though I still do not understand why.
As a workaround, we could allow tolerated_http_errors
and handle them differently. This is what I have implemented here. But I am open to other ideas 😄
Open question:
/repos/{org}/{repo}/contributors
and /repos/{org}/{repo}/stats/contributors
both return HTTP Status 204 when called for an empty repository. Example Postman request:
As a result the Contributors, Anonymous Contributors, and Stats Contributors streams each return JSONDecodeError
when attempting to parse an empty response:
time=2022-05-28 23:34:30 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.435061, 'tags': {'endpoint': '/search/repositories', 'http_status_code': 200, 'status': 'succeeded', 'context': {'search_name': 'empty-repo', 'search_query': 'user:epapineau example-empty-repo'}}}
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Beginning full_table sync of 'contributors' with context: {'org': 'epapineau', 'repo': 'example-empty-repo', 'repo_id': 497421330}...
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Tap has custom mapper. Using 1 provided map(s).
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Found 1 'GITHUB_TOKEN' environment variables for authentication.
time=2022-05-28 23:34:30 name=tap-github level=INFO message=Tap will run with 1 auth tokens
time=2022-05-28 23:34:31 name=tap-github level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.313254, 'tags': {'endpoint': '/repos/{org}/{repo}/contributors', 'http_status_code': 204, 'status': 'succeeded', 'context': {'org': 'epapineau', 'repo': 'example-empty-repo', 'repo_id': 497421330}}}
Traceback (most recent call last):
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/requests/models.py", line 910, in json
return complexjson.loads(self.text, **kwargs)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "venv/bin/tap-github", line 8, in <module>
sys.exit(cli())
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 499, in cli
tap.sync_all()
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/tap_base.py", line 379, in sync_all
stream.sync()
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1020, in sync
self._sync_records(context)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 962, in _sync_records
self._sync_children(child_context)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1025, in _sync_children
child_stream.sync(context=child_context)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 1020, in sync
self._sync_records(context)
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/core.py", line 946, in _sync_records
for record_result in self.get_records(current_context):
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 424, in get_records
for record in self.request_records(context):
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/singer_sdk/streams/rest.py", line 323, in request_records
yield from self.parse_response(resp)
File "/Users/elizepapineau/dev/tap-github/tap_github/client.py", line 226, in parse_response
resp_json = response.json()
File "/Users/elizepapineau/dev/tap-github/venv/lib/python3.8/site-packages/requests/models.py", line 917, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: [Errno Expecting value] : 0
In #93 I decided to run the tap thru a simple target target-sqlite
to see if the new streams I added worked.
By doing this, I instantly found mistyped fields in the Events
stream.
Example error:
CRITICAL '20730626430' is not of type 'integer', 'null'
CRITICAL
CRITICAL Failed validating 'type' in schema['properties']['id']:
CRITICAL {'type': ['integer', 'null']}
CRITICAL
CRITICAL On instance['id']:
CRITICAL '20730626430'
I think this would be a great testing method to ensure our schemas are correctly typed.
Possibly a new step that runs:
poetry run tap-github --config c.json --catalog ca.json | poetry run target-sqlite -c sq.json
and crashes if the exit code is non-zero.
Thoughts on implementing this?
The upgrade to singer-sdk >= 0.3.14
breaks the way this tap works with tolerated_http_errors
. The logic for this should be modified to allow using the latest sdk.
We would like the ability to fetch repo languages and LOC. The repo object has one main language, but a list is available under the languages_url with respective LOC - https://api.github.com/repos/facebook/react/languages
Effectively, this tap is already a lot more advanced than https://github.com/singer-io/tap-github, which is the tap currently proposed as a default GitHub tap on meltano.com.
Let's add missing endpoints from https://github.com/singer-io/tap-github and reach out to the relevant persons at Meltano to upgrade the default tap to this one.
Missing endpoints:
I'm trying to load this tap-github into target-bigquery and get the following error
CRITICAL Duplicate field(s) in stream issues: & reactions.-1 are read as REACTIONS._1 by BigQuery�
The pull_requests
stream has the same problem. What's the best way to deal with it? Having the special chars +/- in the schema name causes trouble, should this be fixed at the extractor level or should I look for a solution with this particular loader?
The GitHub REST API makes it impossible to traverse the stargazers endpoint for large repos and limits how many pages we can query.
For example, querying https://api.github.com/repositories/10270250/stargazers?page=500&per_page=100
will result in the following error:
{
"message": "In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse.",
"documentation_url": "https://docs.github.com/v3/#pagination"
}
And contrary to what the message suggests, querying the "last" link still gives an error...
To make things worse, the API does NOT allow any of the following params for this endpoint:
A solution is to paginate through a Graphql version instead:
query {
repository(name: "react" owner: "facebook") {
stargazers(first: 100 orderBy: {field: STARRED_AT direction: DESC}) {
edges {
node {
id
login
}
starredAt
}
}
}
}
I just noticed after pushing a couple of commits to #71 that each CI job is run twice, once for the PR, once for the push.
There must be some setting on github actions to avoid this, even if we're caching results, it's seems super wasteful.
Let's create a stream to get useful data not readily available from the API.
Any other ideas @laurentS?
We can piggy back on the the work done with #127
Thanks to the changes introduced by #112, we now have a more reliable column to use for children streams.
Note, this is potentially a breaking changes and might change the DB index structure.
The GitHub API allows for conditional requests that do not count towards rate-limiting.
It could be interesting to leverage these and avoid making requests if they are not necessary.
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests
Add a simple pre-commit hook for mypy
and black
to avoid pushing code with errors that are quick and easy to fix.
Add a test when loading tokens to make sure that they are valid and have not expired.
Expired tokens return an "Unauthorized" error which is a bit cryptic to interpret mid-stream.
Instead, I think we should validate tokens during the initialization phase. Other ideas?
In their docs, github mention that long running requests (>10 sec) are interrupted and return nothing. We hit such a case and it seems that Github returns a 502 Bad gateway HTTP status code.
In our case, the issue comments stream gave such errors.
To see the error, run curl -v -o /dev/null "https://api.github.com/repos/bitcoin/bitcoin/issues/comments?per_page=100&sort=updated&direction=asc&since=2019-11-25T15:20:23"
(this discards the output which is irrelevant, but shows the headers and the error 502). Curl helpfully shows a timer which stops at 10 seconds, confirming this is a server-side timeout.
When the tap is running with multiple repos listed in its config, it chokes on such errors and returns prematurely, and never finishes the list of tasks it's supposed to do. In my testing, retrying the same query led to the same result consistently. Lowering the per_page
param seemed to get the data (in the case above, I had to go down to 10).
The tap should be able to get around such errors in a cleaner way:
per_page
value for a while (say until it completed the current stream/repo)MAX_PER_PAGE
is set at 1000, and should be 100 max, according to docsThe basic implementation done in #78 does not allow for pagination. If we are to use the graphql endpoint to get more data, we will need to implement pagination in the client.
I've run into an annoying bug with target-postgres (datamill variant) where it crashes because a unicode char \x000 is sent to postgresql, causing it to reject the data.
The reason I'm reporting this here is that the data comes from this tap, and I'm not entirely sure what to do about it.
To reproduce:
tap-github
to fetch this issue: cockroachdb/cockroach#68703 (you can get to the bug faster by hardcoding the issue number in the IssuesStream
url, so as to only get this one issue)\u0000
and python requests as \x00
(this might be the first time I write an issue report which might cause the bug to happen again 🤯 )psycopg2.errors.CharacterNotInRepertoire: invalid byte sequence for encoding "UTF8": 0x00
On the one hand, I feel like the target should clean the data before sending it to postgres, but the same reasoning can apply to the tap. I'm not sure what the best practice is here. The interface is JSON, and I found no evidence that \x00
is illegal in JSON, as long as it's escaped.
Any thoughts?
Currently, we have to bypass these tests because they were failing frequently. Instead, we should create separate tests.
For example, we could mocky API responses completely.
@edgarrmondragon had an initial working implementation of requests caching using the cached-requests
library: https://requests-cache.readthedocs.io/en/stable/
Used here: https://github.com/MeltanoLabs/tap-stackexchange/blob/main/tap_stackexchange/client.py#L7
And we have an SDK discussion on this here: https://gitlab.com/meltano/sdk/-/issues/237
@laurentS, @edgarrmondragon, and @ericboucher - If ever there were a good use case for this, I think this repo is a great one.
What do you all think of this approach?
requests-cache
library to store http responses.3.9
, perhaps), and the main one runs prior to the others and stores its artifacts in a GitHub CI cache.At most, this means one "main" job needs to send http requests - but in practice, even the main job should not need to make the full set of requests.
Optionally, a scheduled job could kill the cache once a week and refresh from scratch.
Similar to https://github.com/MeltanoLabs/meltano-map-transform/blob/main/.github/dependabot.yml, in order to keep dependencies up to date.
Following the same logic as what is described in #110 and the linked PR, the state bookmarks should use stable IDs for their context, as opposed to the current org / repo
:
If a repo is renamed between 2 runs of the tap, the later run would ignore the bookmark, as the repo name would not match anymore. While not a disaster, it would essentially force a full resync of the history of the stream, which is a waste of time/energy/api quotas.
Example of a state bookmark right now:
{
"context": { "org": "scipy", "repo": "scipy" }, // this should use a stable id
"replication_key": "updated_at",
"replication_key_value": "2022-05-06T14:29:18Z",
"replication_key_signpost": "2022-06-22T09:07:14.047100+00:00",
"starting_replication_value": "2022-05-06T14:29:18Z",
"progress_markers": {
"Note": "Progress is not resumable if interrupted.",
"replication_key": "updated_at",
"replication_key_value": "2022-06-22T07:55:32Z"
}
},
Following up on #112, we should:
Some initial thoughts on refactoring
Technically, user{i} could be item{i} and we could extract the TempStream as a ContextStream to be shared between users, repos, and orgs streams
Some pseudo code of what I am thinking about:
ContextStream(request_records, request_body) -> [request_body_reponses]
repo:
request_records = [{repo, org}]
request_body = "{nameWithOwner, databaseId, id: nodeId }"
users
request_records = [{login}]
request_body = "{login, databaseId, id: nodeId }"
GitHub Apps have higher rate limiting. We can take some inspiration from singer-io/tap-github#163
Tests keep hitting rate limiting issues, which suggests that our request-cache implementation is not working as expected during CI runs. It is working fine locally, which makes it hard to debug...
We should resolve this problem to help with our CI, making it faster and less frustrating
Currently issue_comments
are fetched from /repos/{owner}/{repo}/issues/{issue_number}/comments
as a child stream of issues
. This allows fetching only comments for a single issue.
Our use case is more suited to fetching the same data at the repo level, using the /repos/{owner}/{repo}/issues/comments
API endpoint. It returns the same data for each comment, according to the docs.
The main value for us in using the second solution is that we get more data for the same API usage quota: the typical issue will have far fewer than 100 comments on it, so we get comments for multiple issues in one call. We do not run this as a child stream, instead we feed it a list of repos for which we want data (as in #5), and sync only this stream. This allows us better control over API usage limits.
Is there a way to have both streams coexist in the tap so that either can be used if needed, but the default setup would not double fetch data? (what I mean is that issue_comments_by_repo
could be a child stream of repositories
, and issue_comments_by_issue
a child of issues
, itself a child of repositories
, so if fetching the whole catalog, we would fetch some data twice).
I think there are other streams which have a similar situation, so this is probably a slightly more general question than just this stream (in a similar fashion, repositories
can be obtained through the search
endpoint, or the repos
endpoint, although the data returned is slightly different). Another example is issue_events
which can be fetch by issue or by repo.
While developing I've run a few times into github's api rate limits (with my auth_token set!), which gets pretty annoying, as the tap does not work at all then.
This might become an issue for CI as well when the number of streams increases (the limit without an auth token is 60 requests/hour per IP).
So I am wondering if we should setup some sort of mocking of API calls, for instance using pytest's monkeypatching fixture, or some other mechanism.
Pros:
Cons:
Error loading teams data through Stitch (schema error). After closer evaluation, we realised this was because the parents returned value had changed from a string to an object (see below + list_teams schema https://docs.github.com/en/rest/teams/teams#list-teams)
Sync github tap through Stitch using meltano (Tap--github -> Meltano -> Stitch -> Warehouse). Obtains error:
CRITICAL Error persisting data to Stitch: 400: {'error': 'Record 0 for table teams did not conform to schema:\n#/parent: #: no subschema matched out of the total 2 subschemas\n#/parent: expected type: String, found: JSONObject\n#/parent: expected: null, found: JSONObject\n'}
In organization_stream.py -> TeamsStream
, change
schema = th.PropertiesList(
# Parent Keys
th.Property("org", th.StringType),
# Rest
th.Property("id", th.IntegerType),
th.Property("node_id", th.StringType),
th.Property("url", th.StringType),
th.Property("html_url", th.StringType),
th.Property("name", th.StringType),
th.Property("slug", th.StringType),
th.Property("description", th.StringType),
th.Property("privacy", th.StringType),
th.Property("permission", th.StringType),
th.Property("members_url", th.StringType),
th.Property("repositories_url", th.StringType),
th.Property("parent", th.StringType),
).to_dict()
to
schema = th.PropertiesList(
# Parent Keys
th.Property("org", th.StringType),
# Rest
th.Property("id", th.IntegerType),
th.Property("node_id", th.StringType),
th.Property("url", th.StringType),
th.Property("html_url", th.StringType),
th.Property("name", th.StringType),
th.Property("slug", th.StringType),
th.Property("description", th.StringType),
th.Property("privacy", th.StringType),
th.Property("permission", th.StringType),
th.Property("members_url", th.StringType),
th.Property("repositories_url", th.StringType),
th.Property("parent", th.ObjectType(
th.Property("id", th.IntegerType),
th.Property("node_id", th.StringType),
th.Property("url", th.StringType),
th.Property("html_url", th.StringType),
th.Property("name", th.StringType),
th.Property("slug", th.StringType),
th.Property("description", th.StringType),
th.Property("privacy", th.StringType),
th.Property("permission", th.StringType),
th.Property("members_url", th.StringType),
th.Property("repositories_url", th.StringType),
)
),
).to_dict()
Hope this helps!
In our logs, we regularly get the following error:
requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
which is due to:
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
itself coming from
.../.venv/lib/python3.9/site-packages/urllib3/response.py", line 704, in _update_chunk_length
raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
The underlying error is on the server-side according to psf/requests#5583, however I can't find any info elsewhere on the web. It may be a transient error on github's side, so opening this ticket to keep a record of the problem.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.