Giter Site home page Giter Site logo

ghcrawler's Introduction

Version License Downloads

GHCrawler

GHCrawler is a robust GitHub API crawler that walks a queue of GitHub entities transitively retrieving and storing their contents. GHCrawler is primarily intended for people trying to track sets of orgs and repos. For example, the Microsoft Open Source Programs Office uses this to track 1000s of repos in which Microsoft is involved. In short, GHCrawler is great for:

  • Retrieving all GitHub entities related to an org, repo, user, team, ...
  • Efficiently storing and the retrieved entities
  • Keeping the stored data up to date when used in conjunction with a GitHub webhook to track events

GHCrawler focuses on successively retrieving and walking GitHub API resources supplied on a (set of) queues. Each resource is fetched, processed, plumbed for more resources to fetch and ultimately stored. Discovered resources are themselves queued for further processing. The crawler is careful to not repeatedly fetch the same resource. It makes heavy use of etags, Redis, client-side rate limiting, and GitHub token pooling and rotation to optimize use of your API tokens and not beat up the GitHub API.

The crawler can be configured to use a variety of different queuing technologies (e.g., AMQP 1.0 and AMQP 0.9 compatible queues like Azure ServiceBus and Rabbit MQ, respectively), and storage systems (e.g., Azure Blob and MongoDB). You can create your own infrastructure plugins to use different technologies.

Documentation

This page is essentially the Quick Start Guide for using the crawler. Detailed and complete documentation is maintained in

  • This project's wiki for documentation on the crawler itself
  • The Dashboard repo, for information on the browser-based crawler management dashboard
  • The Command line repo, for details of controlling the crawler from the command line

Running in-memory

The easiest way try our the GHCrawler is to run it in memory. You can get up and running in a couple minutes. This approach does not scale and is not persistent but it's dead simple.

  1. Clone the Microsoft/ghcrawler repo.
  2. Run npm install in the clone repo directory to install the prerequisites.
  3. Set the CRAWLER_GITHUB_TOKENS environment var to a semi-colon delimited list of GitHub API tokens for rate-limiting and permissions. For example, set CRAWLER_GITHUB_TOKENS=432b345acd23.
  4. Run the crawler using node bin/www.js.

Once the service is up and running, you should see some crawler related messages in the console output every few seconds. You can control the crawler either using the cc command line tool or a browser-based dashboard, both of which are described below. Note that since you are running in memory, if you kill the crawler process, all work will be lost. This mode is great for playing around with the crawler or testing.

Running Crawler-In-A-Box (CIABatta)

If you want to persist the data gathered and create some insights dashboards in small to medium production system, you can run GHCrawler in Docker with Mongo, Rabbit, and Redis infrastructure using the Crawler-in-a-box (CIABatta) approach. This setup also includes Metabase for building browser-based insights and gives you a browser-based control-panel for observing and controlling the crawler service.

NOTE This is an evolving solution and the steps for running will be simplified published, ready-to-use images on Docker Hub. For now, follow these steps

  1. Clone the Microsoft/ghcrawler and Microsoft/ghcrawler-dashboard repos.
  2. Set the CRAWLER_GITHUB_TOKENS environment var to a semi-colon delimited list of GitHub API tokens for rate-limiting and permissions. For example, export CRAWLER_GITHUB_TOKENS=432b345acd23.
  3. In a command prompt go to ghcrawler/docker and run docker-compose up.

Once the containers are up and running, you should see some crawler related messages in the container's console output every few seconds. You can control the crawler either using the cc command line tool or the browser-based dashboard, both of which are described below.

Check out the related GHCrawler wiki page for more information on running in Docker.

Deploying native

For ultimate flexibility, the crawler and associated bits can be run directly on VMs or as an app service. This structure typically uses cloud-based infrastructure for queuing, storage and redis. For example, this project comes with adapters for Azure Service Bus queuing and Azure Blob storage. The APIs on these adpaters is very slim so it is easy to for you to implement (and contribute) more.

Setting up this operating mode is a bit more involved and is not fully documented. The wiki pages on Configuration contain much of the raw info needed.

Event tracking

The crawler can hook and track GitHub events by listening webhooks. To set this up,

  1. Create a webhook on your GitHub orgs or repos and point it at the running crawler. When events are on the webhook should point to
  https://<crawler machine>:3000/webhook
  1. Set the crawler to handle webhook events by modifying the queuing.events property in the Runtime configuration or setting the CRAWLER_EVENT_PROVIDER Infrastructure setting to have the value webhook. In both cases changing the value requires a restart. Note that you can turn off events by setting the value to none.

If you are using signature validation, you must set the Infrastructure setting CRAWLER_WEBHOOK_SECRET to the value you configured into the GitHub webhook.

Controlling the crawler

Given a running crawler service (see above), you can control it using either a simple command line app or a browser-based dashboard.

cc command line

The crawler-cli (aka cc) can run interactively or as a single command processor and enables a number of basic operations. For now the crawler-cli is not published as an npm. Instead, clone its repo, run npm install and run the command line using

node bin/cc -i [-s <server url>]

The app's built-in help has general usage info and more details can be found in the project's readme. A typical command sequence shown in the snippet below starts cc in interactive mode talking to the crawler on http://localhost:3000 (default if -s is not specified), configures the crawler with a public and an admin GitHub token, and then queues and starts the processing of the repo called contoso-d/test.

> node bin/cc -i
http://localhost:3000> tokens 43984b2dead7o4ene097efd97#public 972bbdfe09dead704en82309#admin
http://localhost:3000> queue contoso-d/test
http://localhost:3000> start
http://localhost:3000> exit
>

Browser dashboard

The crawler dashboard gives you live feedback on what the crawler is doing as well as better control over the crawler's queues and configuration. Some configurations (e.g., Docker) include and start the dashboard for free. If you need to deploy the dashboard explicitly, clone the Microsoft/ghcrawler-dashboard repo and follow the instructions in the README found there.

Once the dashboard service is up and running, point your browser at the dashboard endpoint (http://localhost:4000 by default). Detail information is included in the dashboard README.

Note that the dashboard does not report queue message rates (top right graph) when used with the memory-based crawler service as that mechanism requires Redis to record activity.

Working with the code

Node version

>=6

Build

npm install

Unit test

npm test

Integration test

npm run integration

Run

node ./bin/www.js

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Known issues

It is clearly early days for the crawler so there are a number of things left to do. Most of the concrete work items are captured in repo issues. Broadly speaking there are several types of work:

  • Documentation -- The crawler code itself is relatively straightforward but not all of the architecture, control and extensibility points are not called out.
  • Completeness -- There are a few functional gaps in certain scenarios that need to be addressed.
  • Docker configuration -- Several items in making the Docker configuration real
  • Analysis and insights -- Metabase is supplied in the Docker configuration but relatively little has been done with analyzing the harvested data.

ghcrawler's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ghcrawler's Issues

Refresh the doc

After a ton of refactoring and new feature/function the doc needs an update.

add pulse data

We have various stats for queues flowing out but not much on crawler loops etc. The idea being that we generate a regular pulse of stats around how many calls, ... Need to reconcile this with the logging and app insights infrastructure

deadletter oddities

This morning the deadletter graph in the dashboard looked as pictured below. Since I was not taking any action, (and I assume you were not) its is hard to see how the deadletters could ever go down.
This may be dashboard related but I suspect something is amiss with the server side computation?
deadletter

Just now I used the command line to get the count and it said 260 while the dashboard itself is reporting 180 and storage explorer was saying 173 (though it was acting a little off so I restarted it and it reported 262). Ask three mechanisms and get four answers... may be some funny business on the Azure side?

mongodb: _metadata.links.self.href missing index

 docker exec -ti docker_mongo_1 mongotop 5
2018-06-07T05:21:41.610+0000	connected to: 127.0.0.1

                            ns     total    read     write    2018-06-07T05:21:46Z
              ghcrawler.commit    4009ms     7ms    4002ms                        
...
{
	"op" : "update",
	"ns" : "ghcrawler.commit",
	"command" : {
		"q" : {
			"_metadata.links.self.href" : "urn:repo:19816070:commit:492e7b081168f1922ef6409ebba77dbf30638185"
		},
		"u" : {
....
	"millis" : 153,
	"planSummary" : "COLLSCAN",
	"execStats" : {
		"stage" : "UPDATE",
		"nReturned" : 0,
		"executionTimeMillisEstimate" : 150,
		"works" : 117269,
		"advanced" : 0,
		"needTime" : 117268,
		"needYield" : 0,
		"saveState" : 916,
		"restoreState" : 916,
		"isEOF" : 1,
		"invalidates" : 0,
		"nMatched" : 0,
		"nWouldModify" : 0,
		"nInvalidateSkips" : 0,
		"wouldInsert" : true,
		"fastmodinsert" : true,
		"inputStage" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"_metadata.links.self.href" : {
					"$eq" : "urn:repo:19816070:commit:492e7b081168f1922ef6409ebba77dbf30638185"
				}
			},
			"nReturned" : 0,
			"executionTimeMillisEstimate" : 150,
			"works" : 117268,
			"advanced" : 0,
			"needTime" : 117267,
			"needYield" : 0,
			"saveState" : 916,
			"restoreState" : 916,
			"isEOF" : 1,
			"invalidates" : 0,
			"direction" : "forward",
			"docsExamined" : 117266
		}
	},
	"ts" : ISODate("2018-06-07T05:22:13.836Z"),
	"client" : "172.18.0.5",
	"allUsers" : [ ],

Solution:

> db.commit.createIndex( { "_metadata.links.self.href":  "hashed" } )
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1
}

After:

                            ns    total    read    write    2018-06-07T05:40:28Z
              ghcrawler.commit      8ms     8ms      0ms      

Get some measure of current progress and completion

Is there any way to get a sense of how much work the crawler has left to do, and how much it's done? The dashboard shows which requests are queued, but getting a sense of "we've done 250 requests and there are 8000 still to go" would be very useful. I appreciate that this might not actually be a knowable figure -- it's possible that all we know is what's currently queued, and each of those queued requests might spawn another million queued requests once they've been fetched and processed. However, at the moment it's a very shot-in-the-dark affair; it's very hard to get a sense of how long one should wait before there'll be data available in Mongo to process, and whether the data in there is roughly OK or wildly incomplete.

Add tracking of team maintainers

Today we get a list of team members but cannot tell the difference between the normal members and the maintainers. That is an interesting dimension on which to report and differentiate.

mongodb: commit - unindexed _metadata.url

Crawling a largish tree like MariaDB generated a rising CPU profile on the server for the mongodb process:

image

mongodb process was niced just before the fix below.

A log at the profiler shows a column scan looking for a _metadata.url

use ghcrawler
db.setProfilingLevel(1)
db.system.profile.find().pretty()
{
	"op" : "query",
	"ns" : "ghcrawler.commit",
	"command" : {
		"find" : "commit",
		"filter" : {
			"_metadata.url" : "https://api.github.com/repos/MariaDB/server/commits/6e791795a2d7319e32a65a4d8a2cb6ed54cfc5c6"
		},
		"limit" : 1,
		"batchSize" : 1,
		"singleBatch" : true,
		"$db" : "ghcrawler"
	},
	"keysExamined" : 0,
	"docsExamined" : 108972,
	"cursorExhausted" : true,
	"numYield" : 857,
	"locks" : {
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(1716)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(858)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(858)
			}
		}
	},
	"nreturned" : 0,
	"responseLength" : 104,
	"protocol" : "op_query",
	"millis" : 168,
	"planSummary" : "COLLSCAN",
	"execStats" : {
		"stage" : "LIMIT",
		"nReturned" : 0,
		"executionTimeMillisEstimate" : 149,
		"works" : 108974,
		"advanced" : 0,
		"needTime" : 108973,
		"needYield" : 0,
		"saveState" : 857,
		"restoreState" : 857,
		"isEOF" : 1,
		"invalidates" : 0,
		"limitAmount" : 1,
		"inputStage" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"_metadata.url" : {
					"$eq" : "https://api.github.com/repos/MariaDB/server/commits/6e791795a2d7319e32a65a4d8a2cb6ed54cfc5c6"
				}
			},
			"nReturned" : 0,
			"executionTimeMillisEstimate" : 149,
			"works" : 108974,
			"advanced" : 0,
			"needTime" : 108973,
			"needYield" : 0,
			"saveState" : 857,
			"restoreState" : 857,
			"isEOF" : 1,
			"invalidates" : 0,
			"direction" : "forward",
			"docsExamined" : 108972
		}
	},
	"ts" : ISODate("2018-06-07T03:54:57.791Z"),
	"client" : "172.18.0.5",
	"allUsers" : [ ],
	"user" : ""
}

Adding an index on this cased a rather quick drop in CPU as shown in the end of the graph:

 db.commit.createIndex( { "_metadata.url":  "hashed" } )

If this could be default that would be most useful.

PullRequestEvent does not queue issues

Github issues are also pull requests. In fact, some data, such as labels, appears only issues.
That's why when a PullRequestEvent comes in, an issue has to be queued in addition to queueing a pull request.

orgList is not remapped to lowercase when changing via PATCH route

The problem lies here:

ghcrawler/lib/crawler.js

Lines 30 to 37 in 0ac1107

_reconfigure(current, changes) {
// ensure the orgList is always lowercase
const orgList = changes.find(patch => patch.path === '/orgList');
if (orgList) {
debug('orgList changed');
this.options.orgList = orgList.value.map(element => element.toLowerCase());
}
}

The path is not only /orgList

const orgList = changes.find(patch => patch.path === '/orgList');

When reassigning this.options.orgList, this.options.orgList.map(element => element.toLowerCase()) should be used and not orgList.value.map(element => element.toLowerCase()).

this.options.orgList = orgList.value.map(element => element.toLowerCase());

blob.count() failure is not logged

If you run the crawler with azure blob configured as the deadletter store but using a SAS token that does not have List permissions, and hit the deadletter store count end point, the blob list appears to not fail (no error) but the result is null and then is accessed unprotected resulting the in the below stack trace. Note tha thtis was in the jm/apiCrawler branch but the code is the same as master

TypeError: Cannot read property 'entries' of null
at /opt/service/node_modules/ghcrawler/providers/storage/storageDocStore.js:174:32
at finalCallback (/opt/service/node_modules/azure-storage/lib/services/blob/blobservice.js:5814:7)
at /opt/service/node_modules/azure-storage/lib/common/filters/retrypolicyfilter.js:178:13

Add support for traversing Releases

The repo traversal currently does not gather Releases and while the ReleaseEvent is harvested, it also does not fetch the actual Release document.

the data is not open?

I am a student, I want to use the data of Github to analysis which factor is more important to the sucess of the project,

Add support for crawling a GitHub User’s various contributions, independent of a specific org or repo

Our goal is to track contributions by our employees to any open-source project on GitHub. So we'll need to look at each employee’s commits, pull requests, issues, etc. We can do this through the User’s Events.

I have some questions about how to do this:

  1. Is there anything in the current constraints of ghcrawler that will make this an exceptionally difficult task?

  2. How do I say “traverse the Events for a given User”? Where is an example of similar code doing something similar?

    • Based on this discussion #94 I thought it would be in the GitHub processor. Inside of that file, my understanding is that this code in user() this._addCollection(request, ‘repos', ‘repo’) should tell it to look at a user’s repos and add those repos to the mongodb repo collection. But currently, as far as I can tell, it processes the user, but doesn’t even hit the repo function. Because I care most about events right now, I also tried this._addCollection(request, 'events', 'null’); and this._addCollection(request, 'events', ‘events’); but neither seemed to do anything.
  3. Will this require an advanced traversal policy? I think that I can use the default traversal policy for now and refine it with an advanced one later to grab fewer things from user, if desired, like using graphQL to do a query. Is that right?

Limit fetched data to only some collections

It would be useful to only fetch some data from Github; for example, if I only need issues, pull_requests, and repos, to not have to fetch commits or issue_comments so as to reduce the amount I need to hit Github. Is this possible? The docs on policies seem to suggest that it might be doable, but I don't think I understand well enough how to do it; perhaps a doc clarification? (Or alternatively saying "you can't" would be OK here too.)

Allow defaulting the preferred queue based on information in the request

Issue #72 is a prerequisite for this.

In the configuration for a crawler you should be able to specify patterns that will set the preferred queue for a request based on information in the request. The two scenarios we have so far are:

  1. Changing the priority based on the type of request
  2. Changing the priority based on the URL of the request

For example, you could imaging a configuration that looks like this:

priorityMappings: [
 {"url": "contoso", "preferredQueue": "later"},
 {"type": "Repo", "preferredQueue": "soon"},
 {"type": "Repo", "url": "(hello|world)", "preferredQueue": "immediate"}
]

When a request is being queued, if it doesn't already specify a preferred queue in the request, then the priority mappings would be evaluated in order, and the first one where all of the criteria match would define the preferred queue for that request. URL matching must support regular expressions.

CAIB: turn on persistence

Mongo, rabbit and redis need persistence turned on in the container world (or at least an option to do so)

  • consider not turning it on for redis. in the mongo case we are not storing the url and urn mapping in redis. The queue contents are tracked in redis but if we lose that, it is not fatal.

review error crawler error handling

From time to time there is evidence that some errors can leak out of the crawler loop. in particular, it seems possible for something marked as "being processed" to not get unmarked when it is done. This causes subsequent loops to get a request and think that it is already being processed (ie.., a collision).

Do a deep review and update tests to validate that all reject and throw cases are being handled in the loop.

Process New Commits

When new commits are pushed, a PushEvent is generated with a list of commits.
Currently the commits are not queued which results in missing commits:

  PushEvent(request) {
    let [document] = this._addEventBasics(request);
    // TODO figure out what to do with the commits
    return document;
  }

See update_events function for implementation details.

Mark deleted entities

When we detect that something is deleted (e.g., some delete event), we need to mark the entity as deleted. likely this is reprocessing and adding a "deletedAt" property to the metadata.

  • be sure to factor that into subsequent data pipelines
  • decide if we need to filter deleted things at some point

Dashboard fails if queue does not exist

  • startup the crawler without events
  • startup the dashboard
  • notice that on the crawler machine the stack below shows up every 5 seconds

The fundamental issue here is that QueueSet.getQueue() throws if the queue is not known. It could return null. If it did, we'd have to make sure all the callers check. Alternatively, queues route could try/catch and return a 404.

trackException:
Error: Queue not found: events
    at QueueSet.getQueue (C:\git\ghcrawler\lib\queueSet.js:86:13)
    at CrawlerService.getQueueInfo (C:\git\ghcrawler\lib\crawlerService.js:84:39)
    at c:\git\ospo-ghcrawler\routes\queues.js:20:37
    at next (native)
    at Function.continuer (c:\git\ospo-ghcrawler\node_modules\q\q.js:1278:45)
    at Q.spawn (c:\git\ospo-ghcrawler\node_modules\q\q.js:1305:16)
    at module.exports (c:\git\ospo-ghcrawler\middleware\promiseWrap.js:9:5)
    at Layer.handle [as handle_request] (c:\git\ospo-ghcrawler\node_modules\express\lib\router\layer.js:95:5)
    at next (c:\git\ospo-ghcrawler\node_modules\express\lib\router\route.js:131:13)
    at validate (c:\git\ospo-ghcrawler\middleware\auth.js:16:12)
    at Layer.handle [as handle_request] (c:\git\ospo-ghcrawler\node_modules\express\lib\router\layer.js:95:5)
    at next (c:\git\ospo-ghcrawler\node_modules\express\lib\router\route.js:131:13)
    at Route.dispatch (c:\git\ospo-ghcrawler\node_modules\express\lib\router\route.js:112:3)
    at Layer.handle [as handle_request] (c:\git\ospo-ghcrawler\node_modules\express\lib\router\layer.js:95:5)
    at c:\git\ospo-ghcrawler\node_modules\express\lib\router\index.js:277:22
    at param (c:\git\ospo-ghcrawler\node_modules\express\lib\router\index.js:349:14)

SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode

When run the default command node bin/www.js, I got this error:

zhangysh1995@ubuntu-zhangyushao:~/Tools/ghcrawler$ node bin/www.js /home/zhangysh1995/Tools/ghcrawler/bin/www.js:12
let port = normalizePort(config.get('CRAWLER_SERVICE_PORT') || process.env.PORT || '3000');
^^^

SyntaxError: Block-scoped declarations (let, const, function, class) not yet supported outside strict mode
    at exports.runInThisContext (vm.js:53:16)
    at Module._compile (module.js:374:25)
    at Object.Module._extensions..js (module.js:417:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:442:10)
    at startup (node.js:136:18)
    at node.js:966:3

Versions:

  • OS: Ubuntu 16.04
  • npm: 3.5.2
  • node: 4.7.0

I'm quite new to node and have difficulty read the original source code. Could you please provide simple usage examples or sample configure files?

Update Mongo store apis

There were a number of changes to the way stores are used (e.g. for deadletter we added a number of apis) and the Mongo store needs to be updated . Currently Crawler in a Box is broken as it uses Mongo.

docker/{mongo,common-services}.yml missing volumes for persistent services

As far as I can tell the mongodb, redis and rabbitmq docker compose images would benefit from persistent volumes to save a large (re)initialisation time if they are restarted.

Seems easy to add:
https://docs.docker.com/compose/compose-file/#volumes

Volumes required:

mongodb: /data/db /data/configdb
ref: https://github.com/docker-library/mongo/blob/master/3.6/Dockerfile#L85

redis: /data
ref: https://github.com/docker-library/redis/blob/master/4.0/Dockerfile#L70

rabbitmq: /var/lib/rabbitmq
ref: https://github.com/docker-library/rabbitmq/blob/master/3.7/debian/Dockerfile#L132

metabase: /var/opt/metabase/
from: metabase/Dockerfile

add the ability to flush config on startup

Currently on startup we look to see if there is a known config from the config provider, if there is, we use it. That's great for production scenarios but in testing its a pain as we typically have config in redis but you are wanting to tweak it for local execution.

Add a env var or command line param for the crawler startup that forces the recreation of the config. This would literally be a set of 5 calls to the config provider telling it to delete the config for the 5 subsystems. From there the normal lazy init code will repopulate the config.

Update telemetry to use Custom Events

The current logging all flows into App Insights as track trace calls. It would be better to have (at least some of) them show up as Track Event calls. This would allow for easier filtering and alerting.

what are the permissions needed for accessing traffic data?

I want to get all data avilable on repo, which includes traffic data and issues and their comments.
I created personal token under my account and provided all access excel write access.

However I do not get any traffic info when I submit request for one of the repo under my account.

I do not have any major traffic data but was atleast expecting couple of entries.

Also how do we store data in some database or files?

Add support for Statuses

The current code traverses statuses but does not really do anything with them. They can be interesting indicators of project and contributor activity.

Authentication deprecation from GitHub

I've been using GHCrawler for about a year with great success - thanks for the awesome tool :)

Today, I got a deprecation notice from Github when the crawler fired up to do some work:

Hi @GregSutcliffe,

On February 4th, 2020 at 14:02 (UTC) your personal access token (<redacted>) using ghrequestor was used as part of a query parameter to access an endpoint through the GitHub API:

https://api.github.com/organizations/44586252

Please use the Authorization HTTP header instead, as using the `access_token` query parameter is deprecated.

Depending on your API usage, we'll be sending you this email reminder once every 3 days for each token and User-Agent used in API calls made on your behalf.
Just one URL that was accessed with a token and User-Agent combination will be listed in the email reminder, not all.

Visit https://developer.github.com/changes/2019-11-05-deprecated-passwords-and-authorizations-api/#authenticating-using-query-parameters for more information.

Thanks,
The GitHub Team

I'm guessing that GHCrawler needs to be updated to handle the new authentication system. I'm happy to help test fixes, or even take a bash at it myself if you can point me in the right direction code-wise. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.