Giter Site home page Giter Site logo

crissyfield / repo-lookout Goto Github PK

View Code? Open in Web Editor NEW
31.0 3.0 0.0 7 KB

🔓 A large-scale security scanner, to find source code repositories that have been inadvertently exposed to the public and report them to the domain’s technical contact.

Home Page: https://www.repo-lookout.org

git security web-scanner vulnerability-scanner

repo-lookout's Introduction

Repo Lookout: Find publicly exposed source code repositories

Repo Lookout is a large-scale security scanner, with a single purpose: Find source code repositories that have been inadvertently exposed to the public and report them to the domain’s technical contact.

Accidentally exposed source code repositories often contain highly sensitive information that can be used for downstream attacks, such as data leakage and ransomware extortion. While the problem has been known and extensively documented for years, our findings show that it is still prevalent.

Our goal is to combat this vulnerability by automatically detecting and reporting instances.

More information at: https://www.repo-lookout.org

What is this repository for?

This repository is used as a public issue tracker and to store additional information, such as mitigations for various server software.

At this point, the repository does not contain the source code for the actual crawler software.

repo-lookout's People

Contributors

tja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

repo-lookout's Issues

Cool Project!

Hello there,

I must say, your project is really impressive! I was wondering if there is an opportunity for me to contribute to the source code. I would be thrilled to get involved.

Enable one to configure the target email on a repo and/or declare it intentionally public

I applaud the intention behind repo-lookout, but I'm afraid it's currently spamming me about a repo that's (AFAIK) intentionally public, hosted on one of my servers.

I'd be OK with being able to put something in the repo to set the email that you alert, so that it could be directed to the person who owns the data, so they can then decide if they want to restrict access (I'm pretty sure they don't want to, but I think it's reasonable to let them decide that).

I'd also be OK with adding some method for declaring that I'd like you to ignore that repo, while still scanning other repos that might appear in future.

As it stands, I guess I'll redirect these specific mails to the repo-owner, since that's trivial for me to do, but I suspect that others might not find this so easy.

Support "unsubscribe" functionality for emails

While we initially supported unsubscribing from email notifications, we decided to remove the feature because emails were more likely to end up in spam. However, unsubscribing has been requested several times in the last few weeks, so we should bring it back!

Mailgun (our email provider) supports email suppression (via unsubscribe or reporting as spam). Leaving this task to Mailgun alone is certainly sufficient to stop sending emails to the suppressed addresses, however, it also reduces our delivery rate unnecessarily. Instead, we should filter our "list of best email addresses" before sending the email to Mailgun, i.e. if the first N email addresses are suppressed, we send the report email to the N+1 address in our list.

In order for this to work, we need to keep a copy of the list of suppressed email addresses. This can be achieved by hooking into Mailgun's event stream and suppression API and keeping track in our own database.

Improve Github Public Hashes

As implemented in #4, we fetch the event stream from the GH Archive project and parse events of type PushEvent to get SHA commit hashes. There are two issues with this approach:

  1. As described in GitHub's documentation, a PushEvent contains a maximum of 20 commits. It is recommended to fetch additional commits using the CommitsAPI.

  2. Repositories created privately and then made public are simply ignored, because there is no PushEvent involved, just a PublicEvent with no payload.

We should improve our current GH archive parser to detect a PushEvent with more than 20 commits, as well as understand the PublicEvent type. In both cases, the Commit API should be used to retrieve all public commits.

Emails are sent too late

In the current workflow, we start with a quick heuristic to see if a Git repository might be exposed, followed by a more reliable but slower Git repository scan to read the last 5 commit history entries, and finally the actual sending of an email report.

Because we send emails slower than we discover Git repositories, the backlog between the second and third steps has been growing. As of today (May 25th), we are still sending out emails with information collected over two months ago (March 23rd). Of course, some of this information may be outdated and the Git repository may no longer be exposed.

To avoid sending out emails and bothering people unnecessarily, we need to change the logic of the workflow by growing the backlog between the first and second step (which is fine) and restricting the backlog between the second and third step.

Support "security.txt"

Moving this issue from our internal issue tracker to Github as a request for this has come up:

Currently, Repo Lookout tries to find a contact email address by 1) checking the WHOIS record, 2) scraping the website for relevant emails (i.e. emails using a domain related to the hostname in question), 3) guessing email addresses if there's an MX record, and 4) extracting email addresses from the git repo itself.

This system should be extended to support the security.txt standard as well.

Wiki Apache Suggestion: RedirectMatch 404 /\.git

Suggesting to also use the following line with in the Apache section of the wiki:
RedirectMatch 404 /.git
This way it doesn't even show that git is being used. I would suggest keeping the directory match as well to prevent .git from being listed in the open directory if the directory is exposed.

GitHub Archive dump updates error out with JSON parsing error

The latest GitHub archive dump updates fail with the following error:

invalid character '\\x00' looking for beginning of value

Investigation shows that the dump contains a block of null bytes (0x00). One theory is that the dump was "cleaned up" in a post-processing step, by "wiping out" an existing record.

Improve detection of legit public Git repositories

To avoid false positives in repository detection (and thus false alert emails) we currently check the latest Git commit hash of the exposed repository against an internal dataset of "known public Git commit hashes". However, this dataset only contains 4 million commit hashes of the most popular or active Github repositories, which turns out to be far too small.

To reduce false positives here, we need to expand the dataset significantly!

One way to do this is to use the GH Archive project, which captures the public GitHub timeline, archives it, and makes it easily accessible for further analysis. In our case, we should read the dumps, check for PushEvent events, extract Git commit hashes, and add them to our "public Git commits" repository. New commits should be imported at least once a day, and our dataset should contain commit hashes going back to 2015.

Add CommonCrawl 2019 Archives

Automatic downloading of URL sources

automate

Adding URLs from the Tranco list, CrUX dumps and CommonCrawl archives is currently a manual process. In an effort to "automate all the things", all current sources (and more?) should be checked periodically and added to the database whenever new data is available.

Note that e.g. CommonCrawl archives are too large to be added as a whole within a day and will most likely need to be broken up.

Apache config already contains git instructions

Hi,

First of all, thank you for this project without which I would never have found out about exposed .git directories, as I've been living under the assumption that web servers protect hidden paths by default. 😅

Now, the wiki suggests adding lines to httpd.conf, but what I noticed before reading it is /etc/apache2/conf-enabled/security.conf already contains the following :

#
# Forbid access to version control directories
#
# If you use version control systems in your document root, you should
# probably deny access to their directories.
#
# Examples:
#
#RedirectMatch 404 /\.git
#RedirectMatch 404 /\.svn

So all I had to do is uncomment the line.

Additionally, when reading the email, I saw no indication that you were already providing a solution, which is why I looked for one myself, ensured that it works, then browsed the website to read about the story of this project, and finally stumbled on this.

Therefore, I would like to suggest adding a link to the wiki in the email.

On a side note :

At this point, the repository does not contain the source code for the actual crawler software.

Will it ever ?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.