Giter Site home page Giter Site logo

francescostl / site-sonar Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 2.0 425 KB

A browser extension which silently crowd-sources ad performance as you browse. Let's put an end to bad ads.

Home Page: http://site-sonar.com

License: Mozilla Public License 2.0

JavaScript 97.62% CSS 0.32% HTML 2.07%

site-sonar's Introduction

Site Sonar Header Image

Site Sonar

A project aimed at identifying ad networks with the fastest and slowest performing ad's on the internet through crowd-sourced, easy to understand, and openly accessible benchmarking data. Inspired by Lightbeam, the Site Sonar browser extension (hosted in this repository) locates and benchmarks ad content silently while you browse. It is then sent to Site-Sonar's servers, where the data is aggregated and displayed on our public dashboard (repo).

Index

Installing Site-Sonar

For Firefox

Clone the repository by running:

git clone https://github.com/FrancescoSTL/Site-Sonar.git

Download and install Node.js

Once you've cloned the repo and installed Node.js, you can start Site-Sonar by running:

  1. npm install
  2. npm run bundle
With web-ext

If you're using web-ext, you'll need to do so with a pre-release version of Firefox for now, as it is only supported in Firefox 49 or higher.

  1. Install web-ext if you haven't already
  2. web-ext run --firefox-binary=/Path/to/your/FirefoxDeveloperEdition/or/FirefoxBeta/or/FirefoxNightly.app

OR

Without web-ext

  1. Go to about:debugging
  2. Click "Load Temporary Add-on"
  3. Select any file in your locally downloaded version of Site-Sonar

For Chrome

  1. Clone the repository by running:
git clone -b Chrome-and-Opera-Version https://github.com/FrancescoSTL/Site-Sonar.git
  1. Download and install Node.js
  2. Go to chrome://extensions
  3. Click "Load Unpacked Extension"
  4. Navigate to the folder where you downloaded Site-Sonar
  5. Click "Select"

For Opera

  1. Clone the repository by running:
git clone -b Chrome-and-Opera-Version https://github.com/FrancescoSTL/Site-Sonar.git
  1. Download and install Node.js
  2. Go to extensions
  3. Click "Load Unpacked Extension"
  4. Navigate to the folder where you downloaded Site-Sonar
  5. Click "Select"

Data Site-Sonar Collects

Using Disconnect's Blacklist of ad domains, Site-Sonar will benchmark and collect the following information about each ad asset in your browser:

  1. assetCompleteTime Integer Amount of time (in milliseconds) that the network took to respond to the HTTP request for the asset. This is calculated using a time diff between onSendHeaders and onHeadersReceived.

  2. originUrl String The URL from which the HTTP request originated. In many cases, this will be the hostUrl, however, sometimes ads will trigger their own HTTP requests. For example, checkout the following example from some real world data we pulled in Site-Sonar Issue #17

  3. hostUrl String The top level host URL from which the HTTP request originated. For example, if you have 3 tabs open and one request originates from the first tab (lets say, youtube.com), the top level host would always be said tab's url (youtube.com).

  4. adNetworkUrl String The host URL of the ad asset.

  5. assetType String Can be anything recieved by webRequest.ResourceType.

  6. fileSize Integer File size in octets of bits.

  7. timeStamp Integer Time when the asset was requested (in milliseconds) since the epoch

  8. method String Either "GET" or "POST".

  9. statusCode Integer Standard HTTP status code returned by the server. Ex: 200, 404, 301, etc

  10. adNetwork String The Ad Network for which the asset belongs.

Privacy Policy

Site-Sonar Privacy Summary

Site-Sonar is a browser extension currently supported in Firefox, Chrome, and Opera, which silently collects data about how ad's are performing in your browser. After collecting that data, it will be sent to Site-Sonar's server to aggregate (unless you opt out) and keep ad networks accountable through publicly accessible performance information.

What you should know

  1. Upon installing Site-Sonar, data will be collected locally and stored in your browser. Unless you opt out, every 2 minutes, that data will be sent to Site-Sonar servers for aggregation and display on our public dashboard.
  2. By default, data collected by Site-Sonar is sent to us.
  3. You can chose to opt out of sending any data to us.
  4. If you do contributeSite-Sonar data to us, your browser will send us your data in a manner which we believe minimizes your risk of being re-identified (you can see a list of the kind of data involved here). We will post your data along with data from others in an aggregated and open database. Opening this data can help users and researchers make more informed decisions based on the collective information.
  5. Uninstalling Site-Sonar prevents collection of any further Site-Sonarm data and will delete the data stored locally in your browser.

FAQ

Will Site-Sonar track my browsing history?

Sort of. Once installed, Site-Sonar collects the host url of any website you browse that hosts ad content. Read more in our Privacy Policy or Summary of Data Collection.

How can I contribute?

Check out our installation instructions and then head to our Github Issues page for either the Site-Sonar web extension (this repo), or the Site-Sonar Dashboard.

Who are you?

A group of humans interested in making the internet a better place through a pragmatic approach to problems on the web.

Specifically:

How can we contact you?

Visit our Contact Page.

site-sonar's People

Contributors

francescostl avatar purukaushik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

site-sonar's Issues

Explore URL Mapping/ Grouping

After #8 was merged, we had some new situation introduced where ad's are requesting other ads. See lax1-ib.adnxs.com below, where the content is Website Origin, Requested Ad Host, and Response Time (in that order, from left to right):
screen shot 2016-08-02 at 1 43 01 pm

Above, you can see that cnn.com is requesting an ad from lax1-ib.adnxs.com. That ad presumably proceeds to call js.moatads.com (note that the above picture is in reverse chronological order, so the last item was requested first and vice versa).

In order to mitigate this, we'd need to group all requests from a page to the original host. We can do this by grabbing the URL of the requested ad's tab (using details.tabId). This also allows us to map data in the same way lightbeam does, which could be interesting.

Upgrade Local Dashboard

Looking to upgrade the local dashboard with metrics we can get for free (without expensive processing or space) like: total memory used on ads, total network time taken to load ads, total number of ads recorded, total number of batches sent, etc.

Develop as Firefox Add-on to Track All HTTP Requests

The current method for collecting potential HTTP requests and testing their speed seems flawed, as we are not loading the javascript on-page when we are making the requests, and it is likely that we are missing requests because of it. The original reasoning behind the current method was that browser automation is expensive, especially when crawling hundreds (if not thousands) of pages.

Batch Ad Benchmarks per Page

It would be interesting to batch data per-page so that we can determine how bad each page-load is on average. This would give us the best shot at ranking sites by perceived performance due to ads on-page.

Record Files of Size 0 as Null

Right now, we're getting 204 No Content responses which are resolving with a content-length (file size) of 0, which throws off our results and muddles the db. We want to be able to ignore all of these files in our mongo queries on the dashboard size. In order to do this, we'd like to set them explicitly to null before storage so we don't have to worry about adding extra server-side logic.

Add Site Profiler Tool

I think it would be nice to add a website profiler where users can record performance for a particular amount of time.

Compounding Requests = Memory Errors & Unmanagble Load Times

Currently we are making the HTTP requests asynchronously which, across the tens of thousands of requests which are triggered in a short amount of time, causes the process to run out of memory/fail/stall forever.

This will be invalidated if we use Selenium as noted in #1

Get Tab Url

Currently we are treating originUrl as our tab's url (the website that the request came from at the lowest level) but an ad can actually be the origin of another ad. For this reason, we should get the top level host url of the tab which the request came from.

Redirect Dashboard Overview

We should give the overview tab a message when there is no ad data so people know that they might need to disable ad blocker, or that there isn't ad data yet. Also we should redirect to the profiler tool when profiling is in progress.

Export Data Tab Takes Long to Load

When you have the add-on installed for a while, export data takes a while to load, as there is a ton of JSON data on the page. We should consider removing the export data feature.

The "benchmarks" question

My friend asked me a question about load times -

if you are collecting data over different data connections( ranging from dial-up to high speed wi-fi connections) how can it be standardized? How can it be called a "benchmark"? Load times may vary from geographical region to region for the same site w/ ad network.

I was thinking maybe we should collect a page-load time vs ad-load time metric ?

Fix ParseURI() to Accept HTTP

Currently, our parseURI function (mindlessly copied from StackOverflow) only parses https urls. We probably don't need all of the extra filtering junk in there either.

Data not sending after X minutes in browser

Seems as though the timers we've set up don't persist between sessions as, when I leave my computer and it locks, the add-on stops posting data to the server and I have to reinstall the add-on. This can still needs to be verified.

Add Multi-tab Top-Level-Host tracking for Disconnect Checker

Currently it seems as though our add-on isn't checking to see whether the requests coming in are an allowed resource or property of the top level hosts in each tab. We are currently getting spammed (ironically) by facebook messenger requests which aren't truly ads. I'm thinking this is likely because the request is coming out of a tab which isn't active. Not certain.

Grab More Useful Data

We're aware that currently, asset load time, ad host URL, and origin URL don't give us a whole lot to report on. The issue is trying to gather information that is both useful and privacy respecting. We don't want to hit a situation where logs allow someone to be uniquely identified.

Some thoughts for useful metrics we can gather:

  1. Unique page-visit ID for each group of assets in one page visit
    This will allow us to determine page performance by host to some degree of accuracy. One variable which may throw off our data here is the amount of time spent on a page. If a user lets only 1 asset of a potential 300 load, while 2 other users had 300 assets loaded, that throws off our average by quite a lot. Which is why it would be useful to grab the next data point.
  2. Time spent on page
    If we collect time spent on a page when we are grouping requests by page visit, we will be able to parse out page visits which were too short to grab a majority of the requests on said page.

(DONE) 3. Ad Network
This determination can also happen server-side, so I'm unsure if we should be doing it in the extension. That said, we've got the list handy already in the extension.

More TBD

.gitignore js/web-crawler.bundle.js

ideally this file is built by npm install, so gotta be .gitignored so we dont have to worry about commiting it when throwing in pull requests

Add Privacy Policy

To comply with AMO + any other entity we'd like to release this on eventually (Chrome?), we need to develop some sort of privacy policy outlining, what we're collecting, how it is being transferred, and how it is being used.

Anonymize HTTP Header Info

We'd like to anonymize certain data sent in the HTTP headers which could be used to identify users in the event of a security issue on our server. Below is an example of the data sent between the client and our server. We should likely anonymize the user agent and accept language by setting one standard agent+accept to send.

screen shot 2016-08-13 at 5 12 41 pm

Fix xhr Memory Leaks

Currently our XMLHTTPRequest is defined globally. This should be descoped to prevent memory leaks across multiple requests.

Collect Ad Asset Type

We'd like to collect the type of ad asset for each request. Examples include analytics, advertising, content, and social.

Implement Data Encryption

After determining what info we can collect while still remaining privacy-respecting, we need to encrypt that data. In this pursuit, we will salt+hash info before sending it to the db.

(lower priority) We may also need to send individual salt's from the db to the client each time the first "write" is requested in order to verify that no one is mucking up our db results. The likelyhood of someone caring enough to send us bad data is low, but this is something to consider at a minimum.

Implement Automated Navigation

We'd like to automatically navigate between pages so the user can add this add-on and let it do it's thing while benchmarking.

In order to complete this, I'll need to identify when a page is truly "loaded", aka, when all our potential ad-scripts have been requested and received. This is not as simple as calling onLoad.

My current methodology is when onLoad would be triggered (utilizing webRequest.onCompleted), noting the number of blocked requests we've sent out, and waiting for them to be received back. The issue is that even this is triggering far too soon. Websites like cnn.com which normally have hundreds of requests will only actually log two. Further, sites like ksdk.com which can often have upwards of 1000 reduce to a mere 10 blocked requests that we can track.

More to come

JSON within JSON in Export Data

When the JSONString builds in export.js, it adds the assets object twice (same data) nested within itself. Not sure why this is happening, but we've got a suspicion it has to do with async.

Determine whether Ad is loaded on Active Tab

Determine whether the Ad Asset is being loaded in the active tab or if it is being side-loaded in inactive tabs. This will allow us to aggregate assets based upon the above to see what ads are invasive and which ones are not so.

Mitigate when data is too large to send (>15mb)

The dashboard is now updated so that it accepts data up to 15mb. We should probably be checking client-side to make sure the string isn't larger than that, because if it is, users will have a growing Map which could increase in size indefinitely. While the likelyhood a user gets 15mb worth of asset logs in 2 min is astronomically low (that would require loading >36,000 ad-asset records), we should probably make sure it isn't happening.

Reference to finding: #43

Explore Building Local User Reports

We are considering adding a local page which users can access which will report on the data that they've contributed. In the interest of not simply rebuilding Lightbeam with group reporting capabilities, we should likely keep this dashboard fairly light and easy to understand. Although it would certainly be cool to have all the features Lightbeam has, that seems to be outside of the scope of this project.

Grab only N% of Asset Benchmarks

To protect against too much PII being shared and to mitigate a potential storage space issue on our server, we'd like to grab ~10% of all asset benchmarks completed.

Log Results to DB

Now that we've got a functioning web extension, we need to log our results somewhere other than the console. Ideally, we will hook this up to a database so every user can send their results to one place.

Remove "www." from all URLs

Right now www.facebook.com and facebook.com are being treated as two different URLs

Example with chris.com and www.chris.com
screen shot 2016-08-08 at 3 39 42 pm

Remove Website Subdomains

To mitigate a PII leak, we'd like to remove subdomains which may contain personally identifiable info in some cases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.