francescostl / site-sonar Goto Github PK

9.0 3.0 2.0 425 KB

A browser extension which silently crowd-sources ad performance as you browse. Let's put an end to bad ads.

License: Mozilla Public License 2.0

JavaScript 97.62% CSS 0.32% HTML 2.07%

site-sonar's Introduction

Site Sonar

A project aimed at identifying ad networks with the fastest and slowest performing ad's on the internet through crowd-sourced, easy to understand, and openly accessible benchmarking data. Inspired by Lightbeam, the Site Sonar browser extension (hosted in this repository) locates and benchmarks ad content silently while you browse. It is then sent to Site-Sonar's servers, where the data is aggregated and displayed on our public dashboard (repo).

Index

Installing Site-Sonar

For Firefox

Clone the repository by running:

git clone https://github.com/FrancescoSTL/Site-Sonar.git

Download and install Node.js

Once you've cloned the repo and installed Node.js, you can start Site-Sonar by running:

npm install
npm run bundle

With `web-ext`

If you're using web-ext, you'll need to do so with a pre-release version of Firefox for now, as it is only supported in Firefox 49 or higher.

Install web-ext if you haven't already
web-ext run --firefox-binary=/Path/to/your/FirefoxDeveloperEdition/or/FirefoxBeta/or/FirefoxNightly.app

Without `web-ext`

Go to about:debugging
Click "Load Temporary Add-on"
Select any file in your locally downloaded version of Site-Sonar

For Chrome

Clone the repository by running:

git clone -b Chrome-and-Opera-Version https://github.com/FrancescoSTL/Site-Sonar.git

Download and install Node.js
Go to chrome://extensions
Click "Load Unpacked Extension"
Navigate to the folder where you downloaded Site-Sonar
Click "Select"

For Opera

Clone the repository by running:

git clone -b Chrome-and-Opera-Version https://github.com/FrancescoSTL/Site-Sonar.git

Download and install Node.js
Go to extensions
Click "Load Unpacked Extension"
Navigate to the folder where you downloaded Site-Sonar
Click "Select"

Data Site-Sonar Collects

Using Disconnect's Blacklist of ad domains, Site-Sonar will benchmark and collect the following information about each ad asset in your browser:

assetCompleteTime Integer Amount of time (in milliseconds) that the network took to respond to the HTTP request for the asset. This is calculated using a time diff between onSendHeaders and onHeadersReceived.
originUrl String The URL from which the HTTP request originated. In many cases, this will be the hostUrl, however, sometimes ads will trigger their own HTTP requests. For example, checkout the following example from some real world data we pulled in Site-Sonar Issue #17
hostUrl String The top level host URL from which the HTTP request originated. For example, if you have 3 tabs open and one request originates from the first tab (lets say, youtube.com), the top level host would always be said tab's url (youtube.com).
adNetworkUrl String The host URL of the ad asset.
assetType String Can be anything recieved by webRequest.ResourceType.
fileSize Integer File size in octets of bits.
timeStamp Integer Time when the asset was requested (in milliseconds) since the epoch
method String Either "GET" or "POST".
statusCode Integer Standard HTTP status code returned by the server. Ex: 200, 404, 301, etc
adNetwork String The Ad Network for which the asset belongs.

Privacy Policy

Site-Sonar Privacy Summary

Site-Sonar is a browser extension currently supported in Firefox, Chrome, and Opera, which silently collects data about how ad's are performing in your browser. After collecting that data, it will be sent to Site-Sonar's server to aggregate (unless you opt out) and keep ad networks accountable through publicly accessible performance information.

What you should know

Upon installing Site-Sonar, data will be collected locally and stored in your browser. Unless you opt out, every 2 minutes, that data will be sent to Site-Sonar servers for aggregation and display on our public dashboard.
By default, data collected by Site-Sonar is sent to us.
You can chose to opt out of sending any data to us.
If you do contributeSite-Sonar data to us, your browser will send us your data in a manner which we believe minimizes your risk of being re-identified (you can see a list of the kind of data involved here). We will post your data along with data from others in an aggregated and open database. Opening this data can help users and researchers make more informed decisions based on the collective information.
Uninstalling Site-Sonar prevents collection of any further Site-Sonarm data and will delete the data stored locally in your browser.

FAQ

Will Site-Sonar track my browsing history?

Sort of. Once installed, Site-Sonar collects the host url of any website you browse that hosts ad content. Read more in our Privacy Policy or Summary of Data Collection.

How can I contribute?

Check out our installation instructions and then head to our Github Issues page for either the Site-Sonar web extension (this repo), or the Site-Sonar Dashboard.

Who are you?

A group of humans interested in making the internet a better place through a pragmatic approach to problems on the web.

Specifically:

How can we contact you?

Visit our Contact Page.

site-sonar's People

Contributors

Stargazers

Watchers

Forkers

purukaushik bpallares

site-sonar's Issues

Explore URL Mapping/ Grouping

After #8 was merged, we had some new situation introduced where ad's are requesting other ads. See lax1-ib.adnxs.com below, where the content is Website Origin, Requested Ad Host, and Response Time (in that order, from left to right):

Above, you can see that cnn.com is requesting an ad from lax1-ib.adnxs.com. That ad presumably proceeds to call js.moatads.com (note that the above picture is in reverse chronological order, so the last item was requested first and vice versa).

In order to mitigate this, we'd need to group all requests from a page to the original host. We can do this by grabbing the URL of the requested ad's tab (using details.tabId). This also allows us to map data in the same way lightbeam does, which could be interesting.

Upgrade Local Dashboard

Looking to upgrade the local dashboard with metrics we can get for free (without expensive processing or space) like: total memory used on ads, total network time taken to load ads, total number of ads recorded, total number of batches sent, etc.

Develop as Firefox Add-on to Track All HTTP Requests

The current method for collecting potential HTTP requests and testing their speed seems flawed, as we are not loading the javascript on-page when we are making the requests, and it is likely that we are missing requests because of it. The original reasoning behind the current method was that browser automation is expensive, especially when crawling hundreds (if not thousands) of pages.

Batch Ad Benchmarks per Page

It would be interesting to batch data per-page so that we can determine how bad each page-load is on average. This would give us the best shot at ranking sites by perceived performance due to ads on-page.

Record Files of Size 0 as Null

Right now, we're getting 204 No Content responses which are resolving with a content-length (file size) of 0, which throws off our results and muddles the db. We want to be able to ignore all of these files in our mongo queries on the dashboard size. In order to do this, we'd like to set them explicitly to null before storage so we don't have to worry about adding extra server-side logic.

Executive Summary - ads since last date- add a date

Add a date in the executive summary like:

Executive Summary

Cumulative ad asset data since 8/25/2016 11:31 AM

3252

Assets Benchmarked

24.26 mb

Memory Used on Ads

12.1 min

Network Time Spent on Ads

Post Image to Readme

Add Site Profiler Tool

I think it would be nice to add a website profiler where users can record performance for a particular amount of time.

Compounding Requests = Memory Errors & Unmanagble Load Times

Currently we are making the HTTP requests asynchronously which, across the tens of thousands of requests which are triggered in a short amount of time, causes the process to run out of memory/fail/stall forever.

This will be invalidated if we use Selenium as noted in #1

Get Tab Url

Currently we are treating originUrl as our tab's url (the website that the request came from at the lowest level) but an ad can actually be the origin of another ad. For this reason, we should get the top level host url of the tab which the request came from.

Implement Growthy Things

We should probably consider how to on-board users, keep them engaged, etc.

Determine Blocking vs Non-blocking Ad's

If we can determine what ad requests are synchronous, our network response time data becomes significantly more useful.

Redirect Dashboard Overview

We should give the overview tab a message when there is no ad data so people know that they might need to disable ad blocker, or that there isn't ad data yet. Also we should redirect to the profiler tool when profiling is in progress.

Export Data Tab Takes Long to Load

When you have the add-on installed for a while, export data takes a while to load, as there is a ton of JSON data on the page. We should consider removing the export data feature.

The "benchmarks" question

My friend asked me a question about load times -

if you are collecting data over different data connections( ranging from dial-up to high speed wi-fi connections) how can it be standardized? How can it be called a "benchmark"? Load times may vary from geographical region to region for the same site w/ ad network.

I was thinking maybe we should collect a page-load time vs ad-load time metric ?

Fix ParseURI() to Accept HTTP

Currently, our parseURI function (mindlessly copied from StackOverflow) only parses https urls. We probably don't need all of the extra filtering junk in there either.

Round load time more reasonably

Right now it displays 3 digits (5083.666). One should enough and makes it easier to read.

Add README.md content for Chrome/Opera when portable

Data not sending after X minutes in browser

Seems as though the timers we've set up don't persist between sessions as, when I leave my computer and it locks, the add-on stops posting data to the server and I have to reinstall the add-on. This can still needs to be verified.

Port add-on to chrome

We'd like to build the add-on to support Chrome, Opera, Edge, etc. This thread should serve as tracking for the progress in that endeavor.

Here is a list of incompatibilities between Chrome and the current API's we are using:
https://developer.mozilla.org/en-US/Add-ons/WebExtensions/Chrome_incompatibilities

Add Multi-tab Top-Level-Host tracking for Disconnect Checker

Currently it seems as though our add-on isn't checking to see whether the requests coming in are an allowed resource or property of the top level hosts in each tab. We are currently getting spammed (ironically) by facebook messenger requests which aren't truly ads. I'm thinking this is likely because the request is coming out of a tab which isn't active. Not certain.

Grab More Useful Data

We're aware that currently, asset load time, ad host URL, and origin URL don't give us a whole lot to report on. The issue is trying to gather information that is both useful and privacy respecting. We don't want to hit a situation where logs allow someone to be uniquely identified.

Some thoughts for useful metrics we can gather:

Unique page-visit ID for each group of assets in one page visit
This will allow us to determine page performance by host to some degree of accuracy. One variable which may throw off our data here is the amount of time spent on a page. If a user lets only 1 asset of a potential 300 load, while 2 other users had 300 assets loaded, that throws off our average by quite a lot. Which is why it would be useful to grab the next data point.
Time spent on page
If we collect time spent on a page when we are grouping requests by page visit, we will be able to parse out page visits which were too short to grab a majority of the requests on said page.

(DONE) 3. Ad Network
This determination can also happen server-side, so I'm unsure if we should be doing it in the extension. That said, we've got the list handy already in the extension.

More TBD

Don't Store assetLoadTimes when sendData flag is false

If the user doesn't want to send their data to us, there is no need to take up extra memory with this Map.

.gitignore js/web-crawler.bundle.js

ideally this file is built by npm install, so gotta be .gitignored so we dont have to worry about commiting it when throwing in pull requests

Add Privacy Policy

To comply with AMO + any other entity we'd like to release this on eventually (Chrome?), we need to develop some sort of privacy policy outlining, what we're collecting, how it is being transferred, and how it is being used.

Anonymize HTTP Header Info

We'd like to anonymize certain data sent in the HTTP headers which could be used to identify users in the event of a security issue on our server. Below is an example of the data sent between the client and our server. We should likely anonymize the user agent and accept language by setting one standard agent+accept to send.

Fix xhr Memory Leaks

Currently our XMLHTTPRequest is defined globally. This should be descoped to prevent memory leaks across multiple requests.

Collect Ad Asset Type

We'd like to collect the type of ad asset for each request. Examples include analytics, advertising, content, and social.

Log fileSizes as Integers and not Strings

Change '-' to Nulls. Maybe we can get MongoDB to eat the null integer and ignore them during aggregation queries.

Implement Data Encryption

After determining what info we can collect while still remaining privacy-respecting, we need to encrypt that data. In this pursuit, we will salt+hash info before sending it to the db.

(lower priority) We may also need to send individual salt's from the db to the client each time the first "write" is requested in order to verify that no one is mucking up our db results. The likelyhood of someone caring enough to send us bad data is low, but this is something to consider at a minimum.

Implement Automated Navigation

We'd like to automatically navigate between pages so the user can add this add-on and let it do it's thing while benchmarking.

In order to complete this, I'll need to identify when a page is truly "loaded", aka, when all our potential ad-scripts have been requested and received. This is not as simple as calling onLoad.

My current methodology is when onLoad would be triggered (utilizing webRequest.onCompleted), noting the number of blocked requests we've sent out, and waiting for them to be received back. The issue is that even this is triggering far too soon. Websites like cnn.com which normally have hundreds of requests will only actually log two. Further, sites like ksdk.com which can often have upwards of 1000 reduce to a mere 10 blocked requests that we can track.

More to come

JSON within JSON in Export Data

When the JSONString builds in export.js, it adds the assets object twice (same data) nested within itself. Not sure why this is happening, but we've got a suspicion it has to do with async.

Overview left-margin is wider than others

We need to check out margins for tomorrow

consistency: a single name – Site-sonar (as far as I can tell)

http://www.site-sonar.com/ mentions the creation of:

Site-sonar

– so I assume that's the true name. One uppercase S, hyphenated.

Elsewhere I see:

site-sonar (no uppercase)
Site-Sonar (uppercase twice)
Site Sonar (a space, non-hyphenated)
SiteSonar (neither a hyphen nor a space).

Determine whether Ad is loaded on Active Tab

Determine whether the Ad Asset is being loaded in the active tab or if it is being side-loaded in inactive tabs. This will allow us to aggregate assets based upon the above to see what ads are invasive and which ones are not so.

Engage User w/ Notifications

Milestones such as "hey, you've benchmarked 100mb of ads," etc.

Show distribution as standard deviation (not range)

From a statistical perspective, SD is easier to understand and apply to the average that also gets shown.

Mitigate when data is too large to send (>15mb)

The dashboard is now updated so that it accepts data up to 15mb. We should probably be checking client-side to make sure the string isn't larger than that, because if it is, users will have a growing Map which could increase in size indefinitely. While the likelyhood a user gets 15mb worth of asset logs in 2 min is astronomically low (that would require loading >36,000 ad-asset records), we should probably make sure it isn't happening.

Reference to finding: #43

Explore Building Local User Reports

We are considering adding a local page which users can access which will report on the data that they've contributed. In the interest of not simply rebuilding Lightbeam with group reporting capabilities, we should likely keep this dashboard fairly light and easy to understand. Although it would certainly be cool to have all the features Lightbeam has, that seems to be outside of the scope of this project.