mozilla / overscripted Goto Github PK

View Code? Open in Web Editor NEW

74.0 20.0 53.0 5.43 MB

Repository for the Mozilla Overscripted Data Mining Challenge

License: Mozilla Public License 2.0

Jupyter Notebook 93.56% Python 6.44%

overscripted's Issues

Add resources on Apache Pyarrow to readme.md

Adding tutorial on Apache Pyarrow for reading the Apache parquet Format.

Adblocker and Tracker Blocker Analysis #86

Please find attached my initial review of #86. I will continue to work on examples of data based on my initial analysis. Any feedback will be great as it ties into a few other issues that were created and listed in my analysis!
2019_04_CoderT_Adblocker and Tracker blocker analysis.pdf

Sample 10 percent data files are not linked correctly in project README.md

The two links for the data sources in (3.7 GB and 9.1 GB respectively) in project's Readme.md are incorrect. Unable to download zipped files from those links.

Add resources on Spark.

Add Resources in the README.md

Updating previous fingerprinting / tracking surveys

Nikiforakis found 40 sites among top 10k sites - font probing and flash
Acar et al found:

404 sites in top million - javascript-based fingerprinting
145 of top 10k flash-based

And there are more papers with similar claims - lets add data from our dataset to these learnings.

Changes in hello_world.ipynb (Spark Context)

While I was going through hello_world.ipynb, I noticed this error ValueError: Cannot run multiple SparkContexts at once. It is a pretty common error that occurs because the system automatically initializes the SparkContex.

I had to use sc.stop() to stop the earlier context and create a new one. @birdsarah Should I maybe add a cell just after this code snippet

import findspark findspark.init('/opt/spark')
# Adjust for the location where you installed spark
from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext(appName="Overscripted") spark = SparkSession(sc)

#If you are already running a context. run this cell and rerun the cell above
sc.stop()

Migrate this repo to https://github.com/mozilla/overscripted

For easier communication around this dataset, lets migrate to the fork at: https://github.com/mozilla/overscripted.

No rush on this, so lets resolve the outstanding PRs and then migrate.
-- OR --

I can delete the other repo and rename this one, then we just need to update all local work to point to the correct upstream resource.

Updating Accessing the data section.

Adding link to the hello_world.ipynb in the README.md

What is the difference between ad blocker dataset and tracker blocker dataset.

@birdsarah hello, I went through the working and functionality of both and I want to explain the difference between these two.

Verification / evolution of "Internet Jones" paper

Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016 - a 2016 paper from USENIX - https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lerner

It contains a number of interesting metrics to describe tracking over time. While the OverScripted dataset does not have sufficient data to compare for all metrics, there may be some that we can reproduce and so continue the evolution of the data presented in the paper.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Static JS pipeline pt. 3: generate list of symbols of interest

From a (single or list of) specified APIs or DOM interfaces, recursively generate a list of symbols to accumulate data on for each javascript file.

Works off of the set of json files from Mozilla's browser-compat-data/api github page (must have this saved locally).

Can we use array extensions to speed up string processing with dask

Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.

Array extensions unfortunately can't be serialized pandas-dev/pandas#20612

https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.

Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.

Dataset Description

We need a section describing each column of the dataset.
Even a single line description for each field would be very helpful for somebody who is starting to work with the dataset.
Most of the multivariate datasets have descriptions of each field.

e.g.: https://archive.ics.uci.edu/ml/datasets/cardiotocography#
Here, the section "Attribute Information" describes each attribute/column

Overscripted dataset Attributes:
['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',
'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',
'arguments_n_keys', 'call_id', 'call_stack', 'file_name', 'func_name',
'in_crawl_list', 'in_iframe', 'in_stripped_crawl_list', 'location',
'locations_len', 'operation', 'script_url', 'symbol', 'time_stamp',
'value_1000', 'value_len']

Descriptions for the above attributes of dataset needs to be added.

Static JS pipeline pt. 2: GET all unique files from a URL list

Check the specified download directory for existing files, generate a list of existing files, compare with the previously generated list of script URLs, and download missing files.

Scan through the specified download directory, and generate a list of already downloaded files.
Compare this list with the script URL list generated in previous step.
Download all missing scripts, and for accumulate failures and their error codes.
- http/s GET all files in a list of URLs
Output a ~~CSV~~ parquet with a list of failures and their failure codes.

Static JS pipeline pt. 4: Static analysis of JS AST

Produce a tree walker script that analyzes the AST of each JS file.

>>Have a rough tree walker, but there are definite errors in my walking algorithm (specifically depth counting). Must investigate further<<

Produce an AST for a given JS file
Walk through the AST and obtain various results:
- Depth
- Width (distance from the first node in the layer)
- Parent information
- Symbol frequency
- Other?
Output all data into a (JSON? Parquet?) file for this corresponding JS file

Further things to do:

Validation: ensure the results are correct
Namespace clashing: Currently, all symbols are stripped from their parent, so getContext() is not distinguished between OffscreenCanvas.getContext() vs HTMLCanvasElement.getContext().
Limited to Mozilla Firefox APIs only (as all API data is taken from Mozilla's repos)

Static JS pipeline pt.1: Extract unique script URLs from the complete dataset

Go through the full dataset, pull out the unique script URLs and save them as a list (to avoid working with the whole 70GB dataset). Assign each URL a corresponding filename to use in the download stage (raw URL not compatible with ext4 filesystem).

Notes:

Doing this for the whole dataset could result in a ~10GB CSV file, which is probably unwieldy? Either case, will try and see what happens... if this fails I'll go back to just iterating over each parquet file rather than this list approach
- Yielded 150mb of total parquet files, seems workable.
I could also generate a giant list, pick the unique URLs (the reason of doing this list to begin with), and then spit out parts of this list as many individual CSVs to iterate over when downloading lists
- Used parquet files, automatically broken up by spark

>>1st iteration done; not verified to be accurate yet<<

Open source "valid=False" data

Around 75,000 rows of data were dropped for various reasons when cleaning the data. (Note the clean dataset is 131million rows).

Although this is only a tiny fraction of the data, for completeness, we should consider releasing that invalid data too.

Audio code looks for createOscillator instead of oncomplete

https://github.com/mozilla/overscripted/blob/master/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense/Audio%20Fingerprinting%20Heuristics.ipynb

cell 15 reads:

on_complete_df = df[df.symbol == 'OfflineAudioContext.createOscillator']
on_complete_urls = on_complete_df.script_url.unique().persist()

pretty sure that should be oncomplete

Analysis on #34, calculating percentage of scripts present in dataset

Finding out the total number of scripts of the three types present in the dataset and calculating the percentage of each script and the three scripts considered together.

Analyses README.md

The README.md of the analyses folder gives the false idea of installing both Anaconda and Spark in Step 2 (alteast I found so). One can proceed without installing Spark (correct me if I am wrong) by using conda or vice versa.
In the Gitter chat I noticed most people get struck on the environment import error in the analyses folder(Issue #70 ) which makes it a frequently raised issue . So a good thing will be to document it in the README.md so that new contributors don't get stuck on the same issue previously been resolved for another contributor.

Suggestion for Wiki Reading List: Thesis on Fingerprinting

This thesis, which was submitted by Amin Faiz Khademi in pursuance of a Master degree in science from Queens University in Ontario, Canada, covers fingerprinting analysis; detection, and prevention at runtime.

Here is the link : https://qspace.library.queensu.ca/handle/1974/12604
The PDF is on the site and can be downloaded.

Add more resources on Dask in README.md

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

How does ad blocking relate to this dataset

There are Adblock lists - like easylist, easyprivacy, someonewhocares, pgl etc (see https://github.com/uBlockOrigin/uAssets) for ideas.

The overscripted dataset is not designed to pick up ads its designed to capture javascript calls for known / expected fingerprinting vectors. That said, there may be overlap. Interesting to investigate correlations - what set of scripts would be blocked by these lists and why.

Import error while using the provided environment in analyses folder

System Specs:
Anaconda: Version 2018.12
Windows 10 (64 -bit)

I get the error when importing Client from dask.distributed when using the overscripted environment.
The error is
`ImportError Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\dask\distributed.py in
4 try:
----> 5 from distributed import *
6 except ImportError:

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed_init_.py in
3 from .config import config
----> 4 from .core import connect, rpc
5 from .deploy import LocalCluster

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\core.py in
19
---> 20 from .comm import (connect, listen, CommClosedError,
21 normalize_address,

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm_init_.py in
16
---> 17 _register_transports()

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm_init_.py in _register_transports()
12 def _register_transports():
---> 13 from . import inproc
14 from . import tcp

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm\inproc.py in
14 from ..compatibility import finalize
---> 15 from ..protocol import nested_deserialize
16 from ..utils import get_ip

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\protocol_init_.py in
4
----> 5 from .compression import compressions, default_compression
6 from .core import (dumps, loads, maybe_compress, decompress, msgpack)

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\protocol\compression.py in
22 from ..config import config
---> 23 from ..utils import ignoring, ensure_bytes
24

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\utils.py in
37 from tornado import gen
---> 38 from tornado.ioloop import IOLoop, PollIOLoop
39

ImportError: cannot import name 'PollIOLoop'

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
1 import dask.dataframe as dd
----> 2 from dask.distributed import Client
3
4 Client()

C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\dask\distributed.py in
9 " conda install dask distributed # either conda install\n"
10 " pip install dask distributed --upgrade # or pip install")
---> 11 raise ImportError(msg)

ImportError: Dask's distributed scheduler is not installed.

Please either conda or pip install dask distributed:

conda install dask distributed # either conda install
pip install dask distributed --upgrade # or pip install`

It works fine with the base environment. I included distributed in environment.yaml
name: overscripted channels: - defaults - conda-forge dependencies: - python=3.6 - jupyter=1.0.0 - pyarrow=0.9.0 - pandas=0.23.0 - distributed=1.21.1 - dask=0.17.5 - findspark=1.2.0 - tldextract=2.2.0

But it doesn't work with the overscripted environment. However, it works perfectly with base environment.

add 'user agent' to glossary

Add Jupyter Notebook tutorial in Resources

Found a tutorial on jupyter notebook. Adding it to the Resources section of README.md
Link: https://www.youtube.com/watch?v=HW29067qVWk

Installing Spark on Ubuntu

Conda install did not work for me and I installed Spark using https://datawookie.netlify.com/blog/2017/07/installing-spark-on-ubuntu/

Should I add this in New contributor tips?

What's in the really large values?

The value column contains some really large items, what's in there?

An initial look by @dzeber found that there were some csv files containing football scores, but a systematic review hasn't been done.

Do any indicate a potential for a privacy / information loss?

Design and implement a "risk metric" for websites

We would like to use the information about website behaviour learnt from the crawl data to build out a system to help users stay safe online. The original idea was to design a scalar "risk" or "badness" metric that would alert the user to the potential riskiness of each page they visit.

However, we immediately run into the question of how to define "badness". While we can claim that some sites are intuitively less trustworthy than others, and some website behaviours are undesirable (eg. respawning cookies deleted by the user), it is difficult to define an overall notion of badness that applies to all sites, as it is highly subjective and nuanced.

Some reasonable candidates for a definition of badness are trustworthiness, privacy loss or security risk. However, these are often difficult to quantify without imposing value judgements on both users and website owners. In particular, it is not realistic to attempt to quantify privacy loss in an absolute way, since different users are comfortable sharing different degrees of personal information, often in exchange for the utility or convenience they gain from some content or a service. Furthermore, a single user may have different thresholds of privacy risk for different content providers, depending on
how much they trust them. Security risk is more objective, but is difficult to measure using the crawl data.

Some examples to consider:

Facebook may be considered "risk" from a privacy point of view, since it solicits vast amounts of personal information from users and uses it to target ads. According to recent reports, it also appears to have been sharing user information with third parties without their knowledge. However, flagging Facebook as a "bad" site is not necessarily practical since it is relied on by millions of people worldwide who may be unlikely to act on this information by quitting Facebook. Furthermore, an outcome like this, which goes against many users' personal privacy tradeoffs, could result in them distrusting our risk metric instead.
Many news organization websites depend on advertising revenue, and loading their sites triggers requests to third-party content such ad ad servers and known tracking services. However, this does not necessarily tell us anything about the trustworthiness of the news organization itself.
Session replay can be considered privacy-invasive, since it records users' interactions with sites in detail, possibly collecting sensitive information entered into text fields, such as credit card numbers. While this may be considered more risky if the session replay service provider is a third party, the site owner may have legitimate reasons for using session replay, and may host the service themselves.
Cryptojacking consumes system resources on the user's device by mining cryptocurrency in the background while they are browsing a site. While this may be considered an annoyance, it is not necessarily a privacy risk if no personal information is collected.
Fingerprinting is often used as a way to identify unique users to a site, and is often considered undesirable from a privacy point of view. However, the fact that a site uses fingerprinting is not necessarily "bad" per se, and it may have legitimate uses. It can become a privacy risk when it is used by the site owner or a third party to associate different units of personal information in a way that the user does not want. However, this is not something we can generally measure from the crawl data we collected.

With this in mind, we propose an approach to assessing website riskiness that draws inspiration from the "Nutrition Facts" that are common in food labelling in many countries. We would want the metric to meet the following conditions:

It should be objective or fact-based, and should not depend on value judgements of either the site owner or third-parties.
It should be measurable on any site.
It should be easy for the user to trust. One way to do this would be to make it reasonably intuitive.
It should be used as a relative measure that users can easily compare between sites or against a personal baseline, and determine that one site has a higher score. Thus, it should probably avoid familiar scales that induce absolute assessments of individual sites, such as letter grades or percentages.
Ideally, it should be accessible at different levels of detail, so that interested users can drill down into multiple measurement dimensions, but there is an overall summary number than can be used to quickly compare two sites.

As an initial version of such a metric, we propose to count things that happen behind the scenes when a webpage is loaded. While this needs some refinement, we generally consider this to mean "what happens in the browser when a page is loaded, outside of rendering visible content". This would include things like requests to third parties, background scripts, calls to Javascript APIs that are not directly related to the page content, etc., which are generally opaque to most users of the modern Web.

By design, this does not directly report a measure of risk or badness. However, it relies on the assumption that, when undesirable behaviour occurs, it occurs behind the scenes. Therefore, this metric would in fact provide information on such behaviours to the user, since they would be covered by the counts. Moreover, pages with higher counts can be considered more risky than those with lower
counts, since they provide more opportunities for undesirable behaviour to occur, ie. a larger potential attack surface.

To implement this metric, we propose the following:

Make a list of concrete behaviours that provide a decent overview of what happens behind the scenes in the browser, such as: a third-party request was made, a cookie was set, local storage was accessed, etc.
When a page loads, count occurrences of each of these behaviours.
Compute an aggregate score for the page by combining the counts in some way, eg. summing them or taking a weighted average.
Provide UI in the browser to display the summary score and allow the user to view the individual counts, possibly organized in a hierarchical way (eg. third-party requests may split into known ad servers, known trackers, etc)

Can we build meaningful sequences out of the calls

Can we build code, that takes the time-stamped calls, (and the col/line number) and generates something useful from it?

I’m thinking not only of a time sequence but also capturing loops in the code - where you go back to the same place again
- Can this enable us to detect the same fingerprinting script even if it was minified differently (so col and line numbers are different, but the shape is the same).
- I’m interested in finding the same script that might execute differently in different places: guessing it’s more likely to be different on different hardware - but may be different on different pages.
Also maybe there are some common sequences of steps, ie. fingerprint.js is called from the main site, but it dispatches to A.js, B.js etc.

Note that there is no guarantee that execution time would be similar to code order, calls from different asyncronous functions could be interleaved, this would be very speculative investigation to see if we can get additional meaning out of the code.

Additionally, we are hoping to add an increment counter to future data collection.

Your research questions and analysis

We welcome all explorations and analyses of this dataset to uncover patterns and insights it may hold.

Update README.md

Add about web crawlers in the glossary section.

Can we build a heuristic for browser attribute fingerprinting?

There are some scripts that we can pick out by name that are doing browser attribute fingerprinting:

The fingerprintjs scripts
Scripts with hs-analytics in the script_url
Scripts with /akam/ in the script_url

Can we build a heuristic for browser attribute fingerprinting that pulls out these scripts?

Continue lit review work

Build on @LABBsoft's work to evaluate prevalence in this dataset and document data deficiencies that we could may be supplement with future crawls or related datasets.

Request to extend the deadline

Hi,

This is a humble request to extend the deadline by a day, at least. Most of our analysis is done, but the final one is still running. The dataset is humongous, and as we had no previous experience working with big data before, it's a time taking task.
It would be amazing if you could extend the deadline by even just a day! It would give us sufficient time to compile our final conclusion statement.

Thanks

mozilla / overscripted Goto Github PK

overscripted's Issues

Further things to do:

Recommend Projects

Recommend Topics

Recommend Org