mozilla / overscripted Goto Github PK
View Code? Open in Web Editor NEWRepository for the Mozilla Overscripted Data Mining Challenge
License: Mozilla Public License 2.0
Repository for the Mozilla Overscripted Data Mining Challenge
License: Mozilla Public License 2.0
Adding tutorial on Apache Pyarrow for reading the Apache parquet Format.
Please find attached my initial review of #86. I will continue to work on examples of data based on my initial analysis. Any feedback will be great as it ties into a few other issues that were created and listed in my analysis!
2019_04_CoderT_Adblocker and Tracker blocker analysis.pdf
The two links for the data sources in (3.7 GB and 9.1 GB respectively) in project's Readme.md are incorrect. Unable to download zipped files from those links.
Add Resources in the README.md
And there are more papers with similar claims - lets add data from our dataset to these learnings.
While I was going through hello_world.ipynb, I noticed this error ValueError: Cannot run multiple SparkContexts at once
. It is a pretty common error that occurs because the system automatically initializes the SparkContex.
I had to use sc.stop()
to stop the earlier context and create a new one. @birdsarah Should I maybe add a cell just after this code snippet
import findspark findspark.init('/opt/spark')
# Adjust for the location where you installed spark
from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext(appName="Overscripted") spark = SparkSession(sc)
#If you are already running a context. run this cell and rerun the cell above
sc.stop()
For easier communication around this dataset, lets migrate to the fork at: https://github.com/mozilla/overscripted.
No rush on this, so lets resolve the outstanding PRs and then migrate.
-- OR --
I can delete the other repo and rename this one, then we just need to update all local work to point to the correct upstream resource.
Adding link to the hello_world.ipynb in the README.md
@birdsarah hello, I went through the working and functionality of both and I want to explain the difference between these two.
Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016 - a 2016 paper from USENIX - https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lerner
It contains a number of interesting metrics to describe tracking over time. While the OverScripted dataset does not have sufficient data to compare for all metrics, there may be some that we can reproduce and so continue the evolution of the data presented in the paper.
As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:
If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].
(Message COC001)
From a (single or list of) specified APIs or DOM interfaces, recursively generate a list of symbols to accumulate data on for each javascript file.
Works off of the set of json files from Mozilla's browser-compat-data/api github page (must have this saved locally).
Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.
Array extensions unfortunately can't be serialized pandas-dev/pandas#20612
https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.
Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.
We need a section describing each column of the dataset.
Even a single line description for each field would be very helpful for somebody who is starting to work with the dataset.
Most of the multivariate datasets have descriptions of each field.
e.g.: https://archive.ics.uci.edu/ml/datasets/cardiotocography#
Here, the section "Attribute Information" describes each attribute/column
Overscripted dataset Attributes:
['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4',
'argument_5', 'argument_6', 'argument_7', 'argument_8', 'arguments',
'arguments_n_keys', 'call_id', 'call_stack', 'file_name', 'func_name',
'in_crawl_list', 'in_iframe', 'in_stripped_crawl_list', 'location',
'locations_len', 'operation', 'script_url', 'symbol', 'time_stamp',
'value_1000', 'value_len']
Descriptions for the above attributes of dataset needs to be added.
Check the specified download directory for existing files, generate a list of existing files, compare with the previously generated list of script URLs, and download missing files.
Produce a tree walker script that analyzes the AST of each JS file.
>>Have a rough tree walker, but there are definite errors in my walking algorithm (specifically depth counting). Must investigate further<<
getContext()
is not distinguished between OffscreenCanvas.getContext()
vs HTMLCanvasElement.getContext()
.Go through the full dataset, pull out the unique script URLs and save them as a list (to avoid working with the whole 70GB dataset). Assign each URL a corresponding filename to use in the download stage (raw URL not compatible with ext4 filesystem).
Notes:
>>1st iteration done; not verified to be accurate yet<<
Around 75,000 rows of data were dropped for various reasons when cleaning the data. (Note the clean dataset is 131million rows).
Although this is only a tiny fraction of the data, for completeness, we should consider releasing that invalid data too.
cell 15 reads:
on_complete_df = df[df.symbol == 'OfflineAudioContext.createOscillator']
on_complete_urls = on_complete_df.script_url.unique().persist()
pretty sure that should be oncomplete
Finding out the total number of scripts of the three types present in the dataset and calculating the percentage of each script and the three scripts considered together.
The README.md of the analyses folder gives the false idea of installing both Anaconda and Spark in Step 2 (alteast I found so). One can proceed without installing Spark (correct me if I am wrong) by using conda or vice versa.
In the Gitter chat I noticed most people get struck on the environment import error in the analyses folder(Issue #70 ) which makes it a frequently raised issue . So a good thing will be to document it in the README.md so that new contributors don't get stuck on the same issue previously been resolved for another contributor.
This thesis, which was submitted by Amin Faiz Khademi in pursuance of a Master degree in science from Queens University in Ontario, Canada, covers fingerprinting analysis; detection, and prevention at runtime.
Here is the link : https://qspace.library.queensu.ca/handle/1974/12604
The PDF is on the site and can be downloaded.
FYI: The following changes were made to this repository's wiki:
defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).
These were made as the result of a recent automated defacement of publically writeable wikis.
There are Adblock lists - like easylist, easyprivacy, someonewhocares, pgl etc (see https://github.com/uBlockOrigin/uAssets) for ideas.
The overscripted dataset is not designed to pick up ads its designed to capture javascript calls for known / expected fingerprinting vectors. That said, there may be overlap. Interesting to investigate correlations - what set of scripts would be blocked by these lists and why.
System Specs:
Anaconda: Version 2018.12
Windows 10 (64 -bit)
I get the error when importing Client from dask.distributed when using the overscripted environment.
The error is
`ImportError Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\dask\distributed.py in
4 try:
----> 5 from distributed import *
6 except ImportError:
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed_init_.py in
3 from .config import config
----> 4 from .core import connect, rpc
5 from .deploy import LocalCluster
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\core.py in
19
---> 20 from .comm import (connect, listen, CommClosedError,
21 normalize_address,
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm_init_.py in
16
---> 17 _register_transports()
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm_init_.py in _register_transports()
12 def _register_transports():
---> 13 from . import inproc
14 from . import tcp
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\comm\inproc.py in
14 from ..compatibility import finalize
---> 15 from ..protocol import nested_deserialize
16 from ..utils import get_ip
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\protocol_init_.py in
4
----> 5 from .compression import compressions, default_compression
6 from .core import (dumps, loads, maybe_compress, decompress, msgpack)
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\protocol\compression.py in
22 from ..config import config
---> 23 from ..utils import ignoring, ensure_bytes
24
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\distributed\utils.py in
37 from tornado import gen
---> 38 from tornado.ioloop import IOLoop, PollIOLoop
39
ImportError: cannot import name 'PollIOLoop'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
in
1 import dask.dataframe as dd
----> 2 from dask.distributed import Client
3
4 Client()
C:\ProgramData\Anaconda3\envs\overscripted\lib\site-packages\dask\distributed.py in
9 " conda install dask distributed # either conda install\n"
10 " pip install dask distributed --upgrade # or pip install")
---> 11 raise ImportError(msg)
ImportError: Dask's distributed scheduler is not installed.
Please either conda or pip install dask distributed:
conda install dask distributed # either conda install
pip install dask distributed --upgrade # or pip install`
It works fine with the base environment. I included distributed in environment.yaml
name: overscripted channels: - defaults - conda-forge dependencies: - python=3.6 - jupyter=1.0.0 - pyarrow=0.9.0 - pandas=0.23.0 - distributed=1.21.1 - dask=0.17.5 - findspark=1.2.0 - tldextract=2.2.0
But it doesn't work with the overscripted environment. However, it works perfectly with base environment.
Found a tutorial on jupyter notebook. Adding it to the Resources section of README.md
Link: https://www.youtube.com/watch?v=HW29067qVWk
Conda install did not work for me and I installed Spark using https://datawookie.netlify.com/blog/2017/07/installing-spark-on-ubuntu/
Should I add this in New contributor tips?
The value
column contains some really large items, what's in there?
An initial look by @dzeber found that there were some csv files containing football scores, but a systematic review hasn't been done.
Do any indicate a potential for a privacy / information loss?
We would like to use the information about website behaviour learnt from the crawl data to build out a system to help users stay safe online. The original idea was to design a scalar "risk" or "badness" metric that would alert the user to the potential riskiness of each page they visit.
However, we immediately run into the question of how to define "badness". While we can claim that some sites are intuitively less trustworthy than others, and some website behaviours are undesirable (eg. respawning cookies deleted by the user), it is difficult to define an overall notion of badness that applies to all sites, as it is highly subjective and nuanced.
Some reasonable candidates for a definition of badness are trustworthiness, privacy loss or security risk. However, these are often difficult to quantify without imposing value judgements on both users and website owners. In particular, it is not realistic to attempt to quantify privacy loss in an absolute way, since different users are comfortable sharing different degrees of personal information, often in exchange for the utility or convenience they gain from some content or a service. Furthermore, a single user may have different thresholds of privacy risk for different content providers, depending on
how much they trust them. Security risk is more objective, but is difficult to measure using the crawl data.
Some examples to consider:
With this in mind, we propose an approach to assessing website riskiness that draws inspiration from the "Nutrition Facts" that are common in food labelling in many countries. We would want the metric to meet the following conditions:
As an initial version of such a metric, we propose to count things that happen behind the scenes when a webpage is loaded. While this needs some refinement, we generally consider this to mean "what happens in the browser when a page is loaded, outside of rendering visible content". This would include things like requests to third parties, background scripts, calls to Javascript APIs that are not directly related to the page content, etc., which are generally opaque to most users of the modern Web.
By design, this does not directly report a measure of risk or badness. However, it relies on the assumption that, when undesirable behaviour occurs, it occurs behind the scenes. Therefore, this metric would in fact provide information on such behaviours to the user, since they would be covered by the counts. Moreover, pages with higher counts can be considered more risky than those with lower
counts, since they provide more opportunities for undesirable behaviour to occur, ie. a larger potential attack surface.
To implement this metric, we propose the following:
Can we build code, that takes the time-stamped calls, (and the col/line number) and generates something useful from it?
Note that there is no guarantee that execution time would be similar to code order, calls from different asyncronous functions could be interleaved, this would be very speculative investigation to see if we can get additional meaning out of the code.
Additionally, we are hoping to add an increment counter to future data collection.
We welcome all explorations and analyses of this dataset to uncover patterns and insights it may hold.
Add about web crawlers in the glossary section.
There are some scripts that we can pick out by name that are doing browser attribute fingerprinting:
hs-analytics
in the script_url/akam/
in the script_urlCan we build a heuristic for browser attribute fingerprinting that pulls out these scripts?
Build on @LABBsoft's work to evaluate prevalence in this dataset and document data deficiencies that we could may be supplement with future crawls or related datasets.
Hi,
This is a humble request to extend the deadline by a day, at least. Most of our analysis is done, but the final one is still running. The dataset is humongous, and as we had no previous experience working with big data before, it's a time taking task.
It would be amazing if you could extend the deadline by even just a day! It would give us sufficient time to compile our final conclusion statement.
Thanks
@birdsarah todo - submit a sanitized version of the work done using whois data to see if others can leverage / build upon.
Adding additional information provided on Gitter
Develop a function for pulling out unique ids from the dataset.
We want to identify scripts that have stored / created unique ids. We also want to identify scripts that have not been storing / creating unique ids.
Presently only conda is mentioned. We could generate a requirements.txt
or environment.txt
file and let users setup the project using pip install
after creating a virtual environment. This is especially helpful for Linux users.
Add video tutorial for Dask and a cheatsheet for quick reference and overall view of the Dask workflow.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.