Comments (6)
Note that you cannot see the "call" being made when you look at the html. You can see the request to load the script.
In the context of this dataset, "call" means individual calls to individual JavaScript APIs that are made by the script, in this case googletagmanager.js
from overscripted.
@birdsarah I've updated the blog post to align our phrasing. Good catch.
from overscripted.
There are field descriptions for the raw data in the schema here: https://github.com/mozilla/overscripted/blob/master/data_prep/raw_data_schema.template
There are the additional fields in_crawl_list
, in_stripped_crawl_list
- this is because I processed this data from my own copy of the data. These fields can be ignored for now.
The other additional fields could be documented in a README in the data_prep folder if you want to.
from overscripted.
An example to help understand this more:
Below is an actual row from one of the parquet files. I have removed the columns which had no-value or were redundant.
location | operation | script_col | script_line | script_url | symbol | time_stamp | value |
---|---|---|---|---|---|---|---|
https://www.syracuse.edu/about/ | get | 1402 | 95 | https://www.googletagmanager.com/gtm.js?id=GTM-5FC97GL | window.navigator.userAgent | 2017-12-16 01:27:46.738 | Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 |
- Location: The site where the JS call was made.
- Operation: The type of call.
- Script_col, script_line: Column and Lin in location's html where the call was made.
- Script_url: The JS script being called.
- Symbol: JS symbol (somehow related to where in the html the call is made)
- Timestamp: When was this call recorded during the crawl
- Value: Value that was passed to the JS call.
Now if we go to location, we can actually see the call being made when we look at the html. Finally the value being passed is a browser identification of some sorts. You can see here that value field is actually a common user agent. Hope this helps :)
from overscripted.
It is unfortunate that the medium blog posts confuses this issue. Here's a piece of the discussion from @asquare14 and i about this on the gitter chat on mar 11
@aSquare14: I was reading the blog post which is mentioned in the Readme. And I have a question.
"Given the set of pages making calls to session replay providers, we also looked into the consistency of SSL usage across these calls. Interestingly, the majority of such calls were made over HTTPS (75.7%), and 49.9% of the pages making these calls were accessed over HTTPS. " What's the difference between the calls being made over HTTPS and accessed over HTTPS ? I'm a little confused.
@birdsarah: .... this sentence is unclear - usually when I'm talking about this dataset the "calls" I'm referring to are JS API calls. Those calls have no relation to http/https - that is just how a script is loaded having looked at the blog post again, in the previous paragraph it says "checked for calls to script URLs" in this context (and for this whole section) calls appears to mean accessing resources.
The medium blog post has a commenting facility, feel free to add a clarifying comment for future readers. This happened because of a variety of authors contributing to the post. But it is a clear change of use of the word calls which were earlier introduced to mean JS calls.
I initially suggested to @asquare14 to post a comment on medium, but perhaps some clarification in the main README is in order.
from overscripted.
@mlopatka - should we edit the medium blog post?
from overscripted.
Related Issues (20)
- Suggestion for Wiki Reading List: Thesis on Fingerprinting
- Adblocker and Tracker Blocker Analysis #86 HOT 1
- Audio code looks for createOscillator instead of oncomplete HOT 1
- Installing Spark on Ubuntu HOT 2
- Add Jupyter Notebook tutorial in Resources HOT 1
- Updating Accessing the data section.
- Add resources on Spark. HOT 2
- Add additional information on issue_34_setup_and_dask_tips.ipynb in issue_34_setup_and_dask_tips.md HOT 2
- Add fingerprinting in glossary
- Add more resources on Dask in README.md
- Sample 10 percent data files are not linked correctly in project README.md HOT 2
- add 'user agent' to glossary
- Import error while using the provided environment in analyses folder HOT 3
- Analysis on #34, calculating percentage of scripts present in dataset HOT 4
- Analyses README.md
- Add additional resources on Dask to README.md
- What is the difference between ad blocker dataset and tracker blocker dataset. HOT 5
- CODE_OF_CONDUCT.md file missing
- Add resources on Apache Pyarrow to readme.md HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from overscripted.