edgi-govdata-archiving / web-monitoring Goto Github PK

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")

License: Creative Commons Attribution Share Alike 4.0 International

HTML 100.00%

web-monitoring documentation project-management gsoc-2017

web-monitoring's Introduction

Warning

This project is no longer actively maintained. It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

Looking for tools to monitor websites? Check out our Awesome Website Change Monitoring document or issue #18, which discusses similar projects. (This project is most useful if monitoring several thousand pages in bulk, but in most cases, other existing tools will solve your needs faster and cheaper.)
If you have questions about this project or the code, we’re happy to respond! Check out the Get Involved section below for information about contacting EDGI members via Slack or e-mail. You can also file an issue on this repo.
We still actively maintain Wayback and web-monitoring-diff. While we built them as part of this project, they are in wider, more generalized use.

EDGI: Web Monitoring Project

As part of EDGI's Website Governance Project this repository contains tools for monitoring changes to government websites, both environment-related and otherwise. It includes technical tools for:

Loading, storing, and analyzing historical snapshots of web pages
Providing an API for retrieving and updating data about those snapshots
A website for visualizing and browsing changes between those snapshots
Tools for managing the workflow of a team of human analysts using the above tools to track and publicize information about meaningful changes to government websites.

EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.

This project and its associated efforts are already monitoring tens of thousands of government web pages. But we aspire for larger impact, eventually monitoring tens of millions or more. Currently, there is a lot of manual labor that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.

For a combined view of all issues and status, check the project board. This repository is for project-wide documentation and issues.

Project Structure
Get Involved
Project Overview
Code of Conduct
Contributors & Sponsors
License & Copyright

Project Structure

The technical tooling for Web Monitoring is broken up into several repositories, each named web-monitoring-{name}:

Repo	Description	Tools Used
web-monitoring	(This Repo!) Project-wide documentation and issue tracking.	Markdown
web-monitoring-db	A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes.	Ruby, Rails, Postgresql
web-monitoring-ui	A web-based UI (built in React) that shows diffs between different versions of the pages we track. It’s built on the API provided by web-monitoring-db.	JavaScript, React
web-monitoring-processing	Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes.	Python
web-monitoring-diff	Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API.	Python, Tornado
web-monitoring-versionista-scraper	A set of Node.js scripts that extract data from Versionista and load it into web-monitoring-db. It also generates the CSV files that analysts currently use to manage their work on a weekly basis.	Node.js
web-monitoring-ops	Server configuration and other deployment information for managing EDGI’s live instance of all these tools.	Kubernetes, Bash, AWS
wayback	A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages).	Python

For more on how all these parts fit together, see ARCHITECTURE.md.

Get Involved

We’d love your help on improving this project! If you are interested in getting involved…

Chat with us on Slack (https://archivers.slack.com)
- You can sign up for an account at https://archivers-slack.herokuapp.com/
- Join us in the #webmonitoring channel.
Please follow EDGI's Code of Conduct

This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.

Get involved as an analyst

Read through the Project Overview and especially the section on "meaningful changes" to get a better idea of the work
Contact us either over Slack or at [email protected] to ask for a quick training

Get involved as a programmer

Be sure to check our contributor guidelines
Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset
Try to get the repo running on your machine (and if you have any challenges, please make issues about them!)
Find an issue labeled good-first-issue and work to resolve it

Project Overview

Project Goals

The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:

Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
Imports those snapshots and other metadata from archival sources.
Determines which snapshots represent a change from a previous version of the page.
Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
Journalists build narratives and amplify stories for the wider public.

Identifying "Meaningful Changes"

The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.

However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.

Some examples of meaningless changes:

it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.

An example of a meaningful change:

In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?

Sample Data

The example-data folder contains examples of website changes to use for analysis.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

Individuals

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!

Contributions	Name
🔢	Chris Amoss
🔢 📋 🤔	Maya Anjur-Dietrich
🔢	Marcy Beck
🔢 📋 🤔	Andrew Bergman
📖	Kelsey Breseman
🔢	Madelaine Britt
🔢	Ed Byrne
🔢	Morgan Currie
🔢	Justin Derry
🔢 📋 🤔	Gretchen Gehrke
🔢	Jon Gobeil
🔢	Pamela Jao
🔢	Sara Johns
🔢	Abby Klionski
🔢	Katherine Kulik
🔢	Aaron Lamelin
🔢 📋 🤔	Rebecca Lave
🔢	Eric Nost
📖	Karna Patel
🔢	Lindsay Poirier
🔢 📋 🤔	Toly Rinberg
🔢	Justin Schell
🔢	Lauren Scott
🤔 🔍	Nick Shapiro
🔢	Miranda Sinnott-Armstrong
🔢	Julia Upfal
🔢	Tyler Wedrosky
🔢	Adam Wizon
🔢	Jacob Wylie

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

Sponsors & Partners

Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:

License & Copyright

Copyright (C) 2017-2020 Environmental Data and Governance Initiative (EDGI)
Web Monitoring documentation in this repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.

web-monitoring's People

Contributors

Stargazers

Watchers

Forkers

chaibapchya mr0grog ecoblockchain lightandluck patcon weatherpattern brierjon karna12 ftsalamp marlungu jsheng15 immortaler s8ad8in edsu dexit sts0mrg0 mattpaz

web-monitoring's Issues

Create a service to diff PDF files

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

a: A URL for the “before” version of the PDF
b: A URL for the “after” version of the PDF

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

Integrate web monitoring diff database efforts

From @ambergman on February 24, 2017 9:46

Wanted to summarize my thoughts here after a great conversation with @danielballan last night and hearing about the great work he and @Mr0grog are doing to coordinate their efforts. Apologies to @Mr0grog, @danielballan, and others if this issue frustrates work at all - happy to take it down and let you all lead:

The conversation about building EDGI's web monitoring software, and a diff database in particular, has been framed by @titaniumbones, myself, and others as a migration from using Versionista to using PageFreezer's snapshots. I wanted to suggest that that may have been a mistake and that, instead of a migration, we'd actually like to integrate our two sources and build a diff database that can store data coming from Versionista, PageFreezer, and any other credible source. This will be important in the short term, as we have snapshot history going further back and at a higher frequency with many of the pages we're watching with Versionista - so we don't want to lose that information after we start using PageFreezer's snapshots. As conversations with the Internet Archive progress, we'd definitely also like to make sure all of the material from the Wayback Machine can be read into our DB as well.

Because Versionista and PageFreezer output different data in different formats, reading in data from the two sources will require different interfaces. And so it's great that the two interfaces are being developed separately in different apps for now - see @Mr0grog's repo for the Versionista app here - and it's great that Dan and Rob are working together to determine how to combine their efforts. In both cases the html data taken in can be converted to a series of diffs, and those diffs can be stored in one big database - with one additional column to denote where the material used to produce the diff came from. Down the road, we can even decide to store diffs made from two html snapshots from two different sources - but I think we can save that for later, perhaps if we've loaded everything into one snapshot database at IA at some point.

So, in short, I think it would be great to think about how to integrate the Versionista and PageFreezer diff databases, not just migrate between them. I know I haven't been specific about interfaces at all here, so I'm sure this wasn't all that helpful in terms of considering what schema to actually use to integrate the two sources - but that's probably the topic of a series of other issues. Let me know what you all think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#29

Need text-only differ/filter

Edit (@dcwalk) March 12, 2017:
We have one implementation of a text differ: https://github.com/edgi-govdata-archiving/web-monitoring-differ , but there are still areas to investigate.

TODOs identified in comments below:

Integrate webmonitoring-differ with web-monitoring-processing as a supplement to PF.
Continue to investigate other text diffing strategies, like the semantic one in get_article_text.

From @titaniumbones on February 21, 2017 22:30

a text-only view of changes has been identified as strongly desirable. Is get_article_text still a good place to start?

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#27

Create functions that ID/characterize page elements for later use in filtration

Clean the set of 300 “significant” changes in prep for model training

Create experimental methods for computing differences

Discussion on Content Moderation

From @ChaiBapchya on March 26, 2017 16:12

While searching for tools similar to Versionista, this is what I found. I stumbled across a different (from Tracking / Monitoring changes) yet (I think a bit relevant) term - Content Moderation

As we talk about meaningful changes, tracking changes, finding diffs and prioritizing them, one of my searches led me to AWS Marketplace products such as WebPurify Image Moderation, etc which moderate web content / traffic.

Drawing analogy to our use-case, all we are doing is actually moderating the changes.
Under this ambit, also include -

Smart Moderation
Based on ML, NLP + self-learning like Human
API Documentation
Implio
Automated (ML based) + Manual content moderation + Ability to write filters
Choosing between ML Generic/Custom-made and Filters
IOSquare
Monitoring, Automated Analysis and Visualization
IoSquare merged with Besedo

On the flipside

Why automation can never replace human-content-moderation!
It talks about how human intervention is critical when it comes to :-

context-awareness
Subjectivity to detect subtle references used
Cultural references, colloquialisms, and racial slurs

What do you make of this @dcwalk @b5 @danielballan ? Have you heard about it before? Do you find this of any use? Anything worth picking up / learning from?

Copied from original issue: edgi-govdata-archiving/overview#106

abandon PageFreezer Diff API for now?

From @titaniumbones on February 10, 2017 18:20

Given the limitations (10k hits/day, 5s/response, no batch processing) should we stop trying to use the PF diffapi for the time being, and maybe revisit this question later on? We can still use their api''s JSON format for our own outputs if we want, or we can add the additional diff streams that @lh00000000 has been talked about. Then if we decide to go back to their format later on, we can drop their differ in as a replacement for ours.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#20

Question: Does IA expose their raw WARCs?

I can't find anyway to download raw WARCs from IA. It would be nice to get the original response from the server (which could be hashed and directly compared to other harvests like dat's) which it seems may be stored by IA in the WARC but not exposed through their API.

attn @titaniumbones

DevOps issues from slack/mr0grog

Thought this should be archived here before it gets lost in Slack history

mr0grog APP [12:34 PM]
In the meeting today, @dcwalk asked about ops/devops issues we need help with. Sorry I haven’t ever enumerated all that super clearly, so here’s a quick run down of what’s been simmering in the back of my mind that we could really use help with (I should probably make some more GitHub issues, too):

~~We need to have sane tools for log handling, error tracking. This issue is on DB but we need the same thing for all 4 code projects: edgi-govdata-archiving/web-monitoring-db#47~~
~~Upgrading our CI usage: edgi-govdata-archiving/web-monitoring-db#99 and edgi-govdata-archiving/web-monitoring-ui#91~~
~~(No issue for this) ^ related to the above, it would be lovely to have some kind of continuous deployment setup based on passing/failing CI~~ (Update 2019-03-10: we explicitly decided to punt on this for now)
~~This documentation for deployment is about to merge: edgi-govdata-archiving/web-monitoring-processing#72, but we’ll be creating other issues for improvements to the process it describes~~
~~We should have similar documentation for all projects (-versionista-scraper has it, but -ui and -db do not)~~ (Update 2019-03-10: all this now has a home in edgi-govdata-archiving/web-monitoring-ops)
We should have scripted deploy tools for all (we have only for -ui)
If the AWS stuff comes through, we probably want to move everything over to EC2. Would love expertise and help in how best to set things up (I can do this, but am 100% sure someone who is an actual ops or devops person would do something approximately 10,000% better) (Update 2019-03-10: this has long been done.)
~~Make the deploy/test/etc. process and tools for all 4 projects reasonably similar and consistent~~ (Update 2019-03-10: see web-monitoring-ops)
Automated DB exports/dumps edgi-govdata-archiving/web-monitoring-db#45
~~Adding caching tools (e.g. Varnish) or integrating memcached/redis with Rails for better caching in the API and the other services (related, but smaller scope: #59)~~
~~We are likely going to eventually get to adding a queueing service in order to do #62 . That will make deployment more complicated~~ (Update 2019-03-10: this was done)

(UPDATED 2018-07-27 by @Mr0grog to cross out stuff that’s since been done)
(UPDATED 2019-03-10 by @Mr0grog)

Determine how to trigger processing/analysis when new versions are added to DB

Adding this to the main repo because it has potential implications for both -processing and -db.

Now that @janakrajchadha is starting to make some headway on analysis, we need to get serious about determining how and when to run that analysis against data that has been added to the DB.

Things that have been discussed:

Have processing tools poll DB on a regular interval looking for new data. (Highest overhead, but least effort; no major new systems or features have to be built for this.)
Create a message queue. DB should add items to the queue when page versions are imported. Processing tools will read from that queue. If we move processing to AWS, we can probably rely on Amazon SQS for this (if staying on GCloud, Cloud Pub/Sub). Otherwise, we’ll need to set up our own queueing service. ZeroMQ and Beanstalkd both seem like good options here because they are language-agnostic.
Webhooks. Processing tools could register themselves with DB as callable webhooks.
Other ideas?

Hidden changes and Derived changes

While thinking of possible loopholes or drawbacks in the current web-monitoring / change detection methodology followed by us, I realized 2 such possibilities.

Hidden changes
Changes that get overlooked or untapped by the Diff mechanism of PageFreezer / Versionista. Such changes can be potentially harmful and critical.
Derived changes
Change that has a cascading effect on others. Thus, these changes lead to other changes. Such derived changes may be overlooked.

For e.g.
Statement -

"This entire section belongs to March 2010 and updated code lies on page - http://xyz.com"

So as a result, our code needs to be updated with the content from new URL.

I guess the current infrastructure doesn't equip us to take information from derived links.

I understand this might be a far off scenario. But reporting the possibility nonetheless.

Migrate analyst training video to YouTube

Reticketed from #37

Post public video
Update doc links (any pointers?)

cc: @dcwalk @Mr0grog

QuickStart: What I need to do to start monitoring website X

What environment is needed?
How to setup all applications?
How to configure monitoring of website X?

Importing from Internet Archive: edgi-govdata-archiving/web-monitoring-processing#45 (comment)
wm import ia envirodatagov.org --site edgi --agency edgi

TODO How to setup UI?
View beginning of dev video

Output: QUICKSTART.md

Links for further onboarding:

https://www.youtube.com/watch?v=zaGeNEyMULE
- The first 15min should give you a good overview of the project.
- The next 20min goes into current data sources and analyst workflow.
- And if you're feeling brave!, the next 40min is a deep dive into web-monitoring-db and it's API.
- The last 20min is q&a.
onboarding Trello: https://trello.com/b/FCGGEaQq/edgi-web-monitoring-project---onboarding (this should be divided in dev onboarding and analyst onboarding)

Pull versions from sentry for diffing

https://sentry.archivers.space/urls?url=https://www.epa.gov/dockets/where-send-comments-epa-dockets#mail

GSoC Report for Week 5 of Phase 1

I’m excited to have made progress on the documentation of the data format of the different sources. An initial draft of the documentation has been pushed to the web monitoring processing repository. I've also added a supporting notebook which displays actual examples.

I worked on the following issues:

Document the differences in the data format of the different sources (PageFreezer, Versionista).
This task can be considered as done for now. There are a few fields which I did not completely understand now and I'll be adding these over time.
Create functions that ID/characterize page elements for later use in filtration

I closed the following issues:

Add functionality to get cabinet ID of a specific URL

The following PR was merged:

Pagefreezer update,diff service demo, and data format documentation

Determine key features in diffs that could be used for filtration

Update training set (the ~300 changes)

Research similar projects

In our 2017-03-11 Dev standup, the question was raised about what comparable projects are out there. We should compile a list and pay attention to their features/implementation specifics.
@ambergman mentioned Klaxon as a one

This could also be a great first-timer issue: we could collect those projects and document important details?

GSoC Report for Week 1 of Phase 3

I’m excited to have made progress on the methods for computing differences. I have found a few possible ways to add HTML diffs for rendering and I'm testing them. I have also fixed issues in the Filtering Work PR.
Alongside this, I have worked on different issues, reviewed PRs, and started discussion around the next part of the project on Slack.

I worked on the following issues:

I have fixed the following issues:

Fix tests and CI builds

Work on this PR was completed:

Filtering Work

I have reviewed the following PRs:

☂ Pull Versions from IA for diffing

Useful links:

Add 1.0 target label?

Can we add a "1.0" target to identify those issues which need to be resolved for a 1.0 release (hoping for say June 1)?

Tagging

Via @trinberg, possible categories of tags for a Page (not to be confused with an Annotation):

Categorizing by agency, office.
What is the content related to?
A tag for categorizing kinds of change (shift in business-focused language)

A point from @dcwalk: try to use an established controlled vocabulary.

Discussion on Architecture diagram

With reference to discussion in edgi-govdata-archiving/overview#102
After gauging the entire Web monitoring project with respect to Data wrangling right from Extraction from disparate sources (versionista, web archive, pagefreezer) to storing in database and visualizing it on webapp (UI), I could come up with this view.
Let me know your ideas / opinion on the same. More than willing to make the requisite changes and add it up to the documentation for everyone's ease of reference. @dcwalk @b5

Architecture diagram

Essential Diffing Tools for v0

The minimum set of diffs views that need to be available:

Source diffs (i.e. the easiest possible thing)
Diffs of visible text content only (subtle, but a simple approach is enough to start)
Side-by-side comparisons of the rendered pages

This is according to @ambergman and generally fits in with the consensus view I've heard.

Use PageFreezer archives to generate a diff for the PageFreezer-Outputter

From @ambergman on February 10, 2017 8:17

Build a simple module that takes as input a URL and two timestamps (referring to two different archived versions of a page) and outputs a diff for the PageFreezer-Outputter. This can be done by pinging PageFreezer's diff server or using another module. See a discussion of what we know about PageFreezer's diff server and other diff services in the pagefreezer-clim README

This module, being developed by @allanpichardo, will likely be perfect for completing this issue.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#18

Implement caching to reduce loading time while working with diff files

There is a need to implement caching of the diff results from the different diffing services as it'll reduce the time it takes to access them. This will also eliminate the need to repeatedly use the API to get the same results.

Deploy on Google Cloud

Waiting for the go-ahead from our friends at Page Freezer

GSoC Report for Week 2 of Phase 3

I’m excited to have made progress on HTML diff for rendering and mapping out a plan for prioritization of important changes.

I worked on the following issues:

The following issues were closed:

The following PRs were merged:

I have opened the following PR:

Html diff render

Moving ahead, I plan to work on the classification of text as significant/insignificant. I'm going through the important changes spreadsheet and plan to discuss ways to classify changes in a call today.

Comments on Trello board: EDGI: Web Monitoring Project - Onboarding

Ping @patcon

Test-drive the chatbot card is empty. What is chatbot? Where to find it? How to test it?
https://trello.com/c/zG8OiKiw/14-perform-your-first-code-review
- doesn't have any description
- doesn't it require some experience, trust and being promoted to a Lieutenant?
https://trello.com/c/D8YpNQgR/15-request-write-access-to-the-repo
- same as above
https://trello.com/c/jLXvgHUb/7-review-the-ui-demo-video the video https://www.youtube.com/watch?v=MG5UOZ6Ck1E ping @lightandluck
- could use a link to heroku in the description
- TODO it's not clear where to get spreadsheet links
https://trello.com/c/ZoaSMbpy/8-explore-the-live-app
- there is nothing the explore :|
- most of the links connect to Versionista, but it asks for login
- even if I create an account I don't have acces to those views ("This page was not found. Perhaps it was deleted?")
I'd copy Analyst-focused onboarding tasks in a separate onboarding process for analysts (or does it exist alread?). I'd put/link it in How to help? Readme section:
- https://trello.com/c/jLXvgHUb/7-review-the-ui-demo-video (for me it's too complicated as for self-onbiarding developer at this stage; first time I see this I)
- https://trello.com/c/ZoaSMbpy/8-explore-the-live-app
Split How to help? Readme section in two:
Become an Analyst
Join web development team
I'd link Trello onboarding in How to help? Readme section section (developer part)

Add draft Architecture doc

I know it might be out of date, but I still think it would be good to have the draft architecture doc in here just to get a handle on previous thinking. (We could through a big "out of date" warning on the top)

Describe Page Freezer zip formats

From @titaniumbones on February 2, 2017 14:35

Each Page Freezer archive for a domain BASEURL consists of a zipfile with the following structure:

Base URL is
storage/export/climate/BASEURL_NUMERICALID_YYYY-MM-DD/http[s]_URL/

inside this you'll find a file:
http[s]_URL_MM_DD_YYYY.xml

and a directory
MM_DD_YYYY/
potentially containing multiple subdirs of the form:
http[s]_URL
where URL is either the BASEURL or an external domain cantaining resources linked from BASEURL pages.

First thing we need to do is to to understand the xml format -- here's an example document (rename with the .xml extension, Github doesn't allow upload of xml files):
http_www.climateandsatellites.org_01_20_2017.txt

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#2

Move over important issues from web-monitoring-ui

With the repo reorg we have some issues that have been left behind in https://github.com/edgi-govdata-archiving/web-monitoring-ui/issues.

I'd like to migrate them all over here and close whichever ones are no longer needed (also set up larger milestones like this (https://github.com/edgi-govdata-archiving/web-monitoring-ui/milestone/1) in this repo going forward.

@titaniumbones and @danielballan what is your take? I don't think many issues will remain open, but it would be nice to maintain that paper trail

Incorporate more details from Web Tracking Changes Google doc into README

We might also incorporate some of this stuff into the schema, but we should wait for the monitoring team's reorg/QA to finish before deciding anything on that.

GSoC Report for Week 3 of Phase 2

I’m excited to have made progress on the filters created to segregate the irrelevant changes which will help analysts focus on the important ones. I'm also excited to have made progress on caching of diff results from different services to reduce the loading time for repeated access.

I worked on the following issues:

I closed the following issues:

Determine key features in diffs that could be used for filtration

The following PR was opened:

Filters and cache work

An impediment I faced is the uncertainty around the diffing services we plan to use in the future but I have discussed it with Mike, Kyala, Rob, and Toly and I have a better idea of the direction I want to move in.

I have gone through the current process followed by the analysts and I've also seen the developments on the dev side. I hope that with the inclusion of more diffing services, the things that I'm working on will be tried and tested by the analysts and I'll be able to incorporate feedback from them.

Update Trello onboarding with newer videos

@patcon, not sure if you're the only one with the ability to do this, but just saw you make some changes in this PR

Can you update the following videos as well?
Demo: Use the new one with Rob https://www.youtube.com/watch?v=lQvpprUn8A8
Also new url for demo: http://web-monitoring-ui-staging.herokuapp.com/

New Analyst training (1/2): https://www.youtube.com/watch?v=1FNi4lfsY-k

Also, the user interview with Maya is a lot more interesting and pertinent now, than the UI+analysts videos: https://www.youtube.com/watch?v=xTN3jOqIXGM (Also, automagic computer thumbnail FTW here. Still wouldn't want computers to decide this all the time though 😉 )

Create sample dataset for machine learning projects

If possible, we should create a sample dataset drawn from the analysis team's records (collected January 7 februrary). Then we can maybe start to implement some simple rules and see if they help us to identify significant changes on a larger scale.

GSoC Report for Week 1 of Phase 2

I’m excited to have made progress on exploring the diffs for various websites. I've looked at 5 different agency websites and have gone through changes between the webpages back in the late 90s and recent ones as well. I've kept all the diff csvs in a Google Drive folder for future use.

I worked on the following issues:

I completed the following issues:

Document the differences in the data format of the different sources (PageFreezer, Versionista).
I've added a few of the remaining fields. The issue hasn't been closed but can be considered as done for now. The PR has been merged and the documentation is available in the repository.

Get domain and SSL

@titaniumbones Should we deploy web-monitoring on an EDGI subdomain, or go the way of archivers.space and get some separate name?

Add functionality to get cabinet ID of a specific URL

Iterating through all cabinets and finding a specific archive is a cumbersome task.
Adding a function for this will make it easy for developers and analysts to find the archives of a specific domain/site, when the need arises.
This should be added to the file pf_edgi.py by @danielballan which already includes various functionalities to efficiently use the PF API.

Need efficient diff-based storage for archives

From @titaniumbones on February 2, 2017 15:25

It's not immediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#4

Designing Tentative Database Schema

Having gone through the Architecture, I realized the "Database Schema are unknown" is a glaring hole that needs to be sorted.
Would like to work on creating the same.
Decisions to be taken
What type of model to be chosen
Eg.Entity relationship model

What DB to be handled
Relational, NoSQL, Hadoop / Spark (Big Data)

Basic template (that quickly comes to mind)

Page name
Website (to which page belongs(
Page id (unique identifier)
Previous state
Current state
List of previous states

GSoC Report for Week 2 of Phase 2

I’m excited to have made progress on creating functions that characterize page elements for use in filtration.

Unfortunately, due to a health problem, I was unavailable for most of the week.

I worked on the following issues:

Create functions that ID/characterize page elements for later use in filtration

I've also been trying to work out how PageFreezer computes its 'Delta' score by fabricating examples.

I have discussed the possibility of setting up a meeting with folks from PageFreezer and I'm hoping we are able to set one up soon.

I will be focussing on the issues which follow the aforementioned ones and will also experiment and add a few things to our own diffing service created by Dan.

GSoC Report for Week 4 Phase 1

https://hackmd.io/s/BkDjR4cXZ
@suchthis @mhucka @danielballan @dcwalk
Here is my report for this week.
Please review :)

Onboarding: Add a database API review step

Re: #42 (comment)

Should onboarding explicitly direct folks to check out the database API? We say "review README", but would you favour an explicit checklist item for specific things like that? Anything else offhand to call out in this card? https://trello.com/c/enZDhOKw/4-review-the-project-overview-repository

cc: @Mr0grog

Document the differences in the data format of the different sources (PageFreezer, Versionista).

We are building a flexible framework designed to accommodate a variety of crawled page snapshots. Different services produce different data formats. By documenting them carefully, we set ourselves up for success.

Identify component/area leads in readme

It would be helpful to 'pin' the top to our repo:

Area leads (e.g. of the 2 parts, as user spokesperson, as pm)
Channels on Slack...?

GSoC Report for Week 4 of Phase 2

I’m excited to have made progress on the filtering work. I have tested and pushed code which should be able to handle the tagging of different types of irrelevant changes. I have also fixed diff-match-patch issues along with Rob. I'm also excited to have made progress on creating new methods for computing differences. I have discussed a few ideas with Rob and we should have new functions soon.

I have finished working on the following issues:

I started working on the following issue:

Create experimental methods for computing differences

I have put the following issues on hold for now:

Implement caching to reduce loading time while working with diff files

The following PR was updated:

Filters and cache work was changed to Filtering Work

I'm working on pushing the filtering to the staging db and hopefully will get some analysts to test it once it has been done.

Build PageFreezer-Outputter that fits into current Versionista workflow

From @ambergman on February 10, 2017 8:5

To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.

I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#17

Incorporate Cluster in the schema

An aspect of the conversation in SF that I didn't fully absorb until chatting with @aleatha is the importance of surfacing a Cluster of related Changes as a concept in the database and in the UI.

The app should request a set of Clusters of Changes for a user to check. The UI should present one representative Change with the option of drilling down into the rest. Here's one way we could adjust the schema. I haven't thought about this very long -- just trying to kick off discussion:

Cluster
    uuid
    priority
    created_at
    updated_at

Then adjust the Change table to add cluster_uuid (starts as NULL, is assigned later by an ETL job) and remove priority, which is a property of a Cluster, not a Change.

Where to put Annotation is a sticky question: does it belong to a Cluster or a Change? I think it's analogous to the (famously sticky) problem of regularly-occurring events on a calendar. If an Annotation belongs to a Change, then it's easy to customize individual Annotations when needed but it's hard to safely update the whole set. If Annotation belongs to a Cluster, we'd need to provide some UI for subdividing errant clusters into sub-clusters. My guess is that leaving an Annotation as a property of a Change is the easier place to start.

edgi-govdata-archiving / web-monitoring Goto Github PK

web-monitoring's Introduction

EDGI: Web Monitoring Project

Project Structure

Get Involved

Get involved as an analyst

Get involved as a programmer

Project Overview

Project Goals

Identifying "Meaningful Changes"

Sample Data

Code of Conduct

Contributors

Individuals

Sponsors & Partners

License & Copyright

web-monitoring's People

Contributors

Stargazers

Watchers

Forkers

web-monitoring's Issues

On the flipside

GSoC Report for Week 5 of Phase 1

GSoC Report for Week 1 of Phase 3

Architecture diagram

GSoC Report for Week 2 of Phase 3

GSoC Report for Week 3 of Phase 2

GSoC Report for Week 1 of Phase 2

GSoC Report for Week 2 of Phase 2

GSoC Report for Week 4 of Phase 2

Recommend Projects

Recommend Topics

Recommend Org