Giter Site home page Giter Site logo

edgi-govdata-archiving / web-monitoring Goto Github PK

View Code? Open in Web Editor NEW
105.0 24.0 17.0 258 KB

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")

License: Creative Commons Attribution Share Alike 4.0 International

HTML 100.00%
web-monitoring documentation project-management gsoc-2017

web-monitoring's Introduction

Code of Conduct  Project Status Board

Warning

This project is no longer actively maintained. It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

  • Looking for tools to monitor websites? Check out our Awesome Website Change Monitoring document or issue #18, which discusses similar projects. (This project is most useful if monitoring several thousand pages in bulk, but in most cases, other existing tools will solve your needs faster and cheaper.)

  • If you have questions about this project or the code, we’re happy to respond! Check out the Get Involved section below for information about contacting EDGI members via Slack or e-mail. You can also file an issue on this repo.

  • We still actively maintain Wayback and web-monitoring-diff. While we built them as part of this project, they are in wider, more generalized use.

EDGI: Web Monitoring Project

As part of EDGI's Website Governance Project this repository contains tools for monitoring changes to government websites, both environment-related and otherwise. It includes technical tools for:

  • Loading, storing, and analyzing historical snapshots of web pages
  • Providing an API for retrieving and updating data about those snapshots
  • A website for visualizing and browsing changes between those snapshots
  • Tools for managing the workflow of a team of human analysts using the above tools to track and publicize information about meaningful changes to government websites.

EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.

This project and its associated efforts are already monitoring tens of thousands of government web pages. But we aspire for larger impact, eventually monitoring tens of millions or more. Currently, there is a lot of manual labor that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.

For a combined view of all issues and status, check the project board. This repository is for project-wide documentation and issues.

Project Structure

The technical tooling for Web Monitoring is broken up into several repositories, each named web-monitoring-{name}:

Repo Description Tools Used
web-monitoring (This Repo!) Project-wide documentation and issue tracking. Markdown
web-monitoring-db A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes. Ruby, Rails, Postgresql
web-monitoring-ui A web-based UI (built in React) that shows diffs between different versions of the pages we track. It’s built on the API provided by web-monitoring-db. JavaScript, React
web-monitoring-processing Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes. Python
web-monitoring-diff Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API. Python, Tornado
web-monitoring-versionista-scraper A set of Node.js scripts that extract data from Versionista and load it into web-monitoring-db. It also generates the CSV files that analysts currently use to manage their work on a weekly basis. Node.js
web-monitoring-ops Server configuration and other deployment information for managing EDGI’s live instance of all these tools. Kubernetes, Bash, AWS
wayback A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages). Python

For more on how all these parts fit together, see ARCHITECTURE.md.

Get Involved

We’d love your help on improving this project! If you are interested in getting involved…

This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.

Get involved as an analyst

  • Read through the Project Overview and especially the section on "meaningful changes" to get a better idea of the work
  • Contact us either over Slack or at [email protected] to ask for a quick training

Get involved as a programmer

  • Be sure to check our contributor guidelines
  • Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset
  • Try to get the repo running on your machine (and if you have any challenges, please make issues about them!)
  • Find an issue labeled good-first-issue and work to resolve it

Project Overview

Project Goals

The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:

  1. Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
  2. Imports those snapshots and other metadata from archival sources.
  3. Determines which snapshots represent a change from a previous version of the page.
  4. Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
  5. Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
  6. Journalists build narratives and amplify stories for the wider public.

Identifying "Meaningful Changes"

The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.

However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.

Some examples of meaningless changes:

  • it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
  • many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.

An example of a meaningful change:

  • In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?

Sample Data

The example-data folder contains examples of website changes to use for analysis.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

Individuals

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!

Contributions Name
🔢 Chris Amoss
🔢 📋 🤔 Maya Anjur-Dietrich
🔢 Marcy Beck
🔢 📋 🤔 Andrew Bergman
📖 Kelsey Breseman
🔢 Madelaine Britt
🔢 Ed Byrne
🔢 Morgan Currie
🔢 Justin Derry
🔢 📋 🤔 Gretchen Gehrke
🔢 Jon Gobeil
🔢 Pamela Jao
🔢 Sara Johns
🔢 Abby Klionski
🔢 Katherine Kulik
🔢 Aaron Lamelin
🔢 📋 🤔 Rebecca Lave
🔢 Eric Nost
📖 Karna Patel
🔢 Lindsay Poirier
🔢 📋 🤔 Toly Rinberg
🔢 Justin Schell
🔢 Lauren Scott
🤔 🔍 Nick Shapiro
🔢 Miranda Sinnott-Armstrong
🔢 Julia Upfal
🔢 Tyler Wedrosky
🔢 Adam Wizon
🔢 Jacob Wylie

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

Sponsors & Partners

Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:

License & Copyright

Copyright (C) 2017-2020 Environmental Data and Governance Initiative (EDGI)
Creative Commons License Web Monitoring documentation in this repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.

web-monitoring's People

Contributors

danielballan avatar dcwalk avatar edsu avatar frijol avatar karna12 avatar lightandluck avatar mr0grog avatar patcon avatar shapironick avatar suchthis avatar titaniumbones avatar weatherpattern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

web-monitoring's Issues

Create a service to diff PDF files

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

  • a: A URL for the “before” version of the PDF
  • b: A URL for the “after” version of the PDF

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

Integrate web monitoring diff database efforts

From @ambergman on February 24, 2017 9:46

Wanted to summarize my thoughts here after a great conversation with @danielballan last night and hearing about the great work he and @Mr0grog are doing to coordinate their efforts. Apologies to @Mr0grog, @danielballan, and others if this issue frustrates work at all - happy to take it down and let you all lead:

The conversation about building EDGI's web monitoring software, and a diff database in particular, has been framed by @titaniumbones, myself, and others as a migration from using Versionista to using PageFreezer's snapshots. I wanted to suggest that that may have been a mistake and that, instead of a migration, we'd actually like to integrate our two sources and build a diff database that can store data coming from Versionista, PageFreezer, and any other credible source. This will be important in the short term, as we have snapshot history going further back and at a higher frequency with many of the pages we're watching with Versionista - so we don't want to lose that information after we start using PageFreezer's snapshots. As conversations with the Internet Archive progress, we'd definitely also like to make sure all of the material from the Wayback Machine can be read into our DB as well.

Because Versionista and PageFreezer output different data in different formats, reading in data from the two sources will require different interfaces. And so it's great that the two interfaces are being developed separately in different apps for now - see @Mr0grog's repo for the Versionista app here - and it's great that Dan and Rob are working together to determine how to combine their efforts. In both cases the html data taken in can be converted to a series of diffs, and those diffs can be stored in one big database - with one additional column to denote where the material used to produce the diff came from. Down the road, we can even decide to store diffs made from two html snapshots from two different sources - but I think we can save that for later, perhaps if we've loaded everything into one snapshot database at IA at some point.

So, in short, I think it would be great to think about how to integrate the Versionista and PageFreezer diff databases, not just migrate between them. I know I haven't been specific about interfaces at all here, so I'm sure this wasn't all that helpful in terms of considering what schema to actually use to integrate the two sources - but that's probably the topic of a series of other issues. Let me know what you all think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#29

Need text-only differ/filter

Edit (@dcwalk) March 12, 2017:
We have one implementation of a text differ: https://github.com/edgi-govdata-archiving/web-monitoring-differ , but there are still areas to investigate.

TODOs identified in comments below:

  • Integrate webmonitoring-differ with web-monitoring-processing as a supplement to PF.
  • Continue to investigate other text diffing strategies, like the semantic one in get_article_text.

From @titaniumbones on February 21, 2017 22:30

a text-only view of changes has been identified as strongly desirable. Is get_article_text still a good place to start?

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#27

Discussion on Content Moderation

From @ChaiBapchya on March 26, 2017 16:12

While searching for tools similar to Versionista, this is what I found. I stumbled across a different (from Tracking / Monitoring changes) yet (I think a bit relevant) term - Content Moderation

As we talk about meaningful changes, tracking changes, finding diffs and prioritizing them, one of my searches led me to AWS Marketplace products such as WebPurify Image Moderation, etc which moderate web content / traffic.

Drawing analogy to our use-case, all we are doing is actually moderating the changes.
Under this ambit, also include -

  1. Smart Moderation
    Based on ML, NLP + self-learning like Human
    API Documentation

  2. Implio
    Automated (ML based) + Manual content moderation + Ability to write filters
    Choosing between ML Generic/Custom-made and Filters

  3. IOSquare
    Monitoring, Automated Analysis and Visualization
    IoSquare merged with Besedo

On the flipside

Why automation can never replace human-content-moderation!
It talks about how human intervention is critical when it comes to :-

  • context-awareness
  • Subjectivity to detect subtle references used
  • Cultural references, colloquialisms, and racial slurs

What do you make of this @dcwalk @b5 @danielballan ? Have you heard about it before? Do you find this of any use? Anything worth picking up / learning from?

Copied from original issue: edgi-govdata-archiving/overview#106

abandon PageFreezer Diff API for now?

From @titaniumbones on February 10, 2017 18:20

Given the limitations (10k hits/day, 5s/response, no batch processing) should we stop trying to use the PF diffapi for the time being, and maybe revisit this question later on? We can still use their api''s JSON format for our own outputs if we want, or we can add the additional diff streams that @lh00000000 has been talked about. Then if we decide to go back to their format later on, we can drop their differ in as a replacement for ours.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#20

Question: Does IA expose their raw WARCs?

I can't find anyway to download raw WARCs from IA. It would be nice to get the original response from the server (which could be hashed and directly compared to other harvests like dat's) which it seems may be stored by IA in the WARC but not exposed through their API.

attn @titaniumbones

DevOps issues from slack/mr0grog

Thought this should be archived here before it gets lost in Slack history

mr0grog APP [12:34 PM]
In the meeting today, @dcwalk asked about ops/devops issues we need help with. Sorry I haven’t ever enumerated all that super clearly, so here’s a quick run down of what’s been simmering in the back of my mind that we could really use help with (I should probably make some more GitHub issues, too):

  • We need to have sane tools for log handling, error tracking. This issue is on DB but we need the same thing for all 4 code projects: edgi-govdata-archiving/web-monitoring-db#47
  • Upgrading our CI usage: edgi-govdata-archiving/web-monitoring-db#99 and edgi-govdata-archiving/web-monitoring-ui#91
  • (No issue for this) ^ related to the above, it would be lovely to have some kind of continuous deployment setup based on passing/failing CI (Update 2019-03-10: we explicitly decided to punt on this for now)
  • This documentation for deployment is about to merge: edgi-govdata-archiving/web-monitoring-processing#72, but we’ll be creating other issues for improvements to the process it describes
  • We should have similar documentation for all projects (-versionista-scraper has it, but -ui and -db do not) (Update 2019-03-10: all this now has a home in edgi-govdata-archiving/web-monitoring-ops)
  • We should have scripted deploy tools for all (we have only for -ui)
  • If the AWS stuff comes through, we probably want to move everything over to EC2. Would love expertise and help in how best to set things up (I can do this, but am 100% sure someone who is an actual ops or devops person would do something approximately 10,000% better) (Update 2019-03-10: this has long been done.)
  • Make the deploy/test/etc. process and tools for all 4 projects reasonably similar and consistent (Update 2019-03-10: see web-monitoring-ops)
  • Automated DB exports/dumps edgi-govdata-archiving/web-monitoring-db#45
  • Adding caching tools (e.g. Varnish) or integrating memcached/redis with Rails for better caching in the API and the other services (related, but smaller scope: #59)
  • We are likely going to eventually get to adding a queueing service in order to do #62 . That will make deployment more complicated (Update 2019-03-10: this was done)

(UPDATED 2018-07-27 by @Mr0grog to cross out stuff that’s since been done)
(UPDATED 2019-03-10 by @Mr0grog)

Determine how to trigger processing/analysis when new versions are added to DB

Adding this to the main repo because it has potential implications for both -processing and -db.

Now that @janakrajchadha is starting to make some headway on analysis, we need to get serious about determining how and when to run that analysis against data that has been added to the DB.

Things that have been discussed:

  1. Have processing tools poll DB on a regular interval looking for new data. (Highest overhead, but least effort; no major new systems or features have to be built for this.)

  2. Create a message queue. DB should add items to the queue when page versions are imported. Processing tools will read from that queue. If we move processing to AWS, we can probably rely on Amazon SQS for this (if staying on GCloud, Cloud Pub/Sub). Otherwise, we’ll need to set up our own queueing service. ZeroMQ and Beanstalkd both seem like good options here because they are language-agnostic.

  3. Webhooks. Processing tools could register themselves with DB as callable webhooks.

  4. Other ideas?

Hidden changes and Derived changes

While thinking of possible loopholes or drawbacks in the current web-monitoring / change detection methodology followed by us, I realized 2 such possibilities.

  1. Hidden changes
    Changes that get overlooked or untapped by the Diff mechanism of PageFreezer / Versionista. Such changes can be potentially harmful and critical.

  2. Derived changes
    Change that has a cascading effect on others. Thus, these changes lead to other changes. Such derived changes may be overlooked.

For e.g.
Statement -

"This entire section belongs to March 2010 and updated code lies on page - http://xyz.com"

So as a result, our code needs to be updated with the content from new URL.

I guess the current infrastructure doesn't equip us to take information from derived links.

I understand this might be a far off scenario. But reporting the possibility nonetheless.

QuickStart: What I need to do to start monitoring website X

  1. What environment is needed?
  2. How to setup all applications?
  3. How to configure monitoring of website X?
  1. TODO How to setup UI?
  2. View beginning of dev video

Output: QUICKSTART.md

Links for further onboarding:

GSoC Report for Week 5 of Phase 1

GSoC Report for Week 5 of Phase 1

I’m excited to have made progress on the documentation of the data format of the different sources. An initial draft of the documentation has been pushed to the web monitoring processing repository. I've also added a supporting notebook which displays actual examples.

I worked on the following issues:

I closed the following issues:

The following PR was merged:

Research similar projects

In our 2017-03-11 Dev standup, the question was raised about what comparable projects are out there. We should compile a list and pay attention to their features/implementation specifics.
@ambergman mentioned Klaxon as a one

This could also be a great first-timer issue: we could collect those projects and document important details?

GSoC Report for Week 1 of Phase 3

GSoC Report for Week 1 of Phase 3

I’m excited to have made progress on the methods for computing differences. I have found a few possible ways to add HTML diffs for rendering and I'm testing them. I have also fixed issues in the Filtering Work PR.
Alongside this, I have worked on different issues, reviewed PRs, and started discussion around the next part of the project on Slack.

I worked on the following issues:

I have fixed the following issues:

Work on this PR was completed:

I have reviewed the following PRs:

Add 1.0 target label?

Can we add a "1.0" target to identify those issues which need to be resolved for a 1.0 release (hoping for say June 1)?

Tagging

Via @trinberg, possible categories of tags for a Page (not to be confused with an Annotation):

  • Categorizing by agency, office.
  • What is the content related to?
  • A tag for categorizing kinds of change (shift in business-focused language)

A point from @dcwalk: try to use an established controlled vocabulary.

Discussion on Architecture diagram

With reference to discussion in edgi-govdata-archiving/overview#102
After gauging the entire Web monitoring project with respect to Data wrangling right from Extraction from disparate sources (versionista, web archive, pagefreezer) to storing in database and visualizing it on webapp (UI), I could come up with this view.
Let me know your ideas / opinion on the same. More than willing to make the requisite changes and add it up to the documentation for everyone's ease of reference. @dcwalk @b5

Architecture diagram

edgi-architecture-diagram

Essential Diffing Tools for v0

The minimum set of diffs views that need to be available:

  • Source diffs (i.e. the easiest possible thing)
  • Diffs of visible text content only (subtle, but a simple approach is enough to start)
  • Side-by-side comparisons of the rendered pages

This is according to @ambergman and generally fits in with the consensus view I've heard.

Use PageFreezer archives to generate a diff for the PageFreezer-Outputter

From @ambergman on February 10, 2017 8:17

Build a simple module that takes as input a URL and two timestamps (referring to two different archived versions of a page) and outputs a diff for the PageFreezer-Outputter. This can be done by pinging PageFreezer's diff server or using another module. See a discussion of what we know about PageFreezer's diff server and other diff services in the pagefreezer-clim README

This module, being developed by @allanpichardo, will likely be perfect for completing this issue.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#18

GSoC Report for Week 2 of Phase 3

GSoC Report for Week 2 of Phase 3

I’m excited to have made progress on HTML diff for rendering and mapping out a plan for prioritization of important changes.

I worked on the following issues:

The following issues were closed:

The following PRs were merged:

I have opened the following PR:

Moving ahead, I plan to work on the classification of text as significant/insignificant. I'm going through the important changes spreadsheet and plan to discuss ways to classify changes in a call today.

Comments on Trello board: EDGI: Web Monitoring Project - Onboarding

Ping @patcon

Describe Page Freezer zip formats

From @titaniumbones on February 2, 2017 14:35

Each Page Freezer archive for a domain BASEURL consists of a zipfile with the following structure:

Base URL is
storage/export/climate/BASEURL_NUMERICALID_YYYY-MM-DD/http[s]_URL/

inside this you'll find a file:
http[s]_URL_MM_DD_YYYY.xml

and a directory
MM_DD_YYYY/
potentially containing multiple subdirs of the form:
http[s]_URL
where URL is either the BASEURL or an external domain cantaining resources linked from BASEURL pages.

First thing we need to do is to to understand the xml format -- here's an example document (rename with the .xml extension, Github doesn't allow upload of xml files):
http_www.climateandsatellites.org_01_20_2017.txt

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#2

Move over important issues from web-monitoring-ui

With the repo reorg we have some issues that have been left behind in https://github.com/edgi-govdata-archiving/web-monitoring-ui/issues.

I'd like to migrate them all over here and close whichever ones are no longer needed (also set up larger milestones like this (https://github.com/edgi-govdata-archiving/web-monitoring-ui/milestone/1) in this repo going forward.

@titaniumbones and @danielballan what is your take? I don't think many issues will remain open, but it would be nice to maintain that paper trail

GSoC Report for Week 3 of Phase 2

GSoC Report for Week 3 of Phase 2

I’m excited to have made progress on the filters created to segregate the irrelevant changes which will help analysts focus on the important ones. I'm also excited to have made progress on caching of diff results from different services to reduce the loading time for repeated access.

I worked on the following issues:

I closed the following issues:

The following PR was opened:

An impediment I faced is the uncertainty around the diffing services we plan to use in the future but I have discussed it with Mike, Kyala, Rob, and Toly and I have a better idea of the direction I want to move in.

I have gone through the current process followed by the analysts and I've also seen the developments on the dev side. I hope that with the inclusion of more diffing services, the things that I'm working on will be tried and tested by the analysts and I'll be able to incorporate feedback from them.

Update Trello onboarding with newer videos

@patcon, not sure if you're the only one with the ability to do this, but just saw you make some changes in this PR

Can you update the following videos as well?
Demo: Use the new one with Rob https://www.youtube.com/watch?v=lQvpprUn8A8
Also new url for demo: http://web-monitoring-ui-staging.herokuapp.com/

New Analyst training (1/2): https://www.youtube.com/watch?v=1FNi4lfsY-k

Also, the user interview with Maya is a lot more interesting and pertinent now, than the UI+analysts videos: https://www.youtube.com/watch?v=xTN3jOqIXGM (Also, automagic computer thumbnail FTW here. Still wouldn't want computers to decide this all the time though 😉 )

Create sample dataset for machine learning projects

If possible, we should create a sample dataset drawn from the analysis team's records (collected January 7 februrary). Then we can maybe start to implement some simple rules and see if they help us to identify significant changes on a larger scale.

GSoC Report for Week 1 of Phase 2

GSoC Report for Week 1 of Phase 2

I’m excited to have made progress on exploring the diffs for various websites. I've looked at 5 different agency websites and have gone through changes between the webpages back in the late 90s and recent ones as well. I've kept all the diff csvs in a Google Drive folder for future use.

I worked on the following issues:

I completed the following issues:

Add functionality to get cabinet ID of a specific URL

Iterating through all cabinets and finding a specific archive is a cumbersome task.
Adding a function for this will make it easy for developers and analysts to find the archives of a specific domain/site, when the need arises.
This should be added to the file pf_edgi.py by @danielballan which already includes various functionalities to efficiently use the PF API.

Designing Tentative Database Schema

Having gone through the Architecture, I realized the "Database Schema are unknown" is a glaring hole that needs to be sorted.
Would like to work on creating the same.
Decisions to be taken
What type of model to be chosen
Eg.Entity relationship model

What DB to be handled
Relational, NoSQL, Hadoop / Spark (Big Data)

Basic template (that quickly comes to mind)

Page name
Website (to which page belongs(
Page id (unique identifier)
Previous state
Current state
List of previous states

GSoC Report for Week 2 of Phase 2

GSoC Report for Week 2 of Phase 2

I’m excited to have made progress on creating functions that characterize page elements for use in filtration.

Unfortunately, due to a health problem, I was unavailable for most of the week.

I worked on the following issues:

I've also been trying to work out how PageFreezer computes its 'Delta' score by fabricating examples.

I have discussed the possibility of setting up a meeting with folks from PageFreezer and I'm hoping we are able to set one up soon.

I will be focussing on the issues which follow the aforementioned ones and will also experiment and add a few things to our own diffing service created by Dan.

GSoC Report for Week 4 of Phase 2

GSoC Report for Week 4 of Phase 2

I’m excited to have made progress on the filtering work. I have tested and pushed code which should be able to handle the tagging of different types of irrelevant changes. I have also fixed diff-match-patch issues along with Rob. I'm also excited to have made progress on creating new methods for computing differences. I have discussed a few ideas with Rob and we should have new functions soon.

I have finished working on the following issues:

I started working on the following issue:

I have put the following issues on hold for now:

The following PR was updated:

I'm working on pushing the filtering to the staging db and hopefully will get some analysts to test it once it has been done.

Build PageFreezer-Outputter that fits into current Versionista workflow

From @ambergman on February 10, 2017 8:5

To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.

I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#17

Incorporate Cluster in the schema

An aspect of the conversation in SF that I didn't fully absorb until chatting with @aleatha is the importance of surfacing a Cluster of related Changes as a concept in the database and in the UI.

The app should request a set of Clusters of Changes for a user to check. The UI should present one representative Change with the option of drilling down into the rest. Here's one way we could adjust the schema. I haven't thought about this very long -- just trying to kick off discussion:

Cluster
    uuid
    priority
    created_at
    updated_at

Then adjust the Change table to add cluster_uuid (starts as NULL, is assigned later by an ETL job) and remove priority, which is a property of a Cluster, not a Change.

Where to put Annotation is a sticky question: does it belong to a Cluster or a Change? I think it's analogous to the (famously sticky) problem of regularly-occurring events on a calendar. If an Annotation belongs to a Change, then it's easy to customize individual Annotations when needed but it's hard to safely update the whole set. If Annotation belongs to a Cluster, we'd need to provide some UI for subdividing errant clusters into sub-clusters. My guess is that leaving an Annotation as a property of a Change is the easier place to start.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.