archiveteam / archivebot Goto Github PK
View Code? Open in Web Editor NEWArchiveBot, an IRC bot for archiving websites
Home Page: http://www.archiveteam.org/index.php?title=ArchiveBot
License: MIT License
ArchiveBot, an IRC bot for archiving websites
Home Page: http://www.archiveteam.org/index.php?title=ArchiveBot
License: MIT License
1. ArchiveBot <SketchCow> Coders, I have a question. <SketchCow> Or, a request, etc. <SketchCow> I spent some time with xmc discussing something we could do to make things easier around here. <SketchCow> What we came up with is a trigger for a bot, which can be triggered by people with ops. <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to archive.org. Boom. <SketchCow> I can supply machine as needed. <SketchCow> Obviously there's some sanitation issues, and it is root all the way down or nothing. <SketchCow> I think that would help a lot for smaller sites <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty simple. <SketchCow> And just being able to go "bot, get a sanity dump" 2. More info ArchiveBot has two major backend components: the control node, which runs the IRC interface and bookkeeping programs, and the crawlers, which do all the Web crawling. ArchiveBot users communicate with ArchiveBot by issuing commands in an IRC channel. User's guide: http://archivebot.readthedocs.org/en/latest/ Control node installation guide: INSTALL.backend Crawler installation guide: INSTALL.pipeline 3. Local use ArchiveBot was originally written as a set of separate programs for deployment on a server. This means it has a poor distribution story. However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline, dashboard, ignores, and control system and created a package intended for personal use. You can find it at https://github.com/ArchiveTeam/grab-site. 4. License Copyright 2013 David Yip; made available under the MIT license. See LICENSE for details. 5. Acknowledgments Thanks to Alard (@alard), who added WARC generation and Lua scripting to GNU Wget. Wget+lua was the first web crawler used by ArchiveBot. Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web crawler. Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and tracking down performance problems at scale. Other thanks go to the following projects: * Celluloid <http://celluloid.io/> * Cinch <https://github.com/cinchrb/cinch/> * CouchDB <http://couchdb.apache.org/> * Ember.js <http://emberjs.com/> * Redis <http://redis.io/> * Seesaw <https://github.com/ArchiveTeam/seesaw-kit> 6. Special thanks Dragonette, Barnaby Bright, Vienna Teng, NONONO. The memory hole of the Web has gone too far. Don't look down, never look away; ArchiveBot's like the wind. vim:ts=2:sw=2:tw=72:et
[11:24:16] <SketchCow> Archivebot will also not regrab a site if it's been grabbed recently, recently being 2 days.
Easy solution: Set archive records to expire 48 hours after successful completion.
The dashboard uses JSON for exchanging log updates. The json
gem (in ArchiveBot's environment) represents JSON in UTF-8.
However, URL data in each log update is not guaranteed to be valid UTF-8. There's a few reasons for this, but
Said causes are irrelevant because the dashboard is being presented with URLs that are just a string of bits and we're interpreting them as UTF-8. That works in a lot of cases, but not all. Explicit transcoding is required.
It's possible that archive jobs may get out of control; the site may be bigger than anticipated.
We need a way to abort currently running jobs. !abort IDENT
could work.
Any trusted ArchiveBot user can issue this command.
Now that ArchiveBot generates a WARC in the complete and abort cases, it should label the generated WARC as aborted.
Something like this, maybe: example_com-inf-aborted-1234567890.warc.gz
ArchiveBot currently logs which jobs have been started, but it doesn't log which jobs have finished. (The idea was to remove jobs once they finished.)
We should log jobs that have finished, as well as how they finished: failure, partial success, or success.
This has a couple of applications:
[02:18:57] <yipdw> !abort cuyp1tr5lig4d6ox4v7kyry9f
[02:18:57] <ATGoKart> yipdw: Sorry, only channel operators may start archive jobs.
ATBot's response should be something like "Sorry, only channel operators may use !abort".
ArchiveBot should be able to upload generated WARCs to multiple simultaneous locations.
The use case:
There are (at least) two needs that this project is trying to address. The first is generating WARCs for injection into the Internet Archive. The second is a quick WARC generation tool.
Multiple simultaneous upload targets makes it possible for us to satisfy both needs.
Each pipeline instance (i.e. each pipeline/pipeline.py
process) should periodically report on its RAM and disk usage.
For now, we just want to display that on ArchiveBot's dashboard. We could in future use it for more sophisticated job routing.
At present, ArchiveBot uses wget's --random-wait
option as a simple form of rate limiting.
This (unexpectedly) also seems to apply to ignored URLs. It shouldn't. Make it so.
I've seen ArchiveBot wget instances try to grab URLs of the form
http://www.example.com/%22http://www.example.com...
on some sites. If you apply URL decoding, this becomes
http://www.example.com/"http://www.example.com...
Many sites do not have content at such URLs, so even if wget generates such a URL, fetch will terminate with a 4xx (or 5xx, I guess, if the site has problems with "s in URLs). However, some sites will send back an HTTP 200 with links going deeper in the link graph. Ugh.
Assuming that URLs of the above form are very rare and are almost never legitimate, ArchiveBot should detect such patterns and reject them. (The other possibility: figure out why these URLs show up. If it's a wget bug, fix it.)
The profile image in the feed is hotlinked to Twitter but that link is now expired (404). There used to be a URL in the API designed for hotlinking, but it no longer works. (They are slowly getting rid of the API to prevent third-party apps.) Maybe it would be best to hotlink a nice square one on the wiki instead.
http://folk.uib.no/hnohf/ contains links to resources in http://www.uib.no/people/hnohf/. However, GETs on these resources result in 301s back to resources on folk.uib.no/hnohf/.
ArchiveBot currently cannot deal with this: it will not follow links from www.uib.no to folk.uib.no.
Find out a way to handle cases like this.
Like #32, but monitoring available RAM. (Includes swap.)
!ao jobs should always be higher-priority than !a jobs, which can hold up time-sensitive things for days.
Currently, the !status IDENT
mechanism is the only way to get status information from ArchiveBot.
ArchiveBot should be able to notify people when jobs complete or finish aborting.
<ivan`> yipdw: did [SITE] land on your server? will it be up for a week or so?
<yipdw> ivan`: [SITE] is not on my pipeline
<ivan`> guess it's on joepie91
<yipdw> or nico_32
<yipdw> ArchiveBot is now truly Cloud
<yipdw> because we can't tell where the fuck a job is
Fix this.
A lot of websites use external asset servers for CSS, Javascript, and other media needed to properly display a page.
wget's --domains
filter seems to apply to both retrieval and page requisites, which is understandable but unfortunate. We can't possibly predict all acceptable asset servers in advance.
We might, however, be able to implement custom filtering using a Lua script.
Cinch supposedly offers oper detection as follows:
on :message, "bleh" do |m|
if m.user.oper?
# stuff
end
end
This didn't work for me on SynIRC, though, which is why I haven't implemented it yet. I haven't tried it on EFNet or other networks.
See what's going on and make it so that ArchiveBot will only start jobs requested by channel ops.
[15:00:22] <SketchCow> yipdw: I wish the archivebot would include a .txt file next to the warc files so we knew what they were.
[15:00:51] <SketchCow> Otherwise we're going to have these massive crawl items that sit there being obscure
A couple of possibilities (not mutually exclusive):
www-example-com-1234567890.warc.gz
)ArchiveBot's Seesaw pipeline is quite large and is becoming painful to navigate and change.
Fortunately, it is composed of a few major components:
These would be a good place to start drawing package boundaries.
However the pipeline is packaged, it should be done with an eye towards clarifying inter-module dependencies and making it easier to change one part of the pipeline whilst thinking less about how it'll affect the rest of the pipeline. This is how you know you have gone too far.
The dashboard should show which pipeline a job is on, and should also show free RAM/disk space for each pipeline. Nothing else for now.
The same kind of junk appears to be grabbed for many blogs, and ArchiveBot should ignore all of it by default.
?replytocom=
?share=email
?share=linkedin
?share=twitter
?share=stumbleupon
?share=google-plus-1
?share=digg
https://plus.google.com/share?url=
http://www.facebook.com/login.php
https://ssl.reddit.com/login?dest=
http://www.reddit.com/login?dest=
http://www.reddit.com/submit?url=
http://digg.com/submit?url=
http://www.facebook.com/sharer.php?
http://www.facebook.com/sharer/sharer.php?
http://pixel.quantserve.com/pixel/
as well as these that appear for reasons unknown to me:
'%20+%20liker.profile_URL%20+%20'
'%20+%20liker.avatar_URL%20+%20'
%22%20+%20$wrapper.data(
(single quotes are actually in the first two URLs)
The dashboard's log view is designed to keep only a small number of log entries around in scrollback.
However, heap analysis in Chrome and Firefox indicate that there's a lot of stuff hanging around even after the views have been removed from the DOM. Chrome's heap analysis implicates Ember's views.
This might not actually be a memory leak (maybe it just needs that much memory and eventually stabilizes), but we've seen memory usage go above a gigabyte with no sign of slowing down. Additionally, disabling logs gives us much lower and more stable memory usage.
Look for and implement more efficient ways to render the log. For example, it might be the case that bypassing Ember's view system and dropping down to DOM manipulation -- while more painful and fraught with corner-cases in an Ember application -- might be enough to get by.
While it's not wrong (on a very literal level) to call a pending job "in progress", it's misleading. The status display should distinguish between these categories.
Perhaps "queued" and "downloading" would be better.
Just because I think it'd be funny.
Easiest way to do this (though not strictly accurate) would be to sum up bytes_downloaded for all jobs. That misses upload, but we can relabel this "total bytes downloaded" or something to make it more accurate.
Here is an example from a job done on coilhouse.net:
https://s3.amazonaws.com/nw-depot/coilhouse_fetch.log.bz2 (expands to ~58 MB)
There's two conditions that seem to be necessary for this to happen:
Figure out how we can detect this condition.
On ArchiveBotDrone, I find sometime the leftover of failed jobs (.warc.gz & .json).
I upload them to Fos manually (i set aborted to true in the json and i rename them to include -failed before uploading) but the bot should do this automatically.
Currently, the only way to stop a download is !abort. This halts downloading and deletes the WARC-in-progress.
It might make sense to have a command which halts downloading, but keeps the WARC (partial WARCs are okay, and can be fixed up pretty easily) and uploads what's there. Sort of like a !goodenough
.
This wouldn't see wide use. The motivating case for this is our current grab of the Silk Road forums, which is currently 500-ing out on queued URLs. We did, however, grab about 10,000 URLs from said forums. It would be nice to be able to say "meh, good enough" and just upload what we have.
I can watch the number of 2xx, 3xx, etc. responses on the dashboard and get an idea of how much has been retrieved relative to what a Wayback query returns for a given domain, but if I'm away from keyboard when the job finishes and I go to the http://archivebot.at.ninjawedding.org:4567/#/histories/ page all I get is the total file size and a histogram, no final counts. This would be handy to see.
Floods of log entries (i.e. many ignores in a row) can overwhelm a browser displaying the dashboard. Ember.js does some work to batch DOM updates together, but sometimes too much is just too much.
I have observed that the dashboard's performance significantly improves when the offending log's output is paused.
Determine whether implementing a "you're updating too fast" timer will offer significant performance improvements in the above extreme case whilst not imposing unacceptable overhead in the common case. The timer will be used to implement an update backoff: instead of updating every entry, we'll back off to every n entries.
[23:26:51] <SketchCow> Speaking of feature creep
[23:27:04] <SketchCow> I am not against Archivebot tweeting what it's doing.
[22:02:11] <ivan`> feature request: submit a few hundred !archiveonly URLs via a URL to a text file listing said URLs
Maybe something like this?
!ao < http://www.example.com/urls.txt
Previously, ArchiveBot's generated filenames looked like this:
f1ae3mhya6r5ujb8kp4v224e6-[date]-[time].warc.gz
That wasn't very useful for sorting or review purposes.
They now look like this:
www.example.com-inf-20130331-120000.warc.gz
which is much better for manual review, but has a collision problem.
Specifically, if two jobs are started from different parts in www.example.com's hierarchy in the same second, you're going to end up with an overwrite whilst rsyncing the data.
This happens a lot on !ao runs, which is also where you don't want this sort of thing to happen. (It's particularly common to encounter this while saving a set of tweets.)
A compromise like this would be useful:
www.example.com-224e6-inf-20130331-120000.warc.gz
e.g. due to redis being down
Many blogs embed audio, video, and other such media. In some cases, wget can detect this (say, if it's a direct link to a file), but custom Flash players can cause problems.
Suggested solution: use an external tool to either
generate a downloadable URL to feed back into wget's fetch queue
download the video file and store it to a separate WARC
and integrate that into ArchiveBot's fetch process.
At present, ArchiveBot's pipeline hardcodes a Web-accessible archive URL into a work item.
This isn't going to work now that we're uploading to a staging area with eventual injection into the Internet Archive. However, the data model is flexible enough to support updates to job data, which means that we can add an archive URL when we know what that URL is going to be. (And we can update said URL as needed.)
Write a companion program that does this.
wget, while versatile, has two major limitations when it comes to downloading large sites:
Luckily, wget-lua
provides us with the get_urls
and httploop_result
hooks, which provide us with the necessary extension points to implement custom URL scraping and queuing logic. Investigate this possibility, keeping the following goals in mind:
From #27:
!ao jobs now always take precedence over !a jobs. However, all jobs still live in one queue. We can go further and build a separate !ao queue that (say) a separate pipeline could process. I'll delegate that to a separate issue.
There are (at least) two important considerations:
RPOPLPUSH
primitive, and therefore benefits from that primitive's atomicity guarantees. Any replacement should maintain those guarantees or conclusively demonstrate that they are not necessary.wpull
errors echoed to the dashboard look like this:
ERROR Fetching \u2018http://www.dogster.com/local/CO/Englewood/Pet_Stores_General/Precious_cat_catlitter-24478\u2019 encountered an error: Read timed out.
The escaped Unicode characters appear to be left and right quotes.
This confuses some ArchiveBot users by making them think that the \u2018 and \u2019 are part of the fetched URL.
From #29:
For an RSS/Atom feed, I'm thinking that the dashboard app could serve it. Sounds good?
Something like http://dashboard.example.com/index.rss
should work.
It'd be really slick if we had TwitterTweeter
or a related object generate that file on-disk.
Wget's memory usage tends to grow with the number of URLs examined, but we don't really have any data indicating what the growth pattern is like.
This would be useful to have primarily for satisfying curiosity, but it would also be good to have as a before-and-after dataset if/when we work on #9.
Ignore pattern sets are (or can be) updated frequently and need to be deployed independent of the rest of the ArchiveBot program set.
At present, ignore patterns are updated by me performing the updates manually in ArchiveBot's database. This has obvious scalability and reproducibility problems.
This sort of thing should be wrapped into a utility that can be run by anyone with sufficient access, where "sufficient" is something that I still need to define.
Ban on nick!hostmask
, just like IRC servers.
When a WARC is uploading, there's no indication of this in the dashboard. However, rsync generates useful diagnostic output.
We should show that in the job's log, so that it doesn't look like the job has inexplicably stalled.
This is related to #15.
The Web is full of crap that ignores the semantics of HTTP. Figuring out whether something has gone wrong in a fetch loop often requires human analysis and intuition.
Why not make that tunable over time?
ArchiveBot's dashboard lets humans see what the spider's doing. We should also have a way to act on that.
Proposed interface:
<me> !ignore 957u6gyj7c536x4fsaam6mqsm turntable.fm/Coilhouse
<ArchiveBot> Pattern turntable.fm/Coilhouse added to job 957u6gyj7c536x4fsaam6mqsm.
This means "for all future retrievals, check if any of the proposed URLs match turntable.fm/Coilhouse. Do not fetch those that do."
!unignore would remove patterns:
<me> !unignore 957u6gyj7c536x4fsaam6mqsm turntable.fm/Coilhouse
<ArchiveBot> Pattern turntable.fm/Coilhouse removed from job 957u6gyj7c536x4fsaam6mqsm.
Because the fetch determination is controlled by a Lua script, any Lua pattern accepted by string.match
will be accepted.
Caution: super feature creep.
There's quite a few requests for websites to be added into the Wayback Machine.
(e.g.: https://archive.org/iathreads/forum-display.php?forum=web&limit=1000) Perhaps we can fulfill their requests automatically?
This might work better as a separate client that feeds URLs in the IRC channel.
or if it already exists, document it in https://github.com/ArchiveTeam/ArchiveBot/blob/master/COMMANDS
If a job's available disk space exceeds some threshold, it should abort itself.
"Abort", in this context, means the same thing it does in the rest of ArchiveBot: immediately terminate grab, upload the partial result, and inform the bot's task tracker.
At present, the Lua script establishes a Redis connection once, and has no way to recover from failure.
There should be a way for the script to refresh its connection and retry commands. (All commands issued by the script can be executed multiple times without harm, so I don't think we need complicated "did I already do this" logic.)
[03:55:38] <joepie91> yipdw: suggestion, perhaps make it say "Queued x" instead of "Archiving x", and post a message into the channel (-without- a nickname prefix) as soon as the job actually starst?
This issue is a refinement of that request.
If a job can start immediately (where "immediately" means "within 5 seconds"), then we should still print Archiving SITE
. However, if the job queue is full at the time of job submission, we should print something like Queued SITE for archival
or Queued SITE for archival without recursion
, and perhaps also the queue length.
Once the job starts, we should print a message into the control channel. Something like
<ArchiveBot> Job IDENT for SITE has started.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.