eggpi / citationhunt Goto Github PK
View Code? Open in Web Editor NEWA fun tool for quickly browsing unsourced snippets on Wikipedia.
Home Page: https://citationhunt.toolforge.org
License: MIT License
A fun tool for quickly browsing unsourced snippets on Wikipedia.
Home Page: https://citationhunt.toolforge.org
License: MIT License
At the very least, track the number of clicks to each button (per session?).
I played around with the UI looking for alternatives to some stuff that I don't currently like, while hopefully keeping the current minimalism. Here's a mock-up:
I think the main points here are:
Other small tweaks:
Eventually I'd also like to change the Nope, next! button to be more neutral, both in text (Next!) and color, but for now I'll just leave these ideas simmer in my head. Other suggestions are definitely welcome :)
Some configuration parameters:
The category for articles lacking references is:
https://es.wikipedia.org/wiki/Categor%C3%ADa:Wikipedia:Art%C3%ADculos_que_necesitan_referencias
The template to look for is:
https://es.wikipedia.org/wiki/Plantilla:Cita_requerida
The (meta-)category for hidden categories is:
https://es.wikipedia.org/wiki/Categor%C3%ADa:Wikipedia:Categor%C3%ADas_ocultas
Some ideas:
post()
/done()
/cancel()
The current behavior is to do nothing -- if a worker crashes, we might still be able to continue with the previous ones, but if the receiver crashes, we eventually deadlock as the receiver queue overflows.
It would be nice if the Heroku deployment could just download the database we're using on Tools Labs so they're always in sync. Then we could remove the database from this repository, and/or keep it in the downloads area.
Categories that describe their articles well should be more likely to be chosen by assign_categories.py
than those that don't. For example, it would be more desirable to categorize Vogue (Madonna song) under "Madonna (entertainer) songs" than, say, "Singles certified silver by the Syndicat National de l'Édition Phonographique".
One simple way to detect how well a category describes an article would be to count the occurrences of each word of the category's title in the article's title and text, and normalize that by the length (in words) of the category's title. This score (for all pages belonging to each category) can then be taken into account in our set-cover heuristic somehow.
The server shouldn't just echo back invalid categories: GET /&cat=invalid
should redirect to /&cat=all
instead of /&cat=invalid&id=<some valid id>
(or perhaps just 404 like invalid ids currently do). Even when an id is provided in the URL, the server should probably still normalize categories.
It is not enough to just normalize the value of the client input client-side, as that doesn't cover us from links containing invalid categories or cases in which JS is disabled.
See LeaVerou/awesomplete#14490 for an Awesomplete event we may need to implement this.
The snippet parser doesn't attempt to handle tables specially, so we just let mwparserfromhell strip the wikicode from them. However, this turns nice tables like this:
into this:
The easiest thing to do is to just blacklist tables entirely, that is, not include snippets contained inside tables. But it might be possible to be smarter about this and actually make them look usable.
Currently, every click of the "next" button causes a GET /?cat=<some category>
request to be made, which draws a new snippet id, and redirects to /?id=<snippet id>&cat=<some category>
. We could be a lot more responsive by pre-selecting the next snippet at each page load, and making the "next" button a link to it, provided the user stays in the same category.
So in this case, GET /?id=<snippet id>&cat=<some category>
would return a page in which the "next" button is a direct link to /?id=<some other snippet id>&cat=<some category>
, and we wouldn't need a redirect when the button is clicked.
uwsgi looks capable of handling some things that heroku isn't, like gzip compression and setting cache headers. The logic for these things should be removed from app.py
and placed into a uwsgi config file.
See http://uwsgi-docs.readthedocs.org/en/latest/Options.html for a list of configuration options.
This would drastically reduce the size of the main page and give us better caching behavior. Polyfill: http://webcomponents.org/polyfills/
For example, https://tools.wmflabs.org/citationhunt/fr/ causes a 404 in the default Flask configuration.
Sometimes, the titles of sections contain simple wikicode for things like italics. For example, the article on The San Diego Union-Tribune has a section called Acquisition of the North County Times, for which the wikicode is ===Acquisition of the ''North County Times''===
.
We use the names of sections to try to infer the anchor for its paragraph so we can link directly to it. In this example, we want our code to be able to transform the wikicode into Acquisition_of_the_North_County_Times
so we can produce a link such as https://en.wikipedia.org/wiki/The_San_Diego_Union-Tribune#Acquisition_of_the_North_County_Times.
The current version of this code is in a function called section_name_to_anchor
in scripts/parse_live.py, and it produces the wrong anchor: instead of just Acquisition_of_the_North_County_Times
, we end up with Acquisition_of_the_.27.27North_County_Times.27.27
, because we're failing to remove the single-quotes before calling urllib.quote(...)
.
I think the best way to make this work is to have the snippet parser remove the wikicode from section titles. That is, we should do something like
section.get(0).title.strip_code().strip()
section_name_to_anchor
(I'm not sure the second strip
will be even needed). Then this can be covered by a unit test in snippet_parser/en_test.py.assign_categories.py
currently only assigns one category for each article in the database, even if that article belongs to more than one of the chosen categories.
We can use the list of WikiProjects as a higher-level categorization of articles (e.g. Medicine), so these can be used in addition to the regular categories.
Note that the WikiProject category is actually added to a page's talk page, rather than the article. For instance, Category:All_WikiProject_Medicine_articles gets added to Talk:Diabetes mellitus rather than Diabetes mellitus.
This makes it obvious that the buttons are clickable, by behaving consistently with links.
It looks like we can't use a robots.txt file, as crawlers don't normally pick up these files from paths below the root. However, we can have some control over the behavior of crawlers by using a meta tag or header.
See Twitter discussion.
In case you are not aware, WMF provides [Tool Labs](https://wikitech.wikimedia.org/wiki/Help:Tool Labs) which is a free hosting service for Wikimedia-related tools, with access to live database replicas and latest dumps. With live database access, I propose another way to discover pages needing citations:
templatelinks
table (see database layout and example)This way, we can avoid the lag caused by the use of dumps (what is shown represents the current state of the article) and there will be no need to pre-process the dumps. Editors also tend to trust tools hosted on Tool Labs more than those on third-party servers.
Tools on Tool Labs need to be licensed under a OSI-approved license.
Zhaofeng Li (User:Zhaofeng Li on Wikipedia)
workerpool.py was introduced way back when to make the parsing of the pages_articles
dump a bit easier, and because only one process was allowed to write to SQLite.
Now that we've replaced pages_articles
with the API and SQLite with MySQL, there's no reason to keep it around. It would be a lot cleaner to use multiprocessing.Pool instead.
We currently insert HTML into snippets before storing them into the database, but never escape existing HTML. Then, when rendering snippets on the site, we disable escaping so our HTML is preserved.
The existing HTML in snippets, if any, should be escaped somehere -- either when we parse them, before adding our own markup, or when rendering. Note, however, that mwparserfromhell's strip_code()
seems to remove HTML:
>>> print mwparserfromhell.parse('<script>alert("hi");</script>').strip_code()
alert("hi");
In which case we may not need to escape at all.
Each page contains a <link rel="prefetch"/>
tag that points to the next page in the same category (if a category is selected), or a random other page, to speed up page transitions.
However, it looks like these links are only being followed by Chrome, not Safari of Firefox. Need to figure out why.
Or perhaps change the button after pressed so it says something like "I'm done", then move to the next snippet when pressed again.
A database update potentially invalidates many links, causing 404s, so that page should be friendlier. At the very least, it should link back to the root page.
It would be cool to allow users to log in while using Citation Hunt, as we can later figure out how many edits they have done and build a leaderboard, or award barnstars.
I found this documentation on using OAuth with a Python app, not sure if outdated.
Suggested by User:MusikAnimal.
A few ideas:
autofocus
attribute, so we get a blinking cursorIt might be possible to adapt the database generation scripts so that we query the revision and text tables to get the wikicode for articles.
This way it should be possible to rebuild the database more often (we're currently limited by the frequency of dumps), while still keeping most of the current scripts intact, especially the algorithm for choosing categories.
We currently implement a selected set of Wikipedia templates and tags, and just remove the contents of the rest. In particular, for templates that are usually transcluded inside parenthesis, we sometimes end up with empty parenthesis in our snippets.
It's not a good idea to try to implement every single Wikipedia template, but it could be nice to support a few more to properly parse a larger set of articles. For example, implementing Template:Lang looks pretty easy, and this is a very widely employed template. Template:Chem, on the other hand, may not be worth it.
In some Wikipedias (Czech and Swedish came up recently), it seems to be common to use substitution instead of transclusion for the citation needed templates. I don't think Citation Hunt can currently handle that.
This sounds fairly simple to do, depending on how mwparserfromhell exposes those to us. If it's not the one-line fix I'm expecting, it might be worth gathering some data on those Wikipedias to see how often it is used, and if it's worth doing.
This makes it less frustrating to navigate through categories with few articles.
It would be nice if the UI would display, for instance, how many citations have been done in the current session, and the total number of snippets lacking citations.
Once we're gathering server-side statistics (issue #25), and if we have enough traffic that this is worth doing, this could also display the global number of citations done across all sessions.
I think I have most of the pieces I need for an Italian version of Citation Hunt. Hopefully going from 2 to 3 languages will be easier than going from 1 to 2 :)
Roadmap
/cc @Aubreymcfato @pepato
Hi, can you get the tool in Bengali, so that we can use it for Bengali Wikipedia?
When picking a random snippet to display in the UI, we use a hard-coded probability that any given snippet will be picked.
For languages containing a small number of snippets, this often causes the database query to fail, because no snippet could be selected. We should (1) not crash when this happens, and (2) dynamically set that probability to something reasonable, based on the total number of snippets available.
On Tools Labs, we have easy access to the latest pages-articles
dumps, plus the categorylinks
table. We can use the grid engine to schedule updates to the CH database.
A few things that will need to be done in the code or figured out:
parse_pages_articles.py
should probably un-bzip2 the XML dump;page
table to get category names instead of grabbing them from the XML dump;$ curl -v http://tools.wmflabs.org/citationhunt/
* Hostname was NOT found in DNS cache
* Trying 208.80.155.131...
* Connected to tools.wmflabs.org (208.80.155.131) port 80 (#0)
> GET /citationhunt/ HTTP/1.1
> User-Agent: curl/7.37.1
> Host: tools.wmflabs.org
> Accept: */*
>
< HTTP/1.1 302 FOUND
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Sun, 03 May 2015 20:50:54 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 283
< Connection: keep-alive
< Location: http://tools.wmflabs.org/citationhunt/?id=63836cf0&cat=all
< Cache-Control: no-cache, no-store
< X-Clacks-Overhead: GNU Terry Pratchett
<
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
* Connection #0 to host tools.wmflabs.org left intact
<p>You should be redirected automatically to target URL: <a href="/citationhunt/?id=63836cf0&cat=all">/citationhunt/?id=63836cf0&cat=all</a>.
It seems possible to use the redirect table to figure out that, say, {{cn}}
is the same as {{Citation needed}}
, instead of having to specify all redirects manually in the config.
A German version will take a bit more work, as the German Wikipedia doesn't use inline references. The idea here would be to just use the first paragraph of the article as a snippet, and update the strings in the UI accordingly (that is, "The Wikipedia article", not "The Wikipedia snippet").
I guess the code changes would be to allow the German version to use a different extract_snippets
function in the snippet_parser
package, somehow, then have that extract lead sections. I think the remainder of the code should be unaffected.
Optionally, we might want to preserve the bold text in the lead section, since it indicates the main topic of the article. This will also require a bit of refactoring, as right now the snippets are stored in the database without any kind wikicode (we drop the bold markup in the snippet parser).
@Sophivorus' idea: we could have a gadget or Javascript client that integrates Citation Hunt with Wikipedia itself. Basically a user would, say, use this UI to jump between articles, and we could have the snippet that needs citation be highlighted somehow.
It shouldn't be hard™ to expose an API endpoint from the current server that can be used by this new client.
Lead sections usually don't require citations if the information is elsewhere in the article, so it's not as useful to flag them.
A counterargument here would be that, since the {{ Citation needed }}
in lead sections should be removed instead of replaced with citations, it may still be useful to flag them. Plus it looks like this is not a hard rule, but a guideline.
Perhaps a compromise would be to recognize lead sections and flag them, but also suggest removing the template if it's redundant.
https://translatewiki.net/w/i.php?title=Translating:CitationHunt&action=edit&redlink=1 is missing. You can find examples at https://translatewiki.net/wiki/Group_descriptions
Originally reported at https://translatewiki.net/wiki/Thread:Support/Translating:CitationHunt
So we can make informed decisions about which templates to implement.
We currently link to Referencing for beginners, which is longer and more detailed than a first-time editor would like to see.
@WikiLibrary on Twitter suggested linking to this video tutorial instead (or in addition) to that.
A few of the smaller Wikipedias don't require a dedicated snippet parser, but we end up having to create a stub file anyway (example: pl.py), which is yet another manual step in the configuration.
We should instead have a default snippet parser that is used when building the database for any language, if there's no file named after that language in the snippet_parser
directory. It's probably a good idea to output a log message when it is used as well.
e.g. we could be linking (at least in the "I got this" button) to https://en.wikipedia.org/wiki/Matthew_Titone#Political_career rather than just https://en.wikipedia.org/wiki/Matthew_Titone, since the "Political carreer" section is the one that contains an unsourced statement.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.