eggpi / citationhunt Goto Github PK

View Code? Open in Web Editor NEW

106.0 14.0 48.0 2.12 MB

A fun tool for quickly browsing unsourced snippets on Wikipedia.

Home Page: https://citationhunt.toolforge.org

License: MIT License

Python 57.04% HTML 5.03% CSS 4.67% JavaScript 33.02% Shell 0.25%

python wikipedia flask citations gamification javascript

citationhunt's People

Contributors

Stargazers

Watchers

Forkers

enterprisey halibutt ricordisamoa shdror cristiancantoro earwig dingdsolutions edgarskos jhsoby stryn trp12xx takot erickguan henriquecrang sjoerddebruin mirrys zerocry maxzhou1902 samwalton9 arslanali1 yethrosh cladis intracer meghasharma21 sarojdhakal abartov papuass ashokchakra-tj whitemike889 truthiswill karmadolkar one909 aikochou diegosiqueir4 technolus jain-aditya rimaalhazmi teknologiwmid theklan hughlilly bd808 bennylin

citationhunt's Issues

Gather some simple server-side metrics to track usage.

At the very least, track the number of clicks to each button (per session?).

Ideas for some UI tweaks.

I played around with the UI looking for alternatives to some stuff that I don't currently like, while hopefully keeping the current minimalism. Here's a mock-up:

I think the main points here are:

The category selection dropdown is in a much less awkward position
The languages are now under a dropdown, making room for more
I added a simple footer and put the GitHub link there. This seems more consistent with other tools I've seen, and fills that space nicely

Other small tweaks:

The instructions use a slightly smaller font size, which looks more proportional to the title
The WP:CITELEAD notice is smaller and sits next to the buttons. Not sure I like it there but I couldn't find a better place.
Added arrows to the dropdowns

Eventually I'd also like to change the Nope, next! button to be more neutral, both in text (Next!) and color, but for now I'll just leave these ideas simmer in my head. Other suggestions are definitely welcome :)

Spanish translation

Some configuration parameters:

The category for articles lacking references is:
https://es.wikipedia.org/wiki/Categor%C3%ADa:Wikipedia:Art%C3%ADculos_que_necesitan_referencias

The template to look for is:
https://es.wikipedia.org/wiki/Plantilla:Cita_requerida

The (meta-)category for hidden categories is:
https://es.wikipedia.org/wiki/Categor%C3%ADa:Wikipedia:Categor%C3%ADas_ocultas

Define how workerpool should behave when there's an exception in a subprocess.

Some ideas:

Kill all other processes and re-raise the exception on the next post()/done()/cancel()
Respawn the crashed process, up to a number of times, then give up like 1.

The current behavior is to do nothing -- if a worker crashes, we might still be able to continue with the previous ones, but if the receiver crashes, we eventually deadlock as the receiver queue overflows.

Keep Tools Labs and Heroku databases in sync.

It would be nice if the Heroku deployment could just download the database we're using on Tools Labs so they're always in sync. Then we could remove the database from this repository, and/or keep it in the downloads area.

Make 'good' categories more likely to be chosen.

Categories that describe their articles well should be more likely to be chosen by assign_categories.py than those that don't. For example, it would be more desirable to categorize Vogue (Madonna song) under "Madonna (entertainer) songs" than, say, "Singles certified silver by the Syndicat National de l'Édition Phonographique".

One simple way to detect how well a category describes an article would be to count the occurrences of each word of the category's title in the article's title and text, and normalize that by the length (in words) of the category's title. This score (for all pages belonging to each category) can then be taken into account in our set-cover heuristic somehow.

Handle invalid categories server-side too.

The server shouldn't just echo back invalid categories: GET /&cat=invalid should redirect to /&cat=all instead of /&cat=invalid&id=<some valid id> (or perhaps just 404 like invalid ids currently do). Even when an id is provided in the URL, the server should probably still normalize categories.

It is not enough to just normalize the value of the client input client-side, as that doesn't cover us from links containing invalid categories or cases in which JS is disabled.

Scroll list of categories when navigating it with the keyboard.

See LeaVerou/awesomplete#14490 for an Awesomplete event we may need to implement this.

Wikicode tables look broken.

The snippet parser doesn't attempt to handle tables specially, so we just let mwparserfromhell strip the wikicode from them. However, this turns nice tables like this:

into this:

The easiest thing to do is to just blacklist tables entirely, that is, not include snippets contained inside tables. But it might be possible to be smarter about this and actually make them look usable.

Don't use redirects when browsing snippets inside the same category.

Currently, every click of the "next" button causes a GET /?cat=<some category> request to be made, which draws a new snippet id, and redirects to /?id=<snippet id>&cat=<some category>. We could be a lot more responsive by pre-selecting the next snippet at each page load, and making the "next" button a link to it, provided the user stays in the same category.
So in this case, GET /?id=<snippet id>&cat=<some category> would return a page in which the "next" button is a direct link to /?id=<some other snippet id>&cat=<some category>, and we wouldn't need a redirect when the button is clicked.

Polish translation

Lead section policy - there's none, we're working on translating WP:LEAD from enwiki, but it's going to take a couple of weeks (our community is pretty conservative and adopting a WP:MoS might take a lot of time)
Translate the interface to Polish
Page about referencing - Pomoc:Przypisy works like a charm and is up to date. There's also one on bibliography as such and, obviously, one on Verifiability.
Add Polish configuration
Prepare a snippet parser

Include uwsgi configuration for Tools Labs.

uwsgi looks capable of handling some things that heroku isn't, like gzip compression and setting cache headers. The logic for these things should be removed from app.py and placed into a uwsgi config file.

See http://uwsgi-docs.readthedocs.org/en/latest/Options.html for a list of configuration options.

Investigate using HTML import for the list of categories.

This would drastically reduce the size of the main page and give us better caching behavior. Polyfill: http://webcomponents.org/polyfills/

Handle URLs ending in <lang_code>/

For example, https://tools.wmflabs.org/citationhunt/fr/ causes a 404 in the default Flask configuration.

Handle wikicode in section titles

Sometimes, the titles of sections contain simple wikicode for things like italics. For example, the article on The San Diego Union-Tribune has a section called Acquisition of the North County Times, for which the wikicode is ===Acquisition of the ''North County Times''===.

We use the names of sections to try to infer the anchor for its paragraph so we can link directly to it. In this example, we want our code to be able to transform the wikicode into Acquisition_of_the_North_County_Times so we can produce a link such as https://en.wikipedia.org/wiki/The_San_Diego_Union-Tribune#Acquisition_of_the_North_County_Times.

The current version of this code is in a function called section_name_to_anchor in scripts/parse_live.py, and it produces the wrong anchor: instead of just Acquisition_of_the_North_County_Times, we end up with Acquisition_of_the_.27.27North_County_Times.27.27, because we're failing to remove the single-quotes before calling urllib.quote(...).

I think the best way to make this work is to have the snippet parser remove the wikicode from section titles. That is, we should do something like

section.get(0).title.strip_code().strip()

here, before the title of the section gets returned to section_name_to_anchor (I'm not sure the second strip will be even needed). Then this can be covered by a unit test in snippet_parser/en_test.py.

Allow articles to belong to more than one category.

assign_categories.py currently only assigns one category for each article in the database, even if that article belongs to more than one of the chosen categories.

Enhance categories with WikiProject names.

We can use the list of WikiProjects as a higher-level categorization of articles (e.g. Medicine), so these can be used in addition to the regular categories.

Note that the WikiProject category is actually added to a page's talk page, rather than the article. For instance, Category:All_WikiProject_Medicine_articles gets added to Talk:Diabetes mellitus rather than Diabetes mellitus.

Cursor should be a hand when hovering over the main interface buttons.

This makes it obvious that the buttons are clickable, by behaving consistently with links.

Catalan version.

Lead section policy: looks like there's none.
Introductory page about referencing: https://ca.wikipedia.org/wiki/Viquip%C3%A8dia:Guia_per_referenciar and https://ca.wikipedia.org/wiki/Viquip%C3%A8dia:Citau_les_fonts.
Add basic configuration for Catalan to the configuration and try parsing the dumps.
Implement a few of the most important Catalan-specific templates in the snippet_parser package.

Handle crawlers/robots.

It looks like we can't use a robots.txt file, as crawlers don't normally pick up these files from paths below the root. However, we can have some control over the behavior of crawlers by using a meta tag or header.

Citoid/Visual Editor integration.

See Twitter discussion.

Alternative approach using live database

In case you are not aware, WMF provides [Tool Labs](https://wikitech.wikimedia.org/wiki/Help:Tool Labs) which is a free hosting service for Wikimedia-related tools, with access to live database replicas and latest dumps. With live database access, I propose another way to discover pages needing citations:

Using SQL, select a random page in article space (namespace 0) which transcludes {{cn}} with the templatelinks table (see database layout and example)
Obtain the page content via API (There doesn't seem to be a way to get this from the database)
Parse the page using existing code, to get the uncited part

This way, we can avoid the lag caused by the use of dumps (what is shown represents the current state of the article) and there will be no need to pre-process the dumps. Editors also tend to trust tools hosted on Tool Labs more than those on third-party servers.

Tools on Tool Labs need to be licensed under a OSI-approved license.

Zhaofeng Li (User:Zhaofeng Li on Wikipedia)

Replace workerpool.py with multiprocessing.Pool.

workerpool.py was introduced way back when to make the parsing of the pages_articles dump a bit easier, and because only one process was allowed to write to SQLite.

Now that we've replaced pages_articles with the API and SQLite with MySQL, there's no reason to keep it around. It would be a lot cleaner to use multiprocessing.Pool instead.

Properly escape HTML in snippets.

We currently insert HTML into snippets before storing them into the database, but never escape existing HTML. Then, when rendering snippets on the site, we disable escaping so our HTML is preserved.

The existing HTML in snippets, if any, should be escaped somehere -- either when we parse them, before adding our own markup, or when rendering. Note, however, that mwparserfromhell's strip_code() seems to remove HTML:

>>> print mwparserfromhell.parse('<script>alert("hi");</script>').strip_code()
alert("hi");

In which case we may not need to escape at all.

Prefetch links are only followed by Chrome.

Each page contains a <link rel="prefetch"/> tag that points to the next page in the same category (if a category is selected), or a random other page, to speed up page transitions.

However, it looks like these links are only being followed by Chrome, not Safari of Firefox. Need to figure out why.

Move to the next snippet when "I got this!" is pressed.

Or perhaps change the button after pressed so it says something like "I'm done", then move to the next snippet when pressed again.

Have a better 404 page.

A database update potentially invalidates many links, causing 404s, so that page should be friendlier. At the very least, it should link back to the root page.

Category filtering is matching HTML-escaped categories.

Allow OAuth login and store the Wikipedia user in the logs.

It would be cool to allow users to log in while using Citation Hunt, as we can later figure out how many edits they have done and build a leaderboard, or award barnstars.

I found this documentation on using OAuth with a Python app, not sure if outdated.

Suggested by User:MusikAnimal.

Make the category filter more visible.

A few ideas:

Make the border darker
Change the default text to make it more inviting
Change the color of the default text
Move the input box somewhere more centered
Add the autofocus attribute, so we get a blinking cursor

Use live databases instead of the pages-articles dump.

It might be possible to adapt the database generation scripts so that we query the revision and text tables to get the wikicode for articles.

This way it should be possible to rebuild the database more often (we're currently limited by the frequency of dumps), while still keeping most of the current scripts intact, especially the algorithm for choosing categories.

Implement more Wikipedia templates.

We currently implement a selected set of Wikipedia templates and tags, and just remove the contents of the rest. In particular, for templates that are usually transcluded inside parenthesis, we sometimes end up with empty parenthesis in our snippets.

It's not a good idea to try to implement every single Wikipedia template, but it could be nice to support a few more to properly parse a larger set of articles. For example, implementing Template:Lang looks pretty easy, and this is a very widely employed template. Template:Chem, on the other hand, may not be worth it.

Handle subst: for templates.

In some Wikipedias (Czech and Swedish came up recently), it seems to be common to use substitution instead of transclusion for the citation needed templates. I don't think Citation Hunt can currently handle that.

This sounds fairly simple to do, depending on how mwparserfromhell exposes those to us. If it's not the one-line fix I'm expecting, it might be worth gathering some data on those Wikipedias to see how often it is used, and if it's worth doing.

Serve snippets in a fixed order within the same category.

This makes it less frustrating to navigate through categories with few articles.

Add some simple statistics to the interface.

It would be nice if the UI would display, for instance, how many citations have been done in the current session, and the total number of snippets lacking citations.

Once we're gathering server-side statistics (issue #25), and if we have enough traffic that this is worth doing, this could also display the global number of citations done across all sessions.

Don't break lines before [citation needed] tags.

Example: https://tools.wmflabs.org/citationhunt/?id=169884bb&cat=all

Italian version.

I think I have most of the pieces I need for an Italian version of Citation Hunt. Hopefully going from 2 to 3 languages will be easier than going from 1 to 2 :)

Roadmap

Figure out whether the Italian Wikipedia has a special policy for citations in the lead section (like WP:CITELEAD in English) and get a link to that if it does.
Find a good introductory page about referencing (like Aide:Source in French).
Add basic configuration for Italian to the configuration and try parsing the dumps.
Implement a few of the most important Italian-specific templates in the snippet_parser package.
Link to it from the other languages.

/cc @Aubreymcfato @pepato

Bengali translation needed

Hi, can you get the tool in Bengali, so that we can use it for Bengali Wikipedia?

Dynamically adjust the probability of picking a given snippet.

When picking a random snippet to display in the UI, we use a hard-coded probability that any given snippet will be picked.

For languages containing a small number of snippets, this often causes the database query to fail, because no snippet could be selected. We should (1) not crash when this happens, and (2) dynamically set that probability to something reasonable, based on the total number of snippets available.

Auto-update on Tools Labs as new dumps are released.

On Tools Labs, we have easy access to the latest pages-articles dumps, plus the categorylinks table. We can use the grid engine to schedule updates to the CH database.

A few things that will need to be done in the code or figured out:

parse_pages_articles.py should probably un-bzip2 the XML dump;
Use the page table to get category names instead of grabbing them from the XML dump;
Evaluate the memory usage of our scripts, as grid jobs are usually limited in that;
Atomically update the database, and in particular, handle pageids that stop existing between redirects;
Use the correct credentials for database access, of course.

HTTPS redirecting is not working on Tools Labs

$ curl -v http://tools.wmflabs.org/citationhunt/
* Hostname was NOT found in DNS cache
*   Trying 208.80.155.131...
* Connected to tools.wmflabs.org (208.80.155.131) port 80 (#0)
> GET /citationhunt/ HTTP/1.1
> User-Agent: curl/7.37.1
> Host: tools.wmflabs.org
> Accept: */*
>
< HTTP/1.1 302 FOUND
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Sun, 03 May 2015 20:50:54 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 283
< Connection: keep-alive
< Location: http://tools.wmflabs.org/citationhunt/?id=63836cf0&cat=all
< Cache-Control: no-cache, no-store
< X-Clacks-Overhead: GNU Terry Pratchett
<
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
* Connection #0 to host tools.wmflabs.org left intact
<p>You should be redirected automatically to target URL: <a href="/citationhunt/?id=63836cf0&amp;cat=all">/citationhunt/?id=63836cf0&amp;cat=all</a>.

Handle templates that redirect to {{citation needed}}

It seems possible to use the redirect table to figure out that, say, {{cn}} is the same as {{Citation needed}}, instead of having to specify all redirects manually in the config.

German version

A German version will take a bit more work, as the German Wikipedia doesn't use inline references. The idea here would be to just use the first paragraph of the article as a snippet, and update the strings in the UI accordingly (that is, "The Wikipedia article", not "The Wikipedia snippet").

I guess the code changes would be to allow the German version to use a different extract_snippets function in the snippet_parser package, somehow, then have that extract lead sections. I think the remainder of the code should be unaffected.

Optionally, we might want to preserve the bold text in the lead section, since it indicates the main topic of the article. This will also require a bit of refactoring, as right now the snippets are stored in the database without any kind wikicode (we drop the bold markup in the snippet parser).

Write a gadget/js client for Wikipedia UI integration.

@Sophivorus' idea: we could have a gadget or Javascript client that integrates Citation Hunt with Wikipedia itself. Basically a user would, say, use this UI to jump between articles, and we could have the snippet that needs citation be highlighted somehow.

It shouldn't be hard™ to expose an API endpoint from the current server that can be used by this new client.

Consider ignoring lead sections when there are other sections in the article.

Lead sections usually don't require citations if the information is elsewhere in the article, so it's not as useful to flag them.

A counterargument here would be that, since the {{ Citation needed }} in lead sections should be removed instead of replaced with citations, it may still be useful to flag them. Plus it looks like this is not a hard rule, but a guideline.

Perhaps a compromise would be to recognize lead sections and flag them, but also suggest removing the template if it's redundant.

Complete project description page at translatewiki.net

https://translatewiki.net/w/i.php?title=Translating:CitationHunt&action=edit&redlink=1 is missing. You can find examples at https://translatewiki.net/wiki/Group_descriptions

Originally reported at https://translatewiki.net/wiki/Thread:Support/Translating:CitationHunt

Gather statistics on which templates are present in our snippets while parsing them.

So we can make informed decisions about which templates to implement.

Add a link to a simpler tutorial for referencing to the UI.

We currently link to Referencing for beginners, which is longer and more detailed than a first-time editor would like to see.

@WikiLibrary on Twitter suggested linking to this video tutorial instead (or in addition) to that.

Add a default snippet parser.

A few of the smaller Wikipedias don't require a dedicated snippet parser, but we end up having to create a stub file anyway (example: pl.py), which is yet another manual step in the configuration.

We should instead have a default snippet parser that is used when building the database for any language, if there's no file named after that language in the snippet_parser directory. It's probably a good idea to output a log message when it is used as well.

Link to individual sections.

e.g. we could be linking (at least in the "I got this" button) to https://en.wikipedia.org/wiki/Matthew_Titone#Political_career rather than just https://en.wikipedia.org/wiki/Matthew_Titone, since the "Political carreer" section is the one that contains an unsourced statement.