Giter Site home page Giter Site logo

eggpi / citationhunt Goto Github PK

View Code? Open in Web Editor NEW
106.0 14.0 48.0 2.12 MB

A fun tool for quickly browsing unsourced snippets on Wikipedia.

Home Page: https://citationhunt.toolforge.org

License: MIT License

Python 57.04% HTML 5.03% CSS 4.67% JavaScript 33.02% Shell 0.25%
python wikipedia flask citations gamification javascript

citationhunt's People

Contributors

amire80 avatar ashokchakra-tj avatar bd808 avatar bennylin avatar cladis avatar dependabot[bot] avatar earwig avatar eggpi avatar enterprisey avatar erickguan avatar henriquecrang avatar hughlilly avatar intracer avatar jain-aditya avatar jessamynwest avatar jhsoby avatar karmadolkar avatar kht1992 avatar maxzhou1902 avatar nikerabbit avatar papuass avatar ricordisamoa avatar sarojdhakal avatar shdror avatar siebrand avatar sjoerddebruin avatar stryn avatar takot avatar translatewiki avatar venca24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

citationhunt's Issues

Ideas for some UI tweaks.

I played around with the UI looking for alternatives to some stuff that I don't currently like, while hopefully keeping the current minimalism. Here's a mock-up:

screen shot 2016-09-16 at 12 39 21 pm

I think the main points here are:

  • The category selection dropdown is in a much less awkward position
  • The languages are now under a dropdown, making room for more
  • I added a simple footer and put the GitHub link there. This seems more consistent with other tools I've seen, and fills that space nicely

Other small tweaks:

  • The instructions use a slightly smaller font size, which looks more proportional to the title
  • The WP:CITELEAD notice is smaller and sits next to the buttons. Not sure I like it there but I couldn't find a better place.
  • Added arrows to the dropdowns

Eventually I'd also like to change the Nope, next! button to be more neutral, both in text (Next!) and color, but for now I'll just leave these ideas simmer in my head. Other suggestions are definitely welcome :)

Define how workerpool should behave when there's an exception in a subprocess.

Some ideas:

  1. Kill all other processes and re-raise the exception on the next post()/done()/cancel()
  2. Respawn the crashed process, up to a number of times, then give up like 1.

The current behavior is to do nothing -- if a worker crashes, we might still be able to continue with the previous ones, but if the receiver crashes, we eventually deadlock as the receiver queue overflows.

Keep Tools Labs and Heroku databases in sync.

It would be nice if the Heroku deployment could just download the database we're using on Tools Labs so they're always in sync. Then we could remove the database from this repository, and/or keep it in the downloads area.

Make 'good' categories more likely to be chosen.

Categories that describe their articles well should be more likely to be chosen by assign_categories.py than those that don't. For example, it would be more desirable to categorize Vogue (Madonna song) under "Madonna (entertainer) songs" than, say, "Singles certified silver by the Syndicat National de l'Édition Phonographique".

One simple way to detect how well a category describes an article would be to count the occurrences of each word of the category's title in the article's title and text, and normalize that by the length (in words) of the category's title. This score (for all pages belonging to each category) can then be taken into account in our set-cover heuristic somehow.

Handle invalid categories server-side too.

The server shouldn't just echo back invalid categories: GET /&cat=invalid should redirect to /&cat=all instead of /&cat=invalid&id=<some valid id> (or perhaps just 404 like invalid ids currently do). Even when an id is provided in the URL, the server should probably still normalize categories.

It is not enough to just normalize the value of the client input client-side, as that doesn't cover us from links containing invalid categories or cases in which JS is disabled.

Wikicode tables look broken.

The snippet parser doesn't attempt to handle tables specially, so we just let mwparserfromhell strip the wikicode from them. However, this turns nice tables like this:

screen shot 2016-02-20 at 12 43 10 pm

into this:

screen shot 2016-02-20 at 12 41 40 pm

The easiest thing to do is to just blacklist tables entirely, that is, not include snippets contained inside tables. But it might be possible to be smarter about this and actually make them look usable.

Don't use redirects when browsing snippets inside the same category.

Currently, every click of the "next" button causes a GET /?cat=<some category> request to be made, which draws a new snippet id, and redirects to /?id=<snippet id>&cat=<some category>. We could be a lot more responsive by pre-selecting the next snippet at each page load, and making the "next" button a link to it, provided the user stays in the same category.
So in this case, GET /?id=<snippet id>&cat=<some category> would return a page in which the "next" button is a direct link to /?id=<some other snippet id>&cat=<some category>, and we wouldn't need a redirect when the button is clicked.

Handle wikicode in section titles

Sometimes, the titles of sections contain simple wikicode for things like italics. For example, the article on The San Diego Union-Tribune has a section called Acquisition of the North County Times, for which the wikicode is ===Acquisition of the ''North County Times''===.

We use the names of sections to try to infer the anchor for its paragraph so we can link directly to it. In this example, we want our code to be able to transform the wikicode into Acquisition_of_the_North_County_Times so we can produce a link such as https://en.wikipedia.org/wiki/The_San_Diego_Union-Tribune#Acquisition_of_the_North_County_Times.

The current version of this code is in a function called section_name_to_anchor in scripts/parse_live.py, and it produces the wrong anchor: instead of just Acquisition_of_the_North_County_Times, we end up with Acquisition_of_the_.27.27North_County_Times.27.27, because we're failing to remove the single-quotes before calling urllib.quote(...).

I think the best way to make this work is to have the snippet parser remove the wikicode from section titles. That is, we should do something like

section.get(0).title.strip_code().strip()
here, before the title of the section gets returned to section_name_to_anchor (I'm not sure the second strip will be even needed). Then this can be covered by a unit test in snippet_parser/en_test.py.

Handle crawlers/robots.

It looks like we can't use a robots.txt file, as crawlers don't normally pick up these files from paths below the root. However, we can have some control over the behavior of crawlers by using a meta tag or header.

Alternative approach using live database

In case you are not aware, WMF provides [Tool Labs](https://wikitech.wikimedia.org/wiki/Help:Tool Labs) which is a free hosting service for Wikimedia-related tools, with access to live database replicas and latest dumps. With live database access, I propose another way to discover pages needing citations:

  1. Using SQL, select a random page in article space (namespace 0) which transcludes {{cn}} with the templatelinks table (see database layout and example)
  2. Obtain the page content via API (There doesn't seem to be a way to get this from the database)
  3. Parse the page using existing code, to get the uncited part

This way, we can avoid the lag caused by the use of dumps (what is shown represents the current state of the article) and there will be no need to pre-process the dumps. Editors also tend to trust tools hosted on Tool Labs more than those on third-party servers.

Tools on Tool Labs need to be licensed under a OSI-approved license.

Zhaofeng Li (User:Zhaofeng Li on Wikipedia)

Properly escape HTML in snippets.

We currently insert HTML into snippets before storing them into the database, but never escape existing HTML. Then, when rendering snippets on the site, we disable escaping so our HTML is preserved.

The existing HTML in snippets, if any, should be escaped somehere -- either when we parse them, before adding our own markup, or when rendering. Note, however, that mwparserfromhell's strip_code() seems to remove HTML:

>>> print mwparserfromhell.parse('<script>alert("hi");</script>').strip_code()
alert("hi");

In which case we may not need to escape at all.

Prefetch links are only followed by Chrome.

Each page contains a <link rel="prefetch"/> tag that points to the next page in the same category (if a category is selected), or a random other page, to speed up page transitions.

However, it looks like these links are only being followed by Chrome, not Safari of Firefox. Need to figure out why.

Have a better 404 page.

A database update potentially invalidates many links, causing 404s, so that page should be friendlier. At the very least, it should link back to the root page.

Make the category filter more visible.

A few ideas:

  • Make the border darker
  • Change the default text to make it more inviting
  • Change the color of the default text
  • Move the input box somewhere more centered
  • Add the autofocus attribute, so we get a blinking cursor

Use live databases instead of the pages-articles dump.

It might be possible to adapt the database generation scripts so that we query the revision and text tables to get the wikicode for articles.

This way it should be possible to rebuild the database more often (we're currently limited by the frequency of dumps), while still keeping most of the current scripts intact, especially the algorithm for choosing categories.

Implement more Wikipedia templates.

We currently implement a selected set of Wikipedia templates and tags, and just remove the contents of the rest. In particular, for templates that are usually transcluded inside parenthesis, we sometimes end up with empty parenthesis in our snippets.

It's not a good idea to try to implement every single Wikipedia template, but it could be nice to support a few more to properly parse a larger set of articles. For example, implementing Template:Lang looks pretty easy, and this is a very widely employed template. Template:Chem, on the other hand, may not be worth it.

Handle subst: for templates.

In some Wikipedias (Czech and Swedish came up recently), it seems to be common to use substitution instead of transclusion for the citation needed templates. I don't think Citation Hunt can currently handle that.

This sounds fairly simple to do, depending on how mwparserfromhell exposes those to us. If it's not the one-line fix I'm expecting, it might be worth gathering some data on those Wikipedias to see how often it is used, and if it's worth doing.

Add some simple statistics to the interface.

It would be nice if the UI would display, for instance, how many citations have been done in the current session, and the total number of snippets lacking citations.

Once we're gathering server-side statistics (issue #25), and if we have enough traffic that this is worth doing, this could also display the global number of citations done across all sessions.

Italian version.

I think I have most of the pieces I need for an Italian version of Citation Hunt. Hopefully going from 2 to 3 languages will be easier than going from 1 to 2 :)

Roadmap

  • Figure out whether the Italian Wikipedia has a special policy for citations in the lead section (like WP:CITELEAD in English) and get a link to that if it does.
  • Find a good introductory page about referencing (like Aide:Source in French).
  • Add basic configuration for Italian to the configuration and try parsing the dumps.
  • Implement a few of the most important Italian-specific templates in the snippet_parser package.
  • Link to it from the other languages.

/cc @Aubreymcfato @pepato

Dynamically adjust the probability of picking a given snippet.

When picking a random snippet to display in the UI, we use a hard-coded probability that any given snippet will be picked.

For languages containing a small number of snippets, this often causes the database query to fail, because no snippet could be selected. We should (1) not crash when this happens, and (2) dynamically set that probability to something reasonable, based on the total number of snippets available.

Auto-update on Tools Labs as new dumps are released.

On Tools Labs, we have easy access to the latest pages-articles dumps, plus the categorylinks table. We can use the grid engine to schedule updates to the CH database.

A few things that will need to be done in the code or figured out:

  • parse_pages_articles.py should probably un-bzip2 the XML dump;
  • Use the page table to get category names instead of grabbing them from the XML dump;
  • Evaluate the memory usage of our scripts, as grid jobs are usually limited in that;
  • Atomically update the database, and in particular, handle pageids that stop existing between redirects;
  • Use the correct credentials for database access, of course.

HTTPS redirecting is not working on Tools Labs

$ curl -v http://tools.wmflabs.org/citationhunt/
* Hostname was NOT found in DNS cache
*   Trying 208.80.155.131...
* Connected to tools.wmflabs.org (208.80.155.131) port 80 (#0)
> GET /citationhunt/ HTTP/1.1
> User-Agent: curl/7.37.1
> Host: tools.wmflabs.org
> Accept: */*
>
< HTTP/1.1 302 FOUND
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Sun, 03 May 2015 20:50:54 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 283
< Connection: keep-alive
< Location: http://tools.wmflabs.org/citationhunt/?id=63836cf0&cat=all
< Cache-Control: no-cache, no-store
< X-Clacks-Overhead: GNU Terry Pratchett
<
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
* Connection #0 to host tools.wmflabs.org left intact
<p>You should be redirected automatically to target URL: <a href="/citationhunt/?id=63836cf0&amp;cat=all">/citationhunt/?id=63836cf0&amp;cat=all</a>.

German version

A German version will take a bit more work, as the German Wikipedia doesn't use inline references. The idea here would be to just use the first paragraph of the article as a snippet, and update the strings in the UI accordingly (that is, "The Wikipedia article", not "The Wikipedia snippet").

I guess the code changes would be to allow the German version to use a different extract_snippets function in the snippet_parser package, somehow, then have that extract lead sections. I think the remainder of the code should be unaffected.

Optionally, we might want to preserve the bold text in the lead section, since it indicates the main topic of the article. This will also require a bit of refactoring, as right now the snippets are stored in the database without any kind wikicode (we drop the bold markup in the snippet parser).

Write a gadget/js client for Wikipedia UI integration.

@Sophivorus' idea: we could have a gadget or Javascript client that integrates Citation Hunt with Wikipedia itself. Basically a user would, say, use this UI to jump between articles, and we could have the snippet that needs citation be highlighted somehow.

It shouldn't be hard™ to expose an API endpoint from the current server that can be used by this new client.

Consider ignoring lead sections when there are other sections in the article.

Lead sections usually don't require citations if the information is elsewhere in the article, so it's not as useful to flag them.

A counterargument here would be that, since the {{ Citation needed }} in lead sections should be removed instead of replaced with citations, it may still be useful to flag them. Plus it looks like this is not a hard rule, but a guideline.

Perhaps a compromise would be to recognize lead sections and flag them, but also suggest removing the template if it's redundant.

Add a default snippet parser.

A few of the smaller Wikipedias don't require a dedicated snippet parser, but we end up having to create a stub file anyway (example: pl.py), which is yet another manual step in the configuration.

We should instead have a default snippet parser that is used when building the database for any language, if there's no file named after that language in the snippet_parser directory. It's probably a good idea to output a log message when it is used as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.