citizenlab / test-lists Goto Github PK

View Code? Open in Web Editor NEW

407.0 49.0 336.0 7.86 MB

URL testing lists intended for discovering website censorship

Python 100.00%

censorship network-measurement network-test csv-files csv

test-lists's Issues

Make list of government issued blocklists

Add links into the comments below!

Are you aware of a government issued blocklist that is available for download?

Then add a link pointing to where it can be downloaded (even better if you specify if it's legal or not to mirror/publish it) and what country to applies to.

If you want to be 👍 💯 then you should write the parser code and submit a pull request ;)

Test list for North Korea(kp)?

seems like Instagram is blocked in DPRK
https://news.vice.com/article/north-korea-has-blocked-instagram

Translate the text of the submission wizard

We should make a list of what languages we want to support in the first release (1.0.0) and then make a call for translators and/or use transifex.

Add documentation about what countries we would like URLs on

Design and implement a page that will tell potential contributors what countries need to have their URL list created or reviewed.

Character 0x9d present in some lists

I noticed that some URLs contain the character 0x9d at the end. This triggers a parsing error when I try to convert it to unicode.

Can I just remove them?

What country does "cis" correspond to?

Inside of the test lists there is a file for an unknown country code: https://github.com/citizenlab/test-lists/blob/master/csv/cis.csv.

From the looks of the content of the file it seems to be applicable to some russian speaking country.

Initial new list for Cambodia kh.csv

url,category_code,category_description,date_added,source,notes
https://www.cambodiadaily.com/,NEWS,News and media,2016-09-17,cchr,Indepdenent online news website
http://khmer.voanews.com/,NEWS,News,2016-09-17,cchr,Indepdenent online news website
http://cchrcambodia.org/,HUMR,Human Rights,2016-09-17,cchr,Human Rights Organization
http://www.cpp.org.kh/,POLR,Political Party,2016-09-17,,cchr,Current ruling political party

hdl.handle.net is blocked in Iran, not sure how to categorize

Handle.net is a provider of persistent URLs and is commonly used for institutional repositories (similar to DOIs). I am the administrator of a large open access repository of agricultural outputs and we recently noticed that the Handles do not work in Iran.

Here is an example of a persistent Handle link to an item on our repository: http://hdl.handle.net/10568/91553

I would submit a pull request with the URL in the Iran country list, but I'm not sure what to classify it as.

Archive pages that get added to the test list

It would be cool if as part of the CI process we also archived pages via http://archive.is/ as they are added.

This way if the site goes down or is self-censored, we still have an archived copy of it (and we get to know what the site was about when it was still accessible).

archive.is supports something called the memento API that we can use to automate this: http://mementoweb.org/depot/native/archiveis/.

Agree on convention for including or not HTTPS URLs

In #38 (comment) we discussed about the pro and contra of including or not https URLs in the testing lists.

@sneft

This is a debate we've had internally a number of times. It's obvious inefficient to repeatedly test the full path of HTTPS URLs for the same domain, but at the same time we didn't want to make any initial assumptions in constructing the lists that would limit their usefulness in the long-run. In other words, trying to future-proof the lists, leave open the possibility that someone's being MITM'd, etc.
We generally have settled on being exhaustive (in some cases testing both HTTP/HTTPS versions) as the performance impact wasn't our primary consideration.
One middle ground may be to include a small sample of HTTPS full-path URLs (e.g. a handful of full HTTPS URLs of sensitive Twitter accounts or YouTube videos) while testing the HTTP version of the rest.

@hellais

One thing to take into consideration is the fact that if you only do testing for the HTTPS version of the site you will not catch cases in which the specific path is being filtered by a transparent proxy that doesn't do MITM.
I guess in the end it's probably up to the tool that does the measurements to take into account the fact that it should be testing the URL in both it's HTTPS and HTTP version.
That said if we decide to include in the URLs to test the HTTPS version of websites it should be consistent across all of them. That is if the site does support HTTPS the URL should always be listed using https. It will then be up to the tool consuming the testing list to take into account the fact it should be testing for the URL in both it's HTTPS and HTTP version.

Do we agree that the convention to adopt is "when the site supports HTTPS list it in HTTPS" and it should be up to the tool using the list to figure out that it should ALSO be testing for "HTTP" when testing a URL?
If so I can apply changes in batch to all the lists to update the URLs that also support HTTPS to have them prefixed by that.
Changes will also be made to ooniprobe to support testing for HTTP AND HTTPS when it finds HTTPS URLs.

I think this is also relevant to @rpanah for centinel testing.

Add a tags column

I am thinking that it would be valuable to have an extra column to annotate in a machine readable way properties of the URLs.

The use-case of this is, for example, instead of removing dead URLs that should no longer be tested, simply adding the inactive tag or for URLs that are generated as part of some automated tooling, marking those as autogenerated, etc.

I think having a generic tags column is sufficiently flexible to support many use-cases.

Thoughts?

Does this create parsing issues for people consuming these test lists?

I believe this would also make it easier to integrate the @berkmancenter effort (see: #236)
cc @sneft @agrabeli @jakubd @rpanah @jdcc

Restful API

Tracking issue for development of an API for extracting URLs for specific categories and/or countries.

Consolidate Iran lists

It looks like there are two ir.csv files. One of them is in the root directory, and the other one is in the lists directory. Futhermore, they appear to be out of sync.

How to deal with blocked cloud service DNS?

Sarawak Report uses AWS auto assigned DNS for cloud host servers as a way to temporarily overcome DNS blocks on their main server. These addresses eventually get added to Malaysia's official block lists.

What's best way of tracking these domains? Should we add them to my.csv country lists too?

http://digimon.sinarproject.org/incidents/sarawak-report

White-listing/Black-listing URLs per country?

At TorDev 2016 it was suggested to OONI that instead of creating criteria for URL risk assessment, perhaps we could consider white-listing and black-listing URLs per country?

Whitelists can include URLs that are commonly accessed and present low risk (e.g. Alexa top 100), while blacklists can include pornography, hate speech, and other objectionable categories.

My main concern with this suggestion is that many URLs might not clearly fall in a "white" or "black" list, or it might not be clear/obvious if they should be white-listed or black-listed. In such cases, would we be creating even more risk for users (if, for example, we have white-listed a URL which is actually quite risky to test)? Furthermore, would this inevitably lead to fewer tests for URLs that are riskier and possibly more interesting to test?

Review the proposed new content category codes

I have drafted a proposed set of revised content category codes which would replace the existing codes. The proposed set of codes are here:

https://github.com/citizenlab/test-lists/blob/master/lists/00-proposed-category_codes.csv

The aim here has been to streamline the categories, collapsing redundant categories and revising categories which were frequently confusing for list-creators. The goal was simplicity - fewer categories will be easier to use, prevent duplication and ideally limit miscategorization.

Alternative approaches considered were to adopt an existing category scheme wholesale (such as Netsweeper or Blue Coat). These generally have many more categories than the 30 proposed here, which can be a pro or a con depending on your needs.

Please feel free to offer feedback here. This list format is very much building on the prior ONI scheme, so I would love to hear comments based on alternative uses for the lists.

All URLs in the Global List Should be Active URLs.

I realize that there is an intention to keep some URLs that are dead (404ing, parked pages,etc) on the local lists because this repo is intended to measure censorship and this may sometimes last longer than the site itself. I believe that this logic does not hold as well when talking about the global list. Is there any objection to checking the consistency of the global list and making sure that all the global list URLs are active non-dead URLs?

Add documentation on usage of bin/*

Currently the README does not explain how to use the code in this repository to update the URL lists nor how to add new URLs.

"LGBT (excluding pornography)"

There are several different category codes (other than PORN) that could arguably include pornography, notably

Provocative Attire,PROV,PROV,"Websites which show provocative attire and portray women in a sexual manner, wearing minimal clothing. "
Sex Education,XED,XED,"Includes contraception, abstinence, STDs, healthy sexuality, teen pregnancy, rape prevention, abortion, sexual rights, and sexual health services."
Online Dating,DATE,DATE,"Online dating services which can be used to meet people, post profiles, chat, etc"
Media sharing,MMED,MMED,"Video, audio or photo sharing platforms."

The LGBT category, however, is alone in having a description that specifies that it excludes pornography:

LGBT,LGBT,GAYL,A range of gay-lesbian-bisexual-transgender queer issues. (Excluding pornography)

This seems inconsistent. I propose that either "Excluding pornography" be added to the descriptions of PROV, XED, DATE, and MMED, or that it be removed from the description of LGBT.

How to cite?

Hi, I'm adding support for using these lists in https://github.com/irl/hellfire which will end up in https://pathspider.net eventually.

As an academic, I may need to cite the lists. Is there a way in which you'd like that to be done?

Thanks,

Google Earth blocking

Some countries block Google Earth by blocking http://kh.google.com/

Seems that it should be added to the default global list.

Update CN list from gfwlist

There's a maintained list of cn domains at
https://github.com/gfwlist/gfwlist
that would be worth interacting with, since it is used by many vpn clients to selectively proxy only blocked content.

Categorise MISC websites

Currently I count 317 MISC category codes inside of the country and global testing lists.

It would be great if these sites were categorised. It's not something that needs to be done all at once, but maybe tackled country by country.

These are all the countries that have MISC category codes in them:

Update URL for gay.com

As of 03 August 2017, gay.com has changed ownership. It appears that the website no longer has an HTTPS version, just an HTTP connection that redirects to https://vanguardnow.org/

This causes a problem in the "Web Connectivity" test in ooniprobe, which expects "https://gay.com" to resolve correctly.

Relevant line:
https://github.com/citizenlab/test-lists/blob/master/lists/global.csv#L582

More information:
This was tested and verified by performing

# Checks open ports
nmap gay.com
# Checks HTTPS
curl -v https://gay.com

from 3 locations:

Home internet
School internet (MTSU)
DigitalOcean droplet (NYC)

On all 3 locations, port 443 does not show as open, and performing curl fails to connect using HTTPS

Cross-reference:
https://github.com/TheTorProject/ooniprobe-android/issues/106
ooni/probe-android#107

Consolidate this effort with the berkmancenter/url-lists effort

This is the other flip of the coin ticket for the issue: berkmancenter/url-lists#1.

Much more is missing from List.

Pakistan Blocked much more websites and links. If only these you tested is fine but if this is Total blocked URL list then I can add much more.

Write text for the URL submission wizard

We should describe the workflow of the URL submission wizard as well as the text to be presented on the various steps.

@jakubd @mobilesuit do you have access to the @citizenlab spreadsheets used to build the original lists? If so you should add the content of them into here.

latinmain.com: make sure it still makes sense to test it

From an heads-up email we received as OONI:

It doesn't look like DNS-based censorship to me:
https://uncomocorreo.com/por-que-latinmail-no-funciona-que-hacer-para-recuperar-mi-correo/

Extra consistency checks on category codes and URLs

We have run into the issue, when using the lists in OONI, that some country lists present the following problems:

The same URL present in different country specific lists, presents a different category code

ex.
id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab,
kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,

The same URL is present in both the global and the country specific list

ex.
global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,

We should add checks to the lint-lists.py script that checks if:

There are inconsistencies in category codes across lists
If a URL is present in the global list it should not also be present in the country specific list

On this second point I would like to hear from @sneft and others to know if this is reasonable or if it's maybe just a OONI specific usage of the lists.

Improve categorisation of URLs

As noted in the bofh test list downloader there are some categories that are missing.

We should discuss how to expand the list of categories and perhaps also how to integrate automatic categorisation from third parties such as OpenDNS, Alexa, BlueCoat.

Convert whiteboard sketched out at CSI15 into tickets and milestones

Here is a dump of the content of the whiteboard that should become a roadmap for the next 6-12 months of development:

Questions that should be answered via documentation

How many countries have public block-lists?
What countries are we missing test lists for?

Things we would like to have

IPs of block pages
Government issued block-lists for the various countries
REST API for getting a subset of testing URLs into your measurement software
URLs used for captive portal testing (ooniprobe has this as part of the captive portal test)

Help wanted

Review current URL for your country (somebody agreed to take care of ZA and I gave them the template as an excel spreadsheet)
Write text for the wizard (@citizenlab can you make available the excel you used for defining the existing lists?)
Translation of text for the wizard

TODO

Define the policy by which URLs are accepted into the test-lists repository
Add tags to URLs for Safe/Not safe (perhaps do this in an automated way by prompting to the user "Do you feel safe testing these URLs and show them a random sample, based on user feedback we can build a metric for safety)
Integrate tags from other third parties (OpenDNS, Alexa, BlueCoat, etc.)

Excess "," at the end of all entries in pk.csv

All entries in pk.csv have an excess trailing comma. Please fix.

Request for access to milestone creation

I would like to be added as a collaborator to this repository so that I can begin creating milestones for it.

using or not www

Hello!

I'm working in a huge revamp of the venezuelan list and i'm a little bit confused in what to do with sites that support www. because in some lists we have sites with, others without and others with both. Which approach do you recommend for this repo?

I am almost ready to upload the new list but this is stopping me.

Thank you very much!

Missing comma's in ae.csv

The last two entries in ae.csv are missing the final comma for the 'notes' column. Please correct these.

Review the URL list for your country

Are you interested in helping us out in making censorship measurement work better in your country?

Here is what you should do:

Look to see if we already have a test-list for your country.

To figure out what is the the two letter country code for your country see the country code mappings.

Then find the file called XY.csv (where XY is your countries two letter code) in here: https://github.com/citizenlab/test-lists/tree/master/lists

Review the list of URLs we test by adding new ones or creating a new file for your country.

Currently this process requires a bit of technical knowledge. If you want to contribute, but don't know git just send an email to art torproject org or comment this ticket with the suggested changes.

Add urls blocked in Spain related to Catalan referendum

Since last week, the Spanish government has been running after websites hosting information about the Catalan referendum that will be hold next 1st October. They started blocking the main website referendum.cat by forcing Spanish operators to hijack the domain name:

$ host referendum.cat
referendum.cat is an alias for paginaintervenida.edgesuite.net.
paginaintervenida.edgesuite.net is an alias for a1836.b.akamai.net.
a1836.b.akamai.net has address 2.16.65.170
a1836.b.akamai.net has address 2.16.65.152

At the beginning an error was shown because they didn't configure properly the webserver to handle the redirection. Now it shows the following image:

The day after it happened, someone uploaded the content of the website to github (https://github.com/GrenderG/referendum_cat_mirror) and more mirrors have appeared online.

Again, they have been running after these mirrors to the extent that they have blocked the whole gateway.ipfs.io domain.

A list of the blocked domains (may be incomplete) can be found here
1-o_catalan_referendum_blocked_websites_in_spain.txt

The virtual harassment has been paralleled with physical harassment. Read what EFF wrote about it at https://www.eff.org/deeplinks/2017/09/cat-domain-casualty-catalonian-independence-crackdown

Also interesting reading about technical measures taken beforehand to prevent DDoS and censorship: https://medium.com/@josepot/is-sensitive-voter-data-being-exposed-by-the-catalan-government-af9d8a909482

suggested category:GLAM

"GLAM", recognised as "Galleries, Libraries Archive and Museums", is a category that is suggested to be added to what's currently possible, as in 00-LEGEND-new_category_codes.csv

The need became obvious to me when trying to categorise https://web.archive.org, which is now blocked in Egypt

Discuss possibility of including application specific addresses

There are some cases where we want to have lists of some addresses (or strings) that are used specifically in the context of some application.

I believe the most logical place to put these is inside of a dedicated "global" list, but it should be placed in the context of the software using it.

Here are some examples of the sorts of addresses I am talking about:

The pluggable transport addresses (and in some cases shared secrets or hash fingerprints) of the Tor bridges shipped as part of Tor Browser Bundle release X
The IP and ports (and possibly fingerprints) or Tor directory authorities that are used for discovery of the Tor network by "vanilla" tor (country that block access to the Tor network will block these, but may not be blocking also the bridges)
The list of IP and ports used by a $INSERT_POPULAR_INTERNET_CHAT_APP application (I'm thinking whatsapp, viber, etc.)
The list of IP, port and protocol by the various VPN providers (it can be useful to check if these are being blocked)
List of IPs of open DNS resolvers per country

I think you get an idea of the sorts of data I am talking about.

I would propose we create the following new directory structure:

lists
├── services
│   ├── tor
│   │   ├── bridges.csv
│   │   └── directory_authorities.csv
│   └── vpn
etc.

Each service specific namespace can have it's own format, maintaing some common elements such as: name, date_added, data_format_version, source, notes.

@sneft, @chokepoint-project, @mobilesuit, @jakubd what do you think?

Icons for the new citizenlab category codes

I was working on some data viz for OONI and as part of that I came up with an icon set for each of the citizenlab category codes.

Most of the icons are taken from either font-awesome or material design icons (which both use SIL Open Font License) and some are designed by OONI (mostly based on stacking font-awesome icons). @bnvk do you think it would be useful to add the OONI made ones to @opensourcedesign?

Feedback an input on the iconset is greatly appreciated. Also if it would be useful I can possibly add these to the repository itself, though maybe that will just increase the size of the repo and maybe most people don't care about the icons.

Move Gay Star News to global list?

http://www.gaystarnews.com/,LGBT,LGBT,2017-10-15,citizenlab,

Noticed that this UK site providing global news is duplicated across several countries, and needs to be added in my.csv list where it is blocked.

Probably better to remove from country lists and move to global.csv?

Define the policy for adding URLs to this repository

Currently we don't have a clear policy that allows us to evaluate if a certain set of URLs should or should not be merged.

We should consult some lawyers that have a good understanding of Canadian law to help us draft a good policy.

Add list of URLs to test for censorship in Australia

I'm not sure what the process is for getting something added.
Would I just add the domain to the AU test list?

Add IPs of blockpages

@citizenlab should look to see if it's possible to extract the IPs of blockpages from the source data set for: https://github.com/citizenlab/blockpages.

Add URLs for captive portal tests

Some people were interested in having the list of URLs that some vendors use to detect the presence of captive portals.

We have this information inside of the ooniprobe captive portal test.

I will paste the important text in here:

# This test is a collection of tests to detect the presence of a
# captive portal. Code is taken, in part, from the old ooni-probe,
# which was written by Jacob Appelbaum and Arturo Filastò.
#
# This module performs multiple tests that match specific vendor captive
# portal tests. This is a basic internet captive portal filter tester written
# for RECon 2011.
#
# Read the following URLs to understand the captive portal detection process
# for various vendors:
#
# http://technet.microsoft.com/en-us/library/cc766017%28WS.10%29.aspx
# http://blog.superuser.com/2011/05/16/windows-7-network-awareness/
# http://isc.sans.org/diary.html?storyid=10312&
# http://src.chromium.org/viewvc/chrome?view=rev&revision=74608
# http://code.google.com/p/chromium-os/issues/detail?3281ttp,
# http://crbug.com/52489
# http://crbug.com/71736
# https://bugzilla.mozilla.org/show_bug.cgi?id=562917
# https://bugzilla.mozilla.org/show_bug.cgi?id=603505
# http://lists.w3.org/Archives/Public/ietf-http-wg/2011JanMar/0086.html
# http://tools.ietf.org/html/draft-nottingham-http-portal-02
#

        vendor_tests = [['http://www.apple.com/library/test/success.html',
                         'Success',
                         '200',
                         'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3',
                         'Apple HTTP Captive Portal'],
                        ['http://tools.ietf.org/html/draft-nottingham-http-portal-02',
                         '428 Network Authentication Required',
                         '428',
                         'Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0',
                         'W3 Captive Portal'],
                        ['http://www.msftncsi.com/ncsi.txt',
                         'Microsoft NCSI',
                         '200',
                         'Microsoft NCSI',
                         'MS HTTP Captive Portal', ]]

Add "safety" column to the test-lists

Add tags to URLs for Safe/Not safe or an extra column that provides a level of safety for that certain URL.

It was suggested by @josswright that we could also do this in an automated way by prompting the user before doing a scan "Do you feel safe testing these URLs":

URL 1 - yes/no
URL 2 - yes/no
etc.

This user feedback can then be submitted to us in a safe way and used to defined a metric of safety or perceived safety (perhaps even on a country by country basis).

Can source be anon?

We have a request from a country partner, that wishes to not be credited as direct source due to the sensitive nature of some of the urls.

Should we put the source as anon? Or should an intermediary that is not at risk such as sinarproject or ooni be put in there instead?

Regards

Add some URLs from the official Germany blocklist to the de.csv list

https://tor.stackexchange.com/q/14006

Germany censors 3.000 domains, this censorship list even got leaked here: https://bpjmleak.neocities.org/

"description" field

It's hard for list maintainers to figure out what some of the test urls are as well as when writing reports manually or generated.

A one line description field would be helpful for maintainers and users of the list (human or machine) to able to to know what the site is about especially if it's in global list or for another country unfamiliar to the maintainer

eg.

"The Citizen Lab is an interdisciplinary laboratory based at the Munk School of Global Affairs at the University of Toronto, Canada."
"The Coalition for Clean and Fair Elections or Bersih is a coalition of non-governmental organisations which seeks to reform the current electoral system in Malaysia to ensure free, clean and fair elections."

We might want to have a title field also.

Blocked Websites in South Korea

Here are some websites that are blocked by the South Korean government:

http://www.kcna.kp/ (North Korean website)
http://www.marijuanatravels.com/ (Drugs)

Blocked websites are redirected to http://warning.or.kr/