citizenlab / test-lists Goto Github PK
View Code? Open in Web Editor NEWURL testing lists intended for discovering website censorship
URL testing lists intended for discovering website censorship
Are you aware of a government issued blocklist that is available for download?
Then add a link pointing to where it can be downloaded (even better if you specify if it's legal or not to mirror/publish it) and what country to applies to.
If you want to be π π― then you should write the parser code and submit a pull request ;)
seems like Instagram is blocked in DPRK
https://news.vice.com/article/north-korea-has-blocked-instagram
We should make a list of what languages we want to support in the first release (1.0.0) and then make a call for translators and/or use transifex.
Design and implement a page that will tell potential contributors what countries need to have their URL list created or reviewed.
I noticed that some URLs contain the character 0x9d at the end. This triggers a parsing error when I try to convert it to unicode.
Can I just remove them?
Inside of the test lists there is a file for an unknown country code: https://github.com/citizenlab/test-lists/blob/master/csv/cis.csv.
From the looks of the content of the file it seems to be applicable to some russian speaking country.
url,category_code,category_description,date_added,source,notes
https://www.cambodiadaily.com/,NEWS,News and media,2016-09-17,cchr,Indepdenent online news website
http://khmer.voanews.com/,NEWS,News,2016-09-17,cchr,Indepdenent online news website
http://cchrcambodia.org/,HUMR,Human Rights,2016-09-17,cchr,Human Rights Organization
http://www.cpp.org.kh/,POLR,Political Party,2016-09-17,,cchr,Current ruling political party
Handle.net is a provider of persistent URLs and is commonly used for institutional repositories (similar to DOIs). I am the administrator of a large open access repository of agricultural outputs and we recently noticed that the Handles do not work in Iran.
Here is an example of a persistent Handle link to an item on our repository: http://hdl.handle.net/10568/91553
I would submit a pull request with the URL in the Iran country list, but I'm not sure what to classify it as.
It would be cool if as part of the CI process we also archived pages via http://archive.is/ as they are added.
This way if the site goes down or is self-censored, we still have an archived copy of it (and we get to know what the site was about when it was still accessible).
archive.is supports something called the memento API that we can use to automate this: http://mementoweb.org/depot/native/archiveis/.
In #38 (comment) we discussed about the pro and contra of including or not https
URLs in the testing lists.
This is a debate we've had internally a number of times. It's obvious inefficient to repeatedly test the full path of HTTPS URLs for the same domain, but at the same time we didn't want to make any initial assumptions in constructing the lists that would limit their usefulness in the long-run. In other words, trying to future-proof the lists, leave open the possibility that someone's being MITM'd, etc.
We generally have settled on being exhaustive (in some cases testing both HTTP/HTTPS versions) as the performance impact wasn't our primary consideration.
One middle ground may be to include a small sample of HTTPS full-path URLs (e.g. a handful of full HTTPS URLs of sensitive Twitter accounts or YouTube videos) while testing the HTTP version of the rest.
One thing to take into consideration is the fact that if you only do testing for the HTTPS version of the site you will not catch cases in which the specific path is being filtered by a transparent proxy that doesn't do MITM.
I guess in the end it's probably up to the tool that does the measurements to take into account the fact that it should be testing the URL in both it's HTTPS and HTTP version.
That said if we decide to include in the URLs to test the HTTPS version of websites it should be consistent across all of them. That is if the site does support HTTPS the URL should always be listed using https. It will then be up to the tool consuming the testing list to take into account the fact it should be testing for the URL in both it's HTTPS and HTTP version.
Do we agree that the convention to adopt is "when the site supports HTTPS list it in HTTPS" and it should be up to the tool using the list to figure out that it should ALSO be testing for "HTTP" when testing a URL?
If so I can apply changes in batch to all the lists to update the URLs that also support HTTPS to have them prefixed by that.
Changes will also be made to ooniprobe to support testing for HTTP AND HTTPS when it finds HTTPS URLs.
I think this is also relevant to @rpanah for centinel testing.
I am thinking that it would be valuable to have an extra column to annotate in a machine readable way properties of the URLs.
The use-case of this is, for example, instead of removing dead URLs that should no longer be tested, simply adding the inactive
tag or for URLs that are generated as part of some automated tooling, marking those as autogenerated
, etc.
I think having a generic tags
column is sufficiently flexible to support many use-cases.
Thoughts?
Does this create parsing issues for people consuming these test lists?
I believe this would also make it easier to integrate the @berkmancenter effort (see: #236)
cc @sneft @agrabeli @jakubd @rpanah @jdcc
Tracking issue for development of an API for extracting URLs for specific categories and/or countries.
It looks like there are two ir.csv files. One of them is in the root directory, and the other one is in the lists
directory. Futhermore, they appear to be out of sync.
Sarawak Report uses AWS auto assigned DNS for cloud host servers as a way to temporarily overcome DNS blocks on their main server. These addresses eventually get added to Malaysia's official block lists.
What's best way of tracking these domains? Should we add them to my.csv country lists too?
At TorDev 2016 it was suggested to OONI that instead of creating criteria for URL risk assessment, perhaps we could consider white-listing and black-listing URLs per country?
Whitelists can include URLs that are commonly accessed and present low risk (e.g. Alexa top 100), while blacklists can include pornography, hate speech, and other objectionable categories.
My main concern with this suggestion is that many URLs might not clearly fall in a "white" or "black" list, or it might not be clear/obvious if they should be white-listed or black-listed. In such cases, would we be creating even more risk for users (if, for example, we have white-listed a URL which is actually quite risky to test)? Furthermore, would this inevitably lead to fewer tests for URLs that are riskier and possibly more interesting to test?
I have drafted a proposed set of revised content category codes which would replace the existing codes. The proposed set of codes are here:
https://github.com/citizenlab/test-lists/blob/master/lists/00-proposed-category_codes.csv
The aim here has been to streamline the categories, collapsing redundant categories and revising categories which were frequently confusing for list-creators. The goal was simplicity - fewer categories will be easier to use, prevent duplication and ideally limit miscategorization.
Alternative approaches considered were to adopt an existing category scheme wholesale (such as Netsweeper or Blue Coat). These generally have many more categories than the 30 proposed here, which can be a pro or a con depending on your needs.
Please feel free to offer feedback here. This list format is very much building on the prior ONI scheme, so I would love to hear comments based on alternative uses for the lists.
I realize that there is an intention to keep some URLs that are dead (404ing, parked pages,etc) on the local lists because this repo is intended to measure censorship and this may sometimes last longer than the site itself. I believe that this logic does not hold as well when talking about the global list. Is there any objection to checking the consistency of the global list and making sure that all the global list URLs are active non-dead URLs?
Currently the README does not explain how to use the code in this repository to update the URL lists nor how to add new URLs.
There are several different category codes (other than PORN) that could arguably include pornography, notably
Provocative Attire,PROV,PROV,"Websites which show provocative attire and portray women in a sexual manner, wearing minimal clothing. "
Sex Education,XED,XED,"Includes contraception, abstinence, STDs, healthy sexuality, teen pregnancy, rape prevention, abortion, sexual rights, and sexual health services."
Online Dating,DATE,DATE,"Online dating services which can be used to meet people, post profiles, chat, etc"
Media sharing,MMED,MMED,"Video, audio or photo sharing platforms."
The LGBT category, however, is alone in having a description that specifies that it excludes pornography:
LGBT,LGBT,GAYL,A range of gay-lesbian-bisexual-transgender queer issues. (Excluding pornography)
This seems inconsistent. I propose that either "Excluding pornography" be added to the descriptions of PROV, XED, DATE, and MMED, or that it be removed from the description of LGBT.
Hi, I'm adding support for using these lists in https://github.com/irl/hellfire which will end up in https://pathspider.net eventually.
As an academic, I may need to cite the lists. Is there a way in which you'd like that to be done?
Thanks,
Some countries block Google Earth by blocking http://kh.google.com/
Seems that it should be added to the default global list.
There's a maintained list of cn domains at
https://github.com/gfwlist/gfwlist
that would be worth interacting with, since it is used by many vpn clients to selectively proxy only blocked content.
Currently I count 317 MISC category codes inside of the country and global testing lists.
It would be great if these sites were categorised. It's not something that needs to be done all at once, but maybe tackled country by country.
These are all the countries that have MISC category codes in them:
As of 03 August 2017, gay.com
has changed ownership. It appears that the website no longer has an HTTPS version, just an HTTP connection that redirects to https://vanguardnow.org/
This causes a problem in the "Web Connectivity" test in ooniprobe, which expects "https://gay.com" to resolve correctly.
Relevant line:
https://github.com/citizenlab/test-lists/blob/master/lists/global.csv#L582
More information:
This was tested and verified by performing
# Checks open ports
nmap gay.com
# Checks HTTPS
curl -v https://gay.com
from 3 locations:
On all 3 locations, port 443 does not show as open, and performing curl
fails to connect using HTTPS
Cross-reference:
https://github.com/TheTorProject/ooniprobe-android/issues/106
ooni/probe-android#107
This is the other flip of the coin ticket for the issue: berkmancenter/url-lists#1.
Pakistan Blocked much more websites and links. If only these you tested is fine but if this is Total blocked URL list then I can add much more.
We should describe the workflow of the URL submission wizard as well as the text to be presented on the various steps.
@jakubd @mobilesuit do you have access to the @citizenlab spreadsheets used to build the original lists? If so you should add the content of them into here.
From an heads-up email we received as OONI:
It doesn't look like DNS-based censorship to me:
https://uncomocorreo.com/por-que-latinmail-no-funciona-que-hacer-para-recuperar-mi-correo/
We have run into the issue, when using the lists in OONI, that some country lists present the following problems:
ex.
id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab,
kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,
ex.
global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,
We should add checks to the lint-lists.py script that checks if:
On this second point I would like to hear from @sneft and others to know if this is reasonable or if it's maybe just a OONI specific usage of the lists.
As noted in the bofh test list downloader there are some categories that are missing.
We should discuss how to expand the list of categories and perhaps also how to integrate automatic categorisation from third parties such as OpenDNS, Alexa, BlueCoat.
Here is a dump of the content of the whiteboard that should become a roadmap for the next 6-12 months of development:
All entries in pk.csv have an excess trailing comma. Please fix.
I would like to be added as a collaborator to this repository so that I can begin creating milestones for it.
Hello!
I'm working in a huge revamp of the venezuelan list and i'm a little bit confused in what to do with sites that support www. because in some lists we have sites with, others without and others with both. Which approach do you recommend for this repo?
I am almost ready to upload the new list but this is stopping me.
Thank you very much!
The last two entries in ae.csv are missing the final comma for the 'notes' column. Please correct these.
Are you interested in helping us out in making censorship measurement work better in your country?
Here is what you should do:
To figure out what is the the two letter country code for your country see the country code mappings.
Then find the file called XY.csv
(where XY is your countries two letter code) in here: https://github.com/citizenlab/test-lists/tree/master/lists
Currently this process requires a bit of technical knowledge. If you want to contribute, but don't know git just send an email to art torproject org or comment this ticket with the suggested changes.
Since last week, the Spanish government has been running after websites hosting information about the Catalan referendum that will be hold next 1st October. They started blocking the main website referendum.cat by forcing Spanish operators to hijack the domain name:
$ host referendum.cat
referendum.cat is an alias for paginaintervenida.edgesuite.net.
paginaintervenida.edgesuite.net is an alias for a1836.b.akamai.net.
a1836.b.akamai.net has address 2.16.65.170
a1836.b.akamai.net has address 2.16.65.152
At the beginning an error was shown because they didn't configure properly the webserver to handle the redirection. Now it shows the following image:
The day after it happened, someone uploaded the content of the website to github (https://github.com/GrenderG/referendum_cat_mirror) and more mirrors have appeared online.
Again, they have been running after these mirrors to the extent that they have blocked the whole gateway.ipfs.io domain.
A list of the blocked domains (may be incomplete) can be found here
1-o_catalan_referendum_blocked_websites_in_spain.txt
The virtual harassment has been paralleled with physical harassment. Read what EFF wrote about it at https://www.eff.org/deeplinks/2017/09/cat-domain-casualty-catalonian-independence-crackdown
Also interesting reading about technical measures taken beforehand to prevent DDoS and censorship: https://medium.com/@josepot/is-sensitive-voter-data-being-exposed-by-the-catalan-government-af9d8a909482
"GLAM", recognised as "Galleries, Libraries Archive and Museums", is a category that is suggested to be added to what's currently possible, as in 00-LEGEND-new_category_codes.csv
The need became obvious to me when trying to categorise https://web.archive.org, which is now blocked in Egypt
There are some cases where we want to have lists of some addresses (or strings) that are used specifically in the context of some application.
I believe the most logical place to put these is inside of a dedicated "global" list, but it should be placed in the context of the software using it.
Here are some examples of the sorts of addresses I am talking about:
I think you get an idea of the sorts of data I am talking about.
I would propose we create the following new directory structure:
lists
βββ services
βΒ Β βββ tor
βΒ Β βΒ Β βββ bridges.csv
βΒ Β βΒ Β βββ directory_authorities.csv
βΒ Β βββ vpn
etc.
Each service specific namespace can have it's own format, maintaing some common elements such as: name
, date_added
, data_format_version
, source
, notes
.
@sneft, @chokepoint-project, @mobilesuit, @jakubd what do you think?
I was working on some data viz for OONI and as part of that I came up with an icon set for each of the citizenlab category codes.
Most of the icons are taken from either font-awesome or material design icons (which both use SIL Open Font License) and some are designed by OONI (mostly based on stacking font-awesome icons). @bnvk do you think it would be useful to add the OONI made ones to @opensourcedesign?
Feedback an input on the iconset is greatly appreciated. Also if it would be useful I can possibly add these to the repository itself, though maybe that will just increase the size of the repo and maybe most people don't care about the icons.
http://www.gaystarnews.com/,LGBT,LGBT,2017-10-15,citizenlab,
Noticed that this UK site providing global news is duplicated across several countries, and needs to be added in my.csv list where it is blocked.
Probably better to remove from country lists and move to global.csv?
Currently we don't have a clear policy that allows us to evaluate if a certain set of URLs should or should not be merged.
We should consult some lawyers that have a good understanding of Canadian law to help us draft a good policy.
I'm not sure what the process is for getting something added.
Would I just add the domain to the AU test list?
@citizenlab should look to see if it's possible to extract the IPs of blockpages from the source data set for: https://github.com/citizenlab/blockpages.
Some people were interested in having the list of URLs that some vendors use to detect the presence of captive portals.
We have this information inside of the ooniprobe captive portal test.
I will paste the important text in here:
# This test is a collection of tests to detect the presence of a
# captive portal. Code is taken, in part, from the old ooni-probe,
# which was written by Jacob Appelbaum and Arturo FilastΓ².
#
# This module performs multiple tests that match specific vendor captive
# portal tests. This is a basic internet captive portal filter tester written
# for RECon 2011.
#
# Read the following URLs to understand the captive portal detection process
# for various vendors:
#
# http://technet.microsoft.com/en-us/library/cc766017%28WS.10%29.aspx
# http://blog.superuser.com/2011/05/16/windows-7-network-awareness/
# http://isc.sans.org/diary.html?storyid=10312&
# http://src.chromium.org/viewvc/chrome?view=rev&revision=74608
# http://code.google.com/p/chromium-os/issues/detail?3281ttp,
# http://crbug.com/52489
# http://crbug.com/71736
# https://bugzilla.mozilla.org/show_bug.cgi?id=562917
# https://bugzilla.mozilla.org/show_bug.cgi?id=603505
# http://lists.w3.org/Archives/Public/ietf-http-wg/2011JanMar/0086.html
# http://tools.ietf.org/html/draft-nottingham-http-portal-02
#
vendor_tests = [['http://www.apple.com/library/test/success.html',
'Success',
'200',
'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3',
'Apple HTTP Captive Portal'],
['http://tools.ietf.org/html/draft-nottingham-http-portal-02',
'428 Network Authentication Required',
'428',
'Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0',
'W3 Captive Portal'],
['http://www.msftncsi.com/ncsi.txt',
'Microsoft NCSI',
'200',
'Microsoft NCSI',
'MS HTTP Captive Portal', ]]
Add tags to URLs for Safe/Not safe or an extra column that provides a level of safety for that certain URL.
It was suggested by @josswright that we could also do this in an automated way by prompting the user before doing a scan "Do you feel safe testing these URLs":
URL 1 - yes/no
URL 2 - yes/no
etc.
This user feedback can then be submitted to us in a safe way and used to defined a metric of safety or perceived safety (perhaps even on a country by country basis).
We have a request from a country partner, that wishes to not be credited as direct source due to the sensitive nature of some of the urls.
Should we put the source as anon? Or should an intermediary that is not at risk such as sinarproject or ooni be put in there instead?
Regards
https://tor.stackexchange.com/q/14006
Germany censors 3.000 domains, this censorship list even got leaked here: https://bpjmleak.neocities.org/
It's hard for list maintainers to figure out what some of the test urls are as well as when writing reports manually or generated.
A one line description field would be helpful for maintainers and users of the list (human or machine) to able to to know what the site is about especially if it's in global list or for another country unfamiliar to the maintainer
eg.
We might want to have a title field also.
Here are some websites that are blocked by the South Korean government:
http://www.kcna.kp/ (North Korean website)
http://www.marijuanatravels.com/ (Drugs)
Blocked websites are redirected to http://warning.or.kr/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.