heuer / cablemap Goto Github PK
View Code? Open in Web Editor NEWCablemap - WikiLeaks Cablegate parser and Topic Maps converter
License: BSD 3-Clause "New" or "Revised" License
Cablemap - WikiLeaks Cablegate parser and Topic Maps converter
License: BSD 3-Clause "New" or "Revised" License
FRONT.
UNCLASSIFIED
PAGE 04 HO CHI 00197 280506Z
RAY
UNCLASSIFIED
I.e.: 09BRUSSELS1332 results into U.S._Embassy_Brussels which is wrong since it is US EU Brussels
282f7b9
added several missing subjects, but other publications provide the subjects, i.e.
06COPENHAGEN336: http://aebr.home.xs4all.nl/wl/aftenposten/06COPENHAGEN336.html
In a previous release Cablemap assumed that 06SAOPAULO348 has three references, c.f.
e9b4aed
The cable has changed and there is no (C) anymore.
Check if the cable should contain a reference to 05 Sao Paulo 975
02ROME1196:
REF: A. STATE 40721
CONFIDENTIAL
PAGE 02 ROME 01196 01 OF 02 082030Z B. ROME 1098 C. ROME 894 D. MYRIAD POST-DEPARTMENT E-MAILS FROM 10/01-02/02 E. ROME 348
Result:
>>> cable.references
[u'02STATE40721']
Should be:
[u'02STATE40721', u'02ROME1098', u'02ROME894', u'02ROME348']
The extraction alg. of the recipients is incomplete:
U.S. Department of State Foreign Affairs Handbook Volume 5 Handbook 1—Correspondence
5 FAH-1 H-230
[...] c. Use existing collective addressees to obviate listing a long list of
addressees on an outgoing telegram. When modifying a collective
address, follow the collective address with a comma, space, "XMT," and
the name(s) of the posts in the collective that will not receive the
telegram. [...]
The reader simply removes the XMT and treats the name of the embassy as recipient.
The cable has an incomplete REF section:
SENSITIVE
SIPDIS
E.O. 12958: N/A TAGS: PGOV PHUM KCRM SOCI SNAR ASEC BR
(B) Sao Paulo 215;
(C) 05 Sao Paulo 975
Cables like 06GENEVA1673 contain ""
These should be removed in advance by reader.py
06BUENOSAIRES579 subject contains
US-
LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US-
URUGUAYAN
the resulting subject looks like this:
US- LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US- URUGUAYAN
Would be nice if the reader removes the whitespace behind the "-":
US-LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US-URUGUAYAN
parse_references omits the enumeration information: A) 09FOO12 B) 07BAR34 becomes [u'09FOO12', u'07BAR34']. Might be useful to return tuples: [(u'A', u'09FOO12'), (u'B', u'07BAR34')]
TODO:
The cable has two ref sections:
SUBJECT: VISAS DONKEY: CORRUPTION 212(F) VISA DENIAL
REF: A. 08 STATE 81854S e c r e t nairobi 001938
Later:
Subject: visas donkey: corruption 212(f) visa denial
Ref: a. 08 state 81854
b. Td-314/014437-09
c. Nairobi 1830
d. Nairobi 1859
e. Nairobi 1831
The parser detects the first ref, only
Find out if "PARISFR" is the same station as "UNESCOPARISFR". If they are the same the TM-Layer should merge them. Or create a helper topic map which merges them, c.f. https://github.com/heuer/topicmaps/tree/master/cablegate
Would be nice to extract the author of the cable
05SANJOSE2020 is tagged as "ESENV". IMO that's an error, should be "SENV"
If the TO header is missing, the reader raises an exception. Errors like these should result into a WARN log entry and ignored
06GENEVA2654 06GENEVA1673, and 07BERN881 contain the "header" section in the "content" section. search-for-subject(content, max-index=1200) does not work for them.
Setting the max index to 2200 works for the cables mentioned above but delivers wrong solutions for cables like 10STATE284 which has no subject but the lenient reg.ex delivers a subject since SUBJECT can be found in the cable's text
WL published 08ECTION01OF02MANAMA492 and 08MANAMA492. The first is a malformed version of the latter. Strangely they have a different transmission id.
cablemap.core.cable_by_id reads always the malformed version since it translates internally the id 08MANAMA492 to 08ECTION01OF02MANAMA492
Result of 04THEHAGUE1717:
[u'04THEHAGUE1701', u'04STATE1', u'04THEHAGUE1701', u'04STATE147536']
should be:
[u'04THEHAGUE1701', u'04STATE147536']
Problem: The cable contains two REF sections. The parser should ignore the first REF section
utils.signer_name has to decide if the woman or man of the couple is meant. Difficult.
I.e. 08TRIPOLI402:
>>> from cablemap.core import cable_from_html; from cablemap.core.utils import cable_page_by_id
>>> cable = cable_from_html(cable_page_by_id('08TRIPOLI402'))
>>> cable.references
[u'08TRIPOLI199', u'08TRIPOLI227', u'08TRIPOLI402']
Should be:
[u'08TRIPOLI199', u'08TRIPOLI227']
Simple solution: Remove self-references?
The cable has no subject. Acc. to http://aebr.home.xs4all.nl/wl/derspiegel/06MUNICH397.html it should have the subject "bruno's last stand -- first wild bear in 170 years proves too wild for bavaria"
Acc. to http://www.xs4all.nl/~aebr/wl/rusrep/georgia.html 08MOSCOW2426 should have the subject "ENERGY AND THE CONFLICT IN GEORGIA"
Acc. to http://apublica.org/2011/06/08brasilia672/ the cable should have the subject "SOURCES OF GENERATION – ELECTRICITY SERIES #2". The WikiLeaks source subject is empty.
The cable has END SUMMARY but not SUMMARY or START SUMMARY. Further it has no paragraph which starts with "1."
PAGE 02 HARARE 01632 02 OF 02 170806Z
E.O. 12958: DECL: 05/17/11
TAGS: PREL PGOV PINS ETRD ZI CA DA
SUBJECT: FOREIGN MINISTER INFORMS WESTERN
AMBASSADORS OF CABINET DECISION TO REIN IN WAR VETERANS
IRVING
CONFIDENTIAL
>
>>> from cablemap.core import cable_by_id
>>> cable = cable_by_id(''08SCTION02OF02SAOPAULO335'')
works but
>>> cable = cable_by_id('08SAOPAULO335')
doesn't work
Extract the Ambassador's comment
08ECTION01OF02MANAMA492 is not a valid cable id
helpers/fix_subjects.py needs the no_subjects.txt file which is generated by update_helpers.py
But once fix_subjects generates the subjects, update_helpers removes the cable ids from no_subjects.txt since the cables have subjects now. Next time fix_subjects removes all subject fixes from subjects.txt which is bad
cablemap.core cannot handle malformed cable IDs
# WikiLeak cable identifiers which are wrong
MALFORMED_CABLE_IDS = {
'08SCTION02OF02SAOPAULO335': u'08SAOPAUL335',
'08SECTION01GF02BISHIEK21': u'08BISHKEK1021',
'09SECTION02OF03QRIPOLI583': u'09TRIPOLI583',
}
>>> cable = cable_by_id('08SCTION02OF02SAOPAULO335')
Traceback (most recent call last):
...
ValueError: Cannot extract the cable's reference id
IS GAZA DISENGAGEMENT ISRAELI PALESTINIAN AFFAIRS
should probably be treated as:
[IS, GAZA DISENGAGEMENT, ISRAELI PALESTINIAN AFFAIRS]
and not as
[IS, GAZA, DISENGAGEMENT, ISRAELI, PALESTINIAN, AFFAIRS]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.