Giter Site home page Giter Site logo

cablemap's People

Contributors

heuer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cablemap's Issues

06SAOPAULO348: References incomplete?

In a previous release Cablemap assumed that 06SAOPAULO348 has three references, c.f.
e9b4aed

The cable has changed and there is no (C) anymore.

Check if the cable should contain a reference to 05 Sao Paulo 975

Smarter reference detection?

02ROME1196:

REF: A. STATE 40721
 CONFIDENTIAL
PAGE 02 ROME 01196 01 OF 02 082030Z  B. ROME 1098  C. ROME 894  D. MYRIAD POST-DEPARTMENT E-MAILS FROM 10/01-02/02  E. ROME 348

Result:
>>> cable.references
[u'02STATE40721']

Should be:
[u'02STATE40721', u'02ROME1098', u'02ROME894', u'02ROME348']

Better recipient extraction

The extraction alg. of the recipients is incomplete:

U.S. Department of State Foreign Affairs Handbook Volume 5 Handbook 1—Correspondence
5 FAH-1 H-230

[...] c. Use existing collective addressees to obviate listing a long list of
addressees on an outgoing telegram. When modifying a collective
address, follow the collective address with a comma, space, "XMT," and
the name(s) of the posts in the collective that will not receive the
telegram. [...]

The reader simply removes the XMT and treats the name of the embassy as recipient.

Handle 06SAOPAULO348

The cable has an incomplete REF section:

SENSITIVE
SIPDIS

E.O. 12958: N/A TAGS: PGOV PHUM KCRM SOCI SNAR ASEC BR
(B) Sao Paulo 215;
(C) 05 Sao Paulo 975

Deal with \ in texts

Cables like 06GENEVA1673 contain ""
These should be removed in advance by reader.py

Optimize subject parsing

06BUENOSAIRES579 subject contains

US-
LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US-
URUGUAYAN

the resulting subject looks like this:

US- LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US- URUGUAYAN

Would be nice if the reader removes the whitespace behind the "-":

US-LATIN AMERICAN RELATIONSHIP US-CHILEAN TIES US-URUGUAYAN

C14N issues

  • 06TOKYO442: 05 OSAKA KOBE 367
  • 06WELLINGTON35: ATLANTA GA 0067, ATD 0245

Add an option to keep the reference enumeration character

parse_references omits the enumeration information: A) 09FOO12 B) 07BAR34 becomes [u'09FOO12', u'07BAR34']. Might be useful to return tuples: [(u'A', u'09FOO12'), (u'B', u'07BAR34')]

TODO:

  • Should this be enabled by default?
  • Impacts for the default models.Cable? Break the API or add another property (Cable.references_ext)?

09NAIROBI1938 references incomplete

The cable has two ref sections:

SUBJECT: VISAS DONKEY: CORRUPTION 212(F) VISA DENIAL
REF: A. 08 STATE 81854

S e c r e t nairobi 001938

Later:

Subject: visas donkey: corruption 212(f) visa denial

Ref: a. 08 state 81854
b. Td-314/014437-09
c. Nairobi 1830
d. Nairobi 1859
e. Nairobi 1831

The parser detects the first ref, only

TAG "ESENV" valid?

05SANJOSE2020 is tagged as "ESENV". IMO that's an error, should be "SENV"

Subject parsing broken

06GENEVA2654 06GENEVA1673, and 07BERN881 contain the "header" section in the "content" section. search-for-subject(content, max-index=1200) does not work for them.

Setting the max index to 2200 works for the cables mentioned above but delivers wrong solutions for cables like 10STATE284 which has no subject but the lenient reg.ex delivers a subject since SUBJECT can be found in the cable's text

Cable id oddities

WL published 08ECTION01OF02MANAMA492 and 08MANAMA492. The first is a malformed version of the latter. Strangely they have a different transmission id.

cablemap.core.cable_by_id reads always the malformed version since it translates internally the id 08MANAMA492 to 08ECTION01OF02MANAMA492

04THEHAGUE1717: Illegal references detected

Result of 04THEHAGUE1717:

[u'04THEHAGUE1701', u'04STATE1', u'04THEHAGUE1701', u'04STATE147536']

should be:

[u'04THEHAGUE1701', u'04STATE147536']

Problem: The cable contains two REF sections. The parser should ignore the first REF section

Better reference extraction for malformed cables

I.e. 08TRIPOLI402:

>>> from cablemap.core import cable_from_html; from cablemap.core.utils import  cable_page_by_id
>>> cable = cable_from_html(cable_page_by_id('08TRIPOLI402'))
>>> cable.references
[u'08TRIPOLI199', u'08TRIPOLI227', u'08TRIPOLI402']

Should be:
[u'08TRIPOLI199', u'08TRIPOLI227']

Simple solution: Remove self-references?

Signer of 01HARARE1632 cannot be detected

PAGE 02        HARARE  01632  02 OF 02  170806Z 
E.O. 12958: DECL: 05/17/11 
TAGS: PREL PGOV PINS ETRD ZI CA DA
SUBJECT: FOREIGN MINISTER INFORMS WESTERN 
AMBASSADORS OF CABINET DECISION TO REIN IN WAR VETERANS 

IRVING 

                   CONFIDENTIAL 

>

Handle reverse malformed cable ids

>>> from cablemap.core import cable_by_id
>>> cable = cable_by_id(''08SCTION02OF02SAOPAULO335'')

works but

 >>> cable = cable_by_id('08SAOPAULO335')

doesn't work

fix_subjects and update_helpers don't work together

helpers/fix_subjects.py needs the no_subjects.txt file which is generated by update_helpers.py
But once fix_subjects generates the subjects, update_helpers removes the cable ids from no_subjects.txt since the cables have subjects now. Next time fix_subjects removes all subject fixes from subjects.txt which is bad

Handle malformed cable ids

cablemap.core cannot handle malformed cable IDs

# WikiLeak cable identifiers which are wrong
MALFORMED_CABLE_IDS = {
    '08SCTION02OF02SAOPAULO335': u'08SAOPAUL335',
    '08SECTION01GF02BISHIEK21': u'08BISHKEK1021',
    '09SECTION02OF03QRIPOLI583': u'09TRIPOLI583',
}

>>> cable = cable_by_id('08SCTION02OF02SAOPAULO335')

Traceback (most recent call last):
...
ValueError: Cannot extract the cable's reference id

Optimize TAG extraction

IS GAZA DISENGAGEMENT ISRAELI PALESTINIAN AFFAIRS

should probably be treated as:

[IS, GAZA DISENGAGEMENT, ISRAELI PALESTINIAN AFFAIRS]

and not as

[IS, GAZA, DISENGAGEMENT, ISRAELI, PALESTINIAN, AFFAIRS]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.