glasgow-ipl / ietfdata Goto Github PK

Python libraries to access the IETF DataTracker and RFC Index

License: BSD 2-Clause "Simplified" License

Python 99.71% Makefile 0.29%

ietfdata's Introduction

ietfdata - Access the IETF Datatracker and related resources

This project contains Python 3 libraries to interact with, and access, the IETF datatracker, RFC index, and related resources.

Getting started

The project uses Pipenv for dependency management. To begin, run:

pipenv install --dev -e .

to create a Python virtual environment with appropriate packages install. Then, run:

pipenv shell

to start the virtual environment, within which you can run the scripts.

Once the virtual environment is started, running:

python3 tests/test_datatracker.py

will run the test suite for the datatracker module. Running:

python3 tests/test_rfcindex.py

Will test the rfcindex module.

Caching

The ietfdata library can use a MongoDB instance as a cache. Using a cache reduces the number of requests that are made directly to the Datatracker, improving performance, and reducing the impact on the IETF's infrastructure. While using a cache is optional when accessing the Datatracker, it is required when accessing the mail archive.

The hostname, port, username, and password for the MongoDB instance that is to be used as the cache can be set when instantiated the DataTracker or MailArchive objects. Alternatively, the following environment variables can be set:

IETFDATA_CACHE_HOST (defaults to localhost when accessing the mail archive)
IETFDATA_CACHE_PORT (defaults to 27017)
IETFDATA_CACHE_USER (optional)
IETFDATA_CACHE_PORT (optional)

Release Process

Edit CHANGELOG.md and ensure up-to-date
Edit setup.py to ensure the correct version number is present
Edit ietfdata/datatracker.py to fix version number in DataTracker::ua
Run make test to run the test suite. If any tests fail, fix then restart the release process
Commit changes and push to GitHub
Check that the GitHub Continuous Integration run succeeds, and fix any problems (this runs with a fresh cache, so can sometimes catch problems that aren't found by local tests).
Run python3 setup.py sdist bdist_wheel to prepare the package
Run python3 -m twine upload dist/* to upload the package
Commit the packages files in dist/* push to GitHub
Tag the release in GitHub

ietfdata's People

Contributors

Stargazers

Watchers

Forkers

glowyee brodiee121 sbenthall csperkins djichthys moates georgefourm moonshiner daniel4x luohk19 zandermonc mladenk42 lumisota

ietfdata's Issues

No programmatic path from document to document type

It is possible to inspect a document to get a DocumentTypeURI:

$ doc.type
DocumentTypeURI(uri='/api/v1/name/doctypename/draft/', params={})

But retrieving the document type object from this URI is not currently possible without ad hoc string manipulation because the document type lookup function uses the slug, i.e.:

$ dt.document_type('draft')

Dependencies not installed when PyPI wheel is installed

Clean installation of ietfdata via PyPI does not work, with dependencies missing (e.g., Pavlova). setup.py doesn't appear to specify install_requires field.

Missing test coverage: DataTracker::document_events()

No tests for DataTracker::document_events()

Missing test coverage: DataTracker::submission_events()

No tests for DataTracker::submission_events()

pip installation of ietfdata not working

ietfdata is listed on pypi

https://pypi.org/project/ietfdata/

But is not available in pip on the command line:

$ pip install ietfdata
ERROR: Could not find a version that satisfies the requirement ietfdata (from versions: none)
ERROR: No matching distribution found for ietfdata

Add support for recent Python versions

The library appears not to work with recent Python versions (3.10, 3.11) (see #126). May need to replace Pavlova (as per #55, it isn't well maintained).

Invalidate cache on datatracker version update

IETF datatracker v7.17.0 will expose its version number via /api/version. Need to invalidate the cache for objects that don't have an explicit update time when the version changes.

PavlovaParsingError: Field: consent missing on dt.iab_chair().name

Running this code from the examples/ directory in a local notebook in an environment with ietfdata installed using pip install -U ietfdata:

from datetime                 import timedelta
from pathlib                  import Path
from ietfdata.datatracker     import *
from ietfdata.datatracker_ext import *
from dateutil.parser          import *

dt = DataTrackerExt()

# =============================================================================
# Information about the IAB:

print(F"The IAB chair is {dt.iab_chair().name}")

print("The IAB members are:")
for m in dt.iab_members():
    print(F"  {m.name}")
print("")

I'm getting the following error/stack trace:
https://gist.github.com/sbenthall/688b3e9fb920e0f1f7ec212718a491a0

I wonder if there's been an API change, or if I've mistaken the configuration/installation somehow.

No way to navigate from a document to its authors?

The resolution of #14 makes the author affiliation available given the document author, and the ability to search documents by author.

I don't think there's a way yet to get the authors of a document.

>>> dt.document(DocumentURI('/api/v1/doc/document/draft-gharai-avt-uncomp-video/'))
Document(resource_uri=DocumentURI(uri='/api/v1/doc/document/draft-gharai-avt-uncomp-video/', params={}), id=26290, name='draft-gharai-avt-uncomp-video', title='RTP Payload Format for Uncompressed Video', pages=11, words=2827, time=datetime.datetime(2009, 2, 24, 0, 0), notify='', expires='2002-12-26T00:00:00', type=DocumentTypeURI(uri='/api/v1/name/doctypename/draft/', params={}), rfc=None, rev='00', abstract='This memo specifies a packetization scheme for encapsulating\r\nuncompressed HDTV as defined by SMPTE 274M and SMPTE 296M into\r\na payload format for  the Real-Time Transport Protocol (RTP).\r\nSMPTE 274M  and SMPTE 296M  define the analog and digital\r\nrepresentation of HDTV with image formats of 1920x1080  and\r\n1280x720, respectively. The payload has been designed such\r\nthat it may scale to future higher resolutions, suhc as\r\nDigital Cinema.', internal_comments='', order=1, note='', ad=None, shepherd=None, group=GroupURI(uri='/api/v1/group/group/1027/', params={}), stream=None, intended_std_level=None, std_level=None, states=[DocumentStateURI(uri='/api/v1/doc/state/4/', params={}), DocumentStateURI(uri='/api/v1/doc/state/150/', params={})], submissions=[], tags=[], uploaded_filename='', external_url='')
>>> doc = dt.document(DocumentURI('/api/v1/doc/document/draft-gharai-avt-uncomp-video/'))

I wonder how to find the authors.

Unable to access role history correctly

I am trying to use this package to get records about working group leadership in the present and past.

Getting them for the present seems to work fine. As an example I'm using dnsop.
https://datatracker.ietf.org/group/dnsop/history/

> wg = dt.group_from_acronym("dnsop")
> [dt.person(r.person).name for r in dt.group_roles(group = wg)]
['Warren "Ace" Kumari', 'Benno Overeinder', 'Suzanne Woolf', 'Tim Wicinski']

These are indeed the names of the people currently in leadership of this group.

Now I'm trying to find the people historically but not currently in roles in this working group.

> [dt.person(r.person).name for r in dt.group_role_histories(group = wg)]
['Andrew Sullivan', 'Marc Blanchet']

I don't think these people have ever been in dnsop leadership. I'm expecting the names Peter Koch, Joel Jaeggli, and Ronald Bonica. These names are mentioned in certain GroupEvents about leadership change. Is there link to these other persons maintained in the group history?

PavlovaParsingError: Field: consent missing on dt.document(...) call

Running this snippet from the examples/ directory in local environment with package installed via pip install -U ietfdata:

from pathlib              import Path
from ietfdata.datatracker import *

# =============================================================================
# Example: print information about document authors

dt = DataTracker()

doc = dt.document(DocumentURI('/api/v1/doc/document/draft-ietf-mmusic-rfc4566bis/'))
print("Title: {}".format(doc.title))

Gets error with the following stack trace:
https://gist.github.com/sbenthall/8382edbb0df5c2b09165cd4a9e55f630

HeaderDataMailHelper crashes with bad dates

The scan_message() method of HeaderDataMailHelper can crash if given a bad date:

INFO:ietfdata:scan message dnsext/015459 for metadata
Traceback (most recent call last):
  File "examples/emails_2019.py", line 48, in <module>
    ml = archive.mailing_list("dnsext")
  File "/Users/csp/Projects/glasgow-ipl/ietfdata/ietfdata/mailarchive.py", line 374, in mailing_list
    self._mailing_lists[mailing_list_name] = MailingList(self._cache_dir, mailing_list_name, self._helpers)
  File "/Users/csp/Projects/glasgow-ipl/ietfdata/ietfdata/mailarchive.py", line 189, in __init__
    self._msg_metadata[msg_id] = {**(helper.scan_message(message_text)), **(self._msg_metadata[msg_id])}
  File "/Users/csp/Projects/glasgow-ipl/ietfdata/ietfdata/mailhelper_headerdata.py", line 43, in scan_message
    timestamp = datetime.fromtimestamp(time.mktime(msg_date))
OverflowError: mktime argument out of range

The message in question here has Date: Wed, 14 Jun 100 05:41:34 -0700 (PDT) (i.e., a Y2K bug in the date string).

It may be worth checking how many message dates fail to parse across the entire archive, to see if it's worth writing a work-around.

DocumentTypeURI is not defined for recent pip installation of ietfdata

Today, I've pip installed ietfdata successfully into an environment.

Within that environment, I am able to import the DataTracker, but not able to import the DocumetTypeURI object:

$ python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> from ietfdata.datatracker import *
>>> dt = DataTracker()
>>> DocumentTypeURI
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'DocumentTypeURI' is not defined

This is unexpected.

Group uri/resource_uri confusion?

dt = DataTracker(cache_dir=Path("cache"))
group = dt.group_from_acronym('mmusic')
drafts = dt.documents(group = group,
                      doctype = dt.document_type('draft'))

This errors with:

Traceback (most recent call last):
  File "affiliations.py", line 36, in <module>
    doctype = dt.document_type('draft')) ###
  File "/home/sb/projects/bigbang-multi/ietfdata/ietfdata/datatracker.py", line 1648, in document_type
    return self._retrieve(doc_type_uri, DocumentType)
  File "/home/sb/projects/bigbang-multi/ietfdata/ietfdata/datatracker.py", line 1367, in _retrieve
    if self._obj_is_cached(resource_uri):
  File "/home/sb/projects/bigbang-multi/ietfdata/ietfdata/datatracker.py", line 1338, in _obj_is_cached
    return self._cache_filepath(resource_uri).exists()
  File "/home/sb/projects/bigbang-multi/ietfdata/ietfdata/datatracker.py", line 1332, in _cache_filepath
    return Path(self.cache_dir, resource_uri.uri[1:-1] + ".json")
AttributeError: 'str' object has no attribute 'uri'

MailArchiveHelper serialise and deserialise do the same thing?

In mailhelper_*.py files the serialise() and deserialise() functions seem to do the same thing. Should they?

Submission authors are not tied to an affiliation

Submissions have author fields denoting the authors:

dt.submission(SubmissionURI('/api/v1/submit/submission/47519/'))
Submission(abstract='   This document describes an improved IS-IS neighbor management scheme\n   which can be used to enhance network performance by allowing\n   operators to quickly and accurately shift traffic away from a point-\n   to-point or multi-access LAN interface by allowing one IS-IS router\n   to signal to a second, adjacent IS-IS neighbor to adjust its IS-IS\n   metric that should be used to temporarily reach the first IS-IS\n   router during network maintenance events.\n', access_key='16722aed4cd4ff65aaa4c1f0398e6f28', auth_key='1c14c45f418f8172a9cee94f58f7645abdf3a1ad', authors="[{'email': '[email protected]', 'name': 'Naiming Shen'}, {'email': '[email protected]', 'name': 'Tony Li'}, {'email': '[email protected]', 'name': 'Shane Amante'}, {'email': '[email protected]', 'name': 'Mikael Abrahamsson'}]", checks=['/api/v1/submit/submissioncheck/45495/'],  ...

But the affiliation data, which is available in the RFC text, is lost.

The people can be looked up by email, but there's no way to recover, for example, that this person was affiliated with Apple Inc. for this RFC8500

dt.person_from_email('[email protected]')
Person(resource_uri=PersonURI(uri='/api/v1/person/person/109004/'), id=109004, name='Shane Amante', name_from_draft='Shane Amante', ascii='Shane Amante', ascii_short=None, user='', time='2012-02-26T00:17:36', photo='None', photo_thumb='None', biography='', consent=False)

Missing test coverage: DataTracker::group_state()

No tests for DataTracker::group_state()

Missing test coverage: DataTracker::related_documents()

No tests for DataTracker::related_documents()

rfcindex.py

There's a duplicate test in line 265. Is it a typo?

elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2497") or \

Also, regarding the test in line 268 for RFC2708, as far as I can tell, U+0092 was introduced in draft-ietf-printmib-job-protomap-01 in multiple places. In -02, it was replaced with U+0027 everywhere except section 5.0. Somehow, that stray character became the corrupt text you identified.

PR #9 breaks the tests

I merged PR #9, but it breaks the tests:

(ietfdata) [mangole] > make test
mypy ietfdata/rfcindex.py
Success: no issues found in 1 source file
mypy ietfdata/datatracker.py
ietfdata/datatracker.py:1346: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1346: error: Unsupported operand types for + ("str" and "RelationshipType")
ietfdata/datatracker.py:1348: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1350: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1350: error: Unsupported operand types for + ("str" and "RelationshipType")
ietfdata/datatracker.py:1352: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1352: error: Unsupported operand types for + ("str" and "RelationshipType")
ietfdata/datatracker.py:1354: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1356: error: Unsupported operand types for + ("str" and "Document")
ietfdata/datatracker.py:1358: error: Unsupported operand types for + ("str" and "RelationshipType")
Found 10 errors in 1 file (checked 1 source file)
make: *** [Makefile:28: test] Error 1
(ietfdata) [mangole] >

Consider replacing Pavlova with Pydantic

Pydantic seems to address the same problem as Pavlova, but looks to be more flexible and better supported.

RFC documents no longer have submissions

A few months ago, the following code would report that there were multiple submissions for each of the documents queried:

from ietfdata.datatracker import *
from ietfdata.datatracker_ext import *
import numpy as np

dt = DataTrackerExt()

g = dt.group_from_acronym('quic')
docs = list(dt.documents(group=g, doctype=dt.document_type_from_slug("rfc")))

np.array([len(draft.submissions) for draft in docs])

This made it possible to collect information about the date and authors of the final draft (by looking at the submissions list, ordering it by submission_date, and pulling the author information).

Now it seems that something has changed, and RFC documents appear to, as a rule, have 0 submissions.

What was the change?
Is it documented anywhere?

How can one do the following:

find all the RFCs for a working group
get the author and date information from that RFC document (the last document in the series)

Thanks!

Bug in rfc-data example

In the rfc-data example, there is the code:

        person = dt.person(rfc, author.person)
        print_dt_person(person)

https://github.com/glasgow-ipl/ietfdata/blob/master/examples/rfc-data.py#L58-L59

I believe the arguments are transposed. It should be:

        person = dt.person(author.person)
        print_dt_person(rfc, person)

Code for scraping document affiliation information hangs

The following code hangs when ran locally:

https://gist.github.com/sbenthall/0c919094f6529b015015f9857e231ff5

This has been taking 10 minutes or so with no response.

Are we doing something wrong?

Unify CI and Makefile test scripts

Presently, there are two test scripts: the Makefile's test target, and the CI configuration's build-and-test job. As these are separate, it is possible that they would produce different results, making it difficult to determine if changes that pass one set of tests will also pass the other.

These should be unified, ideally around the Makefile's test target. Unifying these test scripts should also include ensuring that sensible, human-readable output is produced locally, while reports in the required archival format are produced by the CI build.

Missing test coverage: Meeting::status()

No tests for Meeting::status()

tests fail on fresh checkout

On a fresh clone and running through the pipenv install instructions, I get the following error when running the tests:

$ python tests/test_datatracker.py 
....................................F............
======================================================================
FAIL: test_meeting_session_assignments (__main__.TestDatatracker)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_datatracker.py", line 1103, in test_meeting_session_assignments
    self.fail("not implemented")
AssertionError: not implemented

----------------------------------------------------------------------
Ran 49 tests in 12.185s

FAILED (failures=1)