The python-dwca-reader from belgianbiodiversityplatform

Working files not cleaned up after use

A "v" folder appears with archive content.

Support DwC-A files with column titles on first line

For now, no detection is done, so it's the responsibility of the data consumer to skip first line if necessary.

Ideally we could have 3 modes when opening a file: skip_first_line = true | false | autodetect

API: add a method to DwcA reader to retrieve a specific line per id

.get_line(asked_id) ? or combine with some search/filtering feature ?

Raises more InvalidArchive

An InvalidArchive exception will soon be raised when an archives lacks the metadata file it's referencing.

We should think of other ways an archive can be invalid, and raises this exception in those failure cases, too.

Doc: set vocabulary

Decide on specific terminology (core file vs core data file, descriptor, metadata, ...) and ensure they are used everywhere (tutorial, docstrings, ...)

GBIFResultsReader

IT seems GBIF has changed its export format, very that it still works (mainly row.source_metadata...)

Also new files included that we should support ?

Support archive without metafile

See conditions and details in Darwin Core Archive Format, Reference Guide to the XML Descriptor File

Terrible performance with large extensions data files

Using GBIF Downloads, it has been noticed that looping on the archive was incredibly slow when there's a large verbatim.txt data file in addition to the main file. This continue even if we truncate the main occurrence.txt file to 10 records or so.

Reason is easy to identify: there's a design problem in CoreRow's constructor: an _EmbeddedCSV instance is created for each CoreRow. Creating an _EmbeddedCSV is pretty expensive (_line_offsets attribute, mainly), so it should be only done one per archive.

What happens when we provide incorrect input to extensions_to_ignore ?

Currently, it's undocumented and it (silently) does nothing.
Better: it should throw an exception
This should be documented
This should be tested

Doc: split tutorials

On home page:

Tell the purpose is to be a joy to use
Show tiny, beautiful example

Move "advanced" tutorials to another page.

API: add a filter method to DwCAReader that returns all lines that match some criteria...

Take example on Django's ORM
Define the filter parameter structure and capabilities (should we also filter by extensions?)

No test suite

We need a proper test suite

API: Filtering

Proposal 1) DwCAReader.filter_lines(params), with params similar to get_line() method ?
Proposal 2) DwCAReader.lines.filter(params) (better separation of concern)
Proposal 3): both, 1) being an alias to 2) ?
Something else ?

Doc: specify requirements

beautifulsoup4 + lxml (currently PITA to install on Mac OS X)

API: make DwCAReader suport the for..in construct to iterate over lines.

It would be more elegant than the each_line() method. Also evaluate implications for other (lines attribute, ...).

Polish/complete documentation

API: DwCALine: remove get

line.dataline is a dict, so it should be its responsability to retrieve a specific value.

If so:

rename it to something shorter/better (.data? .fields?)
keep .get as a shortcut, but document if like that

Quick finition level check

API: constructor should probably not open the file itself

Proper separation of concern. Take example on the Python CSV classes.

Fill data for qualname() helper

It is totally incomplete...

No support for Dwc-A extension files

Currently only data in core file (+archive metadata) is available.

Incomptible with Python 3.5

Support for very basic archives

I.e a simple CSV with column headers.

This was mentioned by Peter and Stijn in the context of their dwca validator. This looks definitely doable. Next question is: is that in the scope of python-dwca-reader.

If not and we don't want to clutter this package with such code, it may be a good idea to implement a higher-level wrapper to abstract things, in a way similar to:

if given_file_type == dwca:
    dispatch to python-dwca-reader()
elif given_file_type==csv:
    analyze_headers()
    parse_as_csv()

API: constructor could accept directory instead of zipfile

For example, that can make debugging easier

API: factorize get_line_by_index() and get_line_by_id()

... to get_line(params)

Doc: warns about performance issues when using DwCAReader.rows

Windows line endings read as data

Data are read OK, but as the line ending '\r' is included as a component of the data value no extensions are found from the rows of the core data-file.

Data: {'http://rs.tdwg.org/dwc/terms/taxonID': 'urn:ipni.org:name:77126806-1\r'}
----------------------------------------------------------------------------^

Cannot deal with archives with subdirectories

Example archive:
http://rs.gbif.org/datasets/german_sl.zip

This is the default archive used in the GBIF DWCA validator

It contains the following files:

nickyn@ubuntu:~/dwca$ unzip -l german_sl.zip
Archive:  german_sl.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2009-12-11 20:41   german_sl/
     6148  2009-11-04 16:11   german_sl/.DS_Store
   725969  2009-07-15 17:33   german_sl/distribution.txt
        0  2010-01-15 11:18   __MACOSX/
        0  2010-01-15 11:18   __MACOSX/german_sl/
      184  2009-07-15 17:33   __MACOSX/german_sl/._distribution.txt
     1374  2009-12-09 12:35   german_sl/eml.xml
     3195  2009-10-28 15:24   german_sl/meta.xml
      186  2009-10-28 15:24   __MACOSX/german_sl/._meta.xml
   272992  2009-07-15 16:16   german_sl/species_info.txt
  4149979  2009-10-28 15:39   german_sl/taxa.txt
   177967  2009-07-15 13:49   german_sl/vernacular.txt
---------                     -------
  5337994                     12 files

The dwca-reader unzips, but fails to find a meta.xml - as it is inside a subdirectory. The following error is produced:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-18-8b040c988d99> in <module>()
     13 
     14 
---> 15 with DwCAReader('german_sl.zip') as dwca:
     16     # We can now interact with the 'dwca' object
     17     print("Core type is: %s" % dwca.descriptor.core.type)

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in __init__(self, path, extensions_to_ignore)
     83         #: An :class:`descriptors.ArchiveDescriptor` instance giving access to the archive
     84         #: descriptor (``meta.xml``)
---> 85         self.descriptor = ArchiveDescriptor(self._read_additional_file('meta.xml'),
     86                                             files_to_ignore=extensions_to_ignore)
     87 

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in _read_additional_file(self, relative_path)
    163         """Read an additional file in the archive and return its content."""
    164         p = self.absolute_temporary_path(relative_path)
--> 165         return open(p).read()
    166 
    167     def _parse_metadata_file(self):

FileNotFoundError: [Errno 2] No such file or directory: '/home/nickyn/dwca/t/meta.xml'

Presumeably this is a valid archive - if so should the reader locate the meta.xml and continue relative to that location?

Improve docstrings

This is getting critical because the doc. is now automatically extracted by Sphinx and published on readthedocs.

Update to BeautifulSoup v4

It's the currently suggested version: faster, compatible with Python 3, ...

DwCAReader: line truncated at UTF-8 EOL char

When iterating over lines, it goes to next line prematurely when encountering an UTF8-EOL character (charbase.com/0085-unicode-next-line-nel). Issue similar to: http://stackoverflow.com/questions/16227114/utf-8-files-read-in-python-will-line-break-at-character-x85.

Given the description of this utf byte, it does make sense. However, since the EOL character is specified in meta.xml, we decided that it makes sense (and make DwCAReader more resilient) to ignore it in this case.

The issues was discovered when playing with a sample export from the new GBIF data portal. The "issue" has also been fixed on their side, so this portal will probably not generate such exports in the future.

Complete, automatically generated, API documentation

On ReadTheDocs ? Other ? Explore tools & options...

Missing doc/example for dwca.core_terms

build_dc_terms_list.py still relies on BeautifulSoup

BeautifulSoup dependency has been removed in v0.7, but the utility script
dwca/darwincore/build_dc_terms_list.py (used during development, not needed for normal use) still uses BeautifulSoup to do its job.

It would be nicer if - like the rest of the package - it would use ElementTree instead.

Test DwCALine.id

several cases:

id is a "real" id (string)
is is an empty string (column specified, but empty)
no id (is it possible according to standards?) => None

python-dwca-reader in Jython

Currently the python-dwca-reader has lxml as a requirement. Is there a reason for this? I do not see where it is actually used. The reason I ask is that I would very much like to use the python-dwca-reader with Jython, but the dependency on lxml (which has no implementation that works with Jython, since it is based on C and has not been ported to date) makes this impossible. BeautifulSoup can use other parsers, so I wonder if it is possible to elect the parser rather than require lxml.

Split documentation in multiple files

Improve implementation based on the standard definition

(Default values, ...)
Definition here: http://rs.tdwg.org/dwc/terms/guides/text/index.htm

Error in doc

In API page, we give an example using from dwca import DwCAReader..

It should be: from dwca.read import DwCAReader

Specify (and test) what should happen when providing an invalid archive

Currently, the fact its not a zip file is tested, but what about a non-existent file. Or a Zipfile that does not contains a DwCA.

Doc: document the new optional extensions_to_ignore parameter of DwCAReader()

Document API

Now that API is stabilizing... take examples out of example.py

Make Darwin Core terms shortcuts less naive

The current solution is very simple and efficient, but might be a little naive in the long run:

There can be "collisions" between qualnames: http://xxxxxxxxxxx/machin and http://yyyyyyyy/machin. With current implementation, the last imported qualname would overwrite previous ones.
The list is statically generated from XML files, maybe a little rigid ?
If things get more complex, maybe this should be moved to another "Darwin Core" project (reusability, separation of concerns, ...)

Anyway, this is just a small helper that can be bypassed (by providing the full "qualnames" to python-dwca-reader), so this is probably enough for now.

DwCAReader._metaxml should be made public

Removing _
Testing
Documenting

belgianbiodiversityplatform / python-dwca-reader Goto Github PK

python-dwca-reader's People

Contributors

Stargazers

Watchers

Forkers

python-dwca-reader's Issues

Recommend Projects

Recommend Topics

Recommend Org