Giter Site home page Giter Site logo

belgianbiodiversityplatform / python-dwca-reader Goto Github PK

View Code? Open in Web Editor NEW
43.0 12.0 21.0 729 KB

๐Ÿ A Python package to read Darwin Core Archive (DwC-A) files.

License: BSD 3-Clause "New" or "Revised" License

Python 99.98% Shell 0.02%
biodiversity-informatics python biodiversity biodiversity-standards gbif dwc

python-dwca-reader's People

Contributors

bcail avatar dependabot[bot] avatar evindunn avatar jwcook avatar niconoe avatar stijnvanhoey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-dwca-reader's Issues

Support DwC-A files with column titles on first line

For now, no detection is done, so it's the responsibility of the data consumer to skip first line if necessary.

Ideally we could have 3 modes when opening a file: skip_first_line = true | false | autodetect

Raises more InvalidArchive

An InvalidArchive exception will soon be raised when an archives lacks the metadata file it's referencing.

We should think of other ways an archive can be invalid, and raises this exception in those failure cases, too.

Doc: set vocabulary

Decide on specific terminology (core file vs core data file, descriptor, metadata, ...) and ensure they are used everywhere (tutorial, docstrings, ...)

GBIFResultsReader

IT seems GBIF has changed its export format, very that it still works (mainly row.source_metadata...)

Also new files included that we should support ?

Terrible performance with large extensions data files

Using GBIF Downloads, it has been noticed that looping on the archive was incredibly slow when there's a large verbatim.txt data file in addition to the main file. This continue even if we truncate the main occurrence.txt file to 10 records or so.

Reason is easy to identify: there's a design problem in CoreRow's constructor: an _EmbeddedCSV instance is created for each CoreRow. Creating an _EmbeddedCSV is pretty expensive (_line_offsets attribute, mainly), so it should be only done one per archive.

Doc: split tutorials

On home page:

  • Tell the purpose is to be a joy to use
  • Show tiny, beautiful example

Move "advanced" tutorials to another page.

API: Filtering

  • Proposal 1) DwCAReader.filter_lines(params), with params similar to get_line() method ?
  • Proposal 2) DwCAReader.lines.filter(params) (better separation of concern)
  • Proposal 3): both, 1) being an alias to 2) ?
  • Something else ?

API: DwCALine: remove get

line.dataline is a dict, so it should be its responsability to retrieve a specific value.

If so:

  • rename it to something shorter/better (.data? .fields?)
  • keep .get as a shortcut, but document if like that

Support for very basic archives

I.e a simple CSV with column headers.

This was mentioned by Peter and Stijn in the context of their dwca validator. This looks definitely doable. Next question is: is that in the scope of python-dwca-reader.

If not and we don't want to clutter this package with such code, it may be a good idea to implement a higher-level wrapper to abstract things, in a way similar to:

if given_file_type == dwca:
    dispatch to python-dwca-reader()
elif given_file_type==csv:
    analyze_headers()
    parse_as_csv()

Windows line endings read as data

Data are read OK, but as the line ending '\r' is included as a component of the data value no extensions are found from the rows of the core data-file.

Data: {'http://rs.tdwg.org/dwc/terms/taxonID': 'urn:ipni.org:name:77126806-1\r'}
----------------------------------------------------------------------------^

Cannot deal with archives with subdirectories

Example archive:
http://rs.gbif.org/datasets/german_sl.zip

This is the default archive used in the GBIF DWCA validator

It contains the following files:

nickyn@ubuntu:~/dwca$ unzip -l german_sl.zip
Archive:  german_sl.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2009-12-11 20:41   german_sl/
     6148  2009-11-04 16:11   german_sl/.DS_Store
   725969  2009-07-15 17:33   german_sl/distribution.txt
        0  2010-01-15 11:18   __MACOSX/
        0  2010-01-15 11:18   __MACOSX/german_sl/
      184  2009-07-15 17:33   __MACOSX/german_sl/._distribution.txt
     1374  2009-12-09 12:35   german_sl/eml.xml
     3195  2009-10-28 15:24   german_sl/meta.xml
      186  2009-10-28 15:24   __MACOSX/german_sl/._meta.xml
   272992  2009-07-15 16:16   german_sl/species_info.txt
  4149979  2009-10-28 15:39   german_sl/taxa.txt
   177967  2009-07-15 13:49   german_sl/vernacular.txt
---------                     -------
  5337994                     12 files

The dwca-reader unzips, but fails to find a meta.xml - as it is inside a subdirectory. The following error is produced:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-18-8b040c988d99> in <module>()
     13 
     14 
---> 15 with DwCAReader('german_sl.zip') as dwca:
     16     # We can now interact with the 'dwca' object
     17     print("Core type is: %s" % dwca.descriptor.core.type)

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in __init__(self, path, extensions_to_ignore)
     83         #: An :class:`descriptors.ArchiveDescriptor` instance giving access to the archive
     84         #: descriptor (``meta.xml``)
---> 85         self.descriptor = ArchiveDescriptor(self._read_additional_file('meta.xml'),
     86                                             files_to_ignore=extensions_to_ignore)
     87 

/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in _read_additional_file(self, relative_path)
    163         """Read an additional file in the archive and return its content."""
    164         p = self.absolute_temporary_path(relative_path)
--> 165         return open(p).read()
    166 
    167     def _parse_metadata_file(self):

FileNotFoundError: [Errno 2] No such file or directory: '/home/nickyn/dwca/t/meta.xml'

Presumeably this is a valid archive - if so should the reader locate the meta.xml and continue relative to that location?

Improve docstrings

This is getting critical because the doc. is now automatically extracted by Sphinx and published on readthedocs.

DwCAReader: line truncated at UTF-8 EOL char

When iterating over lines, it goes to next line prematurely when encountering an UTF8-EOL character (charbase.com/0085-unicode-next-line-nel). Issue similar to: http://stackoverflow.com/questions/16227114/utf-8-files-read-in-python-will-line-break-at-character-x85.

Given the description of this utf byte, it does make sense. However, since the EOL character is specified in meta.xml, we decided that it makes sense (and make DwCAReader more resilient) to ignore it in this case.

The issues was discovered when playing with a sample export from the new GBIF data portal. The "issue" has also been fixed on their side, so this portal will probably not generate such exports in the future.

build_dc_terms_list.py still relies on BeautifulSoup

BeautifulSoup dependency has been removed in v0.7, but the utility script
dwca/darwincore/build_dc_terms_list.py (used during development, not needed for normal use) still uses BeautifulSoup to do its job.

It would be nicer if - like the rest of the package - it would use ElementTree instead.

Test DwCALine.id

several cases:

  • id is a "real" id (string)
  • is is an empty string (column specified, but empty)
  • no id (is it possible according to standards?) => None

python-dwca-reader in Jython

Currently the python-dwca-reader has lxml as a requirement. Is there a reason for this? I do not see where it is actually used. The reason I ask is that I would very much like to use the python-dwca-reader with Jython, but the dependency on lxml (which has no implementation that works with Jython, since it is based on C and has not been ported to date) makes this impossible. BeautifulSoup can use other parsers, so I wonder if it is possible to elect the parser rather than require lxml.

Error in doc

In API page, we give an example using from dwca import DwCAReader..

It should be: from dwca.read import DwCAReader

Document API

Now that API is stabilizing... take examples out of example.py

Make Darwin Core terms shortcuts less naive

The current solution is very simple and efficient, but might be a little naive in the long run:

  • There can be "collisions" between qualnames: http://xxxxxxxxxxx/machin and http://yyyyyyyy/machin. With current implementation, the last imported qualname would overwrite previous ones.
  • The list is statically generated from XML files, maybe a little rigid ?
  • If things get more complex, maybe this should be moved to another "Darwin Core" project (reusability, separation of concerns, ...)

Anyway, this is just a small helper that can be bypassed (by providing the full "qualnames" to python-dwca-reader), so this is probably enough for now.

API reality check

Adapt the formicidae-atlas.be webapp and maybe Meuh to use the current version of python-dwca-reader, and use the experience to ensure API's quality and polish the documentation.

Missing in doc

API: the (super important) extensions attributes of CoreRow is not documented anymore ?

Improve string representation of DwcALine

Currently the representation is ugly, the code too and it is untested (so it's probably a good idea to clean up and define the expected ouptut format at this time).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.