belgianbiodiversityplatform / python-dwca-reader Goto Github PK
View Code? Open in Web Editor NEW๐ A Python package to read Darwin Core Archive (DwC-A) files.
License: BSD 3-Clause "New" or "Revised" License
๐ A Python package to read Darwin Core Archive (DwC-A) files.
License: BSD 3-Clause "New" or "Revised" License
A "v" folder appears with archive content.
For now, no detection is done, so it's the responsibility of the data consumer to skip first line if necessary.
Ideally we could have 3 modes when opening a file: skip_first_line = true | false | autodetect
.get_line(asked_id) ? or combine with some search/filtering feature ?
An InvalidArchive exception will soon be raised when an archives lacks the metadata file it's referencing.
We should think of other ways an archive can be invalid, and raises this exception in those failure cases, too.
Decide on specific terminology (core file vs core data file, descriptor, metadata, ...) and ensure they are used everywhere (tutorial, docstrings, ...)
IT seems GBIF has changed its export format, very that it still works (mainly row.source_metadata...)
Also new files included that we should support ?
See conditions and details in Darwin Core Archive Format, Reference Guide to the XML Descriptor File
Using GBIF Downloads, it has been noticed that looping on the archive was incredibly slow when there's a large verbatim.txt data file in addition to the main file. This continue even if we truncate the main occurrence.txt file to 10 records or so.
Reason is easy to identify: there's a design problem in CoreRow's constructor: an _EmbeddedCSV instance is created for each CoreRow. Creating an _EmbeddedCSV is pretty expensive (_line_offsets attribute, mainly), so it should be only done one per archive.
On home page:
Move "advanced" tutorials to another page.
We need a proper test suite
beautifulsoup4 + lxml (currently PITA to install on Mac OS X)
It would be more elegant than the each_line() method. Also evaluate implications for other (lines attribute, ...).
line.dataline is a dict, so it should be its responsability to retrieve a specific value.
If so:
Proper separation of concern. Take example on the Python CSV classes.
It is totally incomplete...
Currently only data in core file (+archive metadata) is available.
I.e a simple CSV with column headers.
This was mentioned by Peter and Stijn in the context of their dwca validator. This looks definitely doable. Next question is: is that in the scope of python-dwca-reader.
If not and we don't want to clutter this package with such code, it may be a good idea to implement a higher-level wrapper to abstract things, in a way similar to:
if given_file_type == dwca:
dispatch to python-dwca-reader()
elif given_file_type==csv:
analyze_headers()
parse_as_csv()
For example, that can make debugging easier
... to get_line(params)
Data are read OK, but as the line ending '\r' is included as a component of the data value no extensions are found from the rows of the core data-file.
Data: {'http://rs.tdwg.org/dwc/terms/taxonID': 'urn:ipni.org:name:77126806-1\r'}
----------------------------------------------------------------------------^
Example archive:
http://rs.gbif.org/datasets/german_sl.zip
This is the default archive used in the GBIF DWCA validator
It contains the following files:
nickyn@ubuntu:~/dwca$ unzip -l german_sl.zip
Archive: german_sl.zip
Length Date Time Name
--------- ---------- ----- ----
0 2009-12-11 20:41 german_sl/
6148 2009-11-04 16:11 german_sl/.DS_Store
725969 2009-07-15 17:33 german_sl/distribution.txt
0 2010-01-15 11:18 __MACOSX/
0 2010-01-15 11:18 __MACOSX/german_sl/
184 2009-07-15 17:33 __MACOSX/german_sl/._distribution.txt
1374 2009-12-09 12:35 german_sl/eml.xml
3195 2009-10-28 15:24 german_sl/meta.xml
186 2009-10-28 15:24 __MACOSX/german_sl/._meta.xml
272992 2009-07-15 16:16 german_sl/species_info.txt
4149979 2009-10-28 15:39 german_sl/taxa.txt
177967 2009-07-15 13:49 german_sl/vernacular.txt
--------- -------
5337994 12 files
The dwca-reader unzips, but fails to find a meta.xml - as it is inside a subdirectory. The following error is produced:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-18-8b040c988d99> in <module>()
13
14
---> 15 with DwCAReader('german_sl.zip') as dwca:
16 # We can now interact with the 'dwca' object
17 print("Core type is: %s" % dwca.descriptor.core.type)
/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in __init__(self, path, extensions_to_ignore)
83 #: An :class:`descriptors.ArchiveDescriptor` instance giving access to the archive
84 #: descriptor (``meta.xml``)
---> 85 self.descriptor = ArchiveDescriptor(self._read_additional_file('meta.xml'),
86 files_to_ignore=extensions_to_ignore)
87
/home/nickyn/anaconda3/lib/python3.5/site-packages/dwca/read.py in _read_additional_file(self, relative_path)
163 """Read an additional file in the archive and return its content."""
164 p = self.absolute_temporary_path(relative_path)
--> 165 return open(p).read()
166
167 def _parse_metadata_file(self):
FileNotFoundError: [Errno 2] No such file or directory: '/home/nickyn/dwca/t/meta.xml'
Presumeably this is a valid archive - if so should the reader locate the meta.xml
and continue relative to that location?
This is getting critical because the doc. is now automatically extracted by Sphinx and published on readthedocs.
It's the currently suggested version: faster, compatible with Python 3, ...
When iterating over lines, it goes to next line prematurely when encountering an UTF8-EOL character (charbase.com/0085-unicode-next-line-nel). Issue similar to: http://stackoverflow.com/questions/16227114/utf-8-files-read-in-python-will-line-break-at-character-x85.
Given the description of this utf byte, it does make sense. However, since the EOL character is specified in meta.xml, we decided that it makes sense (and make DwCAReader more resilient) to ignore it in this case.
The issues was discovered when playing with a sample export from the new GBIF data portal. The "issue" has also been fixed on their side, so this portal will probably not generate such exports in the future.
On ReadTheDocs ? Other ? Explore tools & options...
BeautifulSoup dependency has been removed in v0.7, but the utility script
dwca/darwincore/build_dc_terms_list.py (used during development, not needed for normal use) still uses BeautifulSoup to do its job.
It would be nicer if - like the rest of the package - it would use ElementTree instead.
several cases:
Currently the python-dwca-reader has lxml as a requirement. Is there a reason for this? I do not see where it is actually used. The reason I ask is that I would very much like to use the python-dwca-reader with Jython, but the dependency on lxml (which has no implementation that works with Jython, since it is based on C and has not been ported to date) makes this impossible. BeautifulSoup can use other parsers, so I wonder if it is possible to elect the parser rather than require lxml.
(Default values, ...)
Definition here: http://rs.tdwg.org/dwc/terms/guides/text/index.htm
In API page, we give an example using from dwca import DwCAReader..
It should be: from dwca.read import DwCAReader
Currently, the fact its not a zip file is tested, but what about a non-existent file. Or a Zipfile that does not contains a DwCA.
Now that API is stabilizing... take examples out of example.py
The current solution is very simple and efficient, but might be a little naive in the long run:
Anyway, this is just a small helper that can be bypassed (by providing the full "qualnames" to python-dwca-reader), so this is probably enough for now.
Adapt the formicidae-atlas.be webapp and maybe Meuh to use the current version of python-dwca-reader, and use the experience to ensure API's quality and polish the documentation.
API: the (super important) extensions attributes of CoreRow is not documented anymore ?
Currently the representation is ugly, the code too and it is untested (so it's probably a good idea to clean up and define the expected ouptut format at this time).
According to http://www.gbif.org/resource/80636, tar.gz/tgz files are also a valid way to compress Archives.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.