weecology / retriever Goto Github PK
View Code? Open in Web Editor NEWQuickly download, clean up, and install public datasets into a database management system
Home Page: http://data-retriever.org
License: Other
Quickly download, clean up, and install public datasets into a database management system
Home Page: http://data-retriever.org
License: Other
INSERT INTO PlantTaxonomy.PlantTaxonomy(symbol, synonym_symbol, scientific_name, common_name, family) VALUES ('ACNEI2', 'ACNEI'), "Acer negundo.....
Sorry PostgreSQL is such a hassle!
On Ubuntu 12.04 (the new LTS release) MySQL tests fail on 6 datasets.
Engine, Dataset, Error
('m', 'gentry', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_portal_mammals', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_zachmann2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs50stop', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_del_moral_2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
It would be nice for error reporting if the error messages returned by the retriever could be copied from the GUI.
Importing FIA fails in Postgres 8.4.1 on Ubuntu.
Creating table FIA.SURVEY...
INSERT INTO FIA.SURVEY (cn, invyr, p3_ozone_ind, statecd, stateab, statenm, rscd, ann_inventory, notes, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle) VALUES (22300165010478, 2001, 'N', 1, 'AL', 'Alabama', '33', 'Y', '', '', '2006-02-16', 10478, '', '2009-12-03', '10854', 8, 1);
Traceback (most recent call last):
File "/usr/local/bin/retriever", line 9, in
load_entry_point('retriever==1.2.1', 'console_scripts', 'retriever')()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/main.py", line 83, in main
script.download(engine)
File "scripts/fia.py", line 76, in download
engine.insert_data_from_file(engine.format_filename(prep_file_name))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/engines/postgres.py", line 88, in insert_data_from_file
return Engine.insert_data_from_file(self, filename)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 602, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 88, in add_to_table
self.execute(insert_stmt, commit=False)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 432, in execute
self.cursor.execute(statement)
psycopg2.DataError: integer out of range
I'm getting this error on mysql, postgres, and sqlite (Ubunt 12.04)
INSERT INTO MCDB_communities (record_id, site_id, initial_year, species_id, presence_only, abundance, mass) VALUES (null, 1008, 2002, CHPE, '0', 52, null);
psycopg2.ProgrammingError: column "chpe" does not exist
Engine, Dataset, Error
('p', 'EA_zachmann2010', DataError('value too long for type character varying(4)\n',))
('p', 'EA_del_moral_2010', DataError('length for type varchar must be at least 1\nLINE 1: ...ame varchar(19), plot_code varchar(6), first_year varchar(0)...\n ^\n',))
('p', 'EA_clark2006', DataError('length for type varchar must be at least 1\nLINE 1: CREATE TABLE Clark2006.trees (id varchar(0), site varchar(9)...\n ^\n',))
('p', 'EA_barnes2008', DataError('value too long for type character varying(3)\n',))
Here is the error message:
Traceback (most recent call last):
File "retriever\app\connect_wizard.pyo", line 158, in open_file_dialog
IndexError: tuple index out of range
Windows reports "Error: " with no description. No log file generated.
Ubuntu reports "Error: Not a zip file." (the Ubuntu test was on yesterday's Retriever rc, prior to the small tweaks).
I was just looking through the FIA script in response to an email query and noticed the inclusion of the "year annual inventory began for that state", but can't see that those years are being used anywhere in the script. Am I missing something or is this part of some planned future addition to the script?
Windows 7 - PostgreSQL 8.4
Error: invalid literal for int() with base 10: 'Kotar S WI'
The FIA database is an awesome resource, but it is overwhelmingly complex requiring a large amount of time to be spent with the metadata and a substantial amount of effort and knowledge to use. We should provide a simplified version of this database that provides the core data that would be used in most ecological analyses. This issue is to facilitate discussion of what exactly should be in the simplified FIA database and what the structure should be.
In MySQL some cases the 'notes' field in the 'SURVEY' table is broken up and included as the values for multiple fields, shifting the values of those fields into other fields. For example see,
cn=2147283647
invyr=2001
The following error occurs almost immediately when trying to import CRC using the new deb install:
Error:(1054, "Unknown column 'estimated' in 'field list'")
The Advanced Search produces a more usable format to the data, because you can opt to separate out the authority info from the scientific name, and it generates more information broken up into more fields, in general. The current state of the Retriever version has all variety and authority information included in the scientific_name which doesn't allow for easy joins with other data. Thanks!
Problem:
The Retriever does no launch properly. The splash screen loads and the Retriever reports that it is downloading scripts, but the names of those scripts are not names of scripts but tags, bits of code, etc. It keeps "Downloading scripts" for a long time (I cut it off after a couple of minutes) and the /scripts folder is filled with text files that have junky names and contain only 'Not found'.
Steps to replicate:
Either upgrade a current Retriever installation from 1.0 or do a fresh install after remove the old installation and associated directories.
Hey guys,
I just downloaded a clean version of the following file:
https://github.com/weecology/retriever/raw/v1.2/windows/retriever.exe
which is linked to from this page: http://ecologicaldata.org/ecodata-retriever
and on two different machines both running windows 7 professional the following behavior was observed.
Retriever loads and downloads the relevant scripts, the connection to the database prompt appears to work, but then program appears to fail at loading the metadata for the available databases for download. Let me know if you need more details.
Dan
Importing PortalMammals on Windows 7 into SQLite results in an empty main table.
Probably due to the standard weirdness in Texas:
('c', 'bbs', AttributeError("DummyConnection instance has no attribute 'rollback'",))
Failed bulk insert on Texas, inserting manually.448
There was an error in Texas.
ERROR.
It would be nice to be able to create a single data table out of multiple data files, i.e.:
# tables
table: counts, count_*.csv
would create table "counts" out of files count_1.csv, count_2.csv, etc.
To facilitate better interaction with eBird data and the Hurlbert lab Taxonomy table, make the BBS species table more similar to this existing structure by:
I have just recently learned that PostgreSQL has a different definition of 'database' than MySQL. In PostgreSQL, you cannot get tables in different databases to talk to each other, as you can in MySQL. But, in PostgreSQL, you can organize data into schemas within a database. Tables in different schemas in the same database can talk to each other. The Retriever only allows you to select the database (used as it is in MySQL). When you specify a database in PostgreSQL, the Retriever creates a schema and places the table within that. This is not a big deal, as tables can be moved around, and it might in fact be the best organizational solution. The alternative is to provide the feature to place a table or set of tables directly into a specified schema in a specified database. I'm not sure what the answer here is, and this shouldn't be high on the priority list. Just thought I'd share.
Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 69, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 552, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 542, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 453, in format_insert_value
return int(intvalue)
ValueError: invalid literal for int() with base 10: 'CPPSSS11'
Previous line:
INSERT INTO FIA_COND (cn, plt_cn, invyr, statecd, unitcd, countycd, plot, condid, cond_status_cd, cond_nonsample_reasn_cd, reservcd, owncd, owngrpcd, forindcd, adforcd, fortypcd, fldtypcd, mapden, stdage, stdszcd, fldszcd, siteclcd, sicond, sibase, sisp, stdorgcd, stdorgsp, prop_basis, condprop_unadj, micrprop_unadj, subpprop_unadj, macrprop_unadj, slope, aspect, physclcd, gsstkcd, alstkcd, dstrbcd1, dstrbyr1, dstrbcd2, dstrbyr2, dstrbcd3, dstrbyr3, trtcd1, trtyr1, trtcd2, trtyr2, trtcd3, trtyr3, presnfcd, balive, fldage, alstk, gsstk, fortypcdcalc, habtypcd1, habtypcd1_pub_cd, habtypcd1_descr_pub_cd, habtypcd2, habtypcd2_pub_cd, habtypcd2_descr_pub_cd, mixedconfcd, vol_loc_grp, siteclcdest, sitetree_tree, sitecl_method, carbon_down_dead, carbon_litter, carbon_soil_org, carbon_standing_dead, carbon_understory_ag, carbon_understory_bg, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle, soil_rooting_depth_pnw, ground_land_class_pnw, plant_stockability_factor_pnw, stnd_cond_cd_pnwrs, stnd_struc_cd_pnwrs, stump_cd_pnwrs, fire_srs, grazing_srs, harvest_type1_srs, harvest_type2_srs, harvest_type3_srs, land_use_srs, operability_srs, stand_structure_srs, nf_cond_status_cd, nf_cond_nonsample_reasn_cd, canopy_cvr_sample_method_cd, live_canopy_cvr_pct, live_missing_canopy_cvr_pct, nbr_live_stems) VALUES (3337761010690, 3337759010690, 1989, 32, 1, 3, 351, 1, 1, 0, 1, 11, 10, 0, 417, 261, 261, 1, 164, 1, 0, 6, 21, 50, 108, 0, 0, '', 1, 1, 1, 0, 67, 289, 0, 1, 1, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 260.6012, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, '', '', 0, 0, 0, 6.885517, 18.602616, 14.039627, 7.507694, .393915, .043768, 0, '2004-05-28', 10690, 0, '2010-07-07', 10690, 1, 0, 0, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
I'm wondering if there would be a way to generate the number of rows per table that should show up in your completed database - perhaps to accompany the super-sweet 'download complete' graphic? I am just looking for a quick way to reassure myself that I have everything I should - rather than having to check against all the raw data files.Thanks!
I tried twice to dump PanTHERIA into my PostgreSQL db, and each time it only populated 3057 rows (when the Ecological Archives file has 5416).
One thing the Retriever doesn't really accomplish is helping the user with tracking when they downloaded the data (in case the source changes) and what version of the Retriever a particular database was built with (in case the Retriever changes). Many DBMSs allow metadata to be included as part of the database and including key provenance information in this metadata would be useful. An alternative would be to create a separate Retriever table that stores this information for all datasets installed using the Retriever (e.g., fields could include Dataset, Table, Date Created, Retriever Version). The advantage of this alternative is that it is probably less DBMS specific. This information could be stored as comments in .csv files.
While importing the Stems table the retriever fails and returns the following error:
Error: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Missing ), ], or Item in query expression '`, 5226, 'CUEVA', null, 24;'. (-3100) (SQLExecDirectW)")
Steps to reproduce:
Inserting rows to FIA_PLOT: 31355 / 31355
Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 59, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 538, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 528, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 441, in format_insert_value
return int(strvalue.split('.')[0])
ValueError: invalid literal for int() with base 10: 'S27LAK1B'
Using first time install of the data retriever, I get an error (red exclamation mark in the GUI) with the message "INSERT INTO BBS.species(AOU,genus,species,subspecies,id_to_species) VALUES(10010,'TINAMUS','major',null,1);" GUI had previously reported that it created 'routes' and 'weather' tables in database and inserted rows into them, but no tables or data were found.
All values of 'cn' == 2147483647 instead of the appropriate value from the CSV files.
I tried to install the 1.1 RC deb on my Ubuntu 10.04 (the current LTS release) machine and received the following Package Installer error:
Error: Dependency is not satisfiable: python (>= 2.7.1-0ubuntu2)
The default Python install on Ubuntu 10.04 and on 10.10 is 2.6.x.
On fresh installs on both Windows 7 and XP importing USDA plants fails immediately following the download and returns the following error:
Error: [Errno 2] No such file or directory: '_new.raw_data\PlantTaxonomy\downloadData'
In an SQLite import of the Gentry database on Ubuntu 11.04 all of the text fields have " at both the beginning and end of the value. Looks like an escaping problem.
Documentation of testing prior to a 1.4 release. The goal is to release by April 27th.
Output of tests.py:
('s', 'EA_del_moral_2010', OperationalError('duplicate column name: species',))
('m', 'EA_del_moral_2010', OperationalError(1060, "Duplicate column name 'species'"))
Under sqlite and MySQL (and Access), the Species column of Del Moral is apparently being created twice.
When attempting to use retriever in both Windows 32-bit and 64-bit there is a massive slow down in the program's performance when using a postgreSQL database connection. Specifically this occurs only after configuring the database connection settings. It takes approximately 5 minutes to get retriever to bring up its GUI. You are still able to download the data, the program is just very unresponsive and slow. Note that there is ample RAM and cpu resources available. In contrast, if the database connection is setup for example for a mySQL database the GUI snaps right now and you are able to download datasets.
Via email:
"I've installed MySQL on my Mac, and Retriever. Here is my error. Interesting that is says the matched file is the 'wrong architecture'."
Complete with screen shot.
This is presumably a 32-bit vs. 64-bit issue, so I'm not sure if we can work around this with our existing Mac or if we'll need one with whatever the other architecture is (I'm assuming ours is 32-bit, but can't look until tomorrow). Here's a discussion on Stack Overflow to get us started.
Retriever fails while, or immediately after, creating the weather table and returns the error:
Error: ('07002', '[07002] [Microsoft][ODBC Microsoft Access Driver] Too few parameters. Expected 1. (-3010) (SQLExecDirectW)')
Occurs on both Windows 7 and XP.
tests.py output:
('p', 'EA_portal_mammals', IntegrityError('null value in column "plotid" violates not-null constraint\n',))
Error:[Errno ftp error] [Errno ftp error] 550 failed to change directory. Windows exe - PostgreSQL 8.4
Via email I've received a request to support a situation where the database already exists, but the user does not have the permission to CREATE new databases. This is definitely an edge use case, but one that clearly can come up with shared resources.
A quick look at engine.py suggests that we can probably accomplish this fairly easily by adding a try-catch in create_db. The trick will be failing gracefully in the case where the user doesn't have permission to create and the database doesn't already exist.
The output of tests.py:
('p', 'EA_clark2006', DataError('value too long for type character varying(10)\n',))
('p', 'USDA_plants', DataError('value too long for type character varying(7)\n',))
('p', 'gentry', DataError('value too long for type character varying(20)\n',))
('p', 'EA_portal_mammals', DataError('value too long for type character varying(3)\n',))
('p', 'EA_mom2003', DataError('value too long for type character varying(7)\n',))
Sometimes in pg, comma-delimited strings surrounded by quotes are being read in as one long value which is too long for the column. This is a strange problem and I'm trying to figure out why it would only happen with pg.
The following tests failed on MS Access:
EA_mom2003: IndexError, list index out of range
gentry: IndexError, list index out of range
USDA_plants: IndexError, list index out of range
My PostgreSQL database is configured on a non-standard port. If I do not configure my port correctly in the retriever database connection file then retriever generates an error log (see below) and experiences a major performance slowdown.
<retriever.exe.log>
Traceback (most recent call last):
File "retriever\app\controls.pyo", line 54, in OnGetItem
File "retriever\app\controls.pyo", line 91, in HtmlScriptSummary
File "retriever\lib\templates.pyo", line 57, in exists
File "retriever\lib\engine.pyo", line 452, in exists
File "retriever\engines\postgres.pyo", line 101, in table_exists
File "retriever\lib\engine.pyo", line 555, in get_cursor
File "retriever\engines\postgres.pyo", line 127, in get_connection
File "psycopg2__init__.pyo", line 179, in connect
psycopg2.OperationalError: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
The retriever continues to report the above error message as it continually attempts to find the PostgreSQL connection.
I'm getting an exception from py2app:
WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
Traceback (most recent call last):
File "setup.py", line 108, in
'no_chdir': True,
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 480, in run
self._run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 643, in _run
self.run_normal()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 731, in run_normal
self.create_binaries(py_files, pkgdirs, extensions, loader_files)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 876, in create_binaries
platfiles = mm.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachOStandalone.py", line 133, in run
node.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 117, in write
header.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 312, in write
self.synchronize_size()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 302, in synchronize_size
raise ValueError("New Mach-O header is too large to relocate")
ValueError: New Mach-O header is too large to relocate
Importing into MS Access (2007 and 2010) currently fails on 4 datasets: Avian Body Size, USDA plants, Marine Predator and Prey Body Sizes, and Vegetation plots - del Moral, 2010.
Error: ('42000',"[42000][Microsoft][ODBC Microsoft Access Driver] Syntax error (missing operator in query expression...(-3100)(SQLExecDirectW)")
where the ... is a line of data that starts with:
Schlegel's Francolin', -- null for Avian Body Size
''"ABAM@" -- for USDA plants
29\xba40'N', -- for Marine Predator and Prey Body Sizes
Mertens' sedge', null, -- for Vegetations plots - del Moral, 2010
Texas BBS data are not imported into postgres database. Manual insert of "CTexas.csv" shows error:
ERROR: invalid input syntax for integer: ""
CONTEXT: COPY test, line 17137, column count20: ""
Line 17137 of CTexas.csv has asterisks (*) in column count20 and SpeciesTotal
On Ubuntu 12.04 the following error occurs for MySQL, Postgres, and SQLite:
Error: 'ascii' codec can't decode byte 0xb2 in position 32: ordinal not in range(128)
When I try to open the raw file in gedit it reports:
There was a problem opening the file /path/to/MCDB_trappings.csv
The file you opened has some invalid characters. If you continue editing this file you could corrupt this document.
You can also choose another character encoding and try again.
Looks like a change in the species table:
('p', 'bbs', DataError('value too long for type character varying(30)\n',))
INSERT INTO BBS.species (AOU, genus, species, subspecies, id_to_species) VALUES (1, 'Admin', 'Code', 'Admin Code Admin Code Admin Code Admin Code', TRUE);
ERROR.
I just downloaded the new release and tried to install into my POstgreSQL database. I got the error:
There was a problem with my database connection. No module named psycopg2. Is this a personal problem?
Use the year information in fia.py to only include data starting in the year that the standardized methods began.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.