weecology / retriever Goto Github PK

Quickly download, clean up, and install public datasets into a database management system

License: Other

Python 98.31% Shell 0.04% Inno Setup 1.18% TeX 0.09% Dockerfile 0.38%

data-retrieval dataset python data data-science datasets hacktobefest

retriever's Introduction

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

Installing the Current Release

If you have Python installed you can install the current release using either pip:

pip install retriever

or conda after adding the conda-forge channel (conda config --add channels conda-forge):

conda install retriever

Depending on your system configuration this may require sudo for pip:

sudo pip install retriever

Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.

List of Available Datasets

Installing From Source

To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:

xlrd

The following packages are optionally needed to interact with associated database management systems:

PyMySQL (for MySQL)
sqlite3 (for SQLite)
psycopg2-binary (for PostgreSQL), previously psycopg2.
pyodbc (for MS Access - this option is only available on Windows)
Microsoft Access Driver (ODBC for windows)

To install from source

Either use pip to install directly from GitHub:

pip install git+https://[email protected]/weecology/retriever.git

or:

Clone the repository
From the directory containing setup.py, run the following command: pip install .. You may need to include sudo at the beginning of the command depending on your system (i.e., sudo pip install .).

More extensive documentation for those that are interested in developing can be found here

Using the Command Line

After installing, run retriever update to download all of the available dataset scripts. To see the full list of command line options and datasets run retriever --help. The output will look like this:

usage: retriever [-h] [-v] [-q]
                 {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                 ...

positional arguments:
  {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                        sub-command help
    download            download raw data files for a dataset
    install             download and install dataset
    defaults            displays default options
    update              download updated versions of scripts
    new                 create a new sample retriever script
    new_json            CLI to create retriever datapackage.json script
    edit_json           CLI to edit retriever datapackage.json script
    delete_json         CLI to remove retriever datapackage.json script
    ls                  display a list all available dataset scripts
    citation            view citation
    reset               reset retriever: removes configuration settings,
                        scripts, and cached data
    help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           suppress command-line output

To install datasets, use retriever install:

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
  {mysql,postgres,sqlite,msaccess,csv,json,xml}
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Examples

These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.

Using Install

retriever install -h   (gives install options)

Using specific database engine, retriever install {Engine}

retriever install mysql -h     (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris

install data into an sqlite database named iris.db you would use:

retriever install sqlite iris -f iris.db

Using download

retriever download -h    (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents

Using citation

retriever citation   (citation of the retriever engine)
retriever citation iris  (citation for the iris data)

Spatial Dataset Installation

Set up Spatial support

To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.

retriever install postgres harvard-forest # Vector data
retriever install postgres bioclim # Raster data
# Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074

Website

For more information see the Data Retriever website.

Acknowledgments

Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

retriever's People

Contributors

Stargazers

Watchers

Forkers

ethanwhite bendmorris imclab dmcglinn wolflab sarsees carol-rowe666 mskotneilson kristinariemer brymz rueuntal skmorgane emchristensen beecycles embaldridge prabh27 lyttonhao akshayah3 panks davharris henrykironde inbaltiano ecoviz ghoshbishakh raj-maurya gitter-badger digideskio zhangcandrew parth-25m shivamnegi iskunalpal articuno12 jainamritanshu dlebauer kvnamipara kapilkd13 harshish menikhilpandey mcglinnlab goelakash aglassdarkly yp-ye kingafiebig apoorvaeternity surana-mudit shreyneil himanshuladia makarovyaroslav rupesh1798 rodrickcalvin aswanipranjal ss-is-master-chief blue-atom pranita-s hgysandy augmen qm31122016 stellargo fhoces cyberdrudge prabhjyot28 smallmonster23 amanjain25 maheshjindal harshitbansal05 ashishpriyadarshicic kumararindam piyushchauhan yashwanth711 yulizheng scls19fr garezana lrcfs coolalexzb gaybro8777 kant harshs14 shivam7569 delancey314 yash170106065 dumbmachine pranu2502 ha0ye tz05 siddharth-singhs fdbesanto2 jay-iam nefertime mauwazahmed drimdave azh2 ra2003 kdavis0509 yivash googlelellc git66901 kadam-tushar z3y50n sohan022 mohitkumar6122

retriever's Issues

PortalMammals Main Table is blank in SQLite import

Importing PortalMammals on Windows 7 into SQLite results in an empty main table.

Error importing BBS into PostgreSQL 8.4; Ubuntu 10.04

Using first time install of the data retriever, I get an error (red exclamation mark in the GUI) with the message "INSERT INTO BBS.species(AOU,genus,species,subspecies,id_to_species) VALUES(10010,'TINAMUS','major',null,1);" GUI had previously reported that it created 'routes' and 'weather' tables in database and inserted rows into them, but no tables or data were found.

1.1 RC deb package won't install on older versions of Ubuntu due to Python 2.7 dependency

I tried to install the 1.1 RC deb on my Ubuntu 10.04 (the current LTS release) machine and received the following Package Installer error:

Error: Dependency is not satisfiable: python (>= 2.7.1-0ubuntu2)

The default Python install on Ubuntu 10.04 and on 10.10 is 2.6.x.

Error importing BBS into MS Access 2007

Retriever fails while, or immediately after, creating the weather table and returns the error:

Error: ('07002', '[07002] [Microsoft][ODBC Microsoft Access Driver] Too few parameters. Expected 1. (-3010) (SQLExecDirectW)')

Occurs on both Windows 7 and XP.

FIA fails to import due to Memory Error (too little memory)

Windows reports "Error: " with no description. No log file generated.

Ubuntu reports "Error: Not a zip file." (the Ubuntu test was on yesterday's Retriever rc, prior to the small tweaks).

If port is misspecified Retriever fails to fail gracefully

My PostgreSQL database is configured on a non-standard port. If I do not configure my port correctly in the retriever database connection file then retriever generates an error log (see below) and experiences a major performance slowdown.

<retriever.exe.log>
Traceback (most recent call last):
File "retriever\app\controls.pyo", line 54, in OnGetItem
File "retriever\app\controls.pyo", line 91, in HtmlScriptSummary
File "retriever\lib\templates.pyo", line 57, in exists
File "retriever\lib\engine.pyo", line 452, in exists
File "retriever\engines\postgres.pyo", line 101, in table_exists
File "retriever\lib\engine.pyo", line 555, in get_cursor
File "retriever\engines\postgres.pyo", line 127, in get_connection
File "psycopg2__init__.pyo", line 179, in connect
psycopg2.OperationalError: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?

The retriever continues to report the above error message as it continually attempts to find the PostgreSQL connection.

"Choose... " button erroring for Access database connection

Here is the error message:

Traceback (most recent call last):
File "retriever\app\connect_wizard.pyo", line 158, in open_file_dialog
IndexError: tuple index out of range

PostgreSQL database vs. schema issue

I have just recently learned that PostgreSQL has a different definition of 'database' than MySQL. In PostgreSQL, you cannot get tables in different databases to talk to each other, as you can in MySQL. But, in PostgreSQL, you can organize data into schemas within a database. Tables in different schemas in the same database can talk to each other. The Retriever only allows you to select the database (used as it is in MySQL). When you specify a database in PostgreSQL, the Retriever creates a schema and places the table within that. This is not a big deal, as tables can be moved around, and it might in fact be the best organizational solution. The alternative is to provide the feature to place a table or set of tables directly into a specified schema in a specified database. I'm not sure what the answer here is, and this shouldn't be high on the priority list. Just thought I'd share.

"Duplicate column name: 'species'" on Del Moral (sqlite, MySQL, MS Access)

Output of tests.py:

('s', 'EA_del_moral_2010', OperationalError('duplicate column name: species',))
('m', 'EA_del_moral_2010', OperationalError(1060, "Duplicate column name 'species'"))

Under sqlite and MySQL (and Access), the Species column of Del Moral is apparently being created twice.

Texas BBS import error (Ubuntu 10.04; PostgreSQL 8.4)

Texas BBS data are not imported into postgres database. Manual insert of "CTexas.csv" shows error:

ERROR: invalid input syntax for integer: ""
CONTEXT: COPY test, line 17137, column count20: ""

Line 17137 of CTexas.csv has asterisks (*) in column count20 and SpeciesTotal

Text fields in SQLite Gentry database have incorrect special characters

In an SQLite import of the Gentry database on Ubuntu 11.04 all of the text fields have " at both the beginning and end of the value. Looks like an escaping problem.

BBS import fails in PostgreSQL

Looks like a change in the species table:
('p', 'bbs', DataError('value too long for type character varying(30)\n',))
INSERT INTO BBS.species (AOU, genus, species, subspecies, id_to_species) VALUES (1, 'Admin', 'Code', 'Admin Code Admin Code Admin Code Admin Code', TRUE);
ERROR.

PanTHERIA incomplete import

I tried twice to dump PanTHERIA into my PostgreSQL db, and each time it only populated 3057 rows (when the Ecological Archives file has 5416).

Postgres: "value too long for type character varying(n)"

The output of tests.py:

('p', 'EA_clark2006', DataError('value too long for type character varying(10)\n',))
('p', 'USDA_plants', DataError('value too long for type character varying(7)\n',))
('p', 'gentry', DataError('value too long for type character varying(20)\n',))
('p', 'EA_portal_mammals', DataError('value too long for type character varying(3)\n',))
('p', 'EA_mom2003', DataError('value too long for type character varying(7)\n',))

Sometimes in pg, comma-delimited strings surrounded by quotes are being read in as one long value which is too long for the column. This is a strange problem and I'm trying to figure out why it would only happen with pg.

Wrong architecture error for MySQL on Mac

Via email:

"I've installed MySQL on my Mac, and Retriever. Here is my error. Interesting that is says the matched file is the 'wrong architecture'."

Complete with screen shot.

This is presumably a 32-bit vs. 64-bit issue, so I'm not sure if we can work around this with our existing Mac or if we'll need one with whatever the other architecture is (I'm assuming ours is 32-bit, but can't look until tomorrow). Here's a discussion on Stack Overflow to get us started.

What should a simplified version of the FIA database look like?

The FIA database is an awesome resource, but it is overwhelmingly complex requiring a large amount of time to be spent with the metadata and a substantial amount of effort and knowledge to use. We should provide a simplified version of this database that provides the core data that would be used in most ecological analyses. This issue is to facilitate discussion of what exactly should be in the simplified FIA database and what the structure should be.

py2app exception when trying to build

I'm getting an exception from py2app:

WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
Traceback (most recent call last):
File "setup.py", line 108, in
'no_chdir': True,
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 480, in run
self._run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 643, in _run
self.run_normal()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 731, in run_normal
self.create_binaries(py_files, pkgdirs, extensions, loader_files)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 876, in create_binaries
platfiles = mm.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachOStandalone.py", line 133, in run
node.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 117, in write
header.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 312, in write
self.synchronize_size()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 302, in synchronize_size
raise ValueError("New Mach-O header is too large to relocate")
ValueError: New Mach-O header is too large to relocate

Allow error messages to be copied from Retriever GUI

It would be nice for error reporting if the error messages returned by the retriever could be copied from the GUI.

Bring BBS species table more in line with the eBird Taxonomy

To facilitate better interaction with eBird data and the Hurlbert lab Taxonomy table, make the BBS species table more similar to this existing structure by:

Add Order and Family fields.
Change id_to_species to a Category field the indicates whether the record is identified only to genus ('spuh'), to species ('species'), subspecies ('issf'), is only identified to a group of species ('slash'), or is a hybrid ('hybrid').

Allow multi-file table specifications in Retriever scripts

It would be nice to be able to create a single data table out of multiple data files, i.e.:

# tables
table: counts, count_*.csv

would create table "counts" out of files count_1.csv, count_2.csv, etc.

psycopg2 error

I just downloaded the new release and tried to install into my POstgreSQL database. I got the error:

There was a problem with my database connection. No module named psycopg2. Is this a personal problem?

MCDB 'ascii' error when importing Trapping table

On Ubuntu 12.04 the following error occurs for MySQL, Postgres, and SQLite:

Error: 'ascii' codec can't decode byte 0xb2 in position 32: ordinal not in range(128)

When I try to open the raw file in gedit it reports:

There was a problem opening the file /path/to/MCDB_trappings.csv
The file you opened has some invalid characters. If you continue editing this file you could corrupt this document.
You can also choose another character encoding and try again.

FIA error

Windows 7 - PostgreSQL 8.4
Error: invalid literal for int() with base 10: 'Kotar S WI'

Access import errors for 4 datasets

Importing into MS Access (2007 and 2010) currently fails on 4 datasets: Avian Body Size, USDA plants, Marine Predator and Prey Body Sizes, and Vegetation plots - del Moral, 2010.

Error: ('42000',"[42000][Microsoft][ODBC Microsoft Access Driver] Syntax error (missing operator in query expression...(-3100)(SQLExecDirectW)")

where the ... is a line of data that starts with:
Schlegel's Francolin', -- null for Avian Body Size
''"ABAM@" -- for USDA plants
29\xba40'N', -- for Marine Predator and Prey Body Sizes
Mertens' sedge', null, -- for Vegetations plots - del Moral, 2010

Retriever experiences massive slowdown with postgreSQL

When attempting to use retriever in both Windows 32-bit and 64-bit there is a massive slow down in the program's performance when using a postgreSQL database connection. Specifically this occurs only after configuring the database connection settings. It takes approximately 5 minutes to get retriever to bring up its GUI. You are still able to download the data, the program is just very unresponsive and slow. Note that there is ample RAM and cpu resources available. In contrast, if the database connection is setup for example for a mySQL database the GUI snaps right now and you are able to download datasets.

FIA

Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 69, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 552, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 542, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 453, in format_insert_value
return int(intvalue)
ValueError: invalid literal for int() with base 10: 'CPPSSS11'

Previous line:

INSERT INTO FIA_COND (cn, plt_cn, invyr, statecd, unitcd, countycd, plot, condid, cond_status_cd, cond_nonsample_reasn_cd, reservcd, owncd, owngrpcd, forindcd, adforcd, fortypcd, fldtypcd, mapden, stdage, stdszcd, fldszcd, siteclcd, sicond, sibase, sisp, stdorgcd, stdorgsp, prop_basis, condprop_unadj, micrprop_unadj, subpprop_unadj, macrprop_unadj, slope, aspect, physclcd, gsstkcd, alstkcd, dstrbcd1, dstrbyr1, dstrbcd2, dstrbyr2, dstrbcd3, dstrbyr3, trtcd1, trtyr1, trtcd2, trtyr2, trtcd3, trtyr3, presnfcd, balive, fldage, alstk, gsstk, fortypcdcalc, habtypcd1, habtypcd1_pub_cd, habtypcd1_descr_pub_cd, habtypcd2, habtypcd2_pub_cd, habtypcd2_descr_pub_cd, mixedconfcd, vol_loc_grp, siteclcdest, sitetree_tree, sitecl_method, carbon_down_dead, carbon_litter, carbon_soil_org, carbon_standing_dead, carbon_understory_ag, carbon_understory_bg, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle, soil_rooting_depth_pnw, ground_land_class_pnw, plant_stockability_factor_pnw, stnd_cond_cd_pnwrs, stnd_struc_cd_pnwrs, stump_cd_pnwrs, fire_srs, grazing_srs, harvest_type1_srs, harvest_type2_srs, harvest_type3_srs, land_use_srs, operability_srs, stand_structure_srs, nf_cond_status_cd, nf_cond_nonsample_reasn_cd, canopy_cvr_sample_method_cd, live_canopy_cvr_pct, live_missing_canopy_cvr_pct, nbr_live_stems) VALUES (3337761010690, 3337759010690, 1989, 32, 1, 3, 351, 1, 1, 0, 1, 11, 10, 0, 417, 261, 261, 1, 164, 1, 0, 6, 21, 50, 108, 0, 0, '', 1, 1, 1, 0, 67, 289, 0, 1, 1, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 260.6012, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, '', '', 0, 0, 0, 6.885517, 18.602616, 14.039627, 7.507694, .393915, .043768, 0, '2004-05-28', 10690, 0, '2010-07-07', 10690, 1, 0, 0, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);

BBS download error

Error:[Errno ftp error] [Errno ftp error] 550 failed to change directory. Windows exe - PostgreSQL 8.4

FIA: invalid literal for int()

Inserting rows to FIA_PLOT: 31355 / 31355
Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 59, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 538, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 528, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 441, in format_insert_value
return int(strvalue.split('.')[0])
ValueError: invalid literal for int() with base 10: 'S27LAK1B'

BBS import fails in CSV

Probably due to the standard weirdness in Texas:
('c', 'bbs', AttributeError("DummyConnection instance has no attribute 'rollback'",))
Failed bulk insert on Texas, inserting manually.448
There was an error in Texas.
ERROR.

All FIA plots have the same 'cn' value in the SURVEY table in MySQL

All values of 'cn' == 2147483647 instead of the appropriate value from the CSV files.

Feature request: Allow Retriever to work without the CREATE permission if the database already exists

Via email I've received a request to support a situation where the database already exists, but the user does not have the permission to CREATE new databases. This is definitely an edge use case, but one that clearly can come up with shared resources.

A quick look at engine.py suggests that we can probably accomplish this fairly easily by adding a try-catch in create_db. The trick will be failing gracefully in the case where the user doesn't have permission to create and the database doesn't already exist.

Error importing USDA plants into MS Access 2007

On fresh installs on both Windows 7 and XP importing USDA plants fails immediately following the download and returns the following error:

Error: [Errno 2] No such file or directory: '_new.raw_data\PlantTaxonomy\downloadData'

Testing for 1.4 release

Documentation of testing prior to a 1.4 release. The goal is to release by April 27th.

For FIA only include data collected using the standardized method

Use the year information in fia.py to only include data starting in the year that the standardized methods began.

Improve provenance tracking

One thing the Retriever doesn't really accomplish is helping the user with tracking when they downloaded the data (in case the source changes) and what version of the Retriever a particular database was built with (in case the Retriever changes). Many DBMSs allow metadata to be included as part of the database and including key provenance information in this metadata would be useful. An alternative would be to create a separate Retriever table that stores this information for all datasets installed using the Retriever (e.g., fields could include Dataset, Table, Date Created, Retriever Version). The advantage of this alternative is that it is probably less DBMS specific. This information could be stored as comments in .csv files.

Error importing CRC into MySQL 1.1RC on Ubuntu 10.04

The following error occurs almost immediately when trying to import CRC using the new deb install:
Error:(1054, "Unknown column 'estimated' in 'field list'")

1.1 RC fails to successfully load scripts on Ubuntu 11.04

Problem:
The Retriever does no launch properly. The splash screen loads and the Retriever reports that it is downloading scripts, but the names of those scripts are not names of scripts but tags, bits of code, etc. It keeps "Downloading scripts" for a long time (I cut it off after a couple of minutes) and the /scripts folder is filled with text files that have junky names and contain only 'Not found'.

Steps to replicate:
Either upgrade a current Retriever installation from 1.0 or do a fresh install after remove the old installation and associated directories.

FIA 'integer out of range' error on Postgres 8.4

Importing FIA fails in Postgres 8.4.1 on Ubuntu.

Creating table FIA.SURVEY...
INSERT INTO FIA.SURVEY (cn, invyr, p3_ozone_ind, statecd, stateab, statenm, rscd, ann_inventory, notes, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle) VALUES (22300165010478, 2001, 'N', 1, 'AL', 'Alabama', '33', 'Y', '', '', '2006-02-16', 10478, '', '2009-12-03', '10854', 8, 1);
Traceback (most recent call last):
File "/usr/local/bin/retriever", line 9, in
load_entry_point('retriever==1.2.1', 'console_scripts', 'retriever')()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/main.py", line 83, in main
script.download(engine)
File "scripts/fia.py", line 76, in download
engine.insert_data_from_file(engine.format_filename(prep_file_name))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/engines/postgres.py", line 88, in insert_data_from_file
return Engine.insert_data_from_file(self, filename)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 602, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 88, in add_to_table
self.execute(insert_stmt, commit=False)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 432, in execute
self.cursor.execute(statement)
psycopg2.DataError: integer out of range

MS Access failed tests

The following tests failed on MS Access:

EA_mom2003: IndexError, list index out of range
gentry: IndexError, list index out of range
USDA_plants: IndexError, list index out of range

MCDB insert error

I'm getting this error on mysql, postgres, and sqlite (Ubunt 12.04)

INSERT INTO MCDB_communities (record_id, site_id, initial_year, species_id, presence_only, abundance, mass) VALUES (null, 1008, 2002, CHPE, '0', 52, null);

psycopg2.ProgrammingError: column "chpe" does not exist

MySQL Error 1148 on MySQL 5.5.22 for several datasets

On Ubuntu 12.04 (the new LTS release) MySQL tests fail on 6 datasets.

Engine, Dataset, Error
('m', 'gentry', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_portal_mammals', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_zachmann2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs50stop', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_del_moral_2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))

Error importing Gentry data in 1.1 RC on Windows 7

While importing the Stems table the retriever fails and returns the following error:

Error: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Missing ), ], or Item in query expression '`, 5226, 'CUEVA', null, 24;'. (-3100) (SQLExecDirectW)")

Steps to reproduce:

Remove all previous traces of the retriever (.exe, scripts and raw_data folders, connections file)
Download retriever
Select already created, blank, MS Access database
Select the Gentry dataset

retriever.exe not working in Windows 7 professional

Hey guys,

I just downloaded a clean version of the following file:

https://github.com/weecology/retriever/raw/v1.2/windows/retriever.exe

which is linked to from this page: http://ecologicaldata.org/ecodata-retriever

and on two different machines both running windows 7 professional the following behavior was observed.

Retriever loads and downloads the relevant scripts, the connection to the database prompt appears to work, but then program appears to fail at loading the metadata for the available databases for download. Let me know if you need more details.

Dan

Bug in auto_get_datatypes for Postgres varchar type

Engine, Dataset, Error
('p', 'EA_zachmann2010', DataError('value too long for type character varying(4)\n',))
('p', 'EA_del_moral_2010', DataError('length for type varchar must be at least 1\nLINE 1: ...ame varchar(19), plot_code varchar(6), first_year varchar(0)...\n ^\n',))
('p', 'EA_clark2006', DataError('length for type varchar must be at least 1\nLINE 1: CREATE TABLE Clark2006.trees (id varchar(0), site varchar(9)...\n ^\n',))
('p', 'EA_barnes2008', DataError('value too long for type character varying(3)\n',))

What is the purpose of the years in the fia.py script

I was just looking through the FIA script in response to an email query and noticed the inclusion of the "year annual inventory began for that state", but can't see that those years are being used anywhere in the script. Am I missing something or is this part of some planned future addition to the script?

FIA 'notes' field in 'SURVEY' table not properly parsed in some cases

In MySQL some cases the 'notes' field in the 'SURVEY' table is broken up and included as the values for multiple fields, shifting the values of those fields into other fields. For example see,

cn=2147283647
invyr=2001

USDA plants - use Advanced Search Download option?

The Advanced Search produces a more usable format to the data, because you can opt to separate out the authority info from the scientific name, and it generates more information broken up into more fields, in general. The current state of the Retriever version has all variety and authority information included in the scientific_name which doesn't allow for easy joins with other data. Thanks!

Feature request: table stats to doublecheck successful import

I'm wondering if there would be a way to generate the number of rows per table that should show up in your completed database - perhaps to accompany the super-sweet 'download complete' graphic? I am just looking for a quick way to reassure myself that I have everything I should - rather than having to check against all the raw data files.Thanks!

USDA plants error - PostgreSQL 8.4

INSERT INTO PlantTaxonomy.PlantTaxonomy(symbol, synonym_symbol, scientific_name, common_name, family) VALUES ('ACNEI2', 'ACNEI'), "Acer negundo.....

Sorry PostgreSQL is such a hassle!

Portal Mammals - null value being inserted into plotid

tests.py output:

('p', 'EA_portal_mammals', IntegrityError('null value in column "plotid" violates not-null constraint\n',))