Giter Site home page Giter Site logo

weecology / retriever Goto Github PK

View Code? Open in Web Editor NEW
306.0 30.0 133.0 79.25 MB

Quickly download, clean up, and install public datasets into a database management system

Home Page: http://data-retriever.org

License: Other

Python 98.31% Shell 0.04% Inno Setup 1.18% TeX 0.09% Dockerfile 0.38%
data-retrieval dataset python data data-science datasets hacktobefest

retriever's Issues

USDA plants error - PostgreSQL 8.4

INSERT INTO PlantTaxonomy.PlantTaxonomy(symbol, synonym_symbol, scientific_name, common_name, family) VALUES ('ACNEI2', 'ACNEI'), "Acer negundo.....

Sorry PostgreSQL is such a hassle!

MySQL Error 1148 on MySQL 5.5.22 for several datasets

On Ubuntu 12.04 (the new LTS release) MySQL tests fail on 6 datasets.

Engine, Dataset, Error
('m', 'gentry', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_portal_mammals', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_zachmann2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'bbs50stop', OperationalError(1148, 'The used command is not allowed with this MySQL version'))
('m', 'EA_del_moral_2010', OperationalError(1148, 'The used command is not allowed with this MySQL version'))

FIA 'integer out of range' error on Postgres 8.4

Importing FIA fails in Postgres 8.4.1 on Ubuntu.

Creating table FIA.SURVEY...
INSERT INTO FIA.SURVEY (cn, invyr, p3_ozone_ind, statecd, stateab, statenm, rscd, ann_inventory, notes, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle) VALUES (22300165010478, 2001, 'N', 1, 'AL', 'Alabama', '33', 'Y', '', '', '2006-02-16', 10478, '', '2009-12-03', '10854', 8, 1);
Traceback (most recent call last):
File "/usr/local/bin/retriever", line 9, in
load_entry_point('retriever==1.2.1', 'console_scripts', 'retriever')()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/main.py", line 83, in main
script.download(engine)
File "scripts/fia.py", line 76, in download
engine.insert_data_from_file(engine.format_filename(prep_file_name))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/engines/postgres.py", line 88, in insert_data_from_file
return Engine.insert_data_from_file(self, filename)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 602, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 88, in add_to_table
self.execute(insert_stmt, commit=False)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.2.1-py2.7.egg/retriever/lib/engine.py", line 432, in execute
self.cursor.execute(statement)
psycopg2.DataError: integer out of range

MCDB insert error

I'm getting this error on mysql, postgres, and sqlite (Ubunt 12.04)

INSERT INTO MCDB_communities (record_id, site_id, initial_year, species_id, presence_only, abundance, mass) VALUES (null, 1008, 2002, CHPE, '0', 52, null);

psycopg2.ProgrammingError: column "chpe" does not exist

Bug in auto_get_datatypes for Postgres varchar type

Engine, Dataset, Error
('p', 'EA_zachmann2010', DataError('value too long for type character varying(4)\n',))
('p', 'EA_del_moral_2010', DataError('length for type varchar must be at least 1\nLINE 1: ...ame varchar(19), plot_code varchar(6), first_year varchar(0)...\n ^\n',))
('p', 'EA_clark2006', DataError('length for type varchar must be at least 1\nLINE 1: CREATE TABLE Clark2006.trees (id varchar(0), site varchar(9)...\n ^\n',))
('p', 'EA_barnes2008', DataError('value too long for type character varying(3)\n',))

What is the purpose of the years in the fia.py script

I was just looking through the FIA script in response to an email query and noticed the inclusion of the "year annual inventory began for that state", but can't see that those years are being used anywhere in the script. Am I missing something or is this part of some planned future addition to the script?

FIA error

Windows 7 - PostgreSQL 8.4
Error: invalid literal for int() with base 10: 'Kotar S WI'

What should a simplified version of the FIA database look like?

The FIA database is an awesome resource, but it is overwhelmingly complex requiring a large amount of time to be spent with the metadata and a substantial amount of effort and knowledge to use. We should provide a simplified version of this database that provides the core data that would be used in most ecological analyses. This issue is to facilitate discussion of what exactly should be in the simplified FIA database and what the structure should be.

USDA plants - use Advanced Search Download option?

The Advanced Search produces a more usable format to the data, because you can opt to separate out the authority info from the scientific name, and it generates more information broken up into more fields, in general. The current state of the Retriever version has all variety and authority information included in the scientific_name which doesn't allow for easy joins with other data. Thanks!

1.1 RC fails to successfully load scripts on Ubuntu 11.04

Problem:
The Retriever does no launch properly. The splash screen loads and the Retriever reports that it is downloading scripts, but the names of those scripts are not names of scripts but tags, bits of code, etc. It keeps "Downloading scripts" for a long time (I cut it off after a couple of minutes) and the /scripts folder is filled with text files that have junky names and contain only 'Not found'.

Steps to replicate:
Either upgrade a current Retriever installation from 1.0 or do a fresh install after remove the old installation and associated directories.

retriever.exe not working in Windows 7 professional

Hey guys,

I just downloaded a clean version of the following file:

https://github.com/weecology/retriever/raw/v1.2/windows/retriever.exe

which is linked to from this page: http://ecologicaldata.org/ecodata-retriever

and on two different machines both running windows 7 professional the following behavior was observed.

Retriever loads and downloads the relevant scripts, the connection to the database prompt appears to work, but then program appears to fail at loading the metadata for the available databases for download. Let me know if you need more details.

Dan

BBS import fails in CSV

Probably due to the standard weirdness in Texas:
('c', 'bbs', AttributeError("DummyConnection instance has no attribute 'rollback'",))
Failed bulk insert on Texas, inserting manually.448
There was an error in Texas.
ERROR.

Bring BBS species table more in line with the eBird Taxonomy

To facilitate better interaction with eBird data and the Hurlbert lab Taxonomy table, make the BBS species table more similar to this existing structure by:

  1. Add Order and Family fields.
  2. Change id_to_species to a Category field the indicates whether the record is identified only to genus ('spuh'), to species ('species'), subspecies ('issf'), is only identified to a group of species ('slash'), or is a hybrid ('hybrid').

PostgreSQL database vs. schema issue

I have just recently learned that PostgreSQL has a different definition of 'database' than MySQL. In PostgreSQL, you cannot get tables in different databases to talk to each other, as you can in MySQL. But, in PostgreSQL, you can organize data into schemas within a database. Tables in different schemas in the same database can talk to each other. The Retriever only allows you to select the database (used as it is in MySQL). When you specify a database in PostgreSQL, the Retriever creates a schema and places the table within that. This is not a big deal, as tables can be moved around, and it might in fact be the best organizational solution. The alternative is to provide the feature to place a table or set of tables directly into a specified schema in a specified database. I'm not sure what the answer here is, and this shouldn't be high on the priority list. Just thought I'd share.

FIA

Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 69, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 552, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 542, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 453, in format_insert_value
return int(intvalue)
ValueError: invalid literal for int() with base 10: 'CPPSSS11'

Previous line:

INSERT INTO FIA_COND (cn, plt_cn, invyr, statecd, unitcd, countycd, plot, condid, cond_status_cd, cond_nonsample_reasn_cd, reservcd, owncd, owngrpcd, forindcd, adforcd, fortypcd, fldtypcd, mapden, stdage, stdszcd, fldszcd, siteclcd, sicond, sibase, sisp, stdorgcd, stdorgsp, prop_basis, condprop_unadj, micrprop_unadj, subpprop_unadj, macrprop_unadj, slope, aspect, physclcd, gsstkcd, alstkcd, dstrbcd1, dstrbyr1, dstrbcd2, dstrbyr2, dstrbcd3, dstrbyr3, trtcd1, trtyr1, trtcd2, trtyr2, trtcd3, trtyr3, presnfcd, balive, fldage, alstk, gsstk, fortypcdcalc, habtypcd1, habtypcd1_pub_cd, habtypcd1_descr_pub_cd, habtypcd2, habtypcd2_pub_cd, habtypcd2_descr_pub_cd, mixedconfcd, vol_loc_grp, siteclcdest, sitetree_tree, sitecl_method, carbon_down_dead, carbon_litter, carbon_soil_org, carbon_standing_dead, carbon_understory_ag, carbon_understory_bg, created_by, created_record_date, created_in_instance, modified_by, modified_record_date, modified_in_instance, cycle, subcycle, soil_rooting_depth_pnw, ground_land_class_pnw, plant_stockability_factor_pnw, stnd_cond_cd_pnwrs, stnd_struc_cd_pnwrs, stump_cd_pnwrs, fire_srs, grazing_srs, harvest_type1_srs, harvest_type2_srs, harvest_type3_srs, land_use_srs, operability_srs, stand_structure_srs, nf_cond_status_cd, nf_cond_nonsample_reasn_cd, canopy_cvr_sample_method_cd, live_canopy_cvr_pct, live_missing_canopy_cvr_pct, nbr_live_stems) VALUES (3337761010690, 3337759010690, 1989, 32, 1, 3, 351, 1, 1, 0, 1, 11, 10, 0, 417, 261, 261, 1, 164, 1, 0, 6, 21, 50, 108, 0, 0, '', 1, 1, 1, 0, 67, 289, 0, 1, 1, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 260.6012, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, '', '', 0, 0, 0, 6.885517, 18.602616, 14.039627, 7.507694, .393915, .043768, 0, '2004-05-28', 10690, 0, '2010-07-07', 10690, 1, 0, 0, 0, 0, 0, 0, '', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);

Feature request: table stats to doublecheck successful import

I'm wondering if there would be a way to generate the number of rows per table that should show up in your completed database - perhaps to accompany the super-sweet 'download complete' graphic? I am just looking for a quick way to reassure myself that I have everything I should - rather than having to check against all the raw data files.Thanks!

PanTHERIA incomplete import

I tried twice to dump PanTHERIA into my PostgreSQL db, and each time it only populated 3057 rows (when the Ecological Archives file has 5416).

Improve provenance tracking

One thing the Retriever doesn't really accomplish is helping the user with tracking when they downloaded the data (in case the source changes) and what version of the Retriever a particular database was built with (in case the Retriever changes). Many DBMSs allow metadata to be included as part of the database and including key provenance information in this metadata would be useful. An alternative would be to create a separate Retriever table that stores this information for all datasets installed using the Retriever (e.g., fields could include Dataset, Table, Date Created, Retriever Version). The advantage of this alternative is that it is probably less DBMS specific. This information could be stored as comments in .csv files.

Error importing Gentry data in 1.1 RC on Windows 7

While importing the Stems table the retriever fails and returns the following error:

Error: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Missing ), ], or Item in query expression '`, 5226, 'CUEVA', null, 24;'. (-3100) (SQLExecDirectW)")

Steps to reproduce:

  1. Remove all previous traces of the retriever (.exe, scripts and raw_data folders, connections file)
  2. Download retriever
  3. Select already created, blank, MS Access database
  4. Select the Gentry dataset

FIA: invalid literal for int()

Inserting rows to FIA_PLOT: 31355 / 31355
Traceback (most recent call last):
File "", line 1, in
File "scripts/fia.py", line 59, in download
self.engine.insert_data_from_url(file)
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 538, in insert_data_from_url
self.insert_data_from_file(self.format_filename(filename))
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 528, in insert_data_from_file
self.add_to_table()
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 65, in add_to_table
for n in range(len(linevalues))]
File "/usr/local/lib/python2.7/dist-packages/retriever-1.0.0-py2.7.egg/retriever/lib/engine.py", line 441, in format_insert_value
return int(strvalue.split('.')[0])
ValueError: invalid literal for int() with base 10: 'S27LAK1B'

Error importing BBS into PostgreSQL 8.4; Ubuntu 10.04

Using first time install of the data retriever, I get an error (red exclamation mark in the GUI) with the message "INSERT INTO BBS.species(AOU,genus,species,subspecies,id_to_species) VALUES(10010,'TINAMUS','major',null,1);" GUI had previously reported that it created 'routes' and 'weather' tables in database and inserted rows into them, but no tables or data were found.

Error importing USDA plants into MS Access 2007

On fresh installs on both Windows 7 and XP importing USDA plants fails immediately following the download and returns the following error:

Error: [Errno 2] No such file or directory: '_new.raw_data\PlantTaxonomy\downloadData'

Retriever experiences massive slowdown with postgreSQL

When attempting to use retriever in both Windows 32-bit and 64-bit there is a massive slow down in the program's performance when using a postgreSQL database connection. Specifically this occurs only after configuring the database connection settings. It takes approximately 5 minutes to get retriever to bring up its GUI. You are still able to download the data, the program is just very unresponsive and slow. Note that there is ample RAM and cpu resources available. In contrast, if the database connection is setup for example for a mySQL database the GUI snaps right now and you are able to download datasets.

Wrong architecture error for MySQL on Mac

Via email:

"I've installed MySQL on my Mac, and Retriever. Here is my error. Interesting that is says the matched file is the 'wrong architecture'."

Complete with screen shot.

This is presumably a 32-bit vs. 64-bit issue, so I'm not sure if we can work around this with our existing Mac or if we'll need one with whatever the other architecture is (I'm assuming ours is 32-bit, but can't look until tomorrow). Here's a discussion on Stack Overflow to get us started.

Error importing BBS into MS Access 2007

Retriever fails while, or immediately after, creating the weather table and returns the error:

Error: ('07002', '[07002] [Microsoft][ODBC Microsoft Access Driver] Too few parameters. Expected 1. (-3010) (SQLExecDirectW)')

Occurs on both Windows 7 and XP.

BBS download error

Error:[Errno ftp error] [Errno ftp error] 550 failed to change directory. Windows exe - PostgreSQL 8.4

Feature request: Allow Retriever to work without the CREATE permission if the database already exists

Via email I've received a request to support a situation where the database already exists, but the user does not have the permission to CREATE new databases. This is definitely an edge use case, but one that clearly can come up with shared resources.

A quick look at engine.py suggests that we can probably accomplish this fairly easily by adding a try-catch in create_db. The trick will be failing gracefully in the case where the user doesn't have permission to create and the database doesn't already exist.

Postgres: "value too long for type character varying(n)"

The output of tests.py:

('p', 'EA_clark2006', DataError('value too long for type character varying(10)\n',))
('p', 'USDA_plants', DataError('value too long for type character varying(7)\n',))
('p', 'gentry', DataError('value too long for type character varying(20)\n',))
('p', 'EA_portal_mammals', DataError('value too long for type character varying(3)\n',))
('p', 'EA_mom2003', DataError('value too long for type character varying(7)\n',))

Sometimes in pg, comma-delimited strings surrounded by quotes are being read in as one long value which is too long for the column. This is a strange problem and I'm trying to figure out why it would only happen with pg.

MS Access failed tests

The following tests failed on MS Access:

EA_mom2003: IndexError, list index out of range
gentry: IndexError, list index out of range
USDA_plants: IndexError, list index out of range

If port is misspecified Retriever fails to fail gracefully

My PostgreSQL database is configured on a non-standard port. If I do not configure my port correctly in the retriever database connection file then retriever generates an error log (see below) and experiences a major performance slowdown.

<retriever.exe.log>
Traceback (most recent call last):
File "retriever\app\controls.pyo", line 54, in OnGetItem
File "retriever\app\controls.pyo", line 91, in HtmlScriptSummary
File "retriever\lib\templates.pyo", line 57, in exists
File "retriever\lib\engine.pyo", line 452, in exists
File "retriever\engines\postgres.pyo", line 101, in table_exists
File "retriever\lib\engine.pyo", line 555, in get_cursor
File "retriever\engines\postgres.pyo", line 127, in get_connection
File "psycopg2__init__.pyo", line 179, in connect
psycopg2.OperationalError: could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 5432?
could not connect to server: Connection refused (0x0000274D/10061)
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?

The retriever continues to report the above error message as it continually attempts to find the PostgreSQL connection.

py2app exception when trying to build

I'm getting an exception from py2app:

WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
WARNING: Mach-O header may be too large to relocate
Traceback (most recent call last):
File "setup.py", line 108, in
'no_chdir': True,
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 480, in run
self._run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 643, in _run
self.run_normal()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 731, in run_normal
self.create_binaries(py_files, pkgdirs, extensions, loader_files)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app-0.6.5-py2.7.egg/py2app/build_app.py", line 876, in create_binaries
platfiles = mm.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachOStandalone.py", line 133, in run
node.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 117, in write
header.write(f)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 312, in write
self.synchronize_size()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/macholib/MachO.py", line 302, in synchronize_size
raise ValueError("New Mach-O header is too large to relocate")
ValueError: New Mach-O header is too large to relocate

Access import errors for 4 datasets

Importing into MS Access (2007 and 2010) currently fails on 4 datasets: Avian Body Size, USDA plants, Marine Predator and Prey Body Sizes, and Vegetation plots - del Moral, 2010.

Error: ('42000',"[42000][Microsoft][ODBC Microsoft Access Driver] Syntax error (missing operator in query expression...(-3100)(SQLExecDirectW)")

where the ... is a line of data that starts with:
Schlegel's Francolin', -- null for Avian Body Size
''"ABAM@" -- for USDA plants
29\xba40'N', -- for Marine Predator and Prey Body Sizes
Mertens' sedge', null, -- for Vegetations plots - del Moral, 2010

Texas BBS import error (Ubuntu 10.04; PostgreSQL 8.4)

Texas BBS data are not imported into postgres database. Manual insert of "CTexas.csv" shows error:

ERROR: invalid input syntax for integer: ""
CONTEXT: COPY test, line 17137, column count20: "
"

Line 17137 of CTexas.csv has asterisks (*) in column count20 and SpeciesTotal

MCDB 'ascii' error when importing Trapping table

On Ubuntu 12.04 the following error occurs for MySQL, Postgres, and SQLite:

Error: 'ascii' codec can't decode byte 0xb2 in position 32: ordinal not in range(128)

When I try to open the raw file in gedit it reports:

There was a problem opening the file /path/to/MCDB_trappings.csv
The file you opened has some invalid characters. If you continue editing this file you could corrupt this document.
You can also choose another character encoding and try again.

BBS import fails in PostgreSQL

Looks like a change in the species table:
('p', 'bbs', DataError('value too long for type character varying(30)\n',))
INSERT INTO BBS.species (AOU, genus, species, subspecies, id_to_species) VALUES (1, 'Admin', 'Code', 'Admin Code Admin Code Admin Code Admin Code', TRUE);
ERROR.

psycopg2 error

I just downloaded the new release and tried to install into my POstgreSQL database. I got the error:

There was a problem with my database connection. No module named psycopg2. Is this a personal problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.