ajdamico / asdfree Goto Github PK

View Code? Open in Web Editor NEW

602.0 602.0 451.0 193.38 MB

analyze survey data for free

Home Page: http://asdfree.com/

License: GNU General Public License v3.0

R 88.27% TeX 4.79% Shell 1.05% CSS 5.74% HTML 0.16%

asdfree's People

Stargazers

Watchers

Forkers

fototo jborowitz raffdoc devpeg koltrane hnaci ianblu1 rpietro luisibanez abresler uc-data-services muraenok edwindj xn8spicer aleleich val9 irhmb rdrivers marcionicolau jpainter jun9 mchardy sbenoit michaelschultz m-dev- benjamin-chan gmelikian marcelquintela escherpf poorboy44 vspinu jwc455 allenherman speedtriple955 yojimbodurant jtwalsh0 mkilaas mcmdel jbholman rkbarney toledobastos smartinsightsfromdata randywreed arnoblalam hanifmahboobi mathtam dtybor01 nclarkjudd oliverrbrowne bwlewis colsmith76 caiyong dshen1 dlpress iantist davidwrothwell msteinberg joscani jeddoughertycc jerryjcgf jasonding92 zainrs cmlakhan jianhuawu oboi danconsults jonhersh chihayakenji mikeblanford bradddd odragoi-emg cwq9999 hmenag1 pnandak rotison dhivyar johnrmoreau chitkot rganzi eamcvey f4bioss robguedes ramnathv vnijs stephejs hemip sfleite rajinkumar chachos bielinski danielarantes orlinresearch metjush xw2239 curiouspal bashir1168 jm311081106 cjendres1 tenanatc svjohnson

asdfree's Issues

flip the consumer expenditure survey over to monetdblite & update eanthony

flip survey of business owners over to monetdblite & update eanthony

Server not ready (monetdb.program.path is incorrect)

Thanks for your amazing work on the ACS!

I don't know if the script can address this, but when I run your script, I see this "Server not ready" message:

db <- dbConnect( MonetDB.R() , monet.url , wait = TRUE )
/home/USERNAME/R/ACS/MonetDB/acs.sh: 2: /home/USERNAME/R/ACS/MonetDB/acs.sh: /bin/mserver5: not found
Server not ready(Could not connect to localhost:50001), retrying (ESC or CTRL+C to abort)
...

On my Ubuntu 14.04 machine, following install directions here (https://www.monetdb.org/Documentation/UserGuide/Downloads/UbuntuDebian), my mserver5 binary ends up in /usr/bin. However, the script appears to search for mserver5 in /bin. Both folders are in my $PATH.

I altered script so that: " monetdb.program.path = "/usr" , "

This now connects to the Monet Server.

Thanks again for outstanding work!

Download issue with ESS data

Excellent ESS code!

But I got the following error towards the end, while downloading ESS Round 1:

importing /download.html?file=ESS1cfNO&c=NO&y=2002 ...
Error in factor([email protected], levels = values[use.levels], labels = labels[use.levels]) : 
RMate stopped at line 0
  invalid 'labels'; length 2 should be 1 or 1
Calls: data.frame ... as.data.frame.double.item -> as.factor -> as.factor
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

Could it be due to some SPSS import setting?

P.S. I have not forgotten that I need to work on Insee data when I get some free time!

nvss mortality 2000-2002 files do not contain any contents

https://github.com/ajdamico/asdfree/blob/master/National%20Vital%20Statistics%20System/download%20all%20microdata.R

flip the nvss over to monetdblite & update eanthony

Provide combined db?

Is it possible to run all the scripts into the same database to create one table with data for all years?

think about converting censo demografico and IPUMSI over with `as.svrepdesign`?

looks like many linearization calculations are very slow, but once converted to jackknifed weights, things should move as fast as the ACS does.. need to speed up basic analysis commands like svymean in http://monetdb.cwi.nl/testweb/web/eanthony/fedora20-3/1452365149/censolite.log

@DjalmaPessoa from your perspective is there any methodological problem with taking the cluster/strata/fpc and coercing them all to replicate weights with survey:::as.svrepdesign?

http://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html

you can read about what the function actually does by looking at the code in survey:::jknweights and survey:::jk1weights

cc @hannesmuehleisen

flip the censo over to monetdblite & update eanthony

i cannot reopen the issue...

hey!

sorry but without admin privileges i cannot reopen the issue! :-)

anyway, left another comment there.

cheers

Suggestion: README files

Hi Anthony,

I'm a huge fan. Apologies for not being able to answer your call to user contributions: I have limited experience with complex survey objects, and French official statistics are locked into paper-based institutions that make automation impossible for most sources.

Reading your recent SEER scripts, I was thinking that a lot of the information would benefit from appearing in a folder-specific README file, as GitHub recommends. It would shorten the scripts, show up online when users browse your repo, and perhaps make your invitation to contribute more visible. It might also be easier to update.

All the best,

François

World Values Survey: longitudinal dataset

Dear Anthony,

I cannot seem to download the longitudinal file for the [World Values Survey](https://github.com/ajdamico/usgsd/tree/master/World Values Survey) through your script. I checked the code line per line, and it seems that the HTTP headers that you get at AJDocumentation.jsp?CndWAVE=-1 do not return the links to the data files (or the documentation) any more.

Note sure if that's connected, but the WVS website recently updated its longitudinal data file.

Cheers,

François

Feature Request Economic Census

Hi:

I was wondering if there were any plans to port the US economic census into R? It is an important data source for some economists.

confirm all multiply-imputed, database-backed have a working update() method

flip hmda over to monetdblite & update eanthony

error downloading Current Population Survey

When running

library(downloader)
# setwd( "C:/My Directory/CPS/" )
cps.years.to.download <- 2014:1998
source_url( "https://raw.github.com/ajdamico/usgsd/master/Current%20Population%20Survey/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )

I received the error

Invalid file, or file has unsupported features.
In addition: Warning message:
In parse.SAScii(sas_ri, beginline, lrecl) : NAs introduced by coercion

The file being parsed was http://www.census.gov/housing/povmeas/spmresearch/spmresearch2013.sas7bdat

wrong to assume missing values stay missing and not zero

read.SAScii.sqlite suffers from this issue-

problem <- c( "v01\tv02" , "1000\t1000" , "\t", "\t" , "1000\t1000" )

library(RSQLite)
db <- dbConnect( SQLite() )

tf <- tempfile()
writeLines( problem , tf )
dbWriteTable(db, 'this_table', tf, sep = "\t", header = TRUE)

# missing data stays missing
read.table( tf , sep = '\t' , header = TRUE )

# missing data is zero
dbGetQuery( db , "SELECT * FROM this_table" )

problem in read.SAScii.sqlite

When runnign the 2004 SIPP downloader script, I hit a bug at line 141 of the file

I tracked it down to this line in read.SAScii.sqlite.

dbSendQuery has object statement that is

"ALTER TABLE w1 RENAME TO temp_backup"

after which it fails with

Error in sqliteExecStatement(conn, statement, ...) : RS-DBI driver: (error in statement: no such table: w1)

it seems there is no table w1? Notice that the read.SAScii.sqlite function has at that point been called successfully at line 122 of the file

this error occurs on both MacOS and unix in exactly identical fashion (posting both sessionInfo()s here).

Cheers
flo

sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] descr_1.0.2 RSQLite_0.11.4 DBI_0.2-6 SAScii_1.0 downloader_0.3 vimcom_0.9-8

loaded via a namespace (and not attached):
[1] digest_0.6.3 tools_3.0.0 xtable_1.7-1

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] descr_1.0.2 downloader_0.3 SAScii_1.0 RSQLite_0.11.4 DBI_0.2-7
[6] devtools_1.3

loaded via a namespace (and not attached):
[1] digest_0.6.3 evaluate_0.5.1 httr_0.2 memoise_0.1 parallel_3.0.1
[6] RCurl_1.95-4.1 stringr_0.6.2 tools_3.0.1 whisker_0.3-2 xtable_1.7-1

Use message instead of winDialog

So code works across platforms

update the monetdb installation instructions and blog post

OS X Yosemite 10.10.1 Script fails

It could be something specific to my system but this script would not work for me. For:

install.packages( "sqlsurvey" , repos = c( "http://cran.r-project.org" , "http://R-Forge.R-project.org" ) , dep=TRUE )

I kept getting

Installing package into ‘/Users/phparker/Library/R/3.1/library’(as ‘lib’ is unspecified)
Warning: unable to access index for repository http://R-Forge.R-project.org/bin/macosx/mavericks/contrib/3.1

   package ‘sqlsurvey’ is available as a source package but not as a binary

Warning message:
package ‘sqlsurvey’ is not available (for R version 3.1.2)

Instead I had to:

a) install.packages( c( 'SAScii' , 'descr' , 'survey' , 'MonetDB.R' , 'downloader' , 'R.utils' ) )
Then in terminal:
b) $svn checkout svn://r-forge.r-project.org/svnroot/sqlsurvey/
c) $ R --vanilla CMD INSTALL --build sqlsurvey/pkg/sqlsurvey
d) $ R --vanilla CMD INSTALL --build sqlsurvey/pkg/RMonetDB

i believe this script and this excel file are out of date? what numbers should we reproduce instead?

https://github.com/ajdamico/asdfree/blob/master/Pesquisa%20Nacional%20por%20Amostra%20de%20Domicilios/replicate%20IBGE%20estimates%20-%202011.R

https://github.com/ajdamico/asdfree/blob/master/Pesquisa%20Nacional%20por%20Amostra%20de%20Domicilios/ESTIMATES%20from%20IBGE.XLS

Use downloader::download to source gists

This eliminates the (heavy) RCurl dependency

clean up the pesquisa nacional de saude

djalma, it looks like ibge has released a new 2013 microdata file [1] since you wrote the code in [2].

do you know if this new version requires additional information in your analysis examples? thanks

[1] http://www.ibge.gov.br/home/estatistica/populacao/pns/2013/default_microdados.shtm
[2] https://github.com/ajdamico/asdfree/tree/master/Pesquisa%20Nacional%20De%20Saude

Don't use `setwd()`

Generally it's a bad idea to use setwd() because it means your code is no longer portable

prepare ipumsi for eanthony integration

for password--
what i would do is put username password into a list in a rds file and read that and put that file into gitignore list

[1:51:44 PM] Hannes Mühleisen: a <- list(username="bla", password="blubb")
[1:52:13 PM] Hannes Mühleisen: tf <- tempfile()
[1:52:17 PM] Hannes Mühleisen: saveRDS(a, tf)
[1:52:36 PM] Hannes Mühleisen: a <- readRDS(tf)
[1:52:50 PM] Hannes Mühleisen: its nicer because it only loads a single variable
[1:52:53 PM] Anthony Damico: and just make sure never to print the contents of `a` to the logs

Public Library Survey data has moved

The link in the script gives "Access denied", and there's an interactive thingy at https://data.imls.gov/ and an API that apparently requires an access token

flip the area resource file over to monetdblite & update eanthony

flip the bsapufs over to monetdblite & update eanthony

censolite subsetting mistake

2016-01-10 10:56:33 > dbGetQuery(db, "SELECT SUM( pes_wgt * v6033 ) / SUM( pes_wgt ) AS mean_age FROM c10 WHERE v6033 < 900")
2016-01-10 10:56:33 QQ: 'SELECT SUM( pes_wgt * v6033 ) / SUM( pes_wgt ) AS mean_age FROM c10 WHERE v6033 < 900'
2016-01-10 10:56:35 II: Finished in 2s
2016-01-10 10:56:35 mean_age
2016-01-10 10:56:35 1 32.03549
2016-01-10 10:56:35
2016-01-10 10:56:35 > svymean(~v6033, pes.d, na.rm = TRUE)
2016-01-10 10:56:35 QQ: 'SELECT name, value from sys.env()'
2016-01-10 10:56:35 II: Finished in 0s
2016-01-10 10:56:35 QQ: 'select v6033 from c10'
2016-01-10 10:56:36 II: Finished in 0.99s
2016-01-10 21:27:28 mean SE
2016-01-10 21:27:28 v6033 44.532 0.0265

ipumsi authentication needs to be fixed or removed

https://github.com/ajdamico/asdfree/blob/master/IPUMS%20International/ipumsi%20functions.R#L6-L28

!MALException:setScenario:Scenario not initialized 'sql'

Is it normal to see this error a lot while the script is running?

I'll put it in a bit of context. This is the latest release of MonetDB, R, RStudio, and Windows:

> # loop through each possible acs year
> for ( year in 2050:2005 ){
+ 
+   # loop through each possible acs dataset size category
+   for ( size in c(  .... [TRUNCATED] 
Downloading from URL 'http://www2.census.gov/acs2013_1yr/pums/unix_hwy.zip' to file 'C:\Users\Jason\AppData\Local\Temp\RtmpaYkKCC\file39a43b7b650f'... 
trying URL 'http://www2.census.gov/acs2013_1yr/pums/unix_hwy.zip'
Content type 'application/zip' length 630801 bytes (616 KB)
downloaded 616 KB

MonetDB: Switching to single-threaded query execution.
Downloading from URL 'http://www2.census.gov/acs2013_1yr/pums/csv_hus.zip' to file 'C:\Users\Jason\AppData\Local\Temp\RtmpaYkKCC\file39a43b7b650f'... 
trying URL 'http://www2.census.gov/acs2013_1yr/pums/csv_hus.zip'
Content type 'application/zip' length 259462625 bytes (247.4 MB)
downloaded 247.4 MB

[1] "warning: column name 'type' unacceptable in monetdb.  changing to 'type_'"
SUCCESS: The process with PID 12204 has been terminated.
Downloading from URL 'http://www2.census.gov/acs2013_1yr/pums/unix_pwy.zip' to file 'C:\Users\hackr\AppData\Local\Temp\RtmpaYkKCC\file39a43b7b650f'... SUCCESS: The process with PID 12204 has been terminated.
Downloading from URL 'http://www2.census.gov/acs2013_1yr/pums/unix_pwy.zip' to file 'C:\Users\hackr\AppData\Local\Temp\RtmpaYkKCC\file39a43b7b650f'... 
trying URL 'http://www2.census.gov/acs2013_1yr/pums/unix_pwy.zip'
Content type 'application/zip' length 1240992 bytes (1.2 MB)
downloaded 1.2 MB

Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
MonetDB: Switching to single-threaded query execution.
Downloading from URL 'http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip' to file 'C:\Users\Jason\AppData\Local\Temp\RtmpaYkKCC\file39a43b7b650f'... 
trying URL 'http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip'
Content type 'application/zip' length 616326250 bytes (587.8 MB)
downloaded 587.8 MB

SUCCESS: The process with PID 12292 has been terminated.
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
Server not ready(Authentication error: !MALException:setScenario:Scenario not initialized 'sql'
), retrying (ESC or CTRL+C to abort)
MonetDB: Switching to single-threaded query execution.
Error in .local(conn, statement, ...) : 
  Unable to execute statement 'create table acs2013_1yr_m as select 'M' as rt, a.serialno, a.division, a.puma, a.region, a.st, a.ad...'.
Server says '!GDK reported error.'.

By the way, why the single-threaded option? Won't that slow it down a good bit? I suppose something bad happens without it?

flip the survey of income and program participation over to monetdblite & update eanthony

nsduh download script breaks if there's a space in the folder name

A few notes and suggestions

Hi Anthony,

I've written this Gist that contains pointers to bits and pieces of survey data online, and how to read them into R. There's also a short bibliography of example studies and survey-specific packages.

I have not documented the weighting schemes, but the European Social Survey has a very simple structure, and the ANES data extracts should be straightforward too. The GSS is best weighted as you show in your own, much more elaborate script.

Some quick notes after replicating the NHIS scripts: would it be a good idea to use file.exists and avoid re-downloading any existing file in the download-all-microdata routines? If you run the script in repeated runs (a plausible scenario, given the amount of data), it would help to skip previously done jobs, e.g. documentation files.

Adding a makefile could also help running all tasks in the background with the smallest possible amount of CPU. That would also hint to the reader that it is not a good idea to run some of the scripts in RStudio, which renders SAScii progress statements in a slightly strange way).

I can make a quick fork to illustrate both suggestions.

if HPSA was orig. designated 'withdrawn' before the time point, AND never updated, throw it out

I have one suggestion that should help pick out only those Designated areas. Currently, the code appears to keep those areas that were withdrawn on an early date, but then never updated.

In my tests, just below this snippet:

if the hpsa was updated to 'withdrawn' before the user-defined time point, throw it out

x <- x[ !( no.na( x$ud < designated.time.point ) & x$status.description == 'Withdrawn' ) , ]

I added this snippet:

RTM: if HPSA was orig. designated 'withdrawn' before the time point, AND never updated, throw it out:

x <- x[ !( (x$dd < designated.time.point) & (is.na(x$ud)) & (x$status.description == 'Withdrawn') ) , ]

and this appears to exclude those areas that were designated Withdrawn before the 'designated.time.point' and never updated. (For me, this eliminates 412 additional areas today).

your.email

Hello,

Small bug in the (awesome) ANES code: your.username at lines 57 and 61 should be your.email to match the later call to that object. Keep on rocking.

flip the current population survey over to monetdblite & update eanthony

flip nhts over to monetdblite & update eanthony

World Values Survey: using countries as strata

Quick question related to #25:

Some time ago on Statalist, Stas Kolenikov recommended using countries as super-strata when weighting the WVS longitudinal dataset for multi-country analysis.

As far as I understand, this translates into this survey design:

wvs.design = svydesign(~ 1, data = wvs.multi.country.dataset, strata = ~ S009, weights = ~ S019)

where S009 designates the country variable and S019 designates the 'N = 1500' homogenized survey weights that give every country in the WVS wave the same sample weight, regardless of their actual number of observations.

Is that something you would recommend?

flip pisa over to monetdblite & update eanthony

flip the american housing survey over to monetdblite & update eanthony

think of better ways to catch and restart broken downloads

currently working on 2002Downloading from URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip' to file 'C:\Users\AnthonyD\AppData\Local\Temp\3\RtmpKsOkkL\filefd0772f6b8a'... trying URL 'ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/reponderacao_2001_2012/PNAD_reponderado_2002.zip'
ftp data connection made, file length unknown
downloaded 0 bytes


Error in file(con, "r") : invalid 'description' argument

"no longer logged in" error

Hi!

When trying to run the save.psid function here:

https://github.com/ajdamico/asdfree/blob/master/Panel%20Study%20of%20Income%20Dynamics/download%20all%20microdata.R#L247

I get this error about not being logged in anymore.

Error in save.psid(family[i, "file"], paste0("fam", family[i, "year"]),  : 
  no longer logged in

cheers
florian

flip the pums over to monetdblite & update eanthony

pns `dom` table does not get used in the import script

djalma, i made some changes to the main script to get started--

e7d9b6e

--but the dom table never gets used, and i believe it should be merged onto the pes table before the survey design gets created? that way, it is a rectangular file (like the censo demografico gets dom and pes merged). what merge fields should be used? so that i can merge on the household-level information

using the current version of the import file, the intersecting fields are--

> intersect(names(dom),names(pes))
 [1] "v0001"     "v0024"     "upa_pns"   "v0006_pns" "upa"       "v0028"     "v0029"     "v00281"    "v00291"   "v00282"    "v00292"    "v00283"    "v00293"

thanks!

Warning messages:
1: In readLines("http://www.bls.gov/cex/pumdhome.htm") :
  incomplete final line found on 'http://www.bls.gov/cex/pumdhome.htm'
2: In grep("/pumd_([0-9][0-9][0-9][0-9]).htm", readLines("http://www.bls.gov/cex/pumdhome.htm"),  :
  input string 785 is invalid in this locale

I was able to get the two warnings to go away by making the following changes

add warn=FALSE to the readlines, so it doesn't complain about the html not ending with a new line
set the locale used for interpreting the string format, so that it doesn't complain about invalid input strings