uc-macss / persp-analysis Goto Github PK

Course materials for MACS 30000 (Perspectives on Computational Analysis)

TeX 0.06% HTML 47.38% Jupyter Notebook 52.54% Python 0.01% R 0.01%

persp-analysis's Introduction

MACS 30000 - Perspectives on Computational Analysis

	Dr. Benjamin Soltoff	Ryan C. Hughes (TA)	Joshua G. Mausolf (TA)
Email	[email protected]	[email protected]	[email protected]
Office	249 Saieh Hall	251 Saieh Hall	251 Saieh Hall
Office Hours	Th 1-3pm	M 8:00-10:00am	F 9:30-11:30am
GitHub	bensoltoff	rchughes	jmausolf

Meeting day/time: MW 11:30-1:20pm, 247 Saieh Hall for Economics
Lab session: W 4:30-5:20pm, 247 Saieh Hall for Economics
Office hours also available by appointment

Course description

Massive digital traces of human behavior and ubiquitous computation have both extended and altered classical social science inquiry. This course surveys successful social science applications of computational approaches to the representation of complex data, information visualization, and model construction and estimation. We will reexamine the scientific method in the social sciences in context of both theory development and testing, exploring how computation and digital data enables new answers to classic investigations, the posing of novel questions, and new ethical challenges and opportunities. Students will review fundamental research designs such as observational studies and experiments, statistical summaries, visualization of data, and how computational opportunities can enhance them. The focus of the course is on exploring the wide range of contemporary approaches to computational social science, with practical programming assignments to train with these approaches.

Required textbooks

All textbooks are available in electronic editions either directly from the author or via the UChicago library (authentication required). Hardcopies can be purchased at your preferred retailer.

Evaluation

Assignment	Quantity	Points	Total Points
Short assignments	8	10	80
Final exam	1	20	20

Short assignments will vary depending on subject matter. They could include writing assignments analyzing computational research designs and/or problem sets implementing specific computational methods.
Final exam will be a timed take-home exam. Details to be furnished near the end of term.

Disability services

If you need any special accommodations, please provide me (Dr. Soltoff) with a copy of your Accommodation Determination Letter (provided to you by the Student Disability Services office) as soon as possible so that you may discuss with me how your accommodations may be implemented in this course.

Course schedule (lite)

#	Date	Topic	Assignment due
1.	Mon, Sep. 25	Introduction to Computational Social Science
2.	Wed, Sep. 27	Science in a computational era
3.	Mon, Oct. 2	Observational data - counting things
4.	Wed, Oct. 4	Observational data - measurement
5.	Mon, Oct. 9	Observational data - forecasting
6.	Wed, Oct. 11	Observational data - approximating experiments
7.	Mon, Oct. 16	Asking questions - fundamentals	Proposing an observational study
8.	Wed, Oct. 18	Asking questions - digital enrichment
9.	Mon, Oct. 23	Experiments	Proposing a survey study
10.	Wed, Oct. 25	Experiments
11.	Mon, Oct. 30	Simulated data	Proposing an experiment
12.	Wed, Nov. 1	Simulated data
13.	Mon, Nov. 6	Collaboration	Simulating your income
14.	Wed, Nov. 8	Collaboration
15.	Mon, Nov. 13	Ethics	Collaboration
16.	Wed, Nov. 15	Ethics
17.	Mon, Nov. 20	Exploratory data analysis - univariate visualizations	The ethics of the Montana election experiment
18.	Wed, Nov. 22	Exploratory data analysis - multivariate visualizations
19.	Mon, Nov. 27	Exploratory data analysis - clustering	Exploring the General Social Survey
20.	Wed, Nov. 29	Exploratory data analysis - dimension reduction
21.	Mon, Dec. 4		Unsupervised learning

The final exam will be distributed on Tuesday December 5 at 12pm and must be submitted by 11:59pm Wednesday December 6.

Course schedule (readings)

All readings are required unless otherwise noted. Adjustments can be made throughout the quarter; be sure to check this repository frequently to make sure you know all the assigned readings.

Introduction to computational social science
- Watts, D. J. (2007). A twenty-first century science. Nature, 445(7127), 489-489.
- Lazer et. al. (2009) Computational Social Science. Science, 323, 721-723.
Social science in a computational era
- Bhattacherjee, A. (2012). Social science research: principles, methods, and practices. Chapters 1-4. Skim/review as needed.
- Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289-310.
- Anderson, C. (2008). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired.
- Schrodt, P. A. (2014). Seven deadly sins of contemporary quantitative political analysis. Journal of Peace Research, 51(2), 287-300.
Observational data (counting things)
- "Chapter 2: Observing Behavior." Bit by Bit. Sections 2.1-2.4.1.3.
- King, G., Pan, J., & Roberts, M. E. (2013). How censorship in China allows government criticism but silences collective expression. American Political Science Review, 107(02), 326-343.
- Kossinets, G., & Watts, D. J. (2006). Empirical analysis of an evolving social network. Science, 311(5757), 88-90.
Observational data (measurement)
- Bonica, A. (2014). Mapping the ideological marketplace. American Journal of Political Science, 58(2), 367-386.
- Wojcik, S. P., Hovasapian, A., Graham, J., Motyl, M., & Ditto, P. H. (2015). Conservatives report, but liberals display, greater happiness. Science, 347(6227), 1243-1246.
- Emotional timeline of September 11, 2001
Observational data (forecasting)
- 2.4.2 Forecasting and nowcasting. Bit by Bit.
- Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Predicting consumer behavior with Web search. PNAS, 107(41), 17486-17490.
- Schrodt, P. A., Yonamine, J., & Bagozzi, B. E. (2013). Data-based computational approaches to forecasting political violence. In Handbook of computational approaches to counterterrorism (pp. 129-162). Springer New York.
- Google Flu Trends
  - Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012-1014.
  - Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: traps in big data analysis. Science, 343(6176), 1203-1205.
Observational data (approximating experiments)
- 2.4.3 Approximating experiments. Bit by Bit.
- Phan, T. Q., & Airoldi, E. M. (2015). A natural experiment of social network formation and dynamics. PNAS, 112(21), 6595-6600.
- Hersh, E. D. (2013). Long-term effect of September 11 on the political behavior of victims' families and neighbors. PNAS, 110(52), 20959-20963.
- Cohen, P., et al. (2016). Using Big Data to Estimate Consumer Surplus: The Case of Uber. Working paper.
Asking questions (fundamentals)
- "Chapter 3: Asking Questions." Bit by Bit. Sections 3.1-3.4.
- Schuldt, J. P., Konrath, S. H., & Schwarz, N. (2011). "Global warming" or "climate change"? Whether the planet is warming depends on question wording. Public Opinion Quarterly, 75(1): 115-124.
- Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980-991.
- The Upshot: We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results.
Asking questions (digitally-enriched)
- "Chapter 3: Asking Questions." Bit by Bit. Sections 3.5-3.7.
- Sugie, N. F. (2016). Utilizing Smartphones to Study Disadvantaged and Hard-to-Reach Groups. Sociological Methods & Research, 0049124115626176.
- Lax, J. R., & Phillips, J. H. (2009). How should we estimate public opinion in the states?. American Journal of Political Science, 53(1), 107-121.
- Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802-5805.
Experiments
- "Chapter 4: Running experiments." Bit by Bit.
- Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., & Fowler, J. H. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295-298.
- Milkman, K. L., Akinola, M., & Chugh, D. (2015). What happens before? A field experiment exploring how pay and representation differentially shape bias on the pathway into organizations. Journal of Applied Psychology, 100(6), 1678.
Experiments (more)
- Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online labor markets for experimental research: Amazon. com's Mechanical Turk. Political Analysis, 20(3), 351-368.
- King, G., Pan, J., & Roberts, M. E. (2014). Reverse-engineering censorship in China: Randomized experimentation and participant observation. Science, 345(6199), 1251722.
- Munger, K. (2017). Tweetment effects on the tweeted: Experimentally reducing racist harassment. Political Behavior, 39(3), 629-649.
Simulated data
- "Indirect Inference," New Palgrave Dictionary of Economics
- Benoit, Kenneth, "Simulation Methodologies for Political Scientists," The Political Methodologist, 10:1, pp. 12-16.
- Recommended readings on simulation methods (not required for class)
  - Wolpin, Kenneth I., The Limits of Inference without Theory, MIT Press, 2013.
  - Davidson, Russell and James G. MacKinnon, "Section 9.6: The Method of Simulated Moments," Econometric Theory and Methods, Oxford University Press, 2004.
Simulated data (cont.)
Collaboration
- "Chapter 5: Collaborating". Bit by Bit.
- Pro Git
Collaboration (cont.)
Ethics
- "Chapter 6: Ethics." Bit by Bit.
- Zimmer, M. (2016). OkCupid Study Reveals the Perils of Big-Data Science. Wired.
Ethics (cont.)
- UChicago Social & Behavioral Sciences Institutional Review Board
  - Skim site
  - Specifically read "Does My Research Need IRB Review?"
- Facebook emotional contagion study
- Kosinski, M., & Wang, Y. (2017, September 24). Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology. Retrieved from psyarxiv.com/hv28a
- Parry, M. (2011). Harvard Researchers Accused of Breaching Students' Privacy. Chronicle of Higher Education.
Exploratory data analysis
- Exploring Histograms
- Unwin, A. (2015). Graphical data analysis with R (Vol. 27). CRC Press. - lots of good material here on graphical methods for EDA and how to implement them using different packages in R (e.g. graphics, ggplot2, lattice)
- VanderPlas, Jake. (2016). Python Data Science Handbook. O'Reilly Media, Inc. - see chapter 4 for implementing visualization methods in Python with matplotlib and seaborn
Exploratory data analysis (cont.)
Exploratory data analysis - dimension reduction
- Chapter 10.1-10.2 in An Introduction to Statistical Learning
Exploratory data analysis - clustering
- Chapter 10.1-10.3 in An Introduction to Statistical Learning

persp-analysis's People

Contributors

Stargazers

Watchers

Forkers

xinzhusun jonathanec-uchicago reidmcy shugamoe dpzhang emochoa incipamus jotung07 hsswiki huanye qianxuancheng futureofmaya linzhuolisoc weijia1995 kimswchi banerjeeesha candicezhang521 alicechung limchengyee jtbeyond yiqingzhu007 ningyin-xu magicahan shankswine julia-zhou mariesbrocca cernhofer hayleefay zhuoleng1 yanghou xywu-soc emilyforden sushmitavgopalan16 rodrigovaldes xiaow2 jh-han bobaekang jonathanecm ianmadlenya bethbailey ariboyarsky jfan3 otamio nnickels johnhenrypezzuto chenanhua lingnanhe siyiii jmausolf sumervaid dgamarnik sun-kev tompcurran tamos coopernederhood jheng18 wzheng-94 w4rner alexandertyan ruxinchen yuqian919 fangfangwan shuting05 zhangxiang0822 jgdenby liaoandi cxic-mit lwang11 zundaxu yilundai khan1792 mcs2017 yyd007 dailing616 gmvelez hyunkukwon xiuyuanzhang jmithani irexyu nicholskl rickecon leosonh kevinlanning philipcaochicago mkjang17 kirosg u200915986 interestingprogarmsinhealthscience snowdj weiwanglaw fagan2888 luzelai yilec928 anhnguyendepocen makise-yumei mikkimikki2020

persp-analysis's Issues

Petition to enroll in MACS 30000 (for non-Computation students only)

Here is the petition to enroll in MACS 30000. All students who are not part of the MA in Computational Social Science (MACSS) program need to complete this petition.

If you are a Computation student (member of MACSS) and have not yet registered for the course, contact Brett Baker to be added and tell him you have my consent to enroll.

Return Tibble

Office hours today from 1-3pm

As a reminder, I will have office hours today from 1-3pm. I will not have office hours this Thursday.

Questions re: Assignment 1

Should we be doing a literature review? If yes, to what extent?
Can you elaborate a bit on the following point? (Beyond what is in the instructions)
- Are we justifying its use over other methods?

"A justification for how your proposed research design takes advantage of specific methods for observational study versus alternative observational methods"

Does this have to be a research project we are capable of doing? I.e., Can we suppose certain enabling factors (funding, access to data, time).

Additional experiment reading

Just a head's up, I added another reading for class tomorrow on a Twitter bot experiment designed to reduce racially biased online harassment. See the readme for the link. Make sure to read it for class tomorrow

Can we use jupyter notebook for Kaggle dataset plotting?

as titled

Please Pull My right Version of HW1

Dear Joshua,
I find that you still have not merge my pull request of the right version of HW1.
Best,
Philip, X, Cao

EDA readings

Different forms for visualizing distributions of a variable

Won't let me commit to student folder

For some reason when I try to commit something to my student folder (gamarnik_dan) it goes into a new "patch" instead (which I think is my forked directory of the class). It should be mapped to the master and not into the forked folder.

Updates to Problem Set 1

I have updated parts (a), (b), (c), and (d) in Problem Set 1.

The first update is to have you do 10,000 simulations rather than 1,000 simulations. This will make your plots be smoother and your answers more uniform.
The second change is to make the histogram in part (b) have 50 bins instead of 30 bins.
The last change is to make your histograms for parts (c) and (d) only have as many bins as there are years in which people pay off their debts. For most people's simulations, this should only be 3 or 4 bins, but it is possible to have 5. It just depends on how many extreme outliers you get in your simulations.

Is it possible to borrow an MTurk account from others?

For example, can we ask a friend who's not in this course to register an account and use it to complete the question? Or if we don't have an SSN and therefore cannot get our account verified, the only option is the alternative question? Thanks!

Citation style

Is there a standard citation style we should use for sources in the assignments?

Issue with creating histogram

In exercises (1c) and (1d), you are asked to create histograms, respectively, for the frequency of the various years in which each simulated individual pays off his debt of $95,000. Some of you are having trouble getting your histograms to look right.

For example, suppose you are able to create a unidimensional vector named payoff_year of length 10,000 that contains the year in which each individual pays off his debt.

payoff_year = np.array([1986, 1987, 1986, 1988,... 1987])

You could use the np.unique() function to get a list of the unique year values and a count of how many of each value there is.

payoff_yr_list, payoff_yr_cnt = np.unique(payoff_year,  return_counts=True)

The vector payoff_yr_list will be a numpy array with each unique year in your big payoff_year array. The vector payoff_yr_cnt will have the same number of elements as payoff_yr_list and will contain the counts of how many times each corresponding year occurs.

You should only have three or four different unique years in which simulated households pay off their debt. You can make a nice looking histogram by passing in a bins argument.

plt.hist(payoff_year, bins=np.arange(start_bin, stop_bin, binwidth), weights=hist_wgts)

Suppose in the example above that you only had three unique years that ever occurred (1986, 1987, and 1988). The bins object tells the plt.hist() function where each bin boundary should be. The start_bin element is the left edge of the leftmost bin. If we want each bin centered on the year of payoff, and we want each bin to be the size of one year, then the binwidth = 1.0 and the start_bin = 1986 - 0.5. The stop bin just needs to be something less than 1 unit above the right edge of the right most bin. That right edge should be 1988 + 0.5, so stop_bin = 1989 would work great. Try this and see if you get a great looking histogram.

Summary of mid-course feedback

So a brief summary of the most frequent comments in the feedback you provided (for those interested, the response rate was approximately 50%, though I still don't know if my analysis is plagued by non-response bias):

What has contributed to your learning in this class

Lots of people like Bit by Bit

I too think it's a good summary of the major research designs and prominent research published in computational social science, though it does not serve as a replacement for reading the original research article. Several of you noted it was beneficial when I assigned an article also discussed in Bit by Bit as the textbook helped to summarize and clarify the major points from the article.

The connection to your own future research projects

A few of you said it was helpful that we are relating these projects to research you may conceivably do in the next couple of years here at UChicago. That is one of the major objectives for the course, and my hope is that you do not forget all about survey and experimental designs when it comes to your own research. It is easy to find a canned package of observational data and apply some machine learning algorithms to it, but remember the Schrodt article - we don't learn a ton more by re-analysis of the same dataset. Don't forget this when you think about your research paper in the spring and your thesis next year.

What you would like to change

Make it an elective

No.

Not enough application

Some of you asked when you'd be doing applied work, actually building and testing statistical models. You will be doing that next term (see the syllabus from last year for examples of what methods we will teach you). This term focuses on research design and the process of designing a research project prior to collecting data and analyzing results.

A lot of reading/not enough reading

Some asked for more articles to read with the different methods, while others asked for fewer articles. I am trying to strike a balance between exposing you to substantive applications of these broad methodological approaches and overloading you with reading. I already cut back from the reading load from last year. That said, the value of this course to you is what you make it. If you skip all the readings, then you will get very little out of class discussions and have nothing to contribute. If you read the articles, then you can contribute and you'll be able to better follow the discussion and contribute in small groups.

For those who want to read more articles, most of the Bit by Bit chapters include a table of additional articles employing the methods discussed in that chapter (I know at least this exists for the experiments chapter). I strongly encourage you to look at some of those readings if you want to dive further into a specific methodology.

Too many political science articles

I am a political scientist, so I admit I am a bit biased by my perspective and prior exposure. That said, I went back and tabulated the frequency of articles by major discipline:

Discipline	Number of readings
Political science	10
Sociology	7
Statistics/other	6
Economics	3
Psychology	1-ish (the conservatives/liberals are happier article)

So we could use some more econ and psych articles, and perhaps fewer poli sci articles. I'll try to balance it out more in the second half of the term.

Class is too long/I'm hungry at noon

Why does Perspectives meet when it does? Simply put, we wanted a single time slot for the Perspectives course to meet in the fall, winter, and spring quarters. And there are a significant range of courses first-year Computation students may take, including required courses such as the CAPP programming sequence, linear algebra, statistics, etc. Computational psychology students also have a regimented sequence of courses that meet only once per term. Plus we draw a lot of certificate students from the MAPSS program which also includes some required courses. We tried to find a time that avoided conflicts with any of these requirements, which pretty much left MW at 11:30 as the only option.

We also found that an 80 minute class session was not sufficient on many days to cover the range of material we need to teach you at a sufficient depth, hence we extended it to a two-hour class. The alternative approach, which is common at UChicago, is to squeeze a semester of material (16 weeks) into 10 weeks by assigning tons of outside readings and assignments with no in-class instruction on the material. Which many students dislike, for good reason. As you've seen occasionally this term, if we finish the material early I have no problem ending class early. This is more likely to occur when students come prepared to discuss the articles and I can spend less time summarizing them. But I can only do that when the majority of students come to class prepared. A classic question of causality.

As for your hunger, look at it from my point of view. I teach Perspectives MW 11:30-1:20. I also teach a computing class MW 1:30-2:50. And I have a lab for that computing class W 3:00-4:20. If you're hungry during Perspectives, do what I do: eat lunch at 11. It's a bit early, so pack an afternoon snack (once you have a child of your own you'll discover the benefits of afternoon snacks).

Random seed of simulation assignment

Do we need to set the random seed in the assignment? If yes, do we need to set the same random seed as in the example, which is 524, or any random seed is OK? Thank you!

About rubic

Hi all!

I was wondering if the rubric for assignment 2 will be the same for assignment 1? When will you post it on Canvas?

Thanks!
Best,
Fiona

Initial income?

Is there a particular value we should set our initial income to? Is $1 acceptable, as it is effectively zero?

To clarify, I'm referring to income at t = 2019-1.

Edit: please disregard, I found the answer.

Can I Make a Pie Chart For My Kaggle Plot?

I would like to use a pie chart for my Kaggle plot. I know there is some controversy about pie charts in the world of data visualizations, and that pie charts do not have have axises that I can label.

Can I still score full points on my data visualization by creating a pie chart? I think it would be the best visualization for data I have in mind.

Survey paper evaluations posted on Canvas

Sorry for the delay in returning your scores on the last assignment. Distribution of scores compared to the first assignment:

Median score is the same as the first assignment (8.5/10) with a little wider variance this time. If you have questions about your feedback, please feel free to reach out privately to me or one of the TAs.

Observational data assignment grades available

Check Canvas for your grades and comments on your observational data papers. Overall I am pleased with the results. We saw a substantial mix of topics and observational designs, demonstrating the wide range of computationally-enhanced approaches afforded by digital trace data. The distribution of grades was mixed:

This is to be expected for the first assignment of the year. Ask any second-year student - it was an initial shock, but you will learn from this assignment and your writing will improve (as will your grades). If you have questions about your assignment feedback, please talk with me or one of the preceptors.

How to save README.md as a PDF

We want to be able to save the README.md file for the repository as a PDF. In this instance, the README.md is the syllabus for our Perspectives on Analysis class. My solution that worked pretty well is to do the following steps.

Download the grip application. I used the Homebrew package manager to install grip.
Navigate in terminal to the folder where your README.md file is. Type grip README.md.
This will render the README.md file as HTML on a localhost URL that you can access via your browser. In your terminal, it should tell you where it has rendered the README.md file, for example: * Running on http://localhost:6419/.
Paste the localhost URL into your browser.
Use your browser's functionality to print the page.
Select PDF as your printer.

This method worked pretty well (see syllabus.pdf) except for one strange overlap at the bottom of the first page and the top of the second page. Let me know if you find anything better.

Alternative to Amazon MTurk activity

If you tried to register a worker account on Amazon MTurk and were rejected, I just posted an alternative assignment you can complete for the collaboration homework. See the assignment instructions for more information

Potential Legal Issues?

Hi, Dr. Soltoff,

While I am waiting to hear back from the Amazon review team, I looked at the tasks and saw that many of them provide some monetary reward for finishing the task. My intuition, along with the OIA office's guidelines, is telling me that it is illegal for international students with F-1 visa to accept those tasks. By receiving any monetary compensation for their work from off-campus unauthorized sources, international students would be violating the specific government regulations that come attached to our F-1 status.

It is completely sensible to choose only tasks with zero compensation (since this is a homework assignment anyway), that's not the problem for me. I simply want to make sure that my understanding of the situation is correct. If so, perhaps you could let other international students know about this caveat. If it's not a problem at all, I would like to know the evidence supporting such a statement, namely, receiving financial compensation from employers at MTurk is legal for students with F-1 status.

Thanks!

Edit:
Here are some links to MTurk that seem to suggest Chinese citizens (from the mainland China), at least, are not legally allowed to register to become a turker.
MTurk is now available for Requesters in 10 more countries
FAQs

Pull Requests pending

Dear Joshua,

It seems my pull request is pending.

May I know what is wrong?

Best,
Xinyu

Histogram reading

Use this

Matplotlib plots and histograms

In your problem sets, and in general analysis, you are asked to plot results and data. I wanted to give you some code for doing this using Python's matplotlib plotting library. Here is an advanced piece of plotting code for plotting a line. The plot following the code is the figure produced by that code. I will explain its separate pieces below. Then I will give some discussion about histograms.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import os
...
graph = True
...
if graph:
    '''
    --------------------------------------------------------------------
    cur_path    = string, path name of current directory
    output_fldr = string, folder in current path to save files
    output_dir  = string, total path of images folder
    output_path = string, path of file name of figure to be saved
    year_vec    = (lf_years,) vector, years from beg_year to
                  beg_year + lf_years
    individual  = integer in [0, numdraws-1], index of particular series
                  to plot
    --------------------------------------------------------------------
    '''
    # Create directory if images directory does not already exist
    cur_path = os.path.split(os.path.abspath(__file__))[0]
    output_fldr = 'images'
    output_dir = os.path.join(cur_path, output_fldr)
    if not os.access(output_dir, os.F_OK):
        os.makedirs(output_dir)

    # Plot one lifetime income series from set of simulations
    x_vals = year_vec
    y_vals = inc_mat[:, 500]
    fig, ax = plt.subplots()
    plt.plot(x_vals, y_vals)
    # for the minor ticks, use no labels; default NullFormatter
    minorLocator = MultipleLocator(1)
    ax.xaxis.set_minor_locator(minorLocator)
    plt.grid(b=True, which='major', color='0.65', linestyle='-')
    plt.title('One simulated lifetime income path', fontsize=20)
    plt.xlabel(r'Year $t$')
    plt.ylabel(r'Annual income (\$s)')
    # plt.xlim((xmin, xmax))
    # plt.ylim((ymin, ymax))
    # plt.legend(loc='upper left')
    output_path = os.path.join(output_dir, 'Fig_1a')
    plt.savefig(output_path)
    # plt.show()
    plt.close()

The first lines of this code import the Python packages we need to run this code (numpy, matplotlib, and os). The first thing I do when I write a script that creates a plot is I create a Boolean (True or False, 0 or 1) that says whether or not that section of code will create the plot. This is nice because it separates the analysis from the plotting. Further, you can use code folding in your text editor to minimize the plotting commands under the indented if statement.

Immediately following the if graph: statement, you'll see some code defining paths and using the os package. This code is some nice housekeeping for images. The variable cur_path is a string of the path of the current directory from which you are running this script. Those 5 lines of code list the current directory, name an "images" folder to be placed in the current directory, then checks whether that folder already exists. If the folder does not already exist, it creates the folder. This creates a nice, intuitive place for you to save your images that does not clutter up the directory where your script resides.

The rest of the code is the plotting code. You could just write plt.plot(x_vals, y_vals), but you want your plot to be usable, labeled, and clean. The MultipleLocator package I imported makes nice gridlines for the plot. You also want to make sure that your plot has a title (telling you what it is) as well as clearly labeled axes. Not labeling your axes is one of the cardinal sins of rookie analysts. The philosophy is that a plot should be able to communicate its information independently.

The final four lines simply save the plot. Note the commented out plt.show() command. If you uncomment this, Python will produce the plot on your screen. However, a drawback to plt.show() is that it stops Python from running past that command. Finally, you want to include the plt.close() command at the end of the plotting script, or else you might fill up your computer's memory with plots. For some reason, matplotlib holds the plots in memory that are created until they are explicitly closed. Many times, while working on a script, I have noticed my computer slow down or freeze for no apparent reason. Often, I have realized that the reason for the slowdown was that my script was creating plots that had not been closed.

Not only does Problem Set 1 ask you to make a line plot [part (a)], but it also asks you to make a histogram. Below is some code to make a histogram. Suppose that the data for which I want to create a histogram is stored in a numpy array of length N called data. The following code will create that histogram.

fig, ax = plt.subplots()
hist_wgts = (1 / num_draws) * np.ones(num_draws)
num_bins = 50
plt.hist(data, num_bins, weights=hist_wgts)
plt.title('Histogram of first year ($t$=2018) income', fontsize=20)
plt.xlabel(r'Annual income (\$s)')
plt.ylabel(r'Percent of students')

One thing to note is that I have to give the plt.hist() function some weights in order to get it to plot output in which the height of each histogram bar represents the percent of the observations in that bin.

Add markdown instructions

Write a brief tutorial for creating Markdown documents

How it works
Formatting guide
Recommended Markdown editors

Simulating your income evaluations

Overall grades improved tremendously on this most recent assignment. The median grade was a 9.75. Nice work everyone!

weather data

@rickecon

Hi professor,

When I downloaded the weather data "Daily Summaries" from the website for a given city and time, it gave me data from multiple stations within this city rather than an overall data of the city. In this case, should I just randomly pick up one station? Or should we calculate the average temperature across different stations on the same day...?

Thanks and happy thanksgiving!

yinxian

Asking Questions, Questions

For the Asking Questions assignment:

To what extent would you like the project developed (i.e., general topics for questions--> draft survey?)
Are the rules the same as the previous assignment re: reasonable assumptions of funding, etc.?
Would the specific tools (e.g., ODK, Google Forms) be worth including?

Experimental paper evaluations on Canvas

Median grade increases to a 9/10. Distribution is slightly more skewed, mainly by proposals which don't actually include a digital experiment. If you have questions about your evaluation, please contact one of the TAs or myself in private. I had several students reach out to me last week and I think I was able to help clarify the original evaluations.

Clarification on simulations assignment submission format

I just want to clarify how you should submit your simulations assignment. The main objective is to submit it in a reproducible format. This can be any of the following:

R Markdown document (.Rmd knitted with output: md_document or output: github_document in the front matter)
Jupyter Notebook (.ipynb)
Python (.py) or R (.R) script which saves the graphs to the local directory AND a Markdown document .md which embeds the graphs and your written responses to questions

Generally for problem sets such as this I recommend a notebook format such as R Markdown or Jupyter Notebooks, as it embeds the code, output, and written analysis in a single document. This makes it easy to immediately read through and see how the code generates each of the graphs and statistics. That said, I know many students may not have used a notebook format before, instead only writing scripts in Python or R. Use whatever format seems most comfortable to you, as long as your final submission includes easy access to your code, graphs, and written answers from within GitHub (i.e. we should be able to run your code locally to view your responses, but this should not be required - all the pertinent information should be viewable directly in the repo).

Adding a Number Column to the Course Schedule?

Would be possible to add the class number to the course schedule? Right now the course schedule has the date, and the reading schedule has the class number, but it is a little bit tricky to see how they relate together.

I think adding a class number column would make it much easier to see which readings correspond each class.

Does MTurk Explicitly Say How Long You Worked For?

The qualifications pages seems like it shows the number of studies I participated in, but I'm not sure if it includes the length

Previous Kaggle competitions?

Can we use previous (non-active) Kaggle competitions for problem 2?

Thanks!

Mid-course evaluation form

Please complete this short two-question survey evaluating the first half of the course

about the journal article

How recent does it have to be? I am looking at a 2004 article, would that count as recent?

Thanks!

Estimated verification time for being an MTurk worker?

For people without any potential international complications, how long did it take for Amazon to verify them? Is this like a half-day verification thing or a long process?

Thanks!

What to submit for problem sets

As described in the syllabus for this class, you will submit 4 problem sets during the last half of this term. These problem sets will primarily involve writing code. I want you to submit your assignments in a particular way.

Your assignment submission will involve two parts: (a) your Python code, and (b) a PDF document that you compile using LaTeX that has your answers.
Your Python code should use Python 3.5 or higher. I recommend that you download the Anaconda distribution of Python from Continuum Analytics.
In your code, you should label your sections with the particular part of the Problem Set that that section of code is solving. Below is an example.

'''
------------------------------------------------------------------------
Exercise 1a: Simulate the data
------------------------------------------------------------------------
plot_1a     = Boolean, =True if make a plot of one series of the
              simulated income data
norm_errors = (lf_years, num_draws) matrix, normally distributed errors
              with mean 0 and standard deviation sigma
------------------------------------------------------------------------
'''
plot_1a = True
norm_errors = np.random.normal(0, sigma, (lf_years, num_draws))

You should use a print command for any answers that your code produces in response to things the Problem Set asks for. In this way, we can just run your script and see what answers it produces.

print('1b. Percent of students getting more than $100k in first period: ',
      inc0_gt100k_pct * 100, '%')

The code above produces the following output when I run this script.

1b. Percent of students getting more than $100k in first period: x.xx %

With regard to plots that the Problem Set asks for, follow the template in Issue #47 about having your script save those plots to an images folder. Further, make sure that your plot is included in the PDF document that you submit.
For your PDF document with your answers that you produce using LaTeX, I have included a LaTeX tutorial document as well as a LaTeX problem set template in this repository. Note that when using the LaTeX problem set template, you must have the image document (pencildrawing.png) in the same folder as the template in order to compile the PDF. If you are not using any images, you can just delete or comment out that section of the template.

In summary, we will grade your assignments based on your code that you submit, what output it gives when we execute it, and your accompanying PDF with your answers.

Debt Question on PSet

Hi,

Nora and I were having difficulty in part 3. We are using R and trying to create a for loop in order to have it deduct from the debt.

The best that we have been able to figure out is the following, but we have two issues:

We are stuck in the last step of figuring out how to call the debt from the previous year (since this is a dataframe instead of a vector, we aren't sure how to do that.
We keep getting the following error and are not sure what the problem with our for loop is: Error in for (. in year) seq_along(id) : 4 arguments passed to 'for' which requires 3

Code:

simulatedIncome %>%
mutate(debt = 95000) %>%
for(year in seq_along(id)){
if(year == 2019) {
debt <- (95000 - income*.1)
} else{
debt <- simulatedIncome$debt[year-1] - income*.1
}
}

Thanks!

Error term in simulation assignment

Is it distributed normally or log-normally?

Does anybody know how long does it take to get verified from MTurk? Thanks!

International students completing the collaboration assignment

It was brought to my attention by @ruixue-li and a couple other students that Amazon MTurk did not allow individuals of certain nationalities to register as workers. Additionally, some students expressed concerns that completing the assignment would be considered employment under US law and put their student visas at risk. I never intended for that to occur, and I understand if you do not wish to complete the MTurk portion of the homework assignment.

I am not familiar with other micro-task job market sites similar to Amazon MTurk, so I cannot assign you to work on a different site. If you are aware of one that you can legally participate on, feel free to use that site instead of MTurk - again, completing an hour's worth of micro-task assignments. But I am not asking you to go out and spend time hunting down such a site. As I stated in class, there is an alternative assignment available for students to complete if you cannot complete the MTurk assignment - the InfluenzaNet evaluation. Given all your experience gained this term in reading and assessing research articles, you should be able to complete the alternative assignment in a similar time frame (approximately one hour).

Submission of Collaboration Assignment

May we submit a single ipynb with the entire assignment, instead of submitting separate files?

R Markdown as Submission?

I just wanted to check that it would be okay to submit my assignment as an html from an R markdown document with the R script and the photos of the MTurk pages.

Thanks!

Article on polling techniques and sources of error

I'm originally from the state of Virginia, which is holding an election this year for governor. I just saw this article in my Facebook newsfeed (aka the algorithmically-defined echo chamber) examining how public polling results differ wildly depending on the method of identifying the frame population (random-digit dialing vs. contacting only registered/active voters). I thought it an interesting discussion given our unit on the total survey error framework a couple weeks ago.

uc-macss / persp-analysis Goto Github PK

persp-analysis's Introduction

MACS 30000 - Perspectives on Computational Analysis

Course description

Required textbooks

Evaluation

Disability services

Course schedule (lite)

Course schedule (readings)

persp-analysis's People

Contributors

Stargazers

Watchers

Forkers

persp-analysis's Issues

What has contributed to your learning in this class

Lots of people like Bit by Bit

The connection to your own future research projects

What you would like to change

Make it an elective

Not enough application

A lot of reading/not enough reading

Too many political science articles

Class is too long/I'm hungry at noon

Recommend Projects

Recommend Topics

Recommend Org