Giter Site home page Giter Site logo

electiondataanalysis / electiondata Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 5.0 49.72 MB

Tools for consolidation and analysis of raw election results from the most reliable sources -- the election agencies themselves.

License: Other

Python 100.00%
election election-data common-data-format

electiondata's Introduction

CII Best Practices

Overview

This repository provides tools for consolidation and analysis of raw election results from the most reliable sources -- the election agencies themselves.

  • Consolidation: take as input election results files from a wide variety of sources and load the data into a relational database
  • Export: create consistent-format export files of results sets rolled up to any desired intermediate geography
    • tabular (tab-separated text)
    • xml (following NIST Election Results Reporting Common Data Format V2)
    • json (following NIST Election Results Reporting Common Data Format V2)
  • Analysis:
    • Curates one-county outliers of interest
    • Calculates difference-in-difference for results available by vote type
  • Visualization:
    • Scatter plots
    • Bar charts

Target Audience

This system is intended to be of use to news media, campaigns, election officials, students of politics and elections, and anyone else who is interested in assembling and understanding election results. If you have ideas for using this system or if you would like to stay updated on the progress of this project, we'd like to hear from you.

How to use the app

See documentation directory, which includes

How to Contribute Code

See CONTRIBUTING.MD.

Contributors

  • Stephanie Singer, Hatfield School of Government (Portland State University), former Chair, Philadelphia County Board of Elections
  • Janaki Raghuram Srungavarapu, Hatfield School of Government (Portland State University)
  • Eric Tsai, Hatfield School of Government (Portland State University)
  • Todd Graham, Hatfield School of Government (Portland State University)
  • Bryan Loy
  • Jon Wolgamott
  • Elliot Meyerson

Copyright

Copyright (c) Portland State University 2021

Funding

Funding provided October 2019 - November 2021 by the National Science Foundation

  • Award #1936809, "EAGER: Data Science for Election Verification"
  • Award #2027089, "RAPID: Election Result Anomaly Detection for 2020" Data collection and consolidation for the 2020 US General Election funded in part by the Verified Voting Foundation.

License

See LICENSE.md

electiondata's People

Contributors

bloy86 avatar ekmeyerson avatar ericmtsai avatar nphershm avatar raghu-srungavarapu avatar sfsinger19103 avatar toddgraham121 avatar wonderbug2020 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

electiondata's Issues

Redundant results files

Scenario: A state publishes:

  • On Nov 4, a preliminary election-day results file by precinct
  • On Nov 5, a preliminary absentee results file by county
  • On Nov 7, an updated absentee results file by county.

How should the system handle this? If all three files are munged, and then all results for 2020 General are pulled for the state, some absentee votes will be counted twice, while some other absentee and all the election-day votes will be counted once.

Cross-listed candidates in South Carolina

SC allows candidates to cross register, but doesn’t necessarily distinguish the votes from one party line from the other party line. See, e.g., Constance Anastopoulo, candidate for Attorney General in 2018 General election. She is on the Democratic ticket and the Working Families Party ticket. But in the block of text giving results for that contest, parties are not specified. As of 9/4/2020, our system cannot handle that doubling.

Arkansas bar chart throws error

After loading Arkansas data, running the bar chart on any combination of elections/arguments results in this error:

  File "/Users/eric/Desktop/election_data_analysis/src/election_data_analysis/__init__.py", line 1058, in bar
    False,
  File "/Users/eric/Desktop/election_data_analysis/src/election_data_analysis/analyze/__init__.py", line 309, in create_bar
    ranked = assign_anomaly_score(unsummed)
  File "/Users/eric/Desktop/election_data_analysis/src/election_data_analysis/analyze/__init__.py", line 506, in assign_anomaly_score
    df["score"] = df["score"].fillna(0)
  File "/Users/eric/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/eric/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    raise KeyError(key) from err
KeyError: 'score'```

demote unused requirements

Ideally the requirements.txt should refer only to packages that are actually imported (or implied by imports):

Currently ours has the following entries which, last I checked, were not used anywhere:

alembic==1.3.3 kiwisolver==1.1.0 pyparsing==2.4.6 python-dateutil==2.8.1 pytz==2019.3 six==1.14.0 easygui xlrd sqlalchemy_utils

Having these in the requirements.txt has the effect of (1) slowing the build process and (2) impeding comprehension of the codebase as a whole.

If the idea in putting them there was as a "memo" of sort about packages we're considering using at some point -- that's okay (in a way) but at the very least these entries should be commented. Ideally there should be no "live" entries referencing packages not in actual current use.

Arkansas loads vote types incorrectly

Arkansas data loaded using these materials do not include votes with the label totals. Opening up the raw data files, note that "Total" is an available column. However, the resulting CountItemTypes in the database only include the following:

  • absentee
  • early
  • election-day
  • other
  • provisional

From Stephanie:
Looking at a db with only AR data, I see five vote types, shown here with counts for one selection in one reporting unit:

  • absentee,5
  • early,7
  • election-day,1090
  • other,1280
  • provisional,2382

So it seems, first of all, that the system is recognizing 'total' as 'other' -- not sure why. Also, I'm guessing that the categories are mislabeled, since 2382 is the sum of all the others, and so should be 'total'.

This causes #32

Consolidate SQL used in analyzer

There is a lot of repeated SQL in the database module that is used only by the Analyzer. This varies only in a couple areas: the select statement and the Candidate vs BallotMeasure joins. See here, here, and here.

These should be consolidated into one function that creates the SQL to support the various areas where the query is used.

Backfill tests for jurisdictions that are ready to go

Jurisdictions that are ready to go are listed in our shared google drive. The states labeled check can be loaded into a DB and have tests written against that data.

Tests are located at election_data_analysis/tests/test_dataloader.py. Each state will require at least 6 tests written. Please follow the naming convention with the existing North Carolina examples. Some tests are irrelevant for a particular jurisdiction when counts by vote type are not provided; for these, check the North Carolina examples and you can see an example of a test that is simply assert True == True. (We do this to demonstrate completeness of the tests and indicate that it's not simply a test that has been overlooked.)

The counts that the tests assert against should come from a manual (human) calculation out of the raw datafile. Essentially we are ensuring that the data gets loaded correctly and the aggregation occurs correctly.

You may need to do a pip install pytest to install the correct package to run these tests. Then, from the tests directory, simply run pytest.

An incremental release process (ideally with semantic versioning) would be very helpful

Per a chat w/ @Raghu-Srungavarapu this afternoon:

Even in a project of this scale it would be very helpful to some kind of process for incremental (internal) releases in place. That is, some way of 'checkpointing' our code so we can say "OK, here's a stable version you can pull and do further testing / development on. BTW it includes pull requests X, Y, Z".

I see now we're using git tags [1] which is nice - but tags are meaningless if code (and functionality) keeps changing but the tag value always stays the same (ours currently says v1.0.0 even as new PRs keep coming in). Better would be to start with a much more modest release number (e.g. v0.1.1) and increment that slowly (e.g. v0.1.2, v0.1.3) as new releases are announced.[2].

BTW as "release" doesn't have to be anything more magical than a sequence of PR merges (such that the end result is deemed "stable" by one criterion or another). But the point is, at this point it would be desirable to have some form of (properly versioned) incremental releases - so we at least of some way of saying "ok, I tried doing on v0.1.3 and here's what happened" in our issue writeups.

Also, it'd be nice to have a changelog[3] with every release (which can just be a file CHANGES.md at the top of the repo). And of course the version parameter in the setup.py should always be in synch with the current git tag (there are ways of getting these to synch automatically; but for now, manual fudging will do just fine).

Right now the only way to tell if anything has changed in the state of the code is to look at the output of git log and git branch and then surmise from that that something has been fixed or not (which can sort of work, but not very well).

Finally, it'd be nice to see some kind of branching model apart from "everything into main" (which is of course equivalent to "everything into master). For example a bi-level (master / develop) or 3-level (master / qa / develop) model would make it easier to identify stable releases for testing and production. There are lots of fine articles and blog posts on this topic e.g. [4], [5].

[1] https://git-scm.com/book/en/v2/Git-Basics-Tagging

[2] https://semver.org

[3] https://keepachangelog.com/en/1.0.0/

[4] https://nvie.com/posts/a-successful-git-branching-model/

[5] https://docs.microsoft.com/en-us/azure/devops/repos/git/git-branching-guidance

ExpressVote munger

Georgia, South Carolina and Arkansas all have a text export (presumably from statewide ExpressVotes) that needs munging. It's not a flat file, but a concatenation of blocks of data.

Possible duplication of contests

Once Arkansas 2018 General election data is loaded, run this in your DB:

select  *
from    "Contest"
where   "Name" ~* 'US House AR District 2';

Notice that the results show a lot of similarity in the contest name. Now if you run this:

select  vc.*
from    "VoteCount" vc
        join "Contest" c on vc."Contest_Id" = c."Id"
where   "Name" ~* 'US House AR District 2'
order by "ReportingUnit_Id", "CountItemType_Id", "Count";

It is apparent that the vote counts are duplicated across the 4 contests. Should these be considered 4 different contests?

Remove Pandas warning on bar chart

Running the Analyzer.bar() function produces this warning:

/usr/local/lib/python3.7/site-packages/election_data_analysis-0.1-py3.7.egg/election_data_analysis/analyze/__init__.py:685: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

This does not affect the output but looks ugly. Fix this.

cull unused imports please

There's a lot of them, and (in bulk) they have a negative impact on overall readability and maintainability.

When we get a chance we should clean these up.

pyflakes src src/election_data_analysis/__init__.py:6: 'pathlib.Path' imported but unused src/election_data_analysis/__init__.py:9: 'pprint.pprint' imported but unused src/election_data_analysis/__init__.py:10: 'sys' imported but unused src/election_data_analysis/__init__.py:63: local variable 'e' is assigned to but never used src/election_data_analysis/__init__.py:99: local variable 'e' is assigned to but never used src/election_data_analysis/__init__.py:362: local variable 'e' is assigned to but never used src/election_data_analysis/__init__.py:918: local variable 'e' is assigned to but never used src/election_data_analysis/database/__init__.py:12: 'psycopg2.OperationalError' imported but unused src/election_data_analysis/database/__init__.py:12: 'psycopg2.errorcodes' imported but unused src/election_data_analysis/database/__init__.py:12: 'psycopg2.errors' imported but unused src/election_data_analysis/database/__init__.py:23: 'numpy as np' imported but unused src/election_data_analysis/database/__init__.py:24: 'sqlalchemy.MetaData' imported but unused src/election_data_analysis/database/__init__.py:24: 'sqlalchemy.Table' imported but unused src/election_data_analysis/database/__init__.py:24: 'sqlalchemy.Column' imported but unused src/election_data_analysis/database/__init__.py:24: 'sqlalchemy.Integer' imported but unused src/election_data_analysis/database/__init__.py:24: 'sqlalchemy.Float' imported but unused src/election_data_analysis/database/__init__.py:25: 'election_data_analysis.analyze as a' imported but unused src/election_data_analysis/database/__init__.py:175: local variable 'e' is assigned to but never used src/election_data_analysis/database/__init__.py:181: local variable 'e' is assigned to but never used src/election_data_analysis/database/__init__.py:723: local variable 'exc' is assigned to but never used src/election_data_analysis/database/__init__.py:932: undefined name 'element' src/election_data_analysis/database/create_cdf_db/__init__.py:18: 'sqlalchemy.Date' imported but unused src/election_data_analysis/database/create_cdf_db/__init__.py:18: 'sqlalchemy.TIMESTAMP' imported but unused src/election_data_analysis/database/create_cdf_db/__init__.py:19: 'psycopg2.sql' imported but unused src/election_data_analysis/database/create_cdf_db/__init__.py:23: 'datetime' imported but unused src/election_data_analysis/juris_and_munger/__init__.py:88: local variable 'e' is assigned to but never used src/election_data_analysis/visualize/__init__.py:1: 'plotly' imported but unused src/election_data_analysis/visualize/__init__.py:3: 'numpy as np' imported but unused src/election_data_analysis/verify_results/__init__.py:6: 'election_data_analysis.user_interface as ui' imported but unused src/election_data_analysis/verify_results/__init__.py:8: 'time' imported but unused src/election_data_analysis/verify_results/__init__.py:9: 'pandas.io.common as pderror' imported but unused src/election_data_analysis/special_formats/__init__.py:118: local variable 'exc' is assigned to but never used src/election_data_analysis/munge/__init__.py:128: local variable 'final' is assigned to but never used src/election_data_analysis/munge/__init__.py:515: local variable 'missing' is assigned to but never used src/election_data_analysis/munge/__init__.py:701: local variable 'exc' is assigned to but never used src/election_data_analysis/user_interface/__init__.py:5: 'pandas.errors.ParserWarning' imported but unused src/election_data_analysis/user_interface/__init__.py:6: 'numpy as np' imported but unused src/election_data_analysis/user_interface/__init__.py:8: 'sqlalchemy.orm.sessionmaker' imported but unused src/election_data_analysis/user_interface/__init__.py:11: 'ntpath' imported but unused src/election_data_analysis/user_interface/__init__.py:12: 're' imported but unused src/election_data_analysis/user_interface/__init__.py:13: 'datetime' imported but unused src/election_data_analysis/user_interface/__init__.py:15: 'random' imported but unused src/election_data_analysis/analyze/__init__.py:1: 'csv' imported but unused src/election_data_analysis/analyze/__init__.py:5: 'election_data_analysis.user_interface as ui' imported but unused src/election_data_analysis/analyze/__init__.py:10: 'matplotlib.pyplot as plt' imported but unused src/election_data_analysis/analyze/__init__.py:12: 'pandas.api.types.is_numeric_dtype' imported but unused src/election_data_analysis/analyze/__init__.py:16: 'math' imported but unused src/election_data_analysis/preparation/__init__.py:5: 'election_data_analysis.juris_and_munger as jm' imported but unused src/election_data_analysis/preparation/__init__.py:6: 'pathlib.Path' imported but unused

Add vote count aggregation data exports back in

The functions that aggregated the data by contest/subdivision type/vote count type were removed because there were unexplained permission issues when running. However, these can be replaced by switching back to using Pandas as an intermediary between reading from the database and then writing to a local filesystem, which should avoid the permission issues that were encountered while trying to use the Postgres COPY comman.

Error handling in db.test_connection()

Attached is the error report while trying to connect to the database.
Branch = main
commit 0933b0a

C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\python.exe "C:\Program Files\JetBrains\PyCharm 2019.3.2\plugins\python\helpers\pydev\pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 57270 --file C:/Users/jsru2/Desktop/GitHub/election_data_analysis/test.py
pydev debugger: process 4684 is connecting

Connected to pydev debugger (build 193.6015.41)
Traceback (most recent call last):
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\database_init_.py", line 198, in test_connection
err,
UnboundLocalError: local variable 'err' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\database_init_.py", line 209, in test_connection
err,
UnboundLocalError: local variable 'err' referenced before assignment

Jurisdiction and template files not accessible relative to project root

If the app is run from inside the repository, there are no issues. But if it's run from "out in the wild", you may get this error:

>>> jp.new_juris_files()
Starting new_juris_files
Directory created: /Users/singer3/PycharmProjects/election_data_analysis/src/jurisdictions/Indiana
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/election_data_analysis-0.1-py3.8.egg/election_data_analysis/__init__.py", line 432, in new_juris_files
    error = jm.ensure_jurisdiction_dir(self.d["jurisdiction_path"])
  File "/usr/local/lib/python3.8/site-packages/election_data_analysis-0.1-py3.8.egg/election_data_analysis/juris_and_munger/__init__.py", line 321, in ensure_jurisdiction_dir
    err = ensure_juris_files(juris_path, ignore_empty=ignore_empty)
  File "/usr/local/lib/python3.8/site-packages/election_data_analysis-0.1-py3.8.egg/election_data_analysis/juris_and_munger/__init__.py", line 350, in ensure_juris_files
    template_list = [x[:-4] for x in os.listdir(templates_dir)]
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/site-packages/election_data_analysis-0.1-py3.8.egg/templates/jurisdiction_templates'

This occurs because the directory that holds the template files live outside the built package's directory structure.

Since these files are directly relevant to building the jurisdiction and munger, this issue proposes moving those template files into the juris_and_munger module.

Ability to handle several sheets in a result file.

The program must be able to handle results files that have data spread out over multiple sheets in the same file. Documentation should also be updated to indicate how to setup mungers to handle multiple sheets.

Move jurisdiction_path to run_time.ini

  • Create a dictionary of the jurisdiction counts (senate, house, and congressional districts) for the JurisdictionPrepper class.
  • Print a warning informing that the jurisdiction counts are mostly right but might have to be verified against the information on the state board of elections.
  • Update the documentation.

Fix SQL issues

There are a few places where SQL injection may be an issue. Fix these!

Parent-child weirdness in `ComposingReportingUnitJoin` table

After running the PA-2018g model we end up with an instance of the ComposingReportingUnitJoin where almost every child unit has two parents: one Id for the statewide entity (like one would expect), and another pointing back to itself:

select * from "ComposingReportingUnitJoin" order by "ChildReportingUnit_Id", "ParentReportingUnit_Id" limit 6;
 Id  | ParentReportingUnit_Id | ChildReportingUnit_Id 
-----+------------------------+-----------------------
 742 |                     62 |                    62
 402 |                    129 |                    62
 743 |                     63 |                    63
 403 |                    129 |                    63
 744 |                     64 |                    64
 404 |                    129 |                    64

Seems unlikely that this the intended design.

Take all system-specific info out of results.ini

Currently the *.ini file for each results file need a 'jurisdiction_path' parameter. This is path depends on the user's environment, which makes the *.ini file not portable without modification.

Since the path to the jurisdictions directory is now in run_time.ini, the jurisdiction_path parameter is redundant (except for the name (not path) of the jurisdiction itself.

The *.ini file should use only the name of the jurisdiction folder, not the whole path.

There will be backwards compatibility issues.

Wrong Project root

The following code gets the wrong path.

From juris_and_munger_init_.py (line 334 and [335)]

project_root = Path(__file__).parents[1].absolute()
templates_dir = os.path.join(project_root, "juris_and_munger", "jurisdiction_templates")

Branch = main
commit 0933b0a

error = dl.load_all()
Directory already exists: C:\Users\jsru2\Desktop\GitHub\election_data_analysis\src\jurisdictions\Pennsylvania
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis_init_.py", line 167, in load_all
juris[jp], new_err = ui.pick_juris_from_filesystem(
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\user_interface_init_.py", line 349, in pick_juris_from_filesystem
new_err = jm.ensure_jurisdiction_dir(juris_path)
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\juris_and_munger_init_.py", line 321, in ensure_jurisdiction_dir
err = ensure_juris_files(juris_path, ignore_empty=ignore_empty)
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\juris_and_munger_init_.py", line 337, in ensure_juris_files
extraneous = [
File "C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\juris_and_munger_init_.py", line 340, in
if f not in os.listdir(templates_dir) and f[0] != "."
FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\Users\jsru2\AppData\Local\Programs\Python\Python38-32\lib\site-packages\election_data_analysis-1.0-py3.8.egg\election_data_analysis\juris_and_munger\jurisdiction_templates'

Strange error after running `DataLoader.load_all()`

After running the PA-2018g model (using the setup in the in the debug-PA-1 branch) we see the err dict contains the following cryptic (to us) entry:

err = {'pa18g.ini': {'pa_gen': None}}

This is worrisome, first because in theory there should be no errors - but even more so because we can't figure out what this message is supposed to mean.

Here's the full output transcript from that script:

python ../bin/load-all.py 
Loading jurisdiction from /opt/verify/src/jurisdictions/Pennsylvania to Engine(postgresql://verify:***@localhost:5432/eda2)
Processing results files ['pa18g.ini']
Processing Official_7232020043015PM.CSV
VoteCount load time in seconds: 2.8637200269999994
	Results uploaded with munger pa_gen to database Engine(postgresql://verify:***@localhost:5432/eda2)
from file /opt/verify/model/PA/data/2018g/Official_7232020043015PM.CSV
assuming jurisdiction /opt/verify/src/jurisdictions/Pennsylvania
	Archived pa18g.ini and its results file.
err = {'pa18g.ini': {'pa_gen': None}}

And here's that load-all.py script itself:

import election_data_analysis as ea
an = ea.Analyzer()
dl = ea.DataLoader()
err = dl.load_all()
print(f"err = {err}")

Report file not found error when results folder is empty.

When the specified results folder is empty the code runs without notifying user of any error.
When you run the code to upload data the result file and corresponding .ini move to the archive. These make the results file empty.

Jurisdictions without vote types should be limited in what bar chart selections are available

If a jurisdiction only reports by vote totals, then certain comparisons of data do not make sense. For example, if PA only reports total counts, and a user passes in an election, Pennsylvania, a contest type, and a particular contest into DataLoader.bar(), then there is nothing more granular within that universe to compare and the code errors out. There should be some mechanism to limit what is available to pass into this function, either on the front end or elsewhere, but no such mechanism exists.

Allow election selection for bar graph viz

As discussed with the Verified Voting team, the first selection in the bar graph user flow is to select a state. While tweaking this, some possibly related updates to include:

  • Indicate whether a state has data for that election using boolean
  • Return value: add new line saying either State X provides data by vote type or Data unavailable by vote type.

`Selection` table has `Id` column only

After running the PA-2018g model (using the setup in the in the debug-PA-1 branch) we end up with a Selection table that has just as single column for the PK (and no other columns):

select * from "Selection" limit 5;
  Id  
------
   59
   60
   61
 4774
 4775

Presumably this is not the intended functionality -- as it's hard to to see the purpose of joining on a table that contains no other information besides the primary key.

Proper naming convention for reporting unit

For the Reportingunit.txt and dictionary.txt jurisdiction files, the naming convention for counties should include the word county. For example, Indiana;Adams County as opposed to simply Indiana;Adams

Missing `ElectionDistrict` table

After running the PA-2018g model (using the setup in the in the debug-PA-1 branch) we
see that two tables - Office and BallotMeasureContest - have FK columns for the
table ElectionDistrict - and these keys are even populated in the running database:

select * from "Office" limit 1;
  Id  |        Name         | Description | ElectionDistrict_Id 
------+---------------------+-------------+---------------------
 1080 | PA House District 1 |             |                 130

select * from "BallotMeasureContest" limit 1;
  Id  | ElectionDistrict_Id | Election_Id 
------+---------------------+-------------
 2825 |                 401 |        2824

Only problem is -- there's no such table (ElectionDistrict) present in the database.

Analyzer.top_counts() is slow, needs some indexing

There is a pretty massive SQL query that supports a number of functions in the Analyzer class, which is quite slow. This is most prevalent when running the Analyzer.top_counts() function which takes a few minutes. This should be able to be tuned by adding in appropriate indexes.

Allow flexibility in reporting unit hierarchy for bar chart

Currently we identify most anomalous at the county level. This assumes 2 things: that this reporting unit type is nested exactly one level below the jurisdiction; and that this reporting unit type exists in the hierarchy. However, there are examples such as Louisiana which don't use county but instead use something analogous like parish. This would break the existing code.

EDIT:
Much simpler way of doing this from Stephanie: query the ReportingUnit table for units that are of length 2 when split into an array; get the type of that unit.

Ensure that the SQL is written in a safe manner.

Bar chart results show votes_at_stake=0

As an example, load Arkansas general election data for 2018. Run Analyzer.bar('2018 General', 'Arkansas'). In the return value, notice that there are some results with votes_at_stake = 0. This seems unlikely, maybe even impossible, for a contest that's ranked as an outlier or anomalous.

Every row in `VoteCount` repeats - with empty vote counts

From our understanding of the data model, rows in the VoteCount table should be distinct on the following composite key
(as these are the parameters that effectively specify a single result count):

("_datafile_Id", "Election_Id", "Contest_Id", "Selection_Id", "ReportingUnit_Id", "CountItemType_Id")

Yet after running the PA-2018g model (using the setup in the in the debug-PA-1 branch)
we find that every row is nearly duplicated (exactly once) on the above tuple - except in the
second row, the Count column is empty:

select * from "VoteCount" order by "_datafile_Id", "Election_Id", "Contest_Id", "Selection_Id", "ReportingUnit_Id", "Id" limit 20; 
  Id  | Count | CountItemType_Id | OtherCountItemType | ReportingUnit_Id | Contest_Id | Selection_Id | Election_Id | _datafile_Id 
------+-------+------------------+--------------------+------------------+------------+--------------+-------------+--------------
 5715 | 13019 |               55 |                    |               86 |       2826 |         5016 |        2820 |         4773
 7033 |     0 |               55 |                    |               86 |       2826 |         5016 |        2820 |         4773
 5716 | 13736 |               55 |                    |               86 |       2827 |         5017 |        2820 |         4773
 7034 |     0 |               55 |                    |               86 |       2827 |         5017 |        2820 |         4773
 5717 |  7429 |               55 |                    |               86 |       2827 |         5018 |        2820 |         4773
 7035 |     0 |               55 |                    |               86 |       2827 |         5018 |        2820 |         4773
 5718 | 20265 |               55 |                    |               86 |       2828 |         5019 |        2820 |         4773
 7036 |     0 |               55 |                    |               86 |       2828 |         5019 |        2820 |         4773
 5719 | 17389 |               55 |                    |               86 |       2829 |         5020 |        2820 |         4773
 7037 |     0 |               55 |                    |               86 |       2829 |         5020 |        2820 |         4773

(Note we've left out the OtherCountItemType as its purpose is unclear - and in any case is always NULL).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.