Giter Site home page Giter Site logo

biglocalnews / civic-scraper Goto Github PK

View Code? Open in Web Editor NEW
39.0 13.0 13.0 106.48 MB

Tools for downloading agendas, minutes and other documents produced by local government

Home Page: https://civic-scraper.readthedocs.io

License: Other

Python 96.74% Makefile 2.84% Dockerfile 0.42%
agendawatch python data-journalism journalism news city-council-data scraper

civic-scraper's Introduction

civic-scraper's People

Contributors

antidipyramid avatar dependabot[bot] avatar dipierro avatar fgomez828 avatar fgregg avatar hancush avatar palewire avatar zstumgoren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

civic-scraper's Issues

General code base cleanup

Misc code base clean-ups to perform:

  • Create scripts/ dir to project root
  • Move to civic_scraper/run_scraper.py -> scripts/
  • Move to civic_scraper/scrapers/generate_civicplus_sites.py -> scripts/
  • Move or remove Pipenv files in civic_scraper/
  • Remove all if __name__ blocks from modules in civic_scraper package
  • tests. Remove obsolete tests and fixtures, if any
  • Deletetesting/ directory
  • CLI integration. Fix any potential breakage for end-user install (#49)
  • Remove unimplemented code (Legistar and Granicus?) from master branch and store on separate branches until ready to merge
  • Legistar integration (we should remove any references from master until fully implemented)

Decide on open sourcing

Let's decide whether to open source and, if so, what type of open source license we'd like to settle on. Some key questions we'll need to decide are nicely laid out here and include:

  • distribution (to third parties)
  • modification
  • private use (i.e. without open sourcing after modifying)
  • sublicensing
  • granting of copyright by contributors to us (e.g. in Apache license)

Make AssetCollection collection-like

It would be nice for Asset Collection to be collection-like, so that it would support the following:

>>> asset_collection = AssetCollection()
>>> len(asset_collection)
4
>>> urls = [asset.url for asset in asset_collection]

To do this, add the following methods to AssetCollection (I haven't tested them!)

    def __iter__(self):
        return iter(self.assets)

    def __next__(self):
        return next(self)

    def __len__(self):
        return len(self.assets)

Cleanup civic-scraper

This repo is looking good, and can use some tidying.

  • both civic-scraper and civic_scraper exist right now - do they do the same thing? different things? Can they be merged?

  • civicplus_site_statuses.csv and civicplus_urls.csv are likely best moved out of this repo.

Create a master list of platforms scrape

We need a list of platforms to target. We have some of this information in various scattered places such as:

Let's gather up our scattered resources and create a centralized public list of platforms to target.

As part of this, let's note for each platform a handful of example sites that correlate to the various versions for a given platform (if there are multiple versions).

Fix `run_scraper.py`

Changes to the code base now mean that run_scraper.py no longer works.

First: note that as it is currently written, calling python run_scraper.py from the command line raises an error which seems to be somewhere in asset.py.

Second: uncomment the command line parser so that the URL and other arguments can be passed straight to the scraper. Note the scraper_args argument should probably be modified to accept a list of strings instead of an unwieldy json dict. For example, the following invocation syntax would work nicely:

python run_scraper.py civicplus http://pa-westchester2.civicplus.com/AgendaCenter path/to/target.csv --scraper_args start_date 2015-09-09 end_date 2015-10-14

Identify *novusagenda.com subdomains

A number of local governments in the Bay Area and in other parts of the country post their meeting minutes, agendas, etc. on websites on the *novusagenda.com subdomain. These websites typically look something like this or this and follow the web address convention PLACE.novusagenda.com/agendapublic, where PLACE is a custom field.

Your task is to compile a list of as many *novusagenda.com subdomains as you can find. This will allow us to evaluate how many government agencies are using this website format, which, in turn, will help us to decide which scrapers to build next.

In the past, we have found that this subdomain enumerating search engine is the easiest and most comprehensive way to compile lists of subdomains. (Note that we may need to set up an account to unlock all of the search features on this website.) However, there are many different ways to find subdomains, including using advanced Google searches or using certain pen testing Python libraries and command line utilities (see nmmapper.com for a few examples), and we encourage you to be creative.

To complete this task, please do the following:

  • Create a Google Sheet with a single column, where each row is a unique *novusagenda.com subdomain. Be sure to change the sharing settings so that this sheet is public to anyone with the link.
  • Paste a link to your *primegov.com Google Sheet to the sites_sheet field of this spreadsheet for the row where the short_name is "novusagenda".
  • Write a brief reply to this issue documenting the process you used to identify subdomains so that we can continue to develop best practices.

Define and implement command line invocation, if any, for `civicplus.py` and other scrapers

Right now, there's a command line invocation for civicplus.py that's out of date.

I'm not sure it's necessary that the files scrapers/*.py have command line invocations at all - I'd recommend getting rid of them since we now have other scripts that basically wrap around the individual scrapers.

But if we want to keep command line invocations at the level of individual scrapers, let's decide what they should be, then update them accordingly.

Cape May data bug

Below data bugs discovered in docs/civicplus_sites.csv

# Below should be Cape May County, NJ
http://dev135.civicplus.com/AgendaCenter,2012,2018,civicplus,FALSE,Name error,DE,USA,County....

Make an IQM2 scraper

Add additional data fields to the Asset class and go over data validation

There are additional data fields that would be good to add to Asset which I list below. Relatedly, let's carefully go through data validation together to make sure the data validation (and its documentation!) agree with our intent.

[
'committee_type',   # None, or one of a specified list of allowed committee types: "water board", "city council", ...
'committee_description',    # None, or human readable committee discriptor. "San Jose City Council"
'meeting_type',     # None, or one of a specified list of allowed meeting types: "regular", "special", ...
'meeting_description',     # None, or human readable meeting descriptor.  "Regular Meeting of San Jose City Council, June 22, 2020"
'meeting_location', # None, or string containing address, URL (for online meetings), or other location identifier provided by the platform (this is a hard field to do data validation on)
'asset_description',       # None, or human readable descriptor of asset. "Minutes of Regular Meeting of San Jose City Council, June 22, 2020"
'asset_type',       # None, or one of a specified list of allowed asset types. "agenda", "minutes", "meeting_video", ...
'place_description',       # None, or human readable place descriptor. "San Jose", "City and County of San Francisco"
]

Some of these rename existing fields (e.g. the _name suffix to _description since it emphasizes that the field is human readable but not necessarily a unique identifier).

Define and implement command line invocation, if any, for `asset.py`

Right now the command line invocation of asset.py is hard coded to work for Amy's computer.

It's not clear to me that any command line invocation of this file is necessary at all - I can't think of what I'd want to do with Asset or AssetCollection from the command line (I'm certainly not going to be in the business of constructing an Asset instance by writing out all the asset fields as individual command line arguments, for example).

I suggest scrapping the command line invocation of asset.py, or otherwise defining what you want it to do and updating it.

Complete CivicPlus Scraper

Complete the OO model and make it importable/usable by the production ETL pipeline.

Ultimately we want a function, maybe site.scrape_to_csv(local_path), that writes something like the following CSV to local_path including all documents on the site from the last month.

Also include a column for site_type(e.g. civicplus, legistar, etc.)

,city,date,committee,doc_format,url,doc_type
0,Hayward,2019-12-16,Hayward Youth Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=750204&GUID=F1EFFE96-1D3B-4CF5-A4B6-046D41BC402A,Agenda
1,Hayward,2019-12-12,Planning Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=729839&GUID=315AE98C-D729-4CCE-B4AB-0680E98AE87C,Agenda
2,Hayward,2019-12-12,Personnel Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=714329&GUID=9FFE26A5-7AFB-43FB-90CE-61C5BA123DE9,Agenda
3,Hayward,2019-12-09,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=734306&GUID=854CF140-88DD-4163-B107-D06F449EAAE0,Agenda
4,Hayward,2019-12-05,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=682107&GUID=BE980F9A-0F52-4883-A139-A2A5DAFF7108,Agenda
5,Hayward,2019-12-05,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743352&GUID=C676DE80-DF5E-4C8B-A385-DD4BFA4F0DEE,Agenda
6,Hayward,2019-12-04,Council Budget and Finance Committee,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=718571&GUID=53E8F44A-4144-401A-8469-61870F530B91,Agenda
7,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=736075&GUID=DD500914-EA51-4C20-AFE7-DCDAB8947DEA,Agenda
8,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743832&GUID=DC31DF76-0219-498D-8512-24CCA95C76C4,Agenda

Update Legistar scrapers to produce sites-to-scrape index with agencies in a nested field

The Legistar sites index does not currently contain a list of the agencies associated with a given site. In order to support the Agency Type filtering/faceting for the new website design, we'll need to generate a CSV containing this or manually add this information.

The CivicPlus sites index already contains a list of agencies for each site as a nested field.

Update civicplus scraper to capture entity metadata

Let's perform a level of effort assessment on how to gather entity metadata (e.g. committee name, type and jurisdiction level such as city vs county). This is a blocker for including an entity-based filter or search feature for our v1 design.

Re-spec `doc_format` in civic-scrapers

Motivation: we'd like to know something about a file before downloading it. Specifically, we want to know 1) how big the file is and 2) what is the file type.

Solution: It is generally possible to retrieve some metadata about a file via HTTP before actually downloading it, and the scraper code is a great place to do that.

import requests

url = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
r = requests.head(url)
print(r.headers)

TODO:

  1. remove the doc_format argument from Document.__init__() in document.py
  2. add an argument content_type which would mirror the content-type of the HTTP header and take values of the MIME type string (e.g. "application/pdf; qs=0.01")
  3. add an argument content_length which would mirror the content-length of the HTTP header and take positive integer values
  4. modify the current scrapers (in the .scrape() method ) to report this information about all the document links they find.

Identify *iqm2.com subdomains

A number of local governments in the Bay Area and in other parts of the country post their meeting minutes, agendas, etc. on websites on the *iqm2.com subdomain. These websites typically look something like this and follow the web address convention PLACE.iqm2.com/Citizens/default.aspx, where PLACE is a custom field.

Your task is to compile a list of as many *iqm2.com subdomains as you can find. This will allow us to evaluate how many government agencies are using this website format, which, in turn, will help us to decide which scrapers to build next.

In the past, we have found that this subdomain enumerating search engine is the easiest and most comprehensive way to compile lists of subdomains. (Note that we may need to set up an account to unlock all of the search features on this website.) However, there are many different ways to find subdomains, including using advanced Google searches or using certain pen testing Python libraries (see nmmapper.com for a few examples), and we encourage you to be creative.

Note that the company Granicus, in addition to making the software behind these *iqm2.com websites, also makes other common government meeting websites types including Legistar and *granicus.com. Anecdotally, it does not appear that there is much, if any, overlap between *iqm2.com websites and other website types owned by Granicus, but you should determine the degree to which overlap exists as part of your research.

To complete this task, please do the following:

  • Create a Google Sheet with a single column, where each row is a unique *iqm2.com subdomain. Be sure to change the sharing settings so that this sheet is public to anyone with the link.
  • Compare your sheet of *iqm2.com subdomains to this sheet of *legistar.com subdomains. Add a column legistar to your *granicus.com sheet that is TRUE if a *legistar.com subdomain is available for the same government agency and FALSE otherwise. You may also wish to compare your sheet to the sheet of *granicus.com subdomains, if it exists. (See #52.)
  • Paste a link to your *iqm2.com Google Sheet to the sites_sheet field of this spreadsheet for the row where the short_name is "iqm2".
  • Write a brief reply to this issue documenting the process you used to identify subdomains so that we can continue to develop best practices.

Write README

The readme should briefly summarize what it is this package does, what features it offers, and how to install it. It should also include a brief tutorial on how to run the package both from Python and from the command line.

File handle error for Windows users

Hi Stanford team,

Some members of our team are getting a consistent error with the file names in the testing folder. It seems like Windows users can't clone the repo because of a colon used in the filename:

Example:

'testing/civicplus.py_v2020-07-09-no_data-no_data-2020-07-18T15:59:03.943461.csv'

Cloning fails with this error:

Cloning into 'C:\Users\Krammy\Documents\GitHub\civic-scraper'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (333/333), done.
remote: Compressing objects: 100% (238/238), done.
remote: Total 3295 (delta 204), reused 209 (delta 94), pack-reused 2962
Receiving objects: 100% (3295/3295), 34.10 MiB | 8.63 MiB/s, done.
Resolving deltas: 100% (622/622), done.
error: invalid path 'testing/civicplus.py_v2020-07-09-no_data-no_data-2020-07-18T15:59:03.943461.csv'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Is there another way to name these files?

Identify *granicus.com subdomains

A number of local governments in the Bay Area and in other parts of the country post their meeting minutes, agendas, etc. on websites on the *granicus.com subdomain. These websites typically look something like this or this and follow the web address convention PLACE.granicus.com/ViewPublisher.php?view_id=NUMBER, where PLACE and NUMBER are customized fields.

Your task is to compile a list of as many *granicus.com subdomains as you can find. This will allow us to evaluate how many government agencies are using this website format, which, in turn, will help us to decide which scrapers to build next.

In the past, we have found that this subdomain enumerating search engine is the easiest and most comprehensive way to compile lists of subdomains. (Note that we may need to set up an account to unlock all of the search features on this website.) However, there are many different ways to find subdomains, including using advanced Google searches or using the CLIs and Python tools discussed on nmmapper.com, and we encourage you to be creative.

Note that the company Granicus, in addition to making the software behind these *granicus.com websites, also makes Legistar, another common government meeting website type. While some government agencies with a *legistar.com/Calendar.aspx page also have a *granicus.com/viewpublisher.php page, the two do not have a one-to-one relationship. For example, https://fairfax.granicus.com/viewpublisher.php?view_id=11 exists but https://fairfax.legistar.com/Calendar.aspx does not.

To complete this task, please do the following:

  • Create a Google Sheet with a single column, where each row is a unique *granicus.com subdomain. Be sure to change the sharing settings so that this sheet is public to anyone with the link.
  • Compare your sheet of *granicus.com subdomains to this sheet of *legistar.com subdomains. Add a column legistar to your *granicus.com sheet that is TRUE if a *legistar.com subdomain is available for the same government agency and FALSE otherwise. You may also wish to compare your sheet to the sheet of *iqm2.com subdomains, if it exists. (See #53.)
  • Paste a link to your *granicus.com Google Sheet to the sites_sheet field of this spreadsheet for the row where the short_name is "granicus".
  • Write a brief reply to this issue documenting the process you used to identify subdomains so that we can continue to develop best practices.

Create contributor-friendly issues

Create Issues for code and documentation contributions and assign them clear labels to help contributors more easily locate items appropriate to their interests and skill level.

Add __repr__() methods to Asset and AssetCollection

Add __repr__() methods to support something like the following

>>> asset = Asset()
>>> print(asset)
Asset(url='http://blahblah', asset_type='minutes', content_length=1234, ...)

>>> asset_collection = AssetCollection()
>>> print(asset_collection)
AssetCollection(
Asset(url='http://blahblah', asset_type='minutes', content_length=1234, ...),
Asset(url='http://foobar', asset_type='agenda', content_length=2345, ...)
)

Add signatures of public CivicPlusSite methods to Site class

Since the form inputs and outputs of any public method of CivicPlusSite ought not to be specific to CivicPlus, copy the function signatures from CivicPlusSite to the Site class in site.py. For example:

class Site(object):

    def scrape(
            self,
            start_date=None,
            end_date=None,
            download=False,
            target_dir=None,
            file_size=None,
            asset_list=SUPPORTED_ASSET_TYPES,
            csv_export=None,
            append=False
    ) -> AssetCollection:
        """
        Scrape the site and return a AssetList instance.
        """
        raise NotImplementedError

or whatever the final interface for .scrape() will be.

Create Legistar scraper

Create a LegistarSite scraper class that generates a CSV of values for each site. See description in ticket #8 for example format of expected output.

If possible, this scraper should use requests and bs4 instead of Selenium

Create a CLI executable

Complete implementation of the civic-scraper command-line task (#29).

Along the way, we should remove "if-main" code blocks in modules such as assets.py (#34).

Turn `video` and `video2` into a single asset type, `meeting_video`

The purpose of the asset_type field is to help us decide later on how to treat a file based on its general contents. If two video formats of the same meeting are provided, but otherwise contain the same content, either just supply the more standard format (.mp4, say) or supply them both, call them both meeting_video, and let them be distinguished by the MIME type field.

Default Site.scrape behavior should list everything but not download

It's most intuitive to me that Site.scrape should, if not given any date arguments, return a list of everything it finds (but not download any of them).

Other defaults I'd suggest:

  • If just given a start_date argument, return everything after that start date.
  • If just given an end_date argument, return everything before that end date.

The current default is only look for documents dated today (the date the scraper is invoked), which I don't understand because that's a very infrequent use case.

Update README example

East Palo Alto has switched from CivicPlus Agenda Center to a different meeting software. As a result, we need to update the example in the README to a different government agency.

Add spec that documents the format of the output of download_csv()

There is no place that I see where civic-scraper documents the format of the CSV produced by CivicScraper.download_csv() - in particular, the headers of that table and what they mean. This should be ideally documented in a markdown file somewhere (perhaps the README) in the repo.

Decide which scrapers to make next

Identify *primegov.com subdomains

A number of local governments in the Bay Area and in other parts of the country post their meeting minutes, agendas, etc. on websites on the *primegov.com subdomain. These websites typically look something like this and follow the web address convention PLACE.primegov.com/public/portal, where PLACE is a custom field.

Your task is to compile a list of as many *primegov.com subdomains as you can find. This will allow us to evaluate how many government agencies are using this website format, which, in turn, will help us to decide which scrapers to build next.

In the past, we have found that this subdomain enumerating search engine is the easiest and most comprehensive way to compile lists of subdomains. (Note that we may need to set up an account to unlock all of the search features on this website.) However, there are many different ways to find subdomains, including using advanced Google searches or using certain pen testing Python libraries and command line utilities (see nmmapper.com for a few examples), and we encourage you to be creative.

To complete this task, please do the following:

  • Create a Google Sheet with a single column, where each row is a unique *primegov.com subdomain. Be sure to change the sharing settings so that this sheet is public to anyone with the link.
  • Paste a link to your *primegov.com Google Sheet to the sites_sheet field of this spreadsheet for the row where the short_name is "primegov".
  • Write a brief reply to this issue documenting the process you used to identify subdomains so that we can continue to develop best practices.

What should we name our agenda/minutes scraper repo?

Naming things is hard :) Right now we're using muniscrape as the name for this repo, which will provide a single interface for scraping CivicPlus and Legistar (and possibly other sites).

We need a name that reflects the fact that we're scraping government records (starting with agendas and minutes but possibly other related docs or content such as video), while allowing for different types of government agencies.

We also need to make sure the package name (i.e. the package that will be installed and importable in Python code such as Lambda functions ideally doesn't clash with any of our other project packages).

Here are some options to consider:

  • agenda-watch
  • local-gov-scrape
  • gov-scrape
  • gov-recs

Fill out metadata about each CivicPlus site

For the purposes of having consistent metadata about our sites, it would be best if the table stored in civicplus_urls.csv were filled out with the additional fields:

  • NEED-TO-HAVE

scraper_type: set to "civicplus"

endpoint: the URL to point the scraper to

blacklisted: "true" or "false"; if true, we won't try to scrape the site

whitelisted: "true" or "false"; if true, we will try to scrape the site even if it has thrown errors in the past

  • NICE-TO-HAVE

name: e.g., "Albuquerque" "San Joaquin Council of Governments"

state: e.g., "CA", "AB" (Canadian provinces OK); if site spands multiple states , a comma-separated list

country: "United States" or "Canada"

govt_level: specifies the atomic level of government represented by the agencies that use the site. Examples are "municipality", "county", "parish", "regional district" (that's a county in British Columbia, don'tcha know?), "coalition" (e.g. Association of Bay Area Governments), "state", "province", "transportation authority" (such as MTA/MBTA/SEPTA), "special district" (such as BART, water districts), "federal"; or if more than one of these, a comma-separated list: e.g. "municipality, county" for City and County of San Francisco

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.