Giter Site home page Giter Site logo

data4democracy / town-council Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 5.0 133 KB

Tools to scrape and centralize the text of meeting agendas & minutes from local city governments. NOT ACTIVE -- looking for new lead(s)!

Home Page: http://datafordemocracy.slack.com/messages/p-town-council

Python 93.33% HTML 6.67%

town-council's People

Contributors

andysbolton avatar brucerowan avatar bstarling avatar chooliu avatar josephpd3 avatar marktrovinger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

town-council's Issues

Unable to open database file (Mac OS)

I tried to run the spider out of the box on my macbook, but I had some trouble with the pipelines reading or writing the sqlite database file.

Steps to reproduce:

  • Set up spider dependencies This was done in a conda env, but I don't think pipenv or another would really make a difference? (check if you think it would)
  • scrapy crawl belmont
(council-crawler) โžœ  council_crawler git:(master) scrapy crawl belmont
2017-07-10 20:18:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: council_crawler)
2017-07-10 20:18:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'council_crawler', 'NEWSPIDER_MODULE': 'council_crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDE
R_MODULES': ['council_crawler.spiders']}
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2017-07-10 20:18:11 [twisted] CRITICAL: Unhandled error in Deferred:

2017-07-10 20:18:11 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
    return fn()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 387, in connect
    return _ConnectionFairy._checkout(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 766, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 516, in checkout
    rec = pool._do_get()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1229, in _do_get
    return self._create_connection()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
    return _ConnectionRecord(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 461, in __init__
    self.__connect(first_connect_check=True)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 651, in __connect
    connection = pool._invoke_creator(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
    return dialect.connect(*cargs, **cparams)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 393, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/middleware.py", line 40, in from_settings
    mw = mwcls()
  File "/Users/jdebartola/d4d/town-council/council_crawler/council_crawler/pipelines.py", line 49, in __init__
    models.create_tables(engine)
  File "/Users/jdebartola/d4d/town-council/council_crawler/council_crawler/models.py", line 20, in create_tables
    DeclarativeBase.metadata.create_all(engine)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 3934, in create_all
    tables=tables)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1928, in _run_visitor
    with self._optional_conn_ctx_manager(connection) as conn:
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/contextlib.py", line 82, in __enter__
    return next(self.gen)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1921, in _optional_conn_ctx_manager
    with self.contextual_connect() as conn:
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect
    e, dialect, self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1465, in _handle_dbapi_exception_noconnection
    exc_info
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
    return fn()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 387, in connect
    return _ConnectionFairy._checkout(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 766, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 516, in checkout
    rec = pool._do_get()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1229, in _do_get
    return self._create_connection()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
    return _ConnectionRecord(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 461, in __init__
    self.__connect(first_connect_check=True)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 651, in __connect
    connection = pool._invoke_creator(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
    return dialect.connect(*cargs, **cparams)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 393, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file

create wrapper script / online submission form to apply Legistar template

We are prospectively writing templates for common city content management systems like Legistar (town-council/council_crawler/templates/legistar_cms.py). This template can hopefully be used for all cities with using the Legistar CMS.

Issue: develop wrapper script for a python user to input fields needed to apply template (e.g., city, state, URL).

Or: ideally could be made even more user-friendly for non-technical users (input info into Google Form/Document --> output is the city.py scraper?)

System architecture

To get the conversation started. Rough outline found here

Comment below with input/suggestions/improvements

apply legistar template to Bay Area initial set of cities

Apply Legistar templates in ./council_crawler/templates to the six cities that use Legistar in our initial set of cities (see list_of_cities.csv).

  • Cupertino
  • Hayward
  • Mountain View
  • San Leandro
  • San Mateo
  • Sunnyvale

(legistar_cms.py has not been validated for all Legistar scrapers, so check .json output before submitting a pull request!)

municipality naming scheme

find a system uniquely identifying cities within the database (e.g., due to possible ambiguity of "city name, state" naming system)

Fix URL logic in Belmont Spider

Some document links on Belmont's page are the full link but others are a relative link.

Task:
Implement logic to check if it is a relative link. If so join it with the base URL so full URL is saved in the documents.url field.

Ex:
Full URL stored - correct

{

    "_type": "event",
    "name": "Belmont, CA City Council Planning Commission Meeting",
    "scraped_datetime": "2017-06-04 11:03:33",
    "record_date": "06/06/2017 7:00 PM ",
    "source": "belmont",
    "source_url": "http://www.belmont.gov/city-hall/city-government/city-meetings/-toggle-all",
    "meeting_type": "Planning Commission Meeting",
    "documents": [
        {
            "media_type": "text/html",
            "url": "http://belmont-ca.granicus.com/GeneratedAgendaViewer.php?event_id=a4bb6ac3-4714-11e7-b343-f04da2064c47",
            "url_hash": "d5fae19fe93ee064274e3c7a26fabf88",
            "category": "agenda"
        }
    ]

}

Partial/Relative URL stored - incorrect

{

    "_type": "event",
    "name": "Belmont, CA City Council Audit Committee Meeting",
    "scraped_datetime": "2017-06-04 11:03:33",
    "record_date": "06/05/2017 10:00 AM - 11:00 AM ",
    "source": "belmont",
    "source_url": "http://www.belmont.gov/city-hall/city-government/city-meetings/-toggle-all",
    "meeting_type": "Audit Committee Meeting",
    "documents": [
        {
            "media_type": "application/pdf",
            "url": "/Home/ShowDocument?id=15455",
            "url_hash": "c180f50e6d81ca6462de38a073f19d7d",
            "category": "agenda"
        }
    ]

}

Standardize Belmont record_date field

record_date output by the Belmont spider is the raw text found on website. Text should be parsed and stored as a python date object. (This issue was recently fixed in the Dublin spider for reference)

Current output:
record_date": "05/23/2017 6:30 PM - 7:00 PM "

Desired output:
"record_date": "2017-05-23"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.