data4democracy / town-council Goto Github PK

5.0 5.0 5.0 133 KB

Tools to scrape and centralize the text of meeting agendas & minutes from local city governments. NOT ACTIVE -- looking for new lead(s)!

Home Page: http://datafordemocracy.slack.com/messages/p-town-council

Python 93.33% HTML 6.67%

town-council's People

Contributors

Stargazers

Watchers

Forkers

chooliu bstarling josephpd3 brucerowan shaheen19

town-council's Issues

Unable to open database file (Mac OS)

I tried to run the spider out of the box on my macbook, but I had some trouble with the pipelines reading or writing the sqlite database file.

Steps to reproduce:

Set up spider dependencies This was done in a conda env, but I don't think pipenv or another would really make a difference? (check if you think it would)
scrapy crawl belmont

(council-crawler) ➜  council_crawler git:(master) scrapy crawl belmont
2017-07-10 20:18:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: council_crawler)
2017-07-10 20:18:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'council_crawler', 'NEWSPIDER_MODULE': 'council_crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDE
R_MODULES': ['council_crawler.spiders']}
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-10 20:18:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
Unhandled error in Deferred:
2017-07-10 20:18:11 [twisted] CRITICAL: Unhandled error in Deferred:

2017-07-10 20:18:11 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
    return fn()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 387, in connect
    return _ConnectionFairy._checkout(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 766, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 516, in checkout
    rec = pool._do_get()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1229, in _do_get
    return self._create_connection()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
    return _ConnectionRecord(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 461, in __init__
    self.__connect(first_connect_check=True)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 651, in __connect
    connection = pool._invoke_creator(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
    return dialect.connect(*cargs, **cparams)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 393, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__
    self.itemproc = itemproc_cls.from_crawler(crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/scrapy/middleware.py", line 40, in from_settings
    mw = mwcls()
  File "/Users/jdebartola/d4d/town-council/council_crawler/council_crawler/pipelines.py", line 49, in __init__
    models.create_tables(engine)
  File "/Users/jdebartola/d4d/town-council/council_crawler/council_crawler/models.py", line 20, in create_tables
    DeclarativeBase.metadata.create_all(engine)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 3934, in create_all
    tables=tables)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1928, in _run_visitor
    with self._optional_conn_ctx_manager(connection) as conn:
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/contextlib.py", line 82, in __enter__
    return next(self.gen)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1921, in _optional_conn_ctx_manager
    with self.contextual_connect() as conn:
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect
    e, dialect, self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1465, in _handle_dbapi_exception_noconnection
    exc_info
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
    return fn()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 387, in connect
    return _ConnectionFairy._checkout(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 766, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 516, in checkout
    rec = pool._do_get()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1229, in _do_get
    return self._create_connection()
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
    return _ConnectionRecord(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 461, in __init__
    self.__connect(first_connect_check=True)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/pool.py", line 651, in __connect
    connection = pool._invoke_creator(self)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
    return dialect.connect(*cargs, **cparams)
  File "/Users/jdebartola/anaconda/envs/council-crawler/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 393, in connect
    return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file

create wrapper script / online submission form to apply Legistar template

We are prospectively writing templates for common city content management systems like Legistar (town-council/council_crawler/templates/legistar_cms.py). This template can hopefully be used for all cities with using the Legistar CMS.

Issue: develop wrapper script for a python user to input fields needed to apply template (e.g., city, state, URL).

Or: ideally could be made even more user-friendly for non-technical users (input info into Google Form/Document --> output is the city.py scraper?)

decide on core metadata fields for cities

System architecture

To get the conversation started. Rough outline found here

Comment below with input/suggestions/improvements

apply legistar template to Bay Area initial set of cities

Apply Legistar templates in ./council_crawler/templates to the six cities that use Legistar in our initial set of cities (see list_of_cities.csv).

Cupertino
Hayward
Mountain View
San Leandro
San Mateo
Sunnyvale

(legistar_cms.py has not been validated for all Legistar scrapers, so check .json output before submitting a pull request!)

municipality naming scheme

find a system uniquely identifying cities within the database (e.g., due to possible ambiguity of "city name, state" naming system)

Fix URL logic in Belmont Spider

Some document links on Belmont's page are the full link but others are a relative link.

Task:
Implement logic to check if it is a relative link. If so join it with the base URL so full URL is saved in the documents.url field.

Ex:
Full URL stored - correct

{

    "_type": "event",
    "name": "Belmont, CA City Council Planning Commission Meeting",
    "scraped_datetime": "2017-06-04 11:03:33",
    "record_date": "06/06/2017 7:00 PM ",
    "source": "belmont",
    "source_url": "http://www.belmont.gov/city-hall/city-government/city-meetings/-toggle-all",
    "meeting_type": "Planning Commission Meeting",
    "documents": [
        {
            "media_type": "text/html",
            "url": "http://belmont-ca.granicus.com/GeneratedAgendaViewer.php?event_id=a4bb6ac3-4714-11e7-b343-f04da2064c47",
            "url_hash": "d5fae19fe93ee064274e3c7a26fabf88",
            "category": "agenda"
        }
    ]

}

Partial/Relative URL stored - incorrect

{

    "_type": "event",
    "name": "Belmont, CA City Council Audit Committee Meeting",
    "scraped_datetime": "2017-06-04 11:03:33",
    "record_date": "06/05/2017 10:00 AM - 11:00 AM ",
    "source": "belmont",
    "source_url": "http://www.belmont.gov/city-hall/city-government/city-meetings/-toggle-all",
    "meeting_type": "Audit Committee Meeting",
    "documents": [
        {
            "media_type": "application/pdf",
            "url": "/Home/ShowDocument?id=15455",
            "url_hash": "c180f50e6d81ca6462de38a073f19d7d",
            "category": "agenda"
        }
    ]

}

Desired output:
"record_date": "2017-05-23"

find software to convert scanned docs -> searchable text

are there open-source tools to convert the text in a scanned letter/agenda into searchable text?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.