Giter Site home page Giter Site logo

civicactions / allusgov Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 63.1 MB

This project attempts to map the organization of the US Federal Government by gathering and consolidating information from various directories.

License: GNU General Public License v3.0

Python 96.63% Makefile 3.37%

allusgov's Introduction

Overview

This project attempts to map the organization of the US Federal Government by gathering and consolidating information from various directories.

PyPI License PyPI Version PyPI Downloads

Current sources:

Each source is scraped (see out directory) in raw JSON format, including fields for the organizational unit name/parent (if any), unique ID/parent-ID fields (if the names are not unique) as well as any other attribute data for that organization available from that source.

A normalized name (still WIP) is then added, which corrects letter case, spacing and expands acronyms. Acronyms are selected and verified manually using data from USCD GovSpeak and the DOD Dictionary of Military and Associated Terms as well as manual entry when needed.

Each source is them imported into a tree and exported into the following formats for easy consumption:

  • Plain text tree
  • JSON flat format (with path to each element)
  • JSON nested tree format
  • CSV format (with embedded JSON attributes)
  • Wide CSV format (with flattened attributes)
  • DOT file (does not include attributes)
  • GEXF graph file (includes flattened attributes)
  • GraphQL graph file (includes flattened attributes)
  • Cytoscape.js JSON format (includes flattened attributes)

To merge the lists, each tree is merged into a selected base tree by comparing the normalized names of each node in the tree to the names of each node in the base tree using a fuzzy matching algorithm. Similarity scores between each pair of parents are incorporated into the score to more correctly identify cases where the same/similar office or program name is used for different organizations.

Note that the fuzzy matching is imperfect and may have some inaccurate mappings (although most appear OK) and will certainly have some entries which actually should be merged, but aren't.

The final merged dataset is written in the above formats to the data/merged directory.

Setup

Requirements

Installation

Check out this repository, then from the repository root, install dependencies:

$ poetry install

See command line usage:

poetry run allusgov --help

Run a complete scrape and merge:

poetry run allusgov

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.