Giter Site home page Giter Site logo

Comments (6)

esonderegger avatar esonderegger commented on August 16, 2024

If I had one regret with respect to how this library is set up, it would be that I wish I had used a git submodule to have the CSV files from https://github.com/dwillis/fech-sources be the source of mapping data.

At the time, the Senate was doing its filings via paper and those senate filings were important to my employer. So I spent a fair amount of time extending the js/json file that @chriszs had built for fec-parse (which also uses fech-sources as its upstream source of data). I found the json easy to hand-edit, but it's been a task long on my to-do list to get the mappings I added back into fech-sources.

The most definitive source of data for these mappings is the xls/xsls files the FEC hosts here. Click to expand "Electronically filed reports" and "Paper filed reports" and then click to download the "File formats, header file and metadata" files. I also highly recommend reading through those files before embarking on writing a parser as they will be a huge help in understanding how the filings are structured.

To answer your two questions:

  • The CSVs are complete for filings like the F3 - the F3N, F3A, and F3T all use the same mappings. I believe N stands for "new", A stands for "amended" and T stands for "terminate". (For anyone who knows for sure, please correct me if I'm wrong on these). For example, if a committee is filing an F3 for the first time in a given cycle, it's an F3N. If they need to amend it, its an F3A. If the committee is shutting down. Their last F3 filing will be an F3T.
  • I wouldn't recommend using my types.json file for anything, as they're nowhere near complete. I've started working on a script to merge mappings from the json back into the CSVs. In that script I've worked on bringing in type information from the FEC's xls files. It would then be up to the maintainers of the parsing libraries for each language. For example, the FEC uses AMT-12 for fields that are normally monetary values, and NUM-8 for dates. One potential minefield to warn you about: fields marked A-1 could be boolean, but could also have more options. I don't think there's any way to programmatically know, and I'd recommend keeping those as strings/chars.

What I see as the pros/cons of each are:

  • The mappings.json file has some mappings for either super-old versions or obscure filings that don't exist in the CSVs. If your goal is to write a script that ingests every filing going back to the beginning (1997?) and get something for each one, I'd go this route.
  • The CSVs are likely to be the first to be updated if the FEC ever comes out with a new version. If you want to parse filings from present day going forward, I'd recommend using the CSVs as your data source.

One additional reason to use the CSVs as your data source: if/when you find issues in the CSVs, if you make PRs into fech-sources to fix them, they will benefit everyone, as we are all downstream from them.

Good luck! Please don't hesitate to reach out if you have any questions.

from fecfile.

chriszs avatar chriszs commented on August 16, 2024

I spent some time a year ago trying to reproduce the CSVs from the original source Excel files. Well, actually to merge the two and create JSON Schemas with typing information. This won't come as a surprise to you, but what I found is that both are fairly dirty, the CSVs are sometimes incorrect, there are a ton of records when you multiply the number of fields by the number of versions, JSON Schema has a lot of depth and it's a difficult task. Some of that work was the basis of the draft PR you used as the basis for your fech-sources contribution. A contractor for the FEC was slowly working on a similar project in their fecfile-validate project, but with a slightly different scope (just current filings). FastFEC uses a version of the two .json files, including mappings.json, which I converted from Derek's original Ruby and then Evan and I improved over time. That's as close to a clean source as you'll find, though it originally derives from the CSVs.

from fecfile.

chriszs avatar chriszs commented on August 16, 2024

Oh, also Evan is correct about F3s. There's a PDF technical manual somewhere on the FEC site which details some of this, which if I can find again I'll link to.

from fecfile.

NickCrews avatar NickCrews commented on August 16, 2024

Thank you both so much for this. Oh man I just got overwhelmed ;)

I've been looking into JSONSchema for a while, and I think I've concluded that I think it is overkill for what we need, but here are a few thoughts I had, and I thought I'd write them down.

JSONSchema musings

fecfile-validate looks as canonical as you can get. It looks like they are sourcing their schemas from the .xls files you mentioned above, but it looks like they also don't trust those .xls files and have to hand-edit them.

@chriszs by "current filings" do you mean fecfile-validate only supports filing versions 8.3+? That wouldn't be adequate for my (and I bet others) needs. I doubt the FEC will be motivated to support older versions, so we would need to supplement this.

@chriszs mentions the combinatorial explosion, but I think we could get around this by re-using sub-schemas. Am I missing something there? Still, I'm not sure if we need the full power of JSONSchema, and therefore I'm not sure if it's worth bringing in that complication. Am I right that all we need extra are the dtypes that should get parsed? Like I don't think we need the full

"form_type": {
            "title": "FORM TYPE",
            "description": "",
            "const": "SA11C",
            "examples": [
                "SA11C"
            ],
            "fec_spec": {
                "FIELD_DESCRIPTION": "FORM TYPE",
                "TYPE": "A/N-8",
                "REQUIRED": "X (error)",
                "SAMPLE_DATA": "SA11C",
                "VALUE_REFERENCE": "SA11C Only",
                "RULE_REFERENCE": null,
                "FIELD_FORM_ASSOCIATION": null
            }
        }

that JSONSchema provides.

Path Forward

OK, it sounds like updating fech-sources is what both of you are most supportive of, and I think that would work just fine for me. Adding types would be great, but just them being functional would be fine. I think the todos would be

[] merge dwillis/fech-sources#11
[] figure out dwillis/fech-sources#12
[] use the migration scripts @esonderegger wrote to bring back in the .json stuff, and add types. I can help here @esonderegger if you point me in the right direction.

CC @mjtravers from fecfile-validate, if you have any thoughts on how we could team up at all.

from fecfile.

freedmand avatar freedmand commented on August 16, 2024

Hi! Firstly: sorry I haven't been able to find time to get to your PR in FastFEC. (Though I have validated there is no perf difference in your version, I did notice some diffs I'm going through to try to figure out.) We will be focusing more on FEC at The Post later this year (I'm hoping to find time sooner). But I do want to chime in here to say:

  1. Great to have you working on this! I'm very curious to know what your goals/motivation are for developing this generally. I would love to be able to collaborate as effectively as possible. It's a small world of folks tackling these problems, and you've cc'd a good chunk of them. If you ever want to hop on a call to discuss any of this, I'd be game to find some time.

  2. FastFEC very much comes from translating fecfile to C, and it is downstream of all the work you've identified. It seems like you've already mostly uncovered this lineage, in addition to how the mapping files have been handed down. At some point, cleaning all this up and having a centralized source for these that's community- and/or FEC-maintained would be wonderful. I think the filings <8.3 are handled decently well by the current mappings files; at least they have been for our purposes at The Post loading many historic filings into a centralized/searchable database. But it would be worth investigating further; there's surely possible improvements.

  3. @chriszs indeed has put some time into trying to standardize the typings in a unified way. He can correct me but I think re: combinatorial expansion, there's just very minute differences between each version that would be hard to capture in a nice way, even with reusing sub-schemas. It's a painstaking process generally as the source xls files mentioned above are not always perfect. And filings themselves have various errors too despite this, so parsers may need various layers of tolerance baked in to handle weirdness (and filings can be very messy, e.g. missing columns, shifted columns, inconsistent formats, characters missing, etc.).

Looking forward to seeing what you come up with. And thanks for organizing this discussion.

from fecfile.

chriszs avatar chriszs commented on August 16, 2024

Yes, my design heavily uses sub-schemas.

Yes, there are a lot of edge cases.

Correct that fec-validate only seems interested in the current version.

I think a plan that focuses on improving the CSVs and converting from there sounds reasonable.

from fecfile.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.