Giter Site home page Giter Site logo

data_curators's Introduction

Code style: black pre-commit pre-commit.ci status Python Tests codecov

data_curators

Tools for the curating of data.

Requirements

  • Python >= 3.8

Dictionary to Dataclass curator

from datacurators.dataclass_curator import dataclass_curator

Accepts any dataclass and dictionary. Returns a dictionary that will unpack into dataclass without error.

This is done by matching the dataclass attributes with the dictionary's keys. Missing key/ values, by default, are added as None. Unexpected values are removed from the result.

  1. Does not enforce type-hints
  2. Does not override attribute defaults
  3. Default value for missing keywords can be set per call
  4. Behavior of adding/removing values can be toggled
  5. Optional logging of changes made as INFO level logs
import dataclasses

from datacurators.dataclass_curator import dataclass_curator

@dataclasses.dataclass
class Example:
    first_value: str
    second_value: int
    third_value: bool = False

sample_input = {
    some_value: "This doesn't match",
    second_value: 42,
}

# This will cause a TypeError as we are:
#   - Missing `first_value`
#   - some_value is an unexpected keyword
mydataclass = Example(**sample_input)

# This will not error as we:
#   - Add a None default to `first_value`
#   - Remove `some_value`
mydataclass = Example(**dataclass_curator(Example, sample_input))

A better .update() for nested mappings

from datacurators.nested_update import nested_update

Merge two mappings, inluding nested mappings, returning a new result.

This is done by starting with a copy of map1. Key/values that are unique to map2 are added to map1. Key/values that exist in both are updated with map1[key] = map2[key]. Nested mapping structures have the same logic applied, recursively.

Note: Collections of mappings, such as a list of dicts, must be handled with your own implemented logic. This is because there is no way to ensure the order in which objects of the list are updated.

import json
from datacurators.nested_update import nested_update

STEP01 = {"name": "preocts", "type": {}}

STEP02 = {
    "name": "Preocts",
    "type": {
        "style": "egg",
        "size": "smol",
    },
    "likes": [
        {"id": 0},
        {"id": 1},
    ],
}

STEP03 = {
    "type": {
        "style": "Egg",
        "shell": "thicc",
    },
    "likes": [
        {"id": 0},
    ],
}

step_one = nested_update(STEP01, STEP02)
step_two = nested_update(step_one, STEP03)

print(json.dumps(step_one, indent=2))
print(json.dumps(step_two, indent=2))

Expected results

{
  "name": "Preocts",
  "type": {
    "style": "egg",
    "size": "smol"
  },
  "likes": [
    {
      "id": 0
    },
    {
      "id": 1
    }
  ]
}
{
  "name": "Preocts",
  "type": {
    "style": "Egg",
    "size": "smol",
    "shell": "thicc"
  },
  "likes": [
    {
      "id": 0
    }
  ]
}

Local developer installation

It is highly recommended to use a venv for installation. Leveraging a venv will ensure the installed dependency files will not impact other python projects.

Clone this repo and enter root directory of repo:

$ git clone https://github.com/preocts/data_curators
$ cd data_curators

Create and activate venv:

# Linux/MacOS
python3 -m venv venv
. venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate.bat
# or
py -m venv venv
venv\Scripts\activate.bat

Your command prompt should now have a (venv) prefix on it.

Install editable library and development requirements:

# Linux/MacOS
pip install -r requirements-dev.txt
pip install --editable .

# Windows
python -m pip install -r requirements-dev.txt
python -m pip install --editable .
# or
py -m pip install -r requirements-dev.txt
py -m pip install --editable .

Install pre-commit hooks to local repo:

pre-commit install
pre-commit autoupdate

Run tests

tox

To exit the venv:

deactivate

Makefile

This repo has a Makefile with some quality of life scripts if your system supports make.

  • install : Clean all artifacts, update pip, install requirements with no updates
  • update : Clean all artifacts, update pip, update requirements, install everything
  • build-dist : Build source distribution and wheel distribution
  • clean-pyc : Deletes python/mypy artifacts
  • clean-tests : Deletes tox, coverage, and pytest artifacts
  • clean-build : Deletes build artifacts
  • clean-all : Runs all clean scripts

data_curators's People

Contributors

pre-commit-ci[bot] avatar preocts avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.