Giter Site home page Giter Site logo

moj-analytical-services / dataengineeringutils Goto Github PK

View Code? Open in Web Editor NEW
6.0 5.0 5.0 118 KB

A python package containing functions that help manage our data management processes on AWS

License: MIT License

Python 91.02% Jupyter Notebook 8.98%
data-engineering

dataengineeringutils's Introduction

data Engineering Utils

A python package containing functions that help manage our data management processes on AWS

To install this package pip install git+git://github.com/moj-analytical-services/dataengineeringutils.git#egg=dataengineeringutils

If you want to update the package then you need to delete it first before reinstalling i.e. run: pip uninstall dataengineeringutils

Warning: This package has the following dependencies:

  • numpy
  • pandas
  • io
  • boto3

This package doesn't list its package denpencies because I found errors with io when installing via pip so I have left it blank for now ¯\_(ツ)_/¯

dataengineeringutils's People

Contributors

isichei avatar robinl avatar

Stargazers

Leonid avatar Davide Antonino Giorgio avatar Hugh avatar Haluk Tutuk avatar David Underdown avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar

dataengineeringutils's Issues

Cannot use float in pandas for columns of type int

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
Due to the above gotya, we started by using floats in pandas where the metadata said the column was an int

This solution can't work because it means if we impose metadata types on a pandas dataframe, ints get converted to floats and csvs get outputted with values like 9.0 rather than 9. Hive then throws an error when it sees a value like this which is meant to be an int.

NA support is coming to pandas ints soon (0.24.0) documented here

The principle of least surprise suggests we treat ints as ints in pandas, and then explicitly have to deal with the current problems around NAs.

Impose metadata conformance doesn't guarantee strings

robinlinacre [11:11 AM]
little warning/note about pd.to_parquet. If you load a string column into pandas e.g. from a csv, the dtype will be object.
But if that col contains e.g. 2,2,2C,2, then you might end up with a column of mixed types
the dtype will still be object but some of the contents will be floats/ints and some will be string
pd.parquet will fail because columns in parquet have to be a single type
so you have to explicitly convert using df[col] = df[col].to_string()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.