Light

ian-whitestone / khp-data Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.73 MB

Data infrastructure for the Kids Help Phone

Home Page: https://ian-whitestone.github.io/khp-data/

License: MIT License

Python 89.59% Makefile 6.76% PLpgSQL 3.64%

khp-data's Introduction

👋

I'm currently the Co-Founder & CEO of SELECT, a data warehouse cost observability & optimization platform. If you are trying to optimize your Snowflake usage, improve performance, or curious to learn more about Snowflake in general, don't hesitate to email me - I'd be more than happy to chat!

khp-data's People

Contributors

Watchers

khp-data's Issues

CSI Pipeline

Download from FTP Server
Upload files to S3
- Check which files are already uploaded
Upload files to DB
- Check which files have already been uploaded (need a table with metadata of what has been loaded)
Delete files from S3

Database documentation

Document schema and add field descriptions

Probably do it manually in an .rst file and incorporate into sphinx.

Use dask to speed up process

Historical loading (i.e. re-processing all contacts), can take some time.

Look into using dask for:

Downloading contacts & loading them to db
Downloading transcripts & loading them to db
Enhanced transcripts

Refactoring

Use glom for parsing API response data
Delete files after loading to database
Timestamp handling

FTP lib library errors

Bytes versus strings strikes again...

In the FTP.getresp() method, the KHP FTP server returns a bytes message, where it looks like the FTP library was expecting a string?

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/whitesi/khp/ImplicitlyTLS.py", line 42, in connect
    self.welcome = self.getresp()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 236, in getresp
    resp = self.getmultiline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 222, in getmultiline
    line = self.getline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 213, in getline
    elif line[-1:] in CRLF:
TypeError: 'in <string>' requires string as left operand, not bytes

Potential Fix
Adding this line in the library solved the problem. However, that will be really annoying in prod cause it's a standard library.

def getline(self):
    line = self.file.readline(self.maxline + 1)

    #Ian: added this check
    if type(line) == bytes:
        line = line.decode('ascii')

    if len(line) > self.maxline:
        raise Error("got more than %d bytes" % self.maxline)
    if self.debugging > 1:
        print('*get*', self.sanitize(line))
    if not line:
        raise EOFError
    if line[-2:] == CRLF:
        line = line[:-2]
    elif line[-1:] in CRLF:
        line = line[:-1]
    return line

FTCI Pipeline

Download from FTP Server
Upload files to S3
- Check which files are already uploaded
Upload files to DB
- Check which files have already been uploaded (need a table with metadata of what has been loaded)
Delete files from S3

Exploratory Data Analysis

How many high distress conversations did we miss?
Do highly distressed kids drop off more? Or wait longer?
How often do kids try and chat before the conversation has started?
How does wait-time influence drop-off rate?
Can you quantify repeat callers?
How do factors like chat time, message length, response time, wait time etc. affect NPS

Other Modelling Ideas:

Topic modelling
- unsupervised clustering, then manually tag based on keyword frequencies

Pandas read from s3

Not on my computer so wanna capture thoughts here re: a wrapper package that provides a method for reading an S3 object into pandas.

Instead of calling pd.read_csv, call s3pd.read_csv
s3pd will have boto3 & pandas as an optional dependencies, don't want to force a specific version of boto3 or pandas on people, would need to figure out which versions it's compatible with
General flow would be:
- take filename/buffer argument from read_csv call and read that S3 object
- s3pd.read_csv would need to take some additional, optional arguments (on top of the usual pd.reas_csv args, that could specify things like:
  - which aws_profile to use
  - which region to use
  - users would be responsible for having credentials available (see how boto3 config searches for creds)
- pass the S3 object as a StringIO to pd.reas_csv
  - would be nice to able to stream the object as well if it's big, follow discussions here and here for how this can be implemented

Sphinx documentation

Write docs
host on gh-pages

Summarize evt_type_cd's into agent metadata (at the daily level)
- number of calls taken

Add CI with Circle or Travis

Enhanced transcripts table

Build a python job to process the transcript database records, output them to a new table, enhanced_transcripts

Options to re-run for all records, or only pull records that haven't been processed
- Full re-run done by clearing the enhanced_transcritps table
Run dataframe through transformations
- queue wait time
- conversation duration
- message length (mean, median, max)
  - split by counsellor and counsellee
- response time (mean, median, max)
  - split by counsellor and counsellee
- survey info
Load to database
- Starting to think a key-value store would be way better for this as I will probably be adding new metrics over time...food for thought later on..

MS SQL Server RDS

Testing out PowerBI connection to AWS RDS instance.

Created through console.
Ran the following:

/* 
https://serverfault.com/questions/496966/amazon-rds-sql-server-master-user-privileges
*/

USE master;
CREATE DATABASE test_db;
USE test_db;

create table customers (
	name text,
	balance integer
)
;


insert into customers values ('ian', 1000);
insert into customers values ('rob', 1000);

use test_db;
select * from customers;
select * from [test_db].[dbo].[customers]
;

AWS Architecture

Once the intent and data sources are locked down, refactor and deploy on AWS.

Consecutive Dataframe Filters Not Applied

def filter_df(dataframe, filters):
    """Filter a dataframe

    Args:
        dataframe (pandas.DataFrame): Input dataframe
        filters (list): list of filters (dicts) to apply

    Returns:
        pandas.DataFrame: Filtered dataframe
    """
    for fltr_dict in filters:
        fltr_fn = getattr(operator, fltr_dict['operator'])
        fltr_value = fltr_dict['value']
        fltr_value = getattr(builtins, fltr_dict['value_type'])(fltr_value)
        fltr_check = fltr_fn(dataframe[fltr_dict['column']], fltr_value)
        fltr_df = dataframe[fltr_check]
    return fltr_df

Very dumb mistake where only the last filter will be applied...

Sphinx Syntax Tweaks

Some docstrings aren't rendering properly:

https://ian-whitestone.github.io/khp/khp.html#khp.transforms.Transformer.parse_transforms

Add testing

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.