Giter Site home page Giter Site logo

khp-data's Introduction

๐Ÿ‘‹

I'm currently the Co-Founder & CEO of SELECT, a data warehouse cost observability & optimization platform. If you are trying to optimize your Snowflake usage, improve performance, or curious to learn more about Snowflake in general, don't hesitate to email me - I'd be more than happy to chat!

khp-data's People

Contributors

ian-whitestone avatar

Watchers

 avatar

khp-data's Issues

CSI Pipeline

  • Download from FTP Server
  • Upload files to S3
    • Check which files are already uploaded
  • Upload files to DB
    • Check which files have already been uploaded (need a table with metadata of what has been loaded)
  • Delete files from S3

Database documentation

Document schema and add field descriptions

Probably do it manually in an .rst file and incorporate into sphinx.

Use dask to speed up process

Historical loading (i.e. re-processing all contacts), can take some time.

Look into using dask for:

  • Downloading contacts & loading them to db
  • Downloading transcripts & loading them to db
  • Enhanced transcripts

Refactoring

  • Use glom for parsing API response data
  • Delete files after loading to database
  • Timestamp handling

FTP lib library errors

Bytes versus strings strikes again...

In the FTP.getresp() method, the KHP FTP server returns a bytes message, where it looks like the FTP library was expecting a string?

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/whitesi/khp/ImplicitlyTLS.py", line 42, in connect
    self.welcome = self.getresp()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 236, in getresp
    resp = self.getmultiline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 222, in getmultiline
    line = self.getline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 213, in getline
    elif line[-1:] in CRLF:
TypeError: 'in <string>' requires string as left operand, not bytes

Potential Fix
Adding this line in the library solved the problem. However, that will be really annoying in prod cause it's a standard library.

def getline(self):
    line = self.file.readline(self.maxline + 1)

    #Ian: added this check
    if type(line) == bytes:
        line = line.decode('ascii')

    if len(line) > self.maxline:
        raise Error("got more than %d bytes" % self.maxline)
    if self.debugging > 1:
        print('*get*', self.sanitize(line))
    if not line:
        raise EOFError
    if line[-2:] == CRLF:
        line = line[:-2]
    elif line[-1:] in CRLF:
        line = line[:-1]
    return line

FTCI Pipeline

  • Download from FTP Server
  • Upload files to S3
    • Check which files are already uploaded
  • Upload files to DB
    • Check which files have already been uploaded (need a table with metadata of what has been loaded)
  • Delete files from S3

Exploratory Data Analysis

  • How many high distress conversations did we miss?
  • Do highly distressed kids drop off more? Or wait longer?
  • How often do kids try and chat before the conversation has started?
  • How does wait-time influence drop-off rate?
  • Can you quantify repeat callers?
  • How do factors like chat time, message length, response time, wait time etc. affect NPS

Other Modelling Ideas:

  • Topic modelling
    • unsupervised clustering, then manually tag based on keyword frequencies

Pandas read from s3

Not on my computer so wanna capture thoughts here re: a wrapper package that provides a method for reading an S3 object into pandas.

  • Instead of calling pd.read_csv, call s3pd.read_csv
  • s3pd will have boto3 & pandas as an optional dependencies, don't want to force a specific version of boto3 or pandas on people, would need to figure out which versions it's compatible with
  • General flow would be:
    • take filename/buffer argument from read_csv call and read that S3 object
    • s3pd.read_csv would need to take some additional, optional arguments (on top of the usual pd.reas_csv args, that could specify things like:
    • pass the S3 object as a StringIO to pd.reas_csv
      • would be nice to able to stream the object as well if it's big, follow discussions here and here for how this can be implemented

FTCI Summary Report

  • Summarize evt_type_cd's into agent metadata (at the daily level)
    • number of calls taken

Enhanced transcripts table

Build a python job to process the transcript database records, output them to a new table, enhanced_transcripts

  • Options to re-run for all records, or only pull records that haven't been processed
    • Full re-run done by clearing the enhanced_transcritps table
  • Run dataframe through transformations
    • queue wait time
    • conversation duration
    • message length (mean, median, max)
      • split by counsellor and counsellee
    • response time (mean, median, max)
      • split by counsellor and counsellee
    • survey info
  • Load to database
    • Starting to think a key-value store would be way better for this as I will probably be adding new metrics over time...food for thought later on..

MS SQL Server RDS

Testing out PowerBI connection to AWS RDS instance.

  1. Created through console.
  2. Ran the following:
/* 
https://serverfault.com/questions/496966/amazon-rds-sql-server-master-user-privileges
*/

USE master;
CREATE DATABASE test_db;
USE test_db;

create table customers (
	name text,
	balance integer
)
;


insert into customers values ('ian', 1000);
insert into customers values ('rob', 1000);

use test_db;
select * from customers;
select * from [test_db].[dbo].[customers]
;

AWS Architecture

Once the intent and data sources are locked down, refactor and deploy on AWS.

Consecutive Dataframe Filters Not Applied

def filter_df(dataframe, filters):
    """Filter a dataframe

    Args:
        dataframe (pandas.DataFrame): Input dataframe
        filters (list): list of filters (dicts) to apply

    Returns:
        pandas.DataFrame: Filtered dataframe
    """
    for fltr_dict in filters:
        fltr_fn = getattr(operator, fltr_dict['operator'])
        fltr_value = fltr_dict['value']
        fltr_value = getattr(builtins, fltr_dict['value_type'])(fltr_value)
        fltr_check = fltr_fn(dataframe[fltr_dict['column']], fltr_value)
        fltr_df = dataframe[fltr_check]
    return fltr_df

Very dumb mistake where only the last filter will be applied...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.