Giter Site home page Giter Site logo

khp-data's Introduction

๐Ÿ‘‹

I'm currently the Co-Founder & CEO of SELECT, a data warehouse cost observability & optimization platform. If you are trying to optimize your Snowflake usage, improve performance, or curious to learn more about Snowflake in general, don't hesitate to email me - I'd be more than happy to chat!

khp-data's People

Contributors

ian-whitestone avatar

Watchers

 avatar

khp-data's Issues

MS SQL Server RDS

Testing out PowerBI connection to AWS RDS instance.

  1. Created through console.
  2. Ran the following:
/* 
https://serverfault.com/questions/496966/amazon-rds-sql-server-master-user-privileges
*/

USE master;
CREATE DATABASE test_db;
USE test_db;

create table customers (
	name text,
	balance integer
)
;


insert into customers values ('ian', 1000);
insert into customers values ('rob', 1000);

use test_db;
select * from customers;
select * from [test_db].[dbo].[customers]
;

Consecutive Dataframe Filters Not Applied

def filter_df(dataframe, filters):
    """Filter a dataframe

    Args:
        dataframe (pandas.DataFrame): Input dataframe
        filters (list): list of filters (dicts) to apply

    Returns:
        pandas.DataFrame: Filtered dataframe
    """
    for fltr_dict in filters:
        fltr_fn = getattr(operator, fltr_dict['operator'])
        fltr_value = fltr_dict['value']
        fltr_value = getattr(builtins, fltr_dict['value_type'])(fltr_value)
        fltr_check = fltr_fn(dataframe[fltr_dict['column']], fltr_value)
        fltr_df = dataframe[fltr_check]
    return fltr_df

Very dumb mistake where only the last filter will be applied...

FTP lib library errors

Bytes versus strings strikes again...

In the FTP.getresp() method, the KHP FTP server returns a bytes message, where it looks like the FTP library was expecting a string?

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/whitesi/khp/ImplicitlyTLS.py", line 42, in connect
    self.welcome = self.getresp()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 236, in getresp
    resp = self.getmultiline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 222, in getmultiline
    line = self.getline()
  File "/Users/whitesi/anaconda/envs/khp-data/lib/python3.6/ftplib.py", line 213, in getline
    elif line[-1:] in CRLF:
TypeError: 'in <string>' requires string as left operand, not bytes

Potential Fix
Adding this line in the library solved the problem. However, that will be really annoying in prod cause it's a standard library.

def getline(self):
    line = self.file.readline(self.maxline + 1)

    #Ian: added this check
    if type(line) == bytes:
        line = line.decode('ascii')

    if len(line) > self.maxline:
        raise Error("got more than %d bytes" % self.maxline)
    if self.debugging > 1:
        print('*get*', self.sanitize(line))
    if not line:
        raise EOFError
    if line[-2:] == CRLF:
        line = line[:-2]
    elif line[-1:] in CRLF:
        line = line[:-1]
    return line

Use dask to speed up process

Historical loading (i.e. re-processing all contacts), can take some time.

Look into using dask for:

  • Downloading contacts & loading them to db
  • Downloading transcripts & loading them to db
  • Enhanced transcripts

Database documentation

Document schema and add field descriptions

Probably do it manually in an .rst file and incorporate into sphinx.

Enhanced transcripts table

Build a python job to process the transcript database records, output them to a new table, enhanced_transcripts

  • Options to re-run for all records, or only pull records that haven't been processed
    • Full re-run done by clearing the enhanced_transcritps table
  • Run dataframe through transformations
    • queue wait time
    • conversation duration
    • message length (mean, median, max)
      • split by counsellor and counsellee
    • response time (mean, median, max)
      • split by counsellor and counsellee
    • survey info
  • Load to database
    • Starting to think a key-value store would be way better for this as I will probably be adding new metrics over time...food for thought later on..

FTCI Pipeline

  • Download from FTP Server
  • Upload files to S3
    • Check which files are already uploaded
  • Upload files to DB
    • Check which files have already been uploaded (need a table with metadata of what has been loaded)
  • Delete files from S3

Refactoring

  • Use glom for parsing API response data
  • Delete files after loading to database
  • Timestamp handling

AWS Architecture

Once the intent and data sources are locked down, refactor and deploy on AWS.

CSI Pipeline

  • Download from FTP Server
  • Upload files to S3
    • Check which files are already uploaded
  • Upload files to DB
    • Check which files have already been uploaded (need a table with metadata of what has been loaded)
  • Delete files from S3

FTCI Summary Report

  • Summarize evt_type_cd's into agent metadata (at the daily level)
    • number of calls taken

Pandas read from s3

Not on my computer so wanna capture thoughts here re: a wrapper package that provides a method for reading an S3 object into pandas.

  • Instead of calling pd.read_csv, call s3pd.read_csv
  • s3pd will have boto3 & pandas as an optional dependencies, don't want to force a specific version of boto3 or pandas on people, would need to figure out which versions it's compatible with
  • General flow would be:
    • take filename/buffer argument from read_csv call and read that S3 object
    • s3pd.read_csv would need to take some additional, optional arguments (on top of the usual pd.reas_csv args, that could specify things like:
    • pass the S3 object as a StringIO to pd.reas_csv
      • would be nice to able to stream the object as well if it's big, follow discussions here and here for how this can be implemented

Exploratory Data Analysis

  • How many high distress conversations did we miss?
  • Do highly distressed kids drop off more? Or wait longer?
  • How often do kids try and chat before the conversation has started?
  • How does wait-time influence drop-off rate?
  • Can you quantify repeat callers?
  • How do factors like chat time, message length, response time, wait time etc. affect NPS

Other Modelling Ideas:

  • Topic modelling
    • unsupervised clustering, then manually tag based on keyword frequencies

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.