tnguyen21 / forest Goto Github PK

Project repository for Drexel University Senior Thesis

Python 91.10% HTML 0.85% JavaScript 8.05%

forest's Introduction

Hi there 👋, I'm Tommy [he/him/his]

I'm a Machine Learning Engineer @ Nuuly. I graduated with a Bachelors of Science in Computer Science from Drexel University. In my spare time I like to write. You can find my blog @ tommynguyen.dev.

🛠 I’m currently working on ...

builing ML infrastructure as a Machine Learning Engineer @ Nuuly

📫 How to reach me ...

Twitter @tommy_b_nguyen
LinkedIn @tommybnguyen

📜 In the past I've worked at ...

U.S. Census Bureau as a Software Engineering Civic Digital Fellow developing a content-first API for educational material created @ the USCB
Sharing Excess to help reduce food waste from grocery stores and restaurants
Rise First as a web developer building out content pages for their website
Vanguard (Contracted) as an IT Developer intern where I helped automate middle-office processes using Python and Java
SASE Drexel as webmaster helping to maintain their websites
Drexel Wireless Systems Laboratory as an undergrad researcher
Code for Chicago and Code for Philly making small contributions to front-end projects

☕ Fun facts:

Published in Kernel Magazine with a piece about tech and sustainability

forest's People

Contributors

Stargazers

Watchers

forest's Issues

Search Algorithm and queries with doubled characters

While working on the single word pre-task and timing exact-match queries, I noticed incorrect behavior. Here are some logs I've printed out to help correctly set up the code:

Query: aminophylin | Expected ED: 1 | Found: [('aminophyllin', 0, 0.9972)]
Query: amonul | Expected ED: 1 | Found: [('ammonul', 0, 0.9619)]
Query: aquaphylin | Expected ED: 1 | Found: [('aquaphyllin', 0, 0.9939)]
Query: aranon | Expected ED: 1 | Found: [('arranon', 0, 0.9619)]
Query: atna | Expected ED: 1 | Found: [('atnaa', 0, 0.96)]
Query: aved | Expected ED: 1 | Found: [('aveed', 0, 0.9533)]
Query: bacim | Expected ED: 1 | Found: [('baciim', 0, 0.9667)]
Query: brisdele | Expected ED: 1 | Found: [('brisdelle', 0, 0.9889)]
Query: bucal | Expected ED: 1 | Found: [('buccal', 0, 0.9611)]
Query: coper | Expected ED: 1 | Found: [('copper', 0, 0.9611)]
Query: davp | Expected ED: 1 | Found: [('ddavp', 0, 0.94)]
Query: duave | Expected ED: 1 | Found: [('duavee', 0, 0.9722)]
Query: dycil | Expected ED: 1 | Found: [('dycill', 0, 0.9722)]
Query: efexor | Expected ED: 1 | Found: [('effexor', 0, 0.9619)]
Query: emoquete | Expected ED: 1 | Found: [('emoquette', 0, 0.9889)]
Query: erycete | Expected ED: 1 | Found: [('erycette', 0, 0.9833)]
Query: evkeza | Expected ED: 1 | Found: [('evkeeza', 0, 0.9714)]
Query: ingreza | Expected ED: 1 | Found: [('ingrezza', 0, 0.9833)]
Query: inovar | Expected ED: 1 | Found: [('innovar', 0, 0.9619)]
Query: kimides | Expected ED: 1 | Found: [('kimidess', 0, 0.9875)]
Query: kwel | Expected ED: 1 | Found: [('kwell', 0, 0.96)]
Query: lunele | Expected ED: 1 | Found: [('lunelle', 0, 0.9762)]
Query: merem | Expected ED: 1 | Found: [('merrem', 0, 0.9611)]
Query: minipres | Expected ED: 1 | Found: [('minipress', 0, 0.9926)]
Query: mycapsa | Expected ED: 1 | Found: [('mycapssa', 0, 0.9833)]
Query: niki | Expected ED: 1 | Found: [('nikki', 0, 0.9533)]
Query: oral | Expected ED: 1 | Found: [('oral', 0, 1.0)]
Query: paladone | Expected ED: 1 | Found: [('palladone', 0, 0.9741)]
Query: pastile | Expected ED: 1 | Found: [('pastille', 0, 0.9833)]
Query: pellet | Expected ED: 1 | Found: [('pellet', 0, 1.0)]
Query: phexi | Expected ED: 1 | Found: [('phexxi', 0, 0.9667)]
Query: sebri | Expected ED: 1 | Found: [('seebri', 0, 0.9556)]
Query: shampo | Expected ED: 1 | Found: [('shampoo', 0, 0.981)]
Query: sula | Expected ED: 1 | Found: [('sulla', 0, 0.9533)]
Query: suprelin | Expected ED: 1 | Found: [('supprelin', 0, 0.9741)]
Query: talzena | Expected ED: 1 | Found: [('talzenna', 0, 0.9833)]
Query: vetids | Expected ED: 1 | Found: [('veetids', 0, 0.9619)]
Query: vivele | Expected ED: 1 | Found: [('vivelle', 0, 0.9762)]
Query: xaracol | Expected ED: 1 | Found: [('xaracoll', 0, 0.9875)]
Query: xidra | Expected ED: 1 | Found: [('xiidra', 0, 0.9556)]
{"avg_search_time_per_word": 9.558915955462466e-05, "false_postive_count": 40}

Average search time per word should be self-explanatory.

False positive count is the count of queries that were searched with an exact match, but returned a result with edit distance = 0.

Looking at the query and returned results, it appears that words with "doubled characters" (e.g. ee, aa, etc) are causing the search algorithm to incorrectly count the extra character as not modifying the word's edit distance to the query. This is bad! And we should re-run all previous tasks completed during term 1 to re-evaluate any previous conclusions made.

Deal with bad output from phonetic algorithms

Adding hack to deal with bad output from phonetic algorithms. Need to think about how to handle when a phonetic library produces a bad output (i.e. word is not of type string)

71bae5f

 #! bad bug -- sometimes word is NaN because of phonetic algo; i think it's nysiis
        #! for the time being, just make floats into empty string FIX THIS
        if type(word) != str:
            word = ""

Single Word Evaluation Task: Pre-Tasks Timings

A tentative version of the scripts that run the pre-tasks for the single word evaluation tasks can be found in /experiments/single_word-evaluation_task which search the FDA term training dictionary with ED=0, 1, 2.

A few things to note about these tentative scripts, which are not 100% complete (still need to calculate metrics for search results returned).

ED=0 takes seconds to run. Quite fast, average search time per query of 1e-5 seconds
ED=1 takes less than a minute to run. Perhaps ~10-20 seconds.
ED=2 takes quite a while to get through the whole query. Based on what I was observing when logging out the timings for the first hundred or so queries, each search with ED=2 takes about ~0.1 seconds. This means 0.1 * 17.000 searches = 1.700 seconds = ~28 minutes to run the pre-task.

I profiled a single ED=2 search to get a sense of what might be the slower operations in the process, and how we might optimize it to
make ED=2 searches go even faster. Using cProfile to profile the code, and snakeviz to visualize the run, I got this table:

A single search is quite fast. It's multiplied by 10.000's of queries that the process becomes a bit slow/tedious.

The slowest operation is recursive calls of trie.update_further_children(). If there's a way to make that function faster, we'll probably be able to cut down the time it takes for ED=2 searches.

Bug with duplicated characters

https://github.com/tnguyen21/trie/blob/bcbba6e28915c393f2c77447a2d390de953f77e0/trie/trie.py#L138

I have this guess (50% sure only) that when you call update_further_children recursevily you might ignore then the current char, so my first simple attempt would be changing line 138 according to the following:

from:
L138 self.update_further_children(child_node, char)

to:
L138 self.update_further_children(child_node, "")

Let me know the result of that change, be careful, it can fix a bug and cause two others.

tnguyen21 / forest Goto Github PK

forest's Introduction

Hi there 👋, I'm Tommy [he/him/his]

🛠 I’m currently working on ...

📫 How to reach me ...

📜 In the past I've worked at ...

☕ Fun facts:

forest's People

Contributors

Stargazers

Watchers

forest's Issues

Search Algorithm and queries with doubled characters

Deal with bad output from phonetic algorithms

Single Word Evaluation Task: Pre-Tasks Timings

Bug with duplicated characters

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent