Giter Site home page Giter Site logo

forest's Introduction

Hi there ๐Ÿ‘‹, I'm Tommy [he/him/his]

I'm a Machine Learning Engineer @ Nuuly. I graduated with a Bachelors of Science in Computer Science from Drexel University. In my spare time I like to write. You can find my blog @ tommynguyen.dev.

๐Ÿ›  Iโ€™m currently working on ...

  • builing ML infrastructure as a Machine Learning Engineer @ Nuuly

๐Ÿ“ซ How to reach me ...

๐Ÿ“œ In the past I've worked at ...

โ˜• Fun facts:

forest's People

Contributors

tnguyen21 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

forest's Issues

Search Algorithm and queries with doubled characters

While working on the single word pre-task and timing exact-match queries, I noticed incorrect behavior. Here are some logs I've printed out to help correctly set up the code:

Query: aminophylin | Expected ED: 1 | Found: [('aminophyllin', 0, 0.9972)]
Query: amonul | Expected ED: 1 | Found: [('ammonul', 0, 0.9619)]
Query: aquaphylin | Expected ED: 1 | Found: [('aquaphyllin', 0, 0.9939)]
Query: aranon | Expected ED: 1 | Found: [('arranon', 0, 0.9619)]
Query: atna | Expected ED: 1 | Found: [('atnaa', 0, 0.96)]
Query: aved | Expected ED: 1 | Found: [('aveed', 0, 0.9533)]
Query: bacim | Expected ED: 1 | Found: [('baciim', 0, 0.9667)]
Query: brisdele | Expected ED: 1 | Found: [('brisdelle', 0, 0.9889)]
Query: bucal | Expected ED: 1 | Found: [('buccal', 0, 0.9611)]
Query: coper | Expected ED: 1 | Found: [('copper', 0, 0.9611)]
Query: davp | Expected ED: 1 | Found: [('ddavp', 0, 0.94)]
Query: duave | Expected ED: 1 | Found: [('duavee', 0, 0.9722)]
Query: dycil | Expected ED: 1 | Found: [('dycill', 0, 0.9722)]
Query: efexor | Expected ED: 1 | Found: [('effexor', 0, 0.9619)]
Query: emoquete | Expected ED: 1 | Found: [('emoquette', 0, 0.9889)]
Query: erycete | Expected ED: 1 | Found: [('erycette', 0, 0.9833)]
Query: evkeza | Expected ED: 1 | Found: [('evkeeza', 0, 0.9714)]
Query: ingreza | Expected ED: 1 | Found: [('ingrezza', 0, 0.9833)]
Query: inovar | Expected ED: 1 | Found: [('innovar', 0, 0.9619)]
Query: kimides | Expected ED: 1 | Found: [('kimidess', 0, 0.9875)]
Query: kwel | Expected ED: 1 | Found: [('kwell', 0, 0.96)]
Query: lunele | Expected ED: 1 | Found: [('lunelle', 0, 0.9762)]
Query: merem | Expected ED: 1 | Found: [('merrem', 0, 0.9611)]
Query: minipres | Expected ED: 1 | Found: [('minipress', 0, 0.9926)]
Query: mycapsa | Expected ED: 1 | Found: [('mycapssa', 0, 0.9833)]
Query: niki | Expected ED: 1 | Found: [('nikki', 0, 0.9533)]
Query: oral | Expected ED: 1 | Found: [('oral', 0, 1.0)]
Query: paladone | Expected ED: 1 | Found: [('palladone', 0, 0.9741)]
Query: pastile | Expected ED: 1 | Found: [('pastille', 0, 0.9833)]
Query: pellet | Expected ED: 1 | Found: [('pellet', 0, 1.0)]
Query: phexi | Expected ED: 1 | Found: [('phexxi', 0, 0.9667)]
Query: sebri | Expected ED: 1 | Found: [('seebri', 0, 0.9556)]
Query: shampo | Expected ED: 1 | Found: [('shampoo', 0, 0.981)]
Query: sula | Expected ED: 1 | Found: [('sulla', 0, 0.9533)]
Query: suprelin | Expected ED: 1 | Found: [('supprelin', 0, 0.9741)]
Query: talzena | Expected ED: 1 | Found: [('talzenna', 0, 0.9833)]
Query: vetids | Expected ED: 1 | Found: [('veetids', 0, 0.9619)]
Query: vivele | Expected ED: 1 | Found: [('vivelle', 0, 0.9762)]
Query: xaracol | Expected ED: 1 | Found: [('xaracoll', 0, 0.9875)]
Query: xidra | Expected ED: 1 | Found: [('xiidra', 0, 0.9556)]
{"avg_search_time_per_word": 9.558915955462466e-05, "false_postive_count": 40}

Average search time per word should be self-explanatory.

False positive count is the count of queries that were searched with an exact match, but returned a result with edit distance = 0.

Looking at the query and returned results, it appears that words with "doubled characters" (e.g. ee, aa, etc) are causing the search algorithm to incorrectly count the extra character as not modifying the word's edit distance to the query. This is bad! And we should re-run all previous tasks completed during term 1 to re-evaluate any previous conclusions made.

Deal with bad output from phonetic algorithms

Adding hack to deal with bad output from phonetic algorithms. Need to think about how to handle when a phonetic library produces a bad output (i.e. word is not of type string)

71bae5f

 #! bad bug -- sometimes word is NaN because of phonetic algo; i think it's nysiis
        #! for the time being, just make floats into empty string FIX THIS
        if type(word) != str:
            word = ""

Single Word Evaluation Task: Pre-Tasks Timings

A tentative version of the scripts that run the pre-tasks for the single word evaluation tasks can be found in /experiments/single_word-evaluation_task which search the FDA term training dictionary with ED=0, 1, 2.

A few things to note about these tentative scripts, which are not 100% complete (still need to calculate metrics for search results returned).

  • ED=0 takes seconds to run. Quite fast, average search time per query of 1e-5 seconds
  • ED=1 takes less than a minute to run. Perhaps ~10-20 seconds.
  • ED=2 takes quite a while to get through the whole query. Based on what I was observing when logging out the timings for the first hundred or so queries, each search with ED=2 takes about ~0.1 seconds. This means 0.1 * 17.000 searches = 1.700 seconds = ~28 minutes to run the pre-task.

I profiled a single ED=2 search to get a sense of what might be the slower operations in the process, and how we might optimize it to
make ED=2 searches go even faster. Using cProfile to profile the code, and snakeviz to visualize the run, I got this table:
image

A single search is quite fast. It's multiplied by 10.000's of queries that the process becomes a bit slow/tedious.

The slowest operation is recursive calls of trie.update_further_children(). If there's a way to make that function faster, we'll probably be able to cut down the time it takes for ED=2 searches.

Bug with duplicated characters

https://github.com/tnguyen21/trie/blob/bcbba6e28915c393f2c77447a2d390de953f77e0/trie/trie.py#L138

I have this guess (50% sure only) that when you call update_further_children recursevily you might ignore then the current char, so my first simple attempt would be changing line 138 according to the following:

from:
L138 self.update_further_children(child_node, char)

to:
L138 self.update_further_children(child_node, "")

Let me know the result of that change, be careful, it can fix a bug and cause two others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.