1. Webhook support
We need the embedding vectors, along with their payloads, returned as some consistent type within a payload to a webhook route we can specify.
3. Collision detection
We want to provide the following RPC route and a threshold full_text_score
.
get_top_full_text_match
that returns {id: uuid, full_text_score: float32}
We would like that to get called before using GPU or a usage-priced call to a provider to make the embedding. If the score
exceeds a given threshold then we don't need the embedding and just want the index in the webhook request payload that would contain the embedding to have some collision_id: id
that marks it as a collision. Our system then handles marking it as such.
It would be cool if we could also provide the followng RPC and a threshold distance
.
get_top_semantic_match
that returns {id: uuid, distance: float32}
If the distance
exceeds the threshold we would want to do the same as above. This semantic
distance check is not required but would be a nice add. If we don't have it, then we will have to implement it on our end to validate the embeddings
returned in the webhook request payload.
4. Document id's
job_id
seems like a sort of proxy for a document_id. We don't want to have to track these on our end.
So sending a document then getting back an job_id
would require us to associate the uuid
of the document to the job_id
such that we can link them when the vectors come in to the webhook.
We want to be able to send a document_id
on our submission of the document w/ the signed S3 URL and get that document_id
back in the request payload on the webhook hit.
5. Email chunker
Emails typically export as HTML or XML. We extract all of the innerText into a txt
file before starting to chunk.
For emails, we want one email to constitute a chunk. However, emails are typically chains of several individual emails that have to get parsed.
As the chunk goes, we need to track when emails end. To do so, you want to look for things like regards,
,thanks,
, etc:
words_trigger_email_end = ['forwarded message', 'has invited you', 'open in', 'google llc', 'original message', 'original message follows', '───' '--', '***', '===', 'regards,', 'from,', 'sincerely,', 'yours,', 'regards,', 'gratitude,', 'appreciation,', 'care,', 'cheers,', 'cordially,', 'gratitude,', 'respectfully,', 'warmly,', 'best,', 'wishes,', 'humbly,', "thanks,"]
We also have to track when emails start because you don't want the noise between emails with things like confidential notices and PII:
words_trigger_email_start = ['to:', 'cc:', 'from:', 'date:', 'sent:', 'subject:', 're:', 'fw:', 'fwd:', 'attachments:', 'attached:', 'wrote:']
Then, we also need to track the keys on the email:
keys_to_track = ['to:', 'cc:', 'from:', 'date:', "sent:"]
Every line of the txt
basically has to get split on those keys and then you need to grab the 1
index and add the KV pairs to the metadata that will be included on the embedding
type returned on the webhook payload.
There is also an edge-case here where emails can sometimes exceed a given max_chunk
size and then need to get split. Handling that is very hard.
I will try to share our code soon, but it has a bunch of hardcoded stuff related to PO's we signed such that it requires significant editing.
6. Sentence/Line ignore triggers during chunking
Assuming OCR parse into txt
these ignores are typically line based because OCR does do well with punctuation in our testing. Otherwise these ignores are usually sentence based.
If a word in the array we pass when queuing is in a line or sentence, we want to skip over that line/sentence during chunking:
words_to_trigger_line_ignore = ['to:', 'cc:', 'from:', 'date:', "sent:", "forwarded message", "@", "───", "meeting id:", "password:", "all rights reserved", "has invited you", "google llc",, "recognize the sender's email", "docs.google.com", "external sender", "trust this email", "content is safe", "proof of sender", "do not click", "mentioned in this thread", "google docs sends"]
Future needs
For marketing purposes we are standing up a bunch of "chat with Youtuber foo" demos. That means we are using whisper to Chunk audio files.
We need support doing this kind of chunking. Here is a full implementation we think can mostly be copied.