peterpla / lead-expert Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 502 KB

Generic Go web app with APIs and html/template UI

License: MIT License

Shell 3.36% Go 96.64%

lead-expert's People

Contributors

Watchers

lead-expert's Issues

add repo.Update in key pipeline stages

Type of Issue

Feature (add missing functionality)

Describe the issue

Requests are added to the database (repo.Create) by cmd/server/main.go/postHandler, but subsequent changes are not written, e.g., at the end of the transcriptionGCP and completionProcessing pipeline stages.

Use repo.Update to update the Request at those stages.

Expected behavior

The Request with updated WorkingTranscript (after trancriptionGCP) and CompletedAt (after completionProcessing) to be written to the database using repo.Update.

Additional context

Most other pipeline stages currently don't do anything (except add Begin/End timestamps), so we can avoid hitting Firestore's 1 update/doc/second limit by not updating the database at those points.

Currently default (handles all client HTTP requests), transcriptionGCP (obtain text transcript) and completionProcessing are appropriate pipeline stages to update the Request in the database.

expose app metrics using expvar

Type of Issue

Feature (add missing functionality)

Describe the issue

Expose application metrics (e.g., operations/sec) through an HTTP endpoint that delivers metrics data as JSON. See How to instrument Go code with custom expvar metrics and package expvar.

Identify standard metrics we want from all services
Identify service-specific metrics

Expected behavior

Hit the endpoint, get back the latest set of metrics.

Additional context

Super-easy to implement on the server side, so it could actually get done.

robust parsing of JSON Body

Type of Issue

Feature (add missing functionality)

Describe the issue

Production-level parsing of JSON request Body adds security and robustness to request handling. See Alex Edwards' "How to Parse a JSON Request Body in Go" which includes a decodeJSONBody helper function.

Expected behavior

Many problem scenarios are discussed in the article. Our expected behavior is to not experience those problems in production.

feat: implement queue and worker service

Type of Issue

Feature (add missing functionality)

Describe the issue

Implement task queues, where an API endpoint adds a request to a queue, which is picked up by a worker service. Presumably using Google Cloud Pub/Sub.

bad time.Duration calculation in completionProcessing

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

At the end of cmd/completionProcessing/main.go/taskHandler where the total request time is reported, a bad value is displayed:

=====> Request Processed in -2562047h47m16.854775808s <====

To Reproduce

Appears to occur every time, running locally or in the cloud. Shown in automated testing, i.e., go test -v ./....

Expected behavior

The expected duration is approximately 45 seconds.

Additional context

Log file entry from GCP StackDriver:

2019/12/16 05:20:03 completion-processing.taskHandler completed in 650.728µs =====>
Request Processed in -2562047h47m16.854775808s <==== : queue "CompletionProcessing",
task "7275844481679864645", response: {RequestID:ee11b366-97ab-
4dc2-8ec9-9d35262c0fe6 CustomerID:1234567 MediaFileURI:gs://elated-practice-
224603.appspot.com/audio_uploads/audio-01.mp3 
AcceptedAt:2019-12-16T05:19:01.768536301Z FinalTranscript:thank you for calling Park 
flooring this is Michael how may I help you hey Michael how are you today good what's up my 
name is Yuri well the sun is finally up that's what's up this off a little bit um the reason for my 
call is I had a I just wanted to see who's in charge of your lead management or lead distribution 
I am the owner oh very good well then glad we're chatting so my company and my rather my 
partner and I are focused on helping flooring retailers get five to ten more phone calls per day 
from customers qualified customers without actually any advertising effort we're not we're not 
you know we're not an agency so we don't do any external advertising on any platforms but we 
do help with the the optimizing the existing traffic on your site and the flow of traffic so we can 
let this sort of like the traffic control of your website and we'd be able to get you additional 
phone calls base wage CompletedAt:2019-12-16T05:20:03.310908558Z}

need tests for command line argument handling

Error completion - end request after an error

Issue Template

Type of Issue

Feature (add missing functionality)

Describe the issue

After a request is accepted (202 Accepted returned by cmd/server/main.go/postHandler), an error at a later pipeline stage is not retained and communicated to the client, it's only reported to Cloud Tasks.

Now that we can persist Requests to the database, we can persist error status and report it to the client when they poll /queues/:uuid.

Expected behavior

The cmd/server/main.go/getQueueHandler to read the indicated Request and return the original status, per interface.MD documentation.

design for days-long QA processing

Type of Issue

Feature (add missing functionality)

Describe the issue

Human QA review of transcripts and tagging is likely to require longer than the 10- to 60-minute timeout limits of Google Cloud Tasks (for standard and flexible environments, respectively). This suggests the *QA services (e.g., transcriptQA, taggingQA) will need to use a different solution for managing QA processing.

Expected behavior

Proposed:

The default /task_handler endpoint for QA services receives the QA request
The request is written to a database-hosted queue
The /task_handler returns 202 Accepted telling Cloud Tasks that request has successfully completed
One or more human reviewers are presented with details of the request for their review; e.g., by email or text.
When human review is completed, the (potentially modified) request is sent to the service's /QA_complete endpoint (does not yet exist).
The /QA_complete handler:
- "completes" the task in the database (details TBD)
- updates the Requests database to reflect any changes from the QA process
- writes the request to the Cloud Tasks queue for the next pipeline stages

This takes the long-running human QA process outside the Cloud Tasks infrastructure, thereby avoiding its short timeout limits (in this context).

Alternative: as above, but the default service (cmd/server/main.go) handles /QA_complete endpoint. That keeps the implementation of all public-facing endpoints in one service.

Additional context

A mechanism like this will be needed for many different Beam scenarios, assuming work is dispatched to a network of Beam-affiliated freelance workers handling requests/tasks of various types. Thus is warrants some thought to produce a generalized model.

bug: return error on empty transcript

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

If a request produces no transcript, all subsequent pipeline stages will have nothing to do so report the error and avoid sending the request to the rest of the pipeline.

Expected behavior

If WorkingTranscript is empty:

[] add the appropriate error (TBD) to the Request
[] add status = ERROR to the Request
[] add CompletedAt to the Request
[] update the Request database
[] Do not post the request to the queue for the next pipeline stage.

Additional context

A truly empty transcript likely occurs due to an error processing the recording.

A functionally empty transcript contains words but no actionable information. TBD whether that can be identified programmatically during transcriptQA processing, or if that requires human review.

implement HTTP GET /transcripts

Type of Issue

Feature (add missing functionality)

Describe the issue

Complete implementation of HTTP GET /transcripts/:uuid to get the final transcript from an Accepted request.

Expected behavior

A successful HTTP POST /requests returns 202 Accepted and the response body contains poll_endpoint with the URI to poll for status of the request (e.g., /status/:uuid). A GET /status/:uuid will return '303 See Otherand the response body containstranscript_endpointwith the URI to retrieve the final transcript for the request (e.g.,/transcripts/:uuid`).

This issue is to fully implement that GET /transcripts/:uuid endpoint, as documented in interface.MD.

feat: implement basic flow through queues and services

Type of Issue

Feature (add missing functionality)

Describe the issue

Implement the basic flow of an incoming request through various queues and services/handlers.

Expected behavior

A POST /requests will be handled by the default service, which adds the request to the InitialRequest queue, which triggers the next service/handler, which adds the request to the next queue, which triggers the next service/handler, etc., until the request is added to the TaggingQAComplete queue which triggers the Completion service/handler. See Airtable, "Flooring: Transcription Pipeline" for details of request flow through queues.

For these purposes, each service/handler merely logs the arrival of a request and adds the request to the next queue, so the end-to-end flow is operational (albeit stubbed).

Implementation Tasks

implemented encrypted configuration file support

Goal: use an encrypted app config file for staging/production.

Approach: during deployment to staging/production, copy the current local dev config file (not encrypted), encrypt it into a temporary file, and deploy the encrypted file.

Design:
Deployment: use gsutil and gcloud tools to:

create KMS keyring and key
create an encrypted copy of the local dev config file
copy that to Cloud Storage

App:

read the config file from Cloud Storage
decrypt it
load the configuration

import media files from initial supported sources

Type of Issue

Feature (add missing functionality)

Describe the issue

Cloud-based ML transcription services expect the recordings they analyze to be available either in their cloud storage (e.g., Google Speech-to-Text supports Google Cloud Storage) or in some cases to be streamed into the transcription service.

Recording services host their recordings and offer various APIs to access them.

Understand common scenarios and implement handlers for them.

Expected behavior

Just make it work, but long-term we'll want to minimize the amount of copying done to limit network fees.

migrate go code into project OldManCodes

Type of Issue

Chore (rearrange code without adding or fixing something)

Describe the issue

Need to move the go code out of the Google Cloud elated-practice project which was tied to a free trial that's expired, into the oldmancodes project. That includes creating equivalent Cloud Storage and Cloud KMS resources, and setting the needed IAM policies.

Additional Context

Focus on the service account specified in gdeploy.sh, make sure it has the right permissions.

store requests in persistent storage

Type of Issue

Feature (add missing functionality)

Describe the issue

Store requests in persistent storage

Requests and processing currently use key:value pairs, so a NoSQL database like Cloud Firestore would make sense. See Cloud Firestore documentation.

Expect to archive requests after some period of time.

Expected behavior

Requests are the cornerstone of the business and must be reliably persisted for later reference, audit, etc. Each time a request's information is altered, the change must be persisted.

Additional context

investigate Go-based in-memory caches with configurable write-back.

The potential frequency of updates suggests the need for write-back caching. For example, Firestore's limit is 1 update per second to a given document (i.e., a transcription Request) , and 1,000 updates per second to a collection (i.e., all active Requests). See Best practices for Cloud Firestore.

Cloud Memorystore for Redis would cost about $15/day (about $465/month) for the current 12 services each with a 1 GB cache. See Pricing, the current price is $0.049/GB/hr provisioned even if it's not being used.

implement using Google Speech to Text to generate a transcript of audio files

Type of Issue

Feature (add missing functionality)

Describe the issue

Use Google Speech-to-Text to obtain a text transcript of the recording referenced in the request.

Expected behavior

Pass the audio file from the request to Google Speech-to-Text, receiving back a text transcript or an error.

Additional context

I believe the Google Speech-to-Text APIs distinguish between short/small vs. long/large recordings. Parametrize the thresholds to simplify adding other services with different thresholds.

tag key info like address and phone number

Type of Issue

Feature (add missing functionality)

Describe the issue

Use Google Cloud Data Loss Prevention Classification, using pre-defined Infotype detectors: InfoTypes and infoType detectors.

Expected behavior

Capture details of targeted Infotypes found in the transcript, e.g., PERSON_NAME, PHONE_NUMBER, CREDIT_CARD_NUMBER. Found (matched) or not, the value found for the InfoType, location in the transcript (start and end byte offsets in the transcript), etc.

Additional context

Add any other context about the problem here.

move most env vars from app.yaml to config.yaml

Type of Issue

Chore (rearrange code without adding or fixing something)

Describe the issue

The app.yaml files for each service set a lot of the same environment variables, they must be kept in sync and if/when any of their values change, they must be changed in many places.

Move as many as possible into config.yaml. It will be encrypted and stored in Google Cloud Storage by gupload_encrypted.sh, and decrypted and processed by pkg/config during each service's init().

Timestamps are lost near the end of the pipeline

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

The Timestamps map with Begin/End timestamps for each pipeline stage are present in the response from TaggingQAComplete but are not present in the response from CompletionProcessing.

To Reproduce

See StackDriver logs

Expected behavior

The complete Timestamps map should be included in the Request when CompletionProcessing finishes with it. That allows us to persist the Request with timestamps, and perform later analysis on duration of processing by each pipeline stage, based on those timestamps.

feat: ML transcription of audio file

Type of Issue

Feature (add missing functionality)

Describe the issue

Given an audio file, submit it to a ML transcription like Google Cloud Speech-to-Text or Sonix.ai, and receive a text transcript back.

Additional context

Audio files expected to be captured by Twilio or equivalent, from incoming customer calls or outbound salesperson calls.

The audio format recorded needs to be supported by the ML transcription service(s). Which format(s) match up are TBD.

How audio capture is done is outside the scope of this feature; assume it's been done and the audio recording exists.

The resulting transcription will be in some text format (e.g., .docx, .rtf, .txt) that is human-readable and readily edited.

Transcription quality is expected to improve over time as the ML transcription services improve. Nonetheless, quality control - e.g., via human reviewers - is planned but is again outside the scope of this feature.

support command line arguments

Goal: implement basic command line argument support.

Approach: accept a few mundane options on the command line, as a starting point.

Design: use the standard library flag package to handle optional flags:

port and int to specify the network port the app should listen on; default :8080
config=path/filename a string to specify the location of the config file; default "" (empty string)
v to specify verbose output; default is normal output

return HTTP 202 Accepted with status polling link

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

The default service (cmd/server/main.go) currently returns HTTP 202 Accepted when it receives a request, but provides no mechanism for the client to check status, recognize the request has been processed, or retrieve the transcript.

The HTTP 202 Accepted should include a URL the client can use to test status and eventually - minutes, hours or days later - retrieve the completed transcript.

E.g., the HTTP 202 Accepted can include /queue/12345 to indicate the client should check status using GET /queue/12345.

implement handling GET /queue/{uuid}

The response when the the request has completed should include the URL to retrieve the transcript.

E.g., the response to GET /queue/12345 upon request completion can be HTTP 303 See Other and include /transcript/12345 to indicate the client can retrieve the transcript using GET /transcript/12345.

implement handling GET /transcript/{uuid}

(This is based on REST and long-running jobs.)

implement status-specific response handling

See interface.MD, e.g., for GET /queues/12345 when the status of request 12345 is PENDING the response will include an eta value with the estimated time of completion.

create documentation

Update README.md to reflect current capabilities.

set request ID that propogates through pipeline

Type of Issue

Feature (add missing functionality)

Describe the issue

Need to be able to identify and track a request as it works through the pipeline.

Currently each task added to a Cloud Tasks queue gets a system-defined "taskname" (number) which is unique to that task. When our handler processes that request and adds it to the next queue in the pipeline, a new, different and unrelated system-defined taskname is created.

Because there's no connection between the original request and the tasks created as the request moves through the pipeline, it's very difficult to follow a specific task from beginning to end.

Expected behavior

Add a request ID to each incoming request that propagates through the processing pipeline.

Logging must include that request ID to simplify following a request through the pipeline.

Additional context

Recommend using github.com/google/uuid, it's the most actively-maintained of the popular UUID implementations. See the note in func NewRandom() re: the probability of creating colliding UUIDs.

retain (or not) intermediate forms of the transcript

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

Currently the WorkingTranscript field in a Request holds the in-progress transcript as a Request moves through the pipeline stages. cmd/completionProcessing/main.go/taskHandler processes that into a customer-ready form and saves that in FinalTranscript field. Currently both WorkingTranscript and FinalTranscript are retained in the final Request stored in the database.

Decide whether we want to retain some or all intermediate stages of the transcript, e.g., before tagging is done, before processing to be customer-ready.
Modify the implementation accordingly.

Expected behavior

The final transcript is retained, along with any desired intermediate transcript forms. Others are removed before the final Request is saved in the database.

Additional context

To be able to analyze/compare future results vs. past results, we must retain those past results. The more we retain, the more we'll be able to compare. There is little near-term value in this, only future value; and even the future value is hypothetical. E.g., perhaps we can plot past improvements in specific services and forecast future improvements?

404 getting favicon.ico

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

GET /favicon.ico produces 404 Not Found.

To Reproduce

Open browser and developer tools
gcloud app browser or https://elated-practice-224603.appspot.com/
Observe StackDriver log: 404 for GET /favicon.ico request

Expected behavior

Repeat steps 1 and 2 above, step 3 should result in 200 for GET /favicon.ico request.

Additional context

favicon.ico was removed from the project because it wasn't being served correctly. Need to add it and serve it correctly.

proper local queueing to aid localhost dev/test

Type of Issue

Feature (add missing functionality)

Describe the issue

Add a file system-based queueing service so requests can cascade from one pipeline stage (service/process) to the next when running locally.

Expected behavior

Local execution should mimic the operational characteristics of the cloud-deployed application.

Additional context

The new queueing model should make this fairly straightforward, which would be a big boon to development and testing. E.g., currently cmd/server/main.go/postHandler assigns a UUID to RequestID, but during local execution and testing there's currently no mechanism to pass the request and its UUID to the next pipeline stage.

Sending a new request directly to the next pipeline stage results in an all-zero UUID. Since that UUID is used as the Firestore document ID for each request, this really messes up local testing.

Add consistency and structure to issues, pull requests and commit msgs

Type of Issue

Chore

Describe the issue

Create MD templates for Issues and Pull Requests, and a text template for use with git commit

Expected behavior

Revise the MD template that will populate a new Pull Request in VSCode; see VSCode GitHub Pull Request Extension
Add a MD template to use when creating a new issue.
Add a text template for Git commit messages. See VSCode Git commit message templates #7830

add initial middleware

Type of Issue

Feature (add missing functionality)

Describe the issue

Implement initial middleware services.

Expected behavior

Log the start and end times of each HTTP request.

implement response body as defined in interface.MD

Type of Issue

Bug (something doesn't work as expected)

Describe the issue

The interface.MDdocument defines the JSON response body, but the current implementation does not match that definition.

Expected behavior

The response body for a successful (accepted) request should be as documented, e.g.:

{
  "RequestID":      "269fd581-35aa-465d-81df-c0295034c723",
  "CreatedAt":      "2019-06-22T21:54:56.714Z",
}

Additional context

See also issue #32 which will likely result in additional changes to the implementation of the response body.

implement HTTP GET /status/:uuid

Type of Issue

Feature (add missing functionality)

Describe the issue

Complete implementation of HTTP GET /status/:uuid to get the status of an Accepted request.

Expected behavior

A successful HTTP POST /requests returns 202 Accepted and the response body contains poll_endpoint with the URI (including /status/:uuid) to poll for status of the request.

This issue is to fully implement that GET /status/:uuid endpoint, as documented in interface.MD.

use all recognition services: diarization (identify speakers), punctuation, etc.

Type of Issue

Feature (add missing functionality)

Describe the issue

At this stage we specify rudimentary transcription without asking for many additional features available from Google Cloud Speech-to-Text, including:

Speaker diarization, separate different speakers in an audio recording
Punctuation, recognize commas, question marks, and periods in transcription requests
Speech adaptation, providing specific phrases to recognize from your audio data - e.g., "our special offer" - to improve the output for speech transcription. Includes specifying "Classes" - e.g., $MONTH, $MONEY, $POSTALCODE, $FULLPHONENUM - so those concepts are more likely to be correctly transcribed.

Expected behavior

Take full advantage of all applicable recognition services, to produce the best possible transcript.

Additional context

Log file entries from GCP StackDriver
Add any other context about the problem here.

Tagging adds all Findings, TaggingQA removes all but the best

Type of Issue

Feature (add missing functionality)

Describe the issue

The initial tagging service implementation adds only the top-Likelihood findings (tags) to the Request object; that's a premature optimization. tagging should add all findings to Request, and leave it to later pipeline stages to identify which to report back to the user.

The taggingQA service is where that selection should occur.

Expected behavior

All Findings (matches) added to Request object by tagging.

Only the best Findings retained, taggingQA deletes the others.

add basic routing

Type of Issue

Feature (add missing functionality)

Describe the issue

Implement basic routing with handlers for "/", "/home" and "/about"

add timestamps to requests as they move through the pipeline

Type of Issue

Feature (add missing functionality)

Describe the issue

Understanding how much processing each request requires, and how long that processing takes, is key to managing performance and cost. To ensure that information is captured and available for each request:

Capture start and end times for request processing in each pipeline stage, and store within the request itself.

Additional context

A fast-to-process request will see many updates to the request in a short period of time, which - at scale - can be a challenging scenario for some cloud storage services. Leave open the possibility of using a write-back cache to consolidate multiple updates to a request, to reduce the number of writes to persistent storage.