Giter Site home page Giter Site logo

ibmstreams / streamsx.speech2text Goto Github PK

View Code? Open in Web Editor NEW
3.0 15.0 5.0 230 KB

(incubation) Provides ability to transform speech to text in a Streams application

Home Page: http://ibmstreams.github.io/streamsx.speech2text

License: Apache License 2.0

Makefile 21.92% Shell 40.21% Java 37.87%
speech2text ibm-streams stream-processing speech-to-text

streamsx.speech2text's Introduction

streamsx.speech2text Repository

This repository provides supporting applications/solutions, as well as microservices for effective transformation and analysis of speech to text in a Streams application using the product-included Speech2Text Toolkit.

Check out this video about how Verizon is using Speech2Text in Streams: https://youtu.be/Zg-_BJt6jdc

This is NOT the Speech2Text Toolkit

Using the Speech2Text toolkit with the WatsonS2T operator requires purchase of the IBM Streams product and is included as a separate download (no extra cost).

Build toolkit

  1. Install cyrus-sasl-devel.x86_64 (this is only needed for the dps toolkit, i.e. the CallState application): yum install cyrus-sasl-devel.x86_64
  2. Run ant: ant

streamsx.speech2text's People

Contributors

alex-cook4 avatar chanskw avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

streamsx.speech2text's Issues

Project population and graduation status

Please populate repository.

To graduate from incubation phase, please keep GRADUATION_STATUS.md up to date.

When ready to move out of incubation phase, please open an issue to the streamsx.adminstration project. PMC members will review the GRADUATION_STATUS.md page and vote if the project is ready to be moved out of incubation.

Naming of some attributes/types are deceptive

I haven't started changing attribute names yet because I want to try to keep the work backwards compatible for now. This issue is more for guidance when we decide we can really overhaul things.

Since I keep getting confused, I'm working on adding documentation on what each type means. I will call out attributes that I believe are misleading.

In the Utterance:

type  Utterance	=	
	tuple<rstring callId, 					// ipAddress + captureSeconds -> In an environment where
								// each speaker is on a separate RTP Streams, the callId
								// is effectively the ID for that speaker's stream. 
								// In the one-speaker to one-stream case, CTI correlation
								// must be done. 
		int32 utteranceNumber, 				// The utterance number for given RTP Stream, [0,1,...]
		float64 utteranceStartTime, 			// Seconds of audio processed for a given RTP Stream
								// up to start of the Utterance
        	float64 utteranceEndTime, 			// Seconds of audio processed for a given RTP Stream
								// up to end of the Utterance
        	uint32 captureSeconds,  			// This refers to the capture time in seconds of the first
								// RTP packet in the SSRC stream
        	rstring role, 					// role = "AGENT" -- this is currently useless
		rstring utterance, 				// The text of a single utterance
		int32 speakerId, 				// Not used - based on a channel id that is set to 0, since 
								// we only handle a single channel at a time
        	rstring callCenter, 				// ID for the call center the utterance is coming from
        	float64 utteranceConfidence, 			// Statistical confidence in the transcription of the utterance
        	list<float64>  utteranceTokenConfidences/*, 	// Statistical confidence in each token/word of the utterance
        	list<int32> utteranceSpeakers, 			// If using diarization, speaker of each token/word
		list<rstring> nBestHypotheses*/> ; 		// Alternative guesses for the utterance text

I recommend the following:

  • callId -> rtpStreamId: since this isn't actually the id of a call, it only has a single speaker. The true call id comes from CTI correlation and would have multiple of these "callId"s.
  • captureSeconds -> rtpStreamsStartTime: since it actually refers to the captureSeconds of the first packet in the RTP stream
  • role -> REMOVE: unless there are plans to support this in some way
  • speakerId -> REMOVE: unless there are plans to support this in some way

As I see other types/attributes I think could be cleaned up, I will add them to this issue.

CallCenter application should leverage "transcriptionComplete" attribute

Newer versions of the WatsonS2T operator have a "transcriptionComplete" field that is set when sending the last utterance for a given id (in the case of files, it's the last utterance of the file).

Currently, parts of the code rely upon a utteranceNumber == -1 to indicate that an RTPStream has finished processing. While this tuple with -1 will be emitted most of the time, it is only in the cases where a "partialUtterance" was not found when the reset signal was received, therefore it is unreliable.

This is probably most likely to happen if a call were to disconnected mid-sentence, otherwise it's probably less likely.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.