Giter Site home page Giter Site logo

kyegomez / usm Goto Github PK

View Code? Open in Web Editor NEW
21.0 5.0 3.0 2.25 MB

Implementation of Google's USM speech model in Pytorch

Home Page: https://discord.gg/GYbXvDGevY

License: MIT License

Makefile 6.30% Python 93.70%
ai artificial-intelligence deep-learning gpt4 gpt4all machine-learning neural-networks dall3

usm's Introduction

Multi-Modality

USM

Implementation of Google's universal speech model from the paper: Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages I'm implementing this mostly because Gemini the all-new multi-modality foundation model from google uses it! Check out our Gemini implementation here:

Install

pip install usm-torch

Usage

import torch
from usm_torch import USMEncoder

# Initialize model
model = USMEncoder(
    dim=80,  # Dimension of the input
    heads=4,  # Number of attention heads
    ff_dim=128,  # Dimension of the feed-forward layer
    depth=4,  # Number of transformer layers
    depthwise_conv_kernel_size=31,  # Kernel size for depthwise convolution
    dropout=0.5,  # Dropout rate
)

# Example input
batch_size = 10  # Number of samples in a batch
max_length = 400  # Maximum length of the input sequence
lengths = torch.randint(1, max_length, (batch_size,))  # Randomly generate sequence lengths
inputs = torch.rand(batch_size, int(lengths.max()), 80)  # Randomly generate input tensor

# Forward pass
outputs, output_lengths = model(inputs, lengths)  # Perform forward pass
print(f"outputs.shape: {outputs.shape}")  # Print the shape of the output tensor
print(f"output_lengths.shape: {output_lengths.shape}")  # Print the shape of the output lengths tensor

License

MIT

Citation

@misc{zhang2023google,
    title={Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages}, 
    author={Yu Zhang and Wei Han and James Qin and Yongqiang Wang and Ankur Bapna and Zhehuai Chen and Nanxin Chen and Bo Li and Vera Axelrod and Gary Wang and Zhong Meng and Ke Hu and Andrew Rosenberg and Rohit Prabhavalkar and Daniel S. Park and Parisa Haghani and Jason Riesa and Ginger Perng and Hagen Soltau and Trevor Strohman and Bhuvana Ramabhadran and Tara Sainath and Pedro Moreno and Chung-Cheng Chiu and Johan Schalkwyk and Françoise Beaufays and Yonghui Wu},
    year={2023},
    eprint={2303.01037},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Todo

  • Implement the proj -> cosine similarity -> codebook
  • Implement chunk wise attention
  • Implement on paired input, with the text encoder: embed extractor -> resampler -> refiner -> text embedding, RNN-T reconstruction loss
  • Text input: text input -> speech encoder -> text decoder -> rnn-t reconstruction

usm's People

Contributors

dependabot[bot] avatar kyegomez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

ishine judah04

usm's Issues

Chunk-wise Self-attention.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Chunk-wise Self-attention.

Hi! thanks for providing the architecture, I am a bit confused as to how the chunk-wise self-attention is implemented
Could you point me to where it is perhaps?
I thought perhaps it was an alternative to the torch conformer api, but I don't see that to be the case.

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.