notai-tech / fastpunct Goto Github PK
View Code? Open in Web Editor NEWPunctuation restoration and spell correction experiments.
License: MIT License
Punctuation restoration and spell correction experiments.
License: MIT License
The fastPunct.punct()
function takes a correct
boolean argument, which is supposed to trigger text correction. However, the model corrects text even when correct
is set to False. Steps to reproduce:
model = FastPunct('english', checkpoint_local_path=str(models.get_unzip('zenai-models/punct/FastPunct_2_0_2_en.zip')))
model.punct('effortless', correct=True) --> 'Easy, easy.'
model.punct('effortless', correct=False) --> 'Easy, easy.'
Hi,
I am trying to run fastpunct.py script as it is. But I am facing following issue:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-5-57df30c6aaa4> in <module>
----> 1 print(fastpunct.punct(["call haris mom", "oh i thought you were here", "where are you going", "in theory everyone knows what a comma is", "hey how are you doing", "my name is sheela i am in love with hrithik"]))
<ipython-input-3-de41481cee39> in punct(self, input_texts, batch_size)
101
102 def punct(self, input_texts, batch_size=32):
--> 103 return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
104
105 def fastpunct(self, input_texts, batch_size=32):
<ipython-input-3-de41481cee39> in decode(model, parameters, input_texts, allowed_extras, batch_size)
18 curr_char_index = [i - extra_char_count[j] for j in range(len(input_texts))]
19 input_encodings = np.argmax(input_sequences, axis=2)
---> 20 cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
21 output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
22 sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()
<ipython-input-3-de41481cee39> in <listcomp>(.0)
18 curr_char_index = [i - extra_char_count[j] for j in range(len(input_texts))]
19 input_encodings = np.argmax(input_sequences, axis=2)
---> 20 cur_inp_list = [input_encodings[_][curr_char_index[_]] for _ in range(len(input_texts))]
21 output_tokens = model.predict([input_sequences, target_seq_hot], batch_size=batch_size)
22 sampled_possible_indices = np.argsort(output_tokens[:, i, :])[:, ::-1].tolist()
IndexError: index 43 is out of bounds for axis 0 with size 43
Any suggestions/workarounds?
Not sure if within the scope, but the model can't detect whether the text input should be separated by the period punctuation. In other words, it can't detect whether a text input actually represents two sentences.
e.g.
fastpunct.punct([
"There are three ways to slice a fish on the left on the right and on the middle after you sliced the fish you can go to the house"], correct=True)
yields
['There are three ways to slice a fish on the left, on the right, and on the middle, after you sliced the fish, you can go to the house.']
What is the format of training data set. It will help in fine tuning the model with contextual data
The EMR needs model files to be downloaded to /tmp
and default home /var/lib/livy
is inaccessible to users.
So, I managed to download the content separately in my case, but even passing weights path in parameters didn't work
Hi! I am hoping to use this in an iOS project. Could you convert the model to CoreML or at least provide the frozen model (.pb file)? Thanks
Hello all,
thank you for your amazing effort in this area, could you share how we can contribute to this work by adding support for new languages like Arabic? Also, could share the training code, please?.
Best regards,
Abdullah
Hi!
Thank you so much for the model. I am trying to use it for a project on iOS. I have successfully converted it using CoreMLTools but now I am trying to use it. Would you please be so kind as to provide documentation on the expected input of the model? After conversion I see:
The only things I had done was add:
import coremltools
import tensorflow as tf
and
mlmodel = coremltools.convert(self.model)
mlmodel.save('punctuation.mlmodel')
(to the init)
It would be nice if in future iterations, the input/output could be a little more straightforward.
Thanks!
Hey!
The parameter_dict.pkl file and the fastpunct_eng_weights.h5 file are not uploaded.
Hey there!
I started reading about text correction with Deep Learning and most certainly read all your blog posts.
But I still wonder why you would choose a Seq2Seq network for punctuation restoration over a classification network that classifies for each token if it should be followed by some sort of punctuation. For a Seq2Seq model, you have to make sure the network does not change anything but the punctuation in the sequence, in a classification network you get this out of the box.
Does a Seq2Seq model perform better or is this easier to train for this purpose? If so, could you elaborate on why?
Or is this maybe part of a greater goal to do punctuation restoration, spelling, and grammar correction with one large Seq2Seq network?
Also, what is your network input, as far as I can tell you are not using word embeddings? I guess this is why your model checkpoint is so small.
If I install fastPunct
without having TF/Keras installed already it won't throw an error or install the dependency, it will just fail at runtime. Looks like it should be as easy as adding to the existing list of required packages.
fastPunct punctation fails following quoted error if input text size is greater than around 400 chars.
To replicate run fastPunct.punct method with any input text string with more than 400 chars.
input_text_len 407
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 175, in punct
return decode(self.model, self.parameters, input_texts, self.allowed_extras, batch_size)
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in decode
outputs = [out_dict[text] for text in input_texts_c]
File "/opt/conda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 119, in
outputs = [out_dict[text] for text in input_texts_c]
KeyError: "lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the dkdd may be dd"
Hi there,
Can you please add support for Dutch or tell us how to set this up ourselves/train this?
input = ['My name is sid and i want to become a data scientist.']
output =['Y name is Sid, and I want to become a data scientist.']
It removes M from start which is weird.
If the input contains a hyphen, the hyphen is missing from the output. I want to keep the hyphen in the output as well. How can I do this?
exmaple
Input
Last week it was the return of the world's longest flight -- Singapore to New York JFK. This week comes another new aviation record: the world's longest flight in a single-aisle aircraft. Air Transat flight TS690 flew transatlantic from Montreal, Canada, to Athens, Greece, on Monday -- a journey of 7,600 kilometers, or 4,754 miles. So far, so normal -- except the eight-hour, 32-minute flight was performed in a narrowbody Airbus A321neoLR.
Output
Last week it was the return of the world's longest flight Singapore to New York JFK this week comes another new aviation record: the world's longest flight in a Singleaisle aircraft air Transat flight TS690 flew transatlantic from Montreal Canada to Athens Greece on Monday a journey of 7,600 kilometers, or 475.4 miles so far. So normal Except the Eighthour 32Minute flight was performed in a Narrowbody Airbus A321neolR.
the pypi version can be automatically reverted to latest commit to develop branch
Hi! In some cases I'm getting unwanted whitespaces around the quotation marks, e.g.:
fastpunct.punct('i m going for sure explained my friend')
-> " I'm going for sure ", explained my friend.
But
fastpunct.punct('im going for sure explained my friend')
-> "I'm going for sure", explained my friend.
How and when would it be possible to use fastPunct for german language?
>>> from fastpunct import FastPunct
>>> fastpunct = FastPunct()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
**kwargs,
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
init_kwargs = json.load(tokenizer_config_handle)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>> fastpunct = FastPunct()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 46, in __init__
self.tokenizer = T5Tokenizer.from_pretrained(lang_path)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
**kwargs,
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1782, in _from_pretrained
init_kwargs = json.load(tokenizer_config_handle)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/soldanm/anaconda3/envs/test/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Please provide me with some intuition on how to overcome this issue.
I have try it on python 3.7 env without cuda.
Succefully Downloaded to: /home/ubuntu/.fastPunct_en/params.pkl
2020-07-26 00:38:16.243790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-07-26 00:38:16.243966: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-07-26 00:38:16.244079: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (VM-0-7-ubuntu): /proc/driver/nvidia/version does not exist
2020-07-26 00:38:16.244661: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-26 00:38:16.545005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2394445000 Hz
2020-07-26 00:38:16.545635: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe014000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-26 00:38:16.545671: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/fastpunct/fastpunct.py", line 170, in init
self.model.load_weights(weights_path)
File "/home/ubuntu/miniconda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 250, in load_weights
...
Hi. I am trying to restore punctuation on auto generated transcript and I see this in the console:
Token indices sequence length is longer than the specified maximum sequence length for this model (998 > 512). Running this sequence through the model will result in indexing errors
Is it possible to increase the limit?
cant able to find any score for the model like PRECISION, RECALL or F-SCORE and the data on which its trained on.
If you Please give some idea how model is working in ideal condition.
It takes fastpunct around 5-7 seconds to process one short sentence. I have:
Ubuntu 18.04 on aws
Python 3.6.9
Tensorflow 1.14.0
I am wondering what I'm doing wrong. I've tried this both as a straight cmdline call and using zerorpc, which is what I'd ultimately like it to do in order to load the training first. Right now, it's unusable as I basically need real-time results.
Thank you.
Hi,I have ran this example with GeForce RTX 2080 Ti, but I found it need more than one minite, Is it a little slower?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.