apm1467 / videocr Goto Github PK

View Code? Open in Web Editor NEW

498.0 14.0 110.0 54 KB

Extract hardcoded subtitles from videos using machine learning

License: MIT License

Python 100.00%

videocr's Introduction

videocr

Extract hardcoded (burned-in) subtitles from videos using the Tesseract OCR engine with Python.

Input a video with hardcoded subtitles:

# example.py

from videocr import get_subtitles

if __name__ == '__main__':  # This check is mandatory for Windows.
    print(get_subtitles('video.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))

$ python3 example.py

Output:

0
00:00:01,042 --> 00:00:02,877
喝 点 什么 ? 
What can I get you?

1
00:00:03,044 --> 00:00:05,463
我 不 知道
Um, I'm not sure.

2
00:00:08,091 --> 00:00:10,635
休闲 时 光 …
For relaxing times, make it...

3
00:00:10,677 --> 00:00:12,595
三 得 利 时 光
Bartender, Bob Suntory time.

4
00:00:14,472 --> 00:00:17,142
我 要 一 杯 伏特 加
Un, I'll have a vodka tonic.

5
00:00:18,059 --> 00:00:19,019
谢谢
Laughs Thanks.

Performance

The OCR process is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. More CPU cores will make it faster.

Installation

Install Tesseract and make sure it is in your $PATH
$ pip install videocr

API

Return subtitle string in SRT format

get_subtitles(
    video_path: str, lang='eng', time_start='0:00', time_end='',
    conf_threshold=65, sim_threshold=90, use_fullframe=False)

Write subtitles to file_path

save_subtitles_to_file(
    video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00', time_end='',
    conf_threshold=65, sim_threshold=90, use_fullframe=False)

Parameters

lang

The language of the subtitles. You can extract subtitles in almost any language. All language codes on this page (e.g. 'eng' for English) and all script names in this repository (e.g. 'HanS' for simplified Chinese) are supported.

Note that you can use more than one language, e.g. lang='hin+eng' for Hindi and English together.

Language files will be automatically downloaded to your ~/tessdata. You can read more about Tesseract language data files on their wiki page.
conf_threshold

Confidence threshold for word predictions. Words with lower confidence than this value will be discarded. The default value 65 is fine for most cases.

Make it closer to 0 if you get too few words in each line, or make it closer to 100 if there are too many excess words in each line.
sim_threshold

Similarity threshold for subtitle lines. Subtitle lines with larger Levenshtein ratios than this threshold will be merged together. The default value 90 is fine for most cases.

Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100 if you get too few subtitle lines.
time_start and time_end

Extract subtitles from only a clip of the video. The subtitle timestamps are still calculated according to the full video length.
use_fullframe

By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.

videocr's People

Contributors

Stargazers

Watchers

Forkers

lightsing vvcaesar priyanshu-singhania hlthu conanhdx uhyzqi cjszhj brigademagpie prettywarm junshipeng rowanzhang qzlm engrecho zhuxiangxiao 9aiwan xshadow90 xiezuoru zining-wang qyou muzena onuruslu mohammedgomaa helixngc7293 ibmeye thankspei maanshanguider zhangtao2016 xinru123 rrosajp dsaw14 wfvkvh quinnanya watervip mkvtvseries atn123-gh dream2020-l zuochi mfcer110 wemecan tutu1617 radineon mrlisun seros kanrawang mmdwjsj itsumu l-huan ahmedomarjee antoni zjr23187 yinghuochongxiaoq jtshushu vamuvetv zhutoutoutousan goodbyeoldtime devmaxxing n3h3m twywleo lanbozhanglu feanor3 jorge-santos mzhren wang153723482 zombiebunny ronkbs lightchinese humberthardy jayqilixiang jameswrc birabittoh dineshssdn-867 antonizhubar wojoingithub jamesmarva peterdocter mmmxxi umakantkulkarni qkjin vcip2015 shangshanruoshui77 0year drzraf gocomputing 2733284198 mzn928 giladoved crocodilezs ricardol1u eschallack lesuspect slycordinator felix-glober

videocr's Issues

关于设置读帧的stride方法

你好，非常感谢你，videocr对我的帮助非常大。
但是我现在机器算力有点慢，我想通过增大读帧间隔的方法来提高一下效率，请问怎么修改呢？

404 error by using the example code

Hi, I'm using the example code but changed eng for jpn, like this:

from videocr import get_subtitles

if __name__ == '__main__':  # This check is mandatory for Windows.
    print(get_subtitles('video.mp4', lang='chi_sim+jpn', sim_threshold=70, conf_threshold=65))

However, I'm getting this result:

Traceback (most recent call last):
  File "/media/user/hdd/test.py", line 6, in <module>
    print(get_subtitles('video.mp4', lang='chi_sim+jpn', sim_threshold=70, conf_threshold=65))
  File "/home/user/.local/lib/python3.10/site-packages/videocr/api.py", line 8, in get_subtitles
    utils.download_lang_data(lang)
  File "/home/user/.local/lib/python3.10/site-packages/videocr/utils.py", line 21, in download_lang_data
    with urlopen(url) as res, open(filepath, 'w+b') as f:
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I manually downloaded chi_sim.traineddata and jpn.traineddata and placed them on tessdata folder, but still getting this error.

I'm using Linux mint 21

Can anyone help me how to use?

I'm trying to understand how to make it work, but it's all very confusing.
I'm using Windows 10, I already have Python installed, I already have tesseract working, added to PATH, but I don't know how to make it work. I tried to follow what is explained in this issue: #2
I created the file get_sub.py

I put the video in the same folder, I put all the scripts in the same folder but when I run, I get this error:

Traceback (most recent call last):
File "C:\Users\user\Programs\Python 3.7\venv\Lib\site-packages\videocr\get_sub.py", line 3, in
import video
File "C:\Users\user\Python 3.7\venv\Lib\site-packages\videocr\video.py", line 8, in
from . import constants
ImportError: attempted relative import with no known parent package

Someone please could help me?

Doesn't work

Hi. I am trying this project with a couple of videos, but none of them work. teserract hammer my CPU, but no output can I see.

AttributeError: module 'videocr' has no attribute 'get_subtitles'

What could be the problem here?

PS C:\Users\Wizek\sandbox> tesseract --version
tesseract v5.0.0-alpha.20190708
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5


PS C:\Users\Wizek\sandbox> pip3.7 install videocr
Requirement already satisfied: videocr in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (0.1.5)
Requirement already satisfied: fuzzywuzzy>=0.17 in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from videocr) (0.17.0)
Requirement already satisfied: python-Levenshtein>=0.12 in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from videocr) (0.12.0)
Requirement already satisfied: opencv-python<5.0,>=4.1 in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from videocr) (4.1.1.26)
Requirement already satisfied: pytesseract>=0.2.6 in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from videocr) (0.3.0)
Requirement already satisfied: setuptools in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from python-Levenshtein>=0.12->videocr) (41.2.0)
Requirement already satisfied: numpy>=1.14.5 in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from opencv-python<5.0,>=4.1->videocr) (1.17.2)
Requirement already satisfied: Pillow in c:\users\wizek\appdata\local\programs\python\python37-32\lib\site-packages (from pytesseract>=0.2.6->videocr) (6.1.0)
You are using pip version 19.0.3, however version 19.2.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


PS C:\Users\Wizek\sandbox> py -3.7 .\videocr.py
Traceback (most recent call last):
  File ".\videocr.py", line 4, in <module>
    print(videocr.get_subtitles('1.avi', lang='eng', sim_threshold=70, conf_threshold=65))
AttributeError: module 'videocr' has no attribute 'get_subtitles'
PS C:\Users\Wizek\sandbox>

.\videocr.py:

import videocr

if __name__ == '__main__':
    print(videocr.get_subtitles('1.avi', lang='eng', sim_threshold=70, conf_threshold=65))

Could this be some simple python thing I am missing or mixing up? I'm not very experienced with python.

可以删除输出中的时间轴吗？

十分感谢您！但是如果我输出中并不需要有时间轴，请问有什么接口可以取消输出时间轴只输出文字吗？谢谢

TESSDATA的路径是固定的

TESSDATA的路径使用的home下的tessdata目录，而不是环境变量，tesseract命令运行没问题但跑不了这个代码，环境变量怎么改都没反应，这样会很让人疑惑

另外，我在windows上使用，即使在我的用户目录下放了一个tessdata文件夹也出现了找不到tessdata的问题

运行时异常

你好，运行之后报一下错误：

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 262, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 95, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/gfw/project/videocr/videocr/api.py", line 6, in <module>
    from videocr.video import Video
  File "/Users/gfw/project/videocr/videocr/__init__.py", line 2, in <module>
    from .api import get_subtitles, save_subtitles_to_file
  File "/Users/gfw/project/videocr/videocr/api.py", line 40, in <module>
    print(get_subtitles('/Users/111111/视频/六小龄童.flv',
  File "/Users/gfw/project/videocr/videocr/api.py", line 26, in get_subtitles
    v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
  File "/Users/gfw/project/videocr/videocr/video.py", line 49, in run_ocr
    with Pool() as pool:
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

OS:
python: 3.8

error originates from a subprocess. How to fix this error

pip install videocr Collecting videocr Using cached videocr-0.1.6.tar.gz (6.5 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting fuzzywuzzy>=0.17 (from videocr) Using cached fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB) Collecting python-Levenshtein>=0.12 (from videocr) Using cached python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB) Collecting opencv-python<5.0,>=4.1 (from videocr) Downloading opencv-python-4.9.0.80.tar.gz (92.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.9/92.9 MB 182.4 kB/s eta 0:00:00 Installing build dependencies ... error error: subprocess-exited-with-error × pip subprocess to install build dependencies did not run successfully. │ exit code: 1 ╰─> [111 lines of output] WARNING: Skip installing pip, this will break the python-pip package (termux). Ignoring numpy: markers 'python_version == "3.6" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment Ignoring numpy: markers 'python_version == "3.7" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment Ignoring numpy: markers 'python_version == "3.8" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment Ignoring numpy: markers 'python_version <= "3.9" and sys_platform == "linux" and platform_machine == "aarch64"' don't match your environment Ignoring numpy: markers 'python_version <= "3.9" and sys_platform == "darwin" and platform_machine == "arm64"' don't match your environment Ignoring numpy: markers 'python_version == "3.9" and platform_machine != "aarch64" and platform_machine != "arm64"' don't match your environment Ignoring numpy: markers 'python_version == "3.10" and platform_system != "Darwin"' don't match your environment Ignoring numpy: markers 'python_version == "3.10" and platform_system == "Darwin"' don't match your environment Collecting cmake>=3.1 Downloading cmake-3.29.2.tar.gz (30 kB) Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Installing backend dependencies: started Installing backend dependencies: finished with status 'error' error: subprocess-exited-with-error × pip subprocess to install backend dependencies did not run successfully. │ exit code: 2 ╰─> [79 lines of output] Collecting pathspec Downloading pathspec-0.12.1-py3-none-any.whl.metadata (21 kB) Collecting ninja>=1.5 Downloading ninja-1.11.1.1.tar.gz (132 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.4/132.4 kB 420.8 kB/s eta 0:00:00 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Collecting cmake Using cached cmake-3.29.2.tar.gz (30 kB) ERROR: Exception: Traceback (most recent call last): File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper status = run_func(*args) ^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/cli/req_command.py", line 245, in wrapper return func(self, options, args) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/commands/install.py", line 391, in run requirement_set = resolver.resolve( ^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve result = self._result = resolver.resolve( ^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve state = resolution.resolve(requirements, max_rounds=max_rounds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve self._add_to_criteria(self.state.criteria, r, parent=None) File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria if not criterion.candidates: File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_vendor/resolvelib/structs.py", line 156, in bool return bool(self._sequence) ^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in bool return any(self) ^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in return (c for c in iterator if id(c) not in self._incompatible_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built candidate = func() ^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 182, in _make_candidate_from_link base: Optional[BaseCandidate] = self._make_base_candidate_from_link( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 228, in _make_base_candidate_from_link self._link_candidate_cache[link] = LinkCandidate( ^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 290, in init super().init( File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in init self.dist = self._prepare() ^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 222, in _prepare dist = self._prepare_distribution() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 301, in _prepare_distribution return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/operations/prepare.py", line 525, in prepare_linked_requirement return self._prepare_linked_requirement(req, parallel_builds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/operations/prepare.py", line 640, in _prepare_linked_requirement dist = _get_prepared_distribution( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/operations/prepare.py", line 70, in _get_prepared_distribution with build_tracker.track(req, tracker_id): File "/data/data/com.termux/files/usr/lib/python3.11/contextlib.py", line 137, in enter return next(self.gen) ^^^^^^^^^^^^^^ File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/operations/build/build_tracker.py", line 137, in track self.add(req, tracker_id) File "/data/data/com.termux/files/usr/lib/python3.11/site-packages/pip/_internal/operations/build/build_tracker.py", line 103, in add raise LookupError(message) LookupError: https://files.pythonhosted.org/packages/80/bf/4f9a9f754507992be28b985d1e9b17f93a2271106b5916a212efe1d65205/cmake-3.29.2.tar.gz (from https://pypi.org/simple/cmake/) (requires-python:>=3.7) is already being built: cmake>=3.1 from https://files.pythonhosted.org/packages/80/bf/4f9a9f754507992be28b985d1e9b17f93a2271106b5916a212efe1d65205/cmake-3.29.2.tar.gz [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × pip subprocess to install backend dependencies did not run successfully. │ exit code: 2 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × pip subprocess to install build dependencies did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip.

识别率不高，请问下在进行OCR前是否对图像进行处理

作者你好，我想请问下在进行OCR前是否对图像进行处理？
还有就是能否增加新的功能，就不并不对每一帧进行识别，设置好时间间隔进行识别？
谢谢！

Support for python-Levenshtein 0.13.1

After hours of fiddling and having to install numerous packages manually, this package is the one that manages to not install.
python-Levenshtein
The version that it downloads is the 0.12 while there's already a newer version 0.13.1

The old module is way back from 2014. While installing videocr it checks for several packages, downloads them all but throws an error at that above package.

https://pastebin.com/ksSVw1Wa

I already downloaded the other packages and they all pass the requirement except for that one. I also tried installing vs_buildtools (although left the MSVC v142 - VS 2019 C++ x64/x86 build tools (v14.23) and Windows 10 SDK (10.0.18362.0) packages due to storage limitations.)

I'm running Windows 10 x64 version. This whole module obstructs the ability to install videocr and eventually use it.

Tesseract version requirement

Thanks for the 0.16 update, it fixes my last issue I had opened with NoneType Error.
Now I have some other issue instead. Last time when I had tried this software I think I might have just used the tesseract 4.0 instead of the 3.0. Videocr does indeed ask for >3.05, but all the binaries I tried for 3.05 give this error instead

Now I tried with 4.0.0 instead and surprisingly there's 0 errors at all. But the detection is so bad, there's literally nothing detected at all. I am wondering if you have the compiled binaries for tesseract perhaps for testing or would 4.00/5.00 work too? I think the alpha 5.00 refused to work, saying unrecognized version of tesseract detected or something.

ValueError: invalid literal for int() with base 10

ValueError: 'invalid literal for int() with base 10: '"₪ץ'' (several different words get caught here)

function get_subtitles in api.py at line 11
v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
function run_ocr in video.py at line 52
for i, data in enumerate(it_ocr)
function in video.py at line 52
for i, data in enumerate(it_ocr)
function init in models.py at line 32
block_num, conf = int(block_num), int(conf)

RapidVideOCR

Recently I use this repo to extract the subtitle, but I found the Tesseract is very hardly to use.
I combined RapidOCR and the code of this repo, made a simple transformation, and made it easier to use.
The specific link is as follows:
https://github.com/SWHL/RapidVideOCR

Could it support GPU Acceleration?

It would be more efficiency with GPU Acceleration. Is it possible to support GPU Acceleration?

The URL used in utils/constants.py is no longer valid

Try to this webpages and this one.

Can't detect anything.

video.tar.gz
I tried this repo on this video and couldn't get any subtitles.

Any idea why?

Predicting on images

Great work guys.

Quick question. Is it also possible to use this for predicting subtitles on images as well?

Thanks,

Video path problem.

Good day. What is the correct argument to insert in the Video_path: Line?

一個小問題

按照教學都設定好了
但要開始執行的時候出現
TSVNotSupported: TSV output not supported. Tesseract >= 3.05 required
不知道怎麼解決

Could not install packages due to an EnvironmentError.

I'm trying to install this, but I keep getting this error.
I've tried looking on Google, but did not find anything.

Installing collected packages: setuptools, wheel, distro, six, pyparsing, packaging, scikit-build, cmake, pip, numpy
      Running setup.py install for numpy: started
      Running setup.py install for numpy: still running...
      Running setup.py install for numpy: finished with status 'done'
  ERROR: Could not install packages due to an EnvironmentError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '"C:'

Is there any way to fix this error?

[REQ] Online G-Colab version

Hi there, this software is very interesting.

We wanna suggest you to build a Google-Colab notebook version in order to let users to try it online.

Check out (and, whly not, ask for help/collaboration) those other interesting virtual notebooks:

MiXLab by @shirooo39;
Codemaster by @mohitjoshi155

Hope that inspires !

urllib.error.HTTPError: HTTP Error 404: Not Found

Traceback (most recent call last):
  File "run.py", line 7, in <module>
    print(get_subtitles(video, lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/api.py", line 8, in get_subtitles
    utils.download_lang_data(lang)
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/utils.py", line 21, in download_lang_data
    with urlopen(url) as res, open(filepath, 'w+b') as f:
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

not sure why this is happening. I'm guessing it's a version problem. Trying to run the example code with my own video (full system path specified)

@apm1467 Any chance you could provide the exact python version, tesseract version you used successfully?

Google Colab compatibility

I'm trying to speed up things by running it in Colab but I get this:

File "/usr/local/lib/python3.6/dist-packages/videocr/video.py", line 1
from future import annotations
^
SyntaxError: future feature annotations is not defined

Can you make it compatible with python 3.6 so it can be used in Google Colab with a GPU?
Thanks!

how to use this

How to use this

Tesseract output improvement

Hi,

First of all, thank you for your work. I was looking for OCR projects since it's very difficult to find english subtitles for chinese youtube shows.

I'm wondering if you've attempted to optimize the Tesseract output with different image processing techniques as illustrated here. The use_fullframe argument could be changed to specific rectangular coordinates. Also, the Tesseract wiki indicates a dark text with light background is preferable so adding an option to invert the colors could be helpful. Binarisation could also help further isolate the subtitles. Finally, I believe adding the --psm 6 option to the Tesseract config to indicate a single uniform block of text would be beneficial.

執行後沒有結果

有照著另一篇給的教學操作
不過在最後輸入python get_sub.py之後
雖然資料夾內有出現subtitle.srt
CPU滿載結束後，裏頭也沒有字幕
之後就出現這個訊息

运行错误

OS: fedora 31
Python: 3.7
代码：
from videocr import get_subtitles
if name == 'main': # This check is mandatory for Windows.
print(get_subtitles('22.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65,use_fullframe=False))

执行后cpu满载运行，提示:
`Traceback (most recent call last):
File "t.py", line 8, in
print(get_subtitles('22.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
File "/usr/local/lib/python3.7/site-packages/videocr/api.py", line 11, in get_subtitles
v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
File "/usr/local/lib/python3.7/site-packages/videocr/video.py", line 52, in run_ocr
for i, data in enumerate(it_ocr)
File "/usr/local/lib/python3.7/site-packages/videocr/video.py", line 52, in
for i, data in enumerate(it_ocr)
File "/usr/local/lib/python3.7/site-packages/videocr/models.py", line 32, in init
block_num, conf = int(block_num), int(conf)
ValueError: invalid literal for int() with base 10: '-'

Images for README

support for Baidu's PaddleOCR (supports gpu, newer, and more actively developed compared to Tesseract)

In case this is still being maintained, it would be nice if it could also support PaddleOCR.

I have a working fork here with some added features (e.g. cropping, the ability to skip frames) that I wouldn't mind merging back into the original: https://github.com/oliverfei/videocr-PaddleOCR

urllib.error.URLError: <urlopen error [Errno 61] Connection refused>

见鬼了，这么多人用都没发现已经无法运行了吗？

Traceback (most recent call last): File "/Users/wangyeming/workspace/python/SubOCR/ocr.py", line 9, in <module> save_subtitles_to_file( File "/Users/wangyeming/SubOCR/lib/python3.8/site-packages/videocr/api.py", line 20, in save_subtitles_to_file f.write(get_subtitles( File "/Users/wangyeming/SubOCR/lib/python3.8/site-packages/videocr/api.py", line 8, in get_subtitles utils.download_lang_data(lang) File "/Users/wangyeming/SubOCR/lib/python3.8/site-packages/videocr/utils.py", line 21, in download_lang_data with urlopen(url) as res, open(filepath, 'w+b') as f: File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 563, in error result = self._call_chain(*args) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 755, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open return self.do_open(http.client.HTTPSConnection, req, File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1353, in do_open raise URLError(err) urllib.error.URLError: <urlopen error Tunnel connection failed: 503 Error>

debug发现，下载traineddata的链接已经302了，重定向的地址是

https://raw.githubusercontent.com/tesseract-ocr/tessdata_best/master/script/**.traineddata

而不是原来的

https://github.com/tesseract-ocr/tessdata_best/raw/master/script/**.traineddata

而代码里面的url请求不支持重定向，会报错，请大大赶紧修复发版吧

Exception running with python 3.7

Hello,
i was trying to run videocr on a test avi file but i ran into this issue:

$ python3.7 test.py 
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
    task = get()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
TypeError: __init__() takes 1 positional argument but 2 were given

I'm not a python guy so i don't know where and how to get more information about this.
Thank you, this library look very promising!

請問該如何使用

覺得您很厲害，很想測試看看
只是我是python初學者，沒有使用教學不知該如何著手
可以麻煩您簡單寫個教學嗎?
謝謝

TypeError: 'NoneType' object is not subscriptable

https://pastebin.com/VRnxwxpQ

OS: Windows 10 x64
Python: 3.8
Videocr: 0.1.4
fuzzywuzzy: 0.17.0
Pillow: 6.2.1
pytesseract: 0.3.0
python-Levenshtein-wheels: 0.13.1
opencv-python: 4.1.2

All of the modules are latest, for python 3.8 and x64 architecture. OpenCV wasn't available for py3.8 and had to download it from here http://www.lfd.uci.edu/~gohlke/pythonlibs/#opencv
python-Levenshtein-wheels Video OCR downloads an old version of this module dating back from 2014 which didn't install for me apparently. For this reason, I installed the newer version and then removed the requirement for the module within Video OCR setup.py

Apparently, the module requirement states it as >=0.12.0 but it still downloads the old version. I'm guessing this is the reason I am having the error above, I might be wrong though since the error seems to be related to OpenCV.

Edit: Here's the code in my file.
`# print_sub.py

import videocr
if name == 'main':
videocr.get_subtitles("sample.mp4", lang='chi_sim', time_start='0:00', time_end='', conf_threshold=65, sim_threshold=90, use_fullframe=False)`

Add video processing progress displaying

How do I estimate the processing time, if it's silent?