videocr

Extract hardcoded (burned-in) subtitles from videos using the PaddleOCR OCR engine with Python.

# example.py

from videocr import save_subtitles_to_file

if __name__ == '__main__':
    save_subtitles_to_file('example_cropped.mp4', 'example.srt', lang='ch', time_start='7:10', time_end='7:34',
     sim_threshold=80, conf_threshold=75, use_fullframe=True,
     brightness_threshold=210, similar_image_threshold=1000, frames_to_skip=1)

$ python3 example.py

example.srt:

0
00:07:10,000 --> 00:07:10,083
商城......现在没什么东西

1
00:07:10,416 --> 00:07:12,000
这边是战斗辅助系统

2
00:07:13,083 --> 00:07:14,500
要进去才能了解了

3
00:07:15,083 --> 00:07:15,916
没问题了吧

4
00:07:16,333 --> 00:07:17,166
我们准备登录

5
00:07:18,416 --> 00:07:21,083
啊对了， 登录没有服务器的选择么

6
00:07:21,333 --> 00:07:25,000
没有本游戏所有玩家， 都在个服务器内

7
00:07:25,833 --> 00:07:28,833
刺激了， 这么多玩家居然都不分流的么

8
00:07:29,500 --> 00:07:31,083
那......现在登录吗？

9
00:07:31,166 --> 00:07:32,416
好，登录吧！

Install prerequisites

Python 3.7
PaddleOCR
- 2.0+ (Recommended): download the latest release from https://github.com/PaddlePaddle/PaddleOCR/releases, unzip and run python -m pip install -e . from the root project directory (pip does not appear to have latest version at the moment)
- or 1.1: python -m pip install paddleocr==1.1.1
PaddlePaddle - python -m pip install paddlepaddle or if you want to run OCR with a CUDA 9 or CUDA 10 GPU use python -m pip install paddlepaddle-gpu

Installation

Clone or download and extract this repo
From the root directory of this repository run python -m pip install -e .

Performance

The OCR process can be very slow on CPU. Running with paddlepaddle-gpu is recommended if you have a CUDA 9 or CUDA 10 GPU.

Tips

To shorten the amount of time it takes to perform OCR on each frame, you can use a tool such as ffmpeg to crop out only the areas of the videos where the subtitles appear. When cropping, leave a bit of buffer space above and below the text to ensure accurate readings.

Quick Configuration Cheatsheet

	More Speed	More Accuracy	Notes
Prebuilt PaddleOCR Models	Use default 'mobile' models	Use 'server' models	Running on CPU, 'server' models take significantly more time to run.
Input Video Quality	Use lower quality	Use higher quality	Performance impact of using higher resolution video can be reduced with cropping
`frames_to_skip`	Higher number	Lower number
`brightness_threshold`	Higher threshold	N/A	A brightness threshold can help speed up the OCR process by filtering out dark frames. In certain circumstances such as when subtitles are white and against a bright background, it may also help with accuracy.

API

Return subtitle string in SRT format

get_subtitles(
    video_path: str, lang='ch', time_start='0:00', time_end='',
    conf_threshold=75, sim_threshold=80, use_fullframe=False,
    det_model_dir=None, rec_model_dir=None,
    brightness_threshold=None, similar_image_threshold=100, similar_pixel_threshold=25, frames_to_skip=1)

Write subtitles to file_path

save_subtitles_to_file(
    video_path: str, file_path='subtitle.srt', lang='ch', time_start='0:00', time_end='', 
    conf_threshold=75, sim_threshold=80, use_fullframe=False,
    det_model_dir=None, rec_model_dir=None,
    brightness_threshold=None, similar_image_threshold=100, similar_pixel_threshold=25, frames_to_skip=1)

Parameters

lang

The language of the subtitles.
conf_threshold

Confidence threshold for word predictions. Words with lower confidence than this value will be discarded. The default value 75 is fine for most cases.

Make it closer to 0 if you get too few words in each line, or make it closer to 100 if there are too many excess words in each line.
sim_threshold

Similarity threshold for subtitle lines. Subtitle lines with larger Levenshtein ratios than this threshold will be merged together. The default value 80 is fine for most cases.

Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100 if you get too few subtitle lines.
time_start and time_end

Extract subtitles from only a clip of the video. The subtitle timestamps are still calculated according to the full video length.
use_fullframe

By default, only the bottom third of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom third of each frame.
det_model_dir

the text detection inference model folder. There are two ways to transfer parameters, 1. None: Automatically download the built-in model to ~/.paddleocr/det; 2. The path of a specific inference model, the model and params files must be included in the model path.

Prebuilt detection models (including bigger/slower ones with better accuracy than the default mobile models) can be found here: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/doc/doc_en/models_list_en.md#1-text-detection-model.
rec_model_dir

the text recognition inference model folder. There are two ways to transfer parameters, 1. None: Automatically download the built-in model to ~/.paddleocr/rec; 2. The path of a specific inference model, the model and params files must be included in the model path.

Prebuilt recognition models (including bigger/slower ones with better accuracy than the default mobile models) can be found here: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/doc/doc_en/models_list_en.md#2-text-recognition-model.
brightness_threshold

If set, pixels whose brightness are less than the threshold will be blackened out. Valid brightness values range from 0 (black) to 255 (white). This can help improve accuracy when performing OCR on videos with white subtitles.
similar_image_threshold

The number of non-similar pixels there can be before the program considers 2 consecutive frames to be different. If a frame is not different from the previous frame, then the OCR result from the previous frame will be used (which can save a lot of time depending on how fast each OCR inference takes).
similar_pixel_threshold

Brightness threshold from 0-255 used with the similar_image_threshold to determine if 2 consecutive frames are different. If the difference between 2 pixels exceeds the threshold, then they will be considered non-similar.
frames_to_skip

The number of frames to skip before sampling a frame for OCR. Keep in mind the fps of the input video before increasing.

TODO

parallel processing
handle multiple lines of text in the same frame
publish to pypi
commandline interface

soebb / occr Goto Github PK

occr's Introduction

videocr

Install prerequisites

Installation

Performance

Tips

Quick Configuration Cheatsheet

API

Parameters

TODO

occr's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent