Giter Site home page Giter Site logo

keyframe-extraction-for-video-summarization's Introduction

Description

In this project we use keyframe extraction for video summarization.we present a large model based sequential keyframe extraction, dubbed LMSKE, to extract minimal keyframes to sum up a given video with their sequences maintained.

First, The large model TransNetV2 was utilized to conduct shot segmentations, and the large model CLIP was employed to extract semantic features for each frame within each shot. Second, an adaptive clustering method is devised to automatically determine the optimal clusters, based on which we performed candidate keyframe selection and redundancy elimination shot by shot. Finally, a keyframe set was obtained by concatenating keyframes of all shots in chronological order.

The file for this project consists of the following:

The extraction folder contains our code for extracting keyframes, and the scripts folder folder contains the script files for saving keyframes and evaluating the extraction performance.

Method

Shot segmentations

We use the large model TransNetV2 for segmentation. Code from https://github.com/soCzech/TransNetV2. By this step, we will get the shot segmentation result of the video and the result will be saved locally in txt form. Taking akI8YFjEmUw.mp4 as an example, the result of the shot part is as follows:

Some of the shots shown in the image are very accurately segmented. The first shot shows the presenter narrating, the second shows 2 people visiting the car, the third shows two people talking, and the fourth shows the car charging.
The front indicates the start frame of the shot and the back indicates the end frame of the shot.

Feature Extraction

We use the large-scale model CLIP to extract semantic features for each frame in each shot. Code from https://www.modelscope.cn/models/damo/multi-modal_clip-vit-large-patch14_336_zh/summary.

The video is subjected to feature extraction through the CLIP large model to obtain a 768-dimensional feature vector for each frame of the video. We save the features of the whole video locally in the form of .pkl for subsequent use.

Clustering

We designed an adaptive clustering method to automatically determine the best clustering results. The code can be found in our repository. Keyframe_extraction.py

We take the video feature file(features_path) and the lens segmentation result (scenes_path)file as inputs and after running them we get best_clusters, best_centers, best_k,centre_indices

Redundancy

After obtaining the clustering results, we perform shot-by-shot selection and redundancy elimination of candidate keyframes. In terms of de-redundancy, we mainly divide it into two aspects, on the one hand, it is for solid colour frames or low information frames, and on the other hand, it is for frames with a high degree of similarity. The code can be found in our repository. Redundancy.py After running it to get the final keyframe sequence

Evaluation

Benchmark dataset

The dataset we use is a benchmark dataset we built ourselves called TvSum20, which is used to evaluate the performance of the keyframe extraction method. Dataset from https://github.com/ttharden/Keyframe-extraction

Scripts

We use a script to evaluate the extracted keyframes. The code can be found in our repository. Evaluation.py

We take the key frame sequences extracted by the method(test_index) and the sequences of the benchmark dataset(keyframe_center) as inputs and output the corresponding performance metrics values, which include F-value, fidelity, and compression ratio.

Case

This is a case of keyframe extraction by our proposed method. Below are examples of keyframe sequences extracted by different methods:

In the GMC method, keyframes are extracted by local characterisation, point-of-interest matching and graphical modularity.In the VSUMM method, clustering is done by pre-sampling first. In the INCEPTION method,the eigenvector is provided by a 2048-bit potential vector of InceptionV3 trained on ImageNet.In the UID method,the metric is the 2-Wasserstein distance computed between univariate Gaussian models.In the SHOT method, frames with maximum coefficient of variation are selected as key frames using in-camera, and the maximum coefficient of variation value is calculated by the Sobel gradient function.

Use it

Step 1: Run TransNetV2 to get the result of shot segmentation

Step 2: Run CLIP to get the features of the video frame

Step 3: Run Keyframe_extraction.py to get the keyframe sequence

Step 4: Run save_keyframe.py to save the keyframes and Evaluation.py for proper evaluation

Reference

Tomas Soucek and Jakub Lokoc, “Transnet V2: an effective deep network architecture for fast shot transition detection,” arXiv:2008.04838, pp. 1–4, 2020.

Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang, "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese" arXiv preprint arXiv:2211.01335,2022

Sandra Eliza Fontes de Avila, Ana Paula Brandao Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araujo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, 2011.

Hana Gharbi, Sahbi Bahroun, Mohamed Massaoudi, and Ezzeddine Zagrouba, “Key frames extraction using graph modularity clustering for efficient video summarization,” in ICASSP, 2017, pp. 1502–1506.

H.M. Nandini, H.K. Chethan, and B.S. Rashmi, “Shot based keyframe extraction using edge-lbp approach,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 7, pp. 4537–4545, 2022.

Luis Carlos Garcia-Peraza, Sebastien Ourselin, and Tom Vercauteren, VideoSum: A Python Library for Surgical Video Summarization, pp. 1–2, 2023.

keyframe-extraction-for-video-summarization's People

Contributors

glide-the avatar ttharden avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.