Giter Site home page Giter Site logo

clocrc's Introduction

CLOCR-C: Context Leveraging OCR Correction using Language Models

CLOCR-C (Clock-Er-Sea) is a post-OCR correction approach that utilizes the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. This repository contains the academic work used to demonstrate the concept.

Key Notebooks

This work is arranged by notebooks. The key notebooks are numbered as follows:

  1. 01_prompt_testing.ipynb: Testing the range of sub-prompt combinations on the three datasets across all models using the dev set.
  2. 02_llm_comparison.ipynb: Using the best performing prompt run tests across the test set.
  3. 03_downstream_ner.ipynb: Measure how the corrected texts have improved NER.
  4. 04_sociocultural_context.ipynb: Explore the importance of the socio-cultural context of the prompt and the Task Induced In Context Learning.

Dataset

This study uses three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The NCSE dataset, consisting of 91 transcribed articles with a total of 40 thousand words, is available in a separate data repository at [link to repo].

Requirements

To run the notebooks, you will need the following:

  • Python 3.11
  • Jupyter Notebook
  • Required Python packages are found in the requirements.txt file

Models

The experiments in this study were conducted using seven language models:

  • GPT-4
  • GPT-3.5
  • Llama 3
  • Gemma
  • Mixtral 8x7b
  • Claude 3 (Opus)
  • Claude 3 (Haiku)

API key's for the models are required to reproduce this work

License

This project is licensed under the MIT. See the LICENSE file for more information.

Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs, including GPT-4, GPT-3.5, Llama 3, Gemma, Mixtral 8x7b, and Claude 3 (Opus and Haiku), on three datasets: the 19th Century Serials Edition and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity and F1 scores. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

Citing

Citation information not yet available

clocrc's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.