Giter Site home page Giter Site logo

fd-net's Introduction

FD-Net: A Fully Dilated Convolutional Network for Historical Document Image Binarization

If the code is helpful to your research, the reader can refer to https://link.springer.com/chapter/10.1007/978-3-030-88004-0_42 and please cite the following paper:

  • W. Xiong, L. Yue, L. Zhou, L. Wei, M. Li, "FD-Net: A fully dilated convolutional network for historical document image binarization," in Proceedings of the 4th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2021), Beijing, CHINA, 2021, pp. 518-529. doi: 10.1007/978-3-030-88004-0_42

Introduction

Binarization of antiquarian document images is one of the most important operations in the digitization process, which helps to resolve the conflict between document conservation and cultural heritage.

Historical document images suffer from severe degradation, such as torn page, ink bleed through, text stroke fading, page stain, and artifacts. In addition, the variation of text strokes in degraded handwritten manuscripts further increases the difficulty of binarization.

Figure 1: Historical document image samples from recent DIBCO and H-DIBCO benchmark datasets

Motivation

The SOTA models for document image binarization are variants of the encoder-and-decoder architecture, such as FCN and U-Net. These segmentation models have three key components in common:

  • The encoder comprises consecutive of convolutions and downsampling (e.g., max-pooling) to extract higher-level features, but reduces the spatial resolution of intermediate feature maps, which may lead to internal data structure missing or spatial hierarchical information loss;
  • The decoder consists of repeated upsampling (e.g., bilinear interpolation) and convolutions to restore feature maps to the desired spatial resolution, which may also result in pixelation or texture smoothing;
  • Skip connections merge feature maps with the same levels and transfer localization information from the encoder to the decoder.

In addition, sampling operations like max-pooling and bilinear interpolation are deterministic (a.k.a. not learnable or trainable).

To overcome the above problems, an intuitive approach is to simply remove those downsampling and upsampling layers from the model, but this will also decrease the receptive field size and thus severely reduce the amount of context.

Method/Model

We present a fully dilated convolutional network, termed FD-Net, for degraded historical document image binarization.

What distinguishes the proposed FD-Net from other semantic segmentation models is that our paradigm replaces all the downsampling and upsampling layers with dilated convolutions (a.k.a. atrous convolutions).

Therefore, the proposed segmentation model contains only convolutional and dilated convolutional layers, which are fully trainable. In this way, the spatial resolutions of all the intermediate feature maps are identical, but without significantly increasing the number of model parameters.

Instead of using equal dilation rates or those with a common factor relationship among all the convolutional layers, we introduce a simple hybrid dilation rate solution to avoid the gridding effect.

It has been proven that by choosing an appropriate dilation rate, not only can the receptive field size be effectively increased, but also the segmentation accuracy can be significantly improved.

Figure 2: The proposed FD-Net architecture

Implementation Details

Requirements

  • anaconda
  • cudatoolkit=10.1.243
  • cudnn=7.6.5
  • keras=2.3.1
  • opencv=3.4.2
  • python=3.7.9
  • tensorflow-gpu=2.1.0
  • tqdm=4.51.0

Download pre-trained U-Net and FD-Net model weights for Bickley Diary and DIBCO datasets

Experiments

Adam optimization and BCE-Dice loss

Learning rate reduction by 0.5 if no improvement is seen for 10 consecutive epochs

Early stopping strategy once the learning stagnates for 20 consecutive epochs

We collect 50 training images from the READ project. The Bickley Diary dataset is used for ablation study while the DIBCO and H-DIBCO 2009-2019 benchmark datasets are used for more segmentation experiments.

Table 1: Ablation study of FD-Net on the Bickley Diary dataset with varying dilation rate settings (image patch size: 128×128, and batch size: 32)

Network Model Dilation Rates # of 1st Layer Channels Validation Loss Validation Accuracy # of Model Parameters
U-Net 32 0.0577 0.9903 8,630,177
U-Net 64 0.0541 0.9917 34,512,705
FD-Net 2,2,2,2 32 0.0600 0.9899 9,414,017
FD-Net 2,3,5,7 32 0.0514 0.9931 9,414,017
FD-Net 2,4,8,16 32 0.0524 0.9914 9,414,017

Table 2: Performance evaluation results of our proposed method against SOTA techniques on the 10 DIBCO and H-DIBCO test datasets

Method FM(%) pFM(%) PSNR(dB) NRM(%) DRD MPM(‰)
SAE 79.221 81.123 16.089 9.094 9.752 11.299
GiB 83.159 87.716 16.722 8.954 8.818 7.221
SSP 85.046 87.245 17.911 5.923 9.744 9.503
ConvCRF 86.089 87.397 18.989 6.429 4.825 4.176
cGAN 87.447 88.873 18.811 5.024 5.564 5.536
DSN 88.037 90.812 18.943 6.278 4.473 3.213
UNet 89.290 90.534 21.319 5.577 3.286 1.651
FD-Net 95.254 96.648 22.836 3.224 1.219 0.201

Figure 3: Binarization results of all evaluation techniques for CATEGORY2_20 in DIBCO 2019 dataset

fd-net's People

Contributors

beargolden avatar

Stargazers

 avatar  avatar

fd-net's Issues

How to use your models

Dear Wei Xiong,

I would like to test your code and use the results of your algorithm in some of our studies.
Your repository for DP-LinkNet has the test code, but this one does not have.

Would mind providing an example on how to load and use your models?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.