Giter Site home page Giter Site logo

kohulan / decimer-image-segmentation Goto Github PK

View Code? Open in Web Editor NEW
77.0 6.0 28.0 111.99 MB

Chemical structure detection and segmentation tool for Journal articles.

Home Page: https://decimer.ai

License: MIT License

Python 16.76% Jupyter Notebook 83.24%
decimer-segmentation chemical-structure segmented-structure-depictions deep-learning segmented-images

decimer-image-segmentation's Introduction

DECIMER-Image-Segmentation

License Maintenance GitHub issues GitHub contributors tensorflow DOI GitHub release PyPI version fury.io

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature.

The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs.

By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions.

GitHub Logo

Usage

  • To use DECIMER Segmentation, clone the repository to your local disk. Mask-RCNN runs on a GPU-enabled PC or simply on CPU, so please do make sure you have all the necessary drivers installed if you are using the GPU.
We recommend to use DECIMER-Segmentation inside a Conda environment to facilitate the installation of the dependencies.
  • Conda can be downloaded as part of the Anaconda or the Miniconda platforms (Python 3.0). We recommend to install miniconda3. Using Linux you can get it with:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

How to install DECIMER-Segmentation

$ git clone https://github.com/Kohulan/DECIMER-Image-Segmentation
$ cd DECIMER-Image-Segmentation
$ conda create --name DECIMER_IMGSEG python=3.10
$ conda activate DECIMER_IMGSEG
$ conda install pip
$ python -m pip install -U pip #Upgrade pip
$ pip install .
$ conda install -c conda-forge poppler

#From Pypi
$ pip install decimer-segmentation

The Mask-RCNN Model is available at: DOI

How to use DECIMER-Segmentation

  • The repository contains a script that can be used for the segmentation of chemical structures from an image of a scanned page or from a pdf document:
$ python3 segment_structures_in_document.py file_name (the file can be an image of a scanned page or a pdf document) 
  • Segmented images are saved in the output folder (which has the name of the pdf file).

  • Alternatively, you can use integrate DECIMER Segmentation in your Python code:

from decimer_segmentation import segment_chemical_structures, segment_chemical_structures_from_file
import cv2

# Segment structures in scanned page image (np.array)
page = cv2.imread(scanned_page_file_path)
segments = segment_chemical_structures(page, expand=True)

# Segment structures from file (pdf or image)
# Windows users may need to specify the location of their poppler installation with the poppler_path argument if they want to process pdf files
segments = segment_chemical_structures_from_file(path, expand=True, poppler_path=None)

Notes for Windows users:

  • Execute DECIMER_Segmentation.py in the Anaconda Powershell Prompt

  • If you run into an error with the pdf conversion on Windows, you need to download poppler and extract the file.

  • The method segment_chemical_structures_from_file() takes a 'poppler_path' argument where the user can specify the path of their poppler installation ('PATH/TO/POPPLER/bin').

Authors

decimer.ai

Citation

Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1

Project page

GitHub Logo

More information about our research group

GitHub Logo

decimer-image-segmentation's People

Contributors

adhardy avatar dependabot[bot] avatar github-actions[bot] avatar hanzlika avatar kohulan avatar obrink avatar zmahnoor14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

decimer-image-segmentation's Issues

Notebook Example - Performance Not Replicable

Hi There,

I've followed the instructions and got your model install in a mounted colab drive. Previous tests on the arm64 architecture failed (not surprising).

See below how the test notebook is running, are the model weights uploaded consistent to the papers?

Screenshot 2022-03-14 at 21 18 08

Could you offer the dataset (and the code of training process)?

Dear sir, I am currently following this code. Although the predictions of model are wonderful, there also exists some situations that the model can't be totally segmentationed. I had read original paper. Unfortunately, I don't find the datasets. So, could you offer the dataset . If possible, the code of training process plus.

Issues with mrcnn parallel model

Hi,

When running across multiple gpus, I get ModuleNotFoundError from mrcnn.parallel_model import ParallelModel in model.py.

I add the parallel model from https://github.com/matterport/Mask_RCNN but then get the error 'ParallelModel' object has no attribute '_base_model_initialized'

Would anyone be able to help with this?

Thanks

How do I train the model with my own data and moldetect.py ?

Hi, I'm very interested in this project. I tried to train a new model starting from pre-trained mask_rcnn_molecule.h5 weights. And then there are some difficulties here.

  1. Use moldetect.py with mask_rcnn_molecule.h5 weights, start training with the following command and then encounter an error.
python mrcnn/moldetect.py train --dataset=/path/to/train_data/  --weights=/path/to/mask_rcnn_molecule.h5

error message:

Traceback (most recent call last):
File "./mrcnn/moldetect.py", line 366, in
model.load_weights(weights_path, by_name=True)
File "/DECIMER-Image-Segmentation-master/mrcnn/model.py", line 2140, in load_weights
hdf5_format.load_weights_from_hdf5_group_by_name(f, layers)
File "/miniconda3/envs/tf23/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line > 767, in load_weights_from_hdf5_group_by_name
raise ValueError('Layer #' + str(k) + ' (named "' + layer.name +
ValueError: Layer #362 (named "anchors") expects 1 weight(s), but the saved weights have 0 element(s).

2.Use moldetect.py with mask_rcnn_balloon.h5 weights(from matterport/Mask_RCNN), No bugs were shown up, but the program stalled on the first epoch for a long time. I waited for 3 hours and it didn't progress.

I would like someone to write instructions for the moldetect.py script for model training, thank you very much.

Can't reproduce the original results with the example page

If you have a look into the Jupyter Notebook, the results differ from the results that we originally got. I tried it with different resolutions but the issue remains the same. I am really curious why it is behaving this way. The model is the same, the test page is the same. I will look search for the virtual environment on my old laptop and search for differences between the environment that is built when installing the package and the environment that we originally worked with. I am a bit puzzled about this.

My program got killed

I am working on image segmentation for a couple of PDF files. Before executing the code below, I split the PDF into individual files based on page numbers. However, during the execution of the code, the process gets terminated after processing the seventh file.

 for file in files:
    file_path = os.path.join(output_folder_path, file)
    #print(file_path)
    pages = convert_from_path(file_path, 300)
    raw_segments = segment_chemical_structures(np.array(pages[0], expand=True, visualization=False) 
    save_images(raw_segments, output_folder_path, name=f"{file[:-4]}")

could you publish your training data sets?

Thanks for your impressive research work. I think it will be more helpful to other researchers if you publish your training data and the way how to curated the training data in detail. If possible, please also supply one Jupyter notebook for the image annotating and training process.

Thanks

Available validation set?

Hi! I'm interested in your work. So, any validation set (pdf files from jnatprod, molecules and phytochem) available? Waiting for your answer. Thanks.

Poppler error also in Linux

Hi guys,

Just so you know that the error regarding the poppler installation path also appears in Linux. It has an easy fix using conda:

conda install -c conda-forge poppler

Best,
Isa.

Segmentation error on page without structure


ValueError Traceback (most recent call last)
c:\Users\Otto Brinkhaus\OneDrive - Friedrich-Schiller-Universität Jena\Dokumente\Arbeit\testetst\DECIMER-Image-Segmentation\DECIMER_Segmentation_notebook.ipynb Cell 7 line 1
----> 1 segments = segment_chemical_structures(
2 np.array(pages[0]), expand=True, visualization=True
3 )

File c:\Users\Otto Brinkhaus\OneDrive - Friedrich-Schiller-Universität Jena\Dokumente\Arbeit\testetst\DECIMER-Image-Segmentation\decimer_segmentation\decimer_segmentation.py:101, in segment_chemical_structures(image, expand, visualization)
99 masks, bboxes, _ = get_mrcnn_results(image)
100 else:
--> 101 masks = get_expanded_masks(image)
103 segments, bboxes = apply_masks(image, masks)
105 if visualization:

File c:\Users\Otto Brinkhaus\OneDrive - Friedrich-Schiller-Universität Jena\Dokumente\Arbeit\testetst\DECIMER-Image-Segmentation\decimer_segmentation\decimer_segmentation.py:230, in get_expanded_masks(image)
228 # Structure detection with MRCNN
229 masks, bboxes, _ = get_mrcnn_results(image)
--> 230 size = determine_depiction_size_with_buffer(bboxes)
231 # Mask expansion
232 expanded_masks = complete_structure_mask(
233 image_array=image,
234 mask_array=masks,
235 max_depiction_size=size,
236 )
...
83 else:
84 return reduction(axis=axis, out=out, **passkwargs)
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

ValueError: zero-size array to reduction operation maximum which has no identity

There is a question

Thank you for your code, I have a question that in your web I can get the Resolved SMILES representation and picture which can be modified, but I can't find relative code in this project, Is the relative code open source ?

IndexError: index 0 is out of bounds for axis 0 with size 0

x_center = np.where(mask_array[y_center] == True)[0][0]

IndexError: index 0 is out of bounds for axis 0 with size 0.
it is in complete_structure.py. it is find_mask_center() function.
'''
if mask_array[y_center, x_center]:
return x_center, y_center
else:
# If the global mask center is not placed in the mask, take the center on the x-axis and the first-best y-coordinate that lies in the mask
x_center = np.where(mask_array[y_center] == True)[0][0]
return x_center, y_center
'''
===>mask_array[y_center, x_center] == True, np.where(mask_array[y_center] == True)[0][0]====>mask_array[y_center] == False, What should I do about this?

Deprecated datatypes/args on new install of 1.1.1

Hi Guys,

I've had problems with a fresh install of 1.1.1 from pip.

  1. Numpy (1.24.2) deprecated np.bool in 1.20 and this is used in several places within the project source. Downgrading numpy to 1.19.5 caused problems with the version of tensorflow I was pulling ("...compiled for different version of numpy...") but changing all instances of np.bool to bool in the source seems to fix it.

  2. The binary_erosion function from skimage.morphology uses the selem kwarg which looks like it was changed to footprint at some point and was not working with the version of skimage that was installed (0.20.0). Replacing the selem arg with footprint seems to work.

Happy to create a pull request with these fixes, unless there is another way you'd want to approach it - either way looks like you need to tighten up your version requirements.

Error with expanded mask function

Hi,

I was getting an error on line 451 from decimer_segmentation/complete_structure.py:

arrays to stack must be passed as a "sequence" type such as list or tuple.

It's fixed changing the line 451 to
mask_array = np.stack(list(expanded_split_mask_arrays), -1)

Best,
Isa.

On segmented images terminated atom groups or atoms are not included

Hi guys,

There is a problem with the segmentation of some images when terminated atom groups or atoms are not included in segmented images. I tried to use both expand as True and False and even for expand=False it still cut out some atoms. Could you please provide any information about the origin of the problem and how to avoid it?

Here is the output I got when using vizualization=True.

Example output with expand=True:
US-20220048929-A1_image_1674_output_expand_True

Example output with expand=False:
US-20220048929-A1_image_1674_output_expand_False

The segmented saved files and original image are in the archive:
US-20220048929-A1_image_1674.zip

Best regards,
Aleksei

Error with single page PDF at the get_mask_center function

Hi,

the following error appears with the attached PDF document, do you know why?

Best,
Isa.

Traceback (most recent call last): File "/Users/mag/UCMI/DECIMER-Image-Segmentation/segment_structures_in_document.py", line 45, in <module> main() File "/Users/mag/UCMI/DECIMER-Image-Segmentation/segment_structures_in_document.py", line 22, in main raw_segments = segment_chemical_structures_from_file(sys.argv[1]) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/decimer_segmentation.py", line 75, in segment_chemical_structures_from_file segments = segment_chemical_structures(images[0]) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/decimer_segmentation.py", line 100, in segment_chemical_structures masks = get_expanded_masks(image) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/decimer_segmentation.py", line 157, in get_expanded_masks expanded_masks = complete_structure_mask(image_array=image, mask_array=masks) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/complete_structure.py", line 451, in complete_structure_mask mask_array = np.stack(expanded_split_mask_arrays, -1) File "<__array_function__ internals>", line 180, in stack File "/Users/mag/opt/anaconda3/envs/DECIMER_IMGSEG/lib/python3.10/site-packages/numpy/core/shape_base.py", line 420, in stack arrays = [asanyarray(arr) for arr in arrays] File "/Users/mag/opt/anaconda3/envs/DECIMER_IMGSEG/lib/python3.10/site-packages/numpy/core/shape_base.py", line 420, in <listcomp> arrays = [asanyarray(arr) for arr in arrays] File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/complete_structure.py", line 399, in expansion_coordination seed_pixels = get_seeds(image_array, mask_array) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/complete_structure.py", line 297, in get_seeds x_center, y_center = get_mask_center(mask_array) File "/Users/mag/UCMI/DECIMER-Image-Segmentation/decimer_segmentation/complete_structure.py", line 280, in get_mask_center x_center = np.where(mask_array[y_center] == True)[0][0] IndexError: index 0 is out of bounds for axis 0 with size 0

105.pdf

Dataset issues

Hi, i am a student really fascinated about CV and want to train ny own model on the same problem. Because of no man power i can't annotations the segmentation by myself.
Can you please kindly provide Dataset link please

Including molecule label in an output image

Hi,

Awesome package! I was wondering, is there any easy way to include a third image in the output which maybe includes a slightly expanded area around a molecule, to try and include the label for the molecule.

For example, many of the papers I'm dealing with will include a figure showing a panel of substrates, each one given a number or letter. I'd like to use DECIMER to do segment the molecules, use OSRA to do image recognition on each segmented molecules, and then also include the source image so that a human user can go through and label the molecules / SMILES appropriately. This is of course possible just from the original PDF, but it can be a little time consuming matching the OSRA output to the original molecule label in each case.

Thanks!

Will

Need to downgrade protobuf following the installation

Hi,

after following the installation steps with conda, when running DECIMER Segmentation there is an error related to the protobuf version. It can be fixed by downgrading it to 3.19.0:
pip install protobuf==3.19.0

Just a comment.

Best!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.