Giter Site home page Giter Site logo

tomashubelbauer / pdf-scrape Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 1.57 MB

Demonstrating PDF text and image extraction with correct bounds

Home Page: https://tomashubelbauer.github.io/pdf-scrape

HTML 0.08% CSS 0.03% JavaScript 99.89%
pdf pdfjs pdf-js pdf-scraping

pdf-scrape's Introduction

  1. Print demo.html to demo.pdf or use your own document
  2. Go to https://mozilla.github.io/pdf.js/getting_started
  3. Download Stable
  4. Extract pdf.js and pdf.worker.js and their corresponding *.map here
  5. Make index.html and reference PDF.js:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
  </head>
  <body>

  </body>
</html>
  1. Create index.js and reference it from index.html:

index.js

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>

  </body>
</html>
  1. Update index.js with code to load the document and render its page:

index.js

void async function () {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
}()
  1. Add a canvas element to index.html where the page will be rendered:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>
    <canvas id="pageCanvas"></canvas>
  </body>
</html>
  1. Extend the code to render the page to the canvas context:

index.js

window.addEventListener('load', async () => {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
  const viewport = page.getViewport({ scale: 1 });
  const canvas = window.document.getElementById('pageCanvas');
  canvas.width = viewport.width;
  canvas.height = viewport.height;
  const context = canvas.getContext('2d');
  page.render({ canvasContext: context, viewport });
});
  1. Hook up code to extract text and highlight texts and images (see this repo)

pdf-scrape's People

Contributors

tomashubelbauer avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.