Giter Site home page Giter Site logo

web-crawler's Introduction

web-crawler

A web crawler used to obtain information about books from https://www.bookdepository.com/ when a barcode (isbn13) is given.

The output and input files are located in the data/ directory. The input file (inputIsbn13.txt) contains isbn13 barcodes separated by a newline. The output file (output.txt) contains all of the results of the web search from https://www.bookdepository.com/.

Frameworks/packages used:

Setup:

  1. Ensure that Node.js is installed

    Install at least v12.16.1 or higher as this is the oldest active LTS version Only releases that are or will become an LTS release are officially supported

  2. Clone the repo

  3. Navigate to the cloned repo and run the following command in the terminal:

    npm i
    npm start
    
  4. The current web crawler works with the Chrome browser.

Storage of information

The following information is stored in JSON format:

  1. Barcode (isbn13)
  2. Format
  3. Dimensions
  4. Publication Date
  5. Publisher
  6. Imprint
  7. Publication Country
  8. Language
  9. Edition Statement
  10. isbn10
  11. isbn13
  12. Bestseller Rank
  13. Description

Example of how information is parsed from https://www.bookdepository.com/ using regex

Format:

format = 'Format Paperback | 560 pages'
format.match(new RegExp(/\d+\s.+/))
>> Array [ "560 pages" ]

Dimensions:

dimensions = 'Dimensions 129 x 198 x 24mm | 383g'
dimensions.match(new RegExp(/\d+.+/))
>>Array [ "129 x 198 x 24mm | 383g" ]

Publication Date

publicationDate = 'Publication date 01 Sep 2015'
publicationDate.match(new RegExp(/\d{2} \w{3} \d{4}/))
>> Array [ "01 Sep 2015" ]

Publisher

publisher = 'Publisher Penguin Books Ltd'
publisher.match(new RegExp(/[^Publisher].+/))
>> Array [ " Penguin Books Ltd" ]

Imprint

imprint = 'Imprint PENGUIN CLASSICS'
imprint.match(new RegExp(/[^Imprint].+/i))
>>> Array [ " PENGUIN CLASSICS" ]

Publication Country

publicationCountry = 'Publication City/Country London, United Kingdom'
publicationCountry.match(new RegExp(/[^(?!Publication City\/Country )].+/))
>> Array [ "London, United Kingdom" ]

Language

language = 'Language English'
language.match(new RegExp(/[^(?!language)].+/i))
>> Array [ " English" ]

Edition Statement

editionStatement = 'Edition Statement UK ed.'
editionStatement.match(new RegExp(/[^(?!edition statement)].+/i))
>> Array [ "UK ed." ]

isbn10

isbn10 = 'ISBN10 024120013X'
isbn10.match(new RegExp(/[^(?!isbn10)].+/i))
>> Array [ " 024120013X" ]

isbn13

isbn13 = 'ISBN13 9780241200131'
isbn13.match(new RegExp(/[^(?!isbn13)].+/i))
>> Array [ " 9780241200131" ]

Bestseller Rank

bestsellerRank = 'Bestsellers rank 7,918'
bestsellerRank.match(new RegExp(/[^(?!bestsellers rank)].+/i))
>> Array [ "7,918" ]

Description

Replacing multiple newlines to a single newline

  • description.replace(/[\r\n\s]{2,}/g,"\n")
results.forEach(result => {
   result.description = result.description.replace(/[\r\n\s]{2,}/g, "\n");
   result.description = result.description.replace(/[\r\n\s]*(show more)[\r\n\s]*$/, "");
   result.description = result.description.trim();
});
description
>> "
                Description


                    With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.
                    show more

            ";

description.match(new RegExp(/[^(?!\n\sdescription)].+/i))
Array [ "With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.    " ]

References

How to Setup WebdriverIO

WebdriverIO Selectors

Regular Expressions (RegEx) in 100 Seconds

Will Brock - 09 Selecting elements on a page - WebdriverIO

WebdriverIO setTimeout

Node.js - How do i write files in Node.js?

web-crawler's People

Contributors

jeremyloh avatar

Watchers

James Cloos avatar  avatar

web-crawler's Issues

Formatting of book description

To check different types of books and handle their text formatting

e.g. https://www.bookdepository.com/The-Book-of-Disquiet-Fernando-Pessoa/9780241200131
image

description: "With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked
fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture. \n" +
' show more'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.