Giter Site home page Giter Site logo

novelscraper's Introduction

Novel Scraper

A tool build to scrape sites presenting webnovels as a webpage, with flexibility in mind.

Table of contents

Usage:

Installation:

You only need to download the main.py script

Needed Packages:

  • requests
  • BeautifulSoup4
  • time
  • json
  • os

Preparation:

Before executing the script, you need to go to the site of choice and open the inspector. There, you will have to look for the following elements:

  • The url of the first chapter (or whichever chapter was last downloaded)
  • The base url to which the (mostly relative links are added)
  • The element containing the title
  • the link which directs to the next site
  • The element(s) that contain(s) the text

Filters for the needed elements

You can use any of these to get an element in the page.

  • class name
  • id name
  • element name

Using either the id or class is recommended over the element name though, as the whole DOM will be searched.

You can also use multiple key-value pairs for in example the link, meaning you can check for multiple variants with just one more line in the .json file

Example

For example, this novel on readnovelfull.com.

The .json file will look something like this:

{
  "last": "/world-domination-system/chapter-c-1-power-levels-and-abbreviations/",
  "last_write_to": null,
  "base_url": "https://readnovelfull.me",
  "chapters_done": 0,
  "next_page_finder": {
    "id1": "next_chap"
  },
  "title_finder": {
    "class1": "chr-text"
  },
  "text_finder": {
    "find_all": true,
    "element1": "p"
  },
  "skip": true
}

"last"

The "last" chapter will allow the scraper to start off from where it left. It's important to note that I programmed this under the impression that all novel websites use relative links.
Because of that, the last chapter has to be a relative link, i.e. "/world-domination-system/chapter-1-daneel/". The rest of the url will be created with the base_url parameter. You have to have a "/" at the beginning because without it the program will fail

"last_write_to"

The "last_write_to" can get set to null, because we didn't write to any file yet and the program will create new ones automatically.

"base_url"

The previously mentioned "base_url" in our case is "https://readnovelfull.me", as all of the links to the next chapter are relative (i.e. "/world-domination-system/chapter-1-daneel/")

"chapters_done"

"chapters_done" is used to split the chapters to files containing 100 chapters each. You have to set this to 0, as it checks whether the chapters are divisible by 100.

"next_page_finder"

The element containing the link that directs to the next chapter is:
<a class="btn btn-success" href="/world-domination-system/chapter-1-daneel/" title="Chapter 1 Daneel" id="next_chap">...</a>
To now filter that in the program, you have to grab one of these:

  • class name
  • id name
  • element name

Using either the id or class is recommended over the element name though, as the whole DOM will be searched.
Because there are multiple elements with the class "btn" or "btn-success", we will use the id, which is "next_chap".

"title_finder"

The element containing the title is:
<span class="chr-text">Chapter c-1 Power Levels And Abbreviations</span>
It has a class name, "chr-text", so we will use that for the filter.

"text_finder"

The last thing we need is the elements that contain the text.
These are a lot of <p>'s, so we'll grab those and add a key with the "find_all" key set to true.

"skip"

The last thing in our json file is "skip", which you just have to set to false, as it is used to download new chapters and not download the last one again.

novelscraper's People

Contributors

keterion avatar

Stargazers

Zxmon avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.