Giter Site home page Giter Site logo

martial-god / benny-scraper Goto Github PK

View Code? Open in Web Editor NEW
22.0 3.0 2.0 577 KB

Webnovel and Manga Scraper that stores Webnovels as Epubs, and mangas as either PDFs of Comicbook Archives

Home Page: https://feahnthor.github.io/

License: GNU General Public License v3.0

C# 100.00%
database entity-framework-core novels scraping-websites selenium-csharp manga-scraper web-novels manga manga-downloader sqlite

benny-scraper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

benny-scraper's Issues

[enhancement] Supporting webnovel.com

Since I was told to make a request here from reddit, here it is. And I don't really know how it'd be done...but if you are unable to I'd understand

Add missed chapters to an already downloaded novel

Problem


A feature needs to be added to add individual chapters, whether or not this will be a standalone commandline option will be determined later. This will be useful for cases like the image below where chapters could not be loaded, so an incomplete file was created, another useful use is for those who would only like to get only the most recent or particular range of chapters from a series.

image

Possible things to do. May change as i go along

  1. This needs to be implemented for each of the possible file types Benny-Scraper currently supports. i.e. Pdf, Epub, and all ComicbookArchives.
  2. Files that will need to be touched in order to get this working. U
    a. NovelProcessor.cs - the AddNewNovelAysnc, UpdateExistingNovelAsync
    b. ScraperStrategy.cs the FetchContentByAttribute method, as this is what actually grabs our chapters urls from the websites. It may be a good idea to add a new attribute, if so then we need to create a way for users to enter the range of chapters or specific urls when the application runs.
    c. All the Generator files, PdfGenerator EpubGenerator, ComicBookArchiveGenerator- ensure they will update or add these chapters without a problem. d.Program.cscommand line region, I think it might be easier to add this as a command line feature first. So a new option will need to be made in theCommandlineOptions.cs`
  3. It might be a good idea to create a column that checks if a novel is not missing chapters when installed. When doing a quick search I found that at least for novels with images, the chapters will be skipped entirely when being added to the database. Notice how the chapter number 45 is missing as the image above shows.
select p.id, p.chapter_id, c.title, c.number, p.image, p.url from novel
inner join chapter as c
on c.novel_id = novel.id
join page as p
on p.chapter_id = c.id
where novel.id = 'ABEA4D61-9685-46FA-AE3A-3E4B48E9CD7B'
order by c.number asc;

image

Task: Fix `Forbidden` Error When Scraping LightNovelWorld.com

This is a task that someone can contribute to that is not too difficult.

Location of Error

ScraperStrategy.cs.LoadHtmlAsync(Uri uri) which is being called by LightNovelWorldStrategy.cs.ScrapeAsync()

Approaches to Try

  1. Update User Agents: The list of user agents might be outdated. Refresh this list with the latest strings from popular browsers. This is most likely the best solution to resolving this

  2. Add New Headers: Mimic a legitimate browser session by adding headers such as Referer, Accept-Language, and proper Cookie values based on a real session.

  3. Rate Limiting and IP Rotation: Implement delays between requests and explore using proxy services for IP rotation to avoid triggering rate limits or IP bans.

  4. Use a headless Selenium to get Cookies: Consider implementing a headless browser solution like Selenium to programmatically access the site and capture the required cookies. This method simulates a real user's browser session, which can help in bypassing detection mechanisms that rely on the absence of typical browser-generated headers and cookies.

  • Implementation Steps:
    a. Use Selenium with a headless browser configuration (e.g., Chrome or Firefox in headless mode) to navigate to the target website. This will make use of the DriverFactory.cs for creating the driver.
    b. Perform any necessary interactions to initiate a session (e.g., navigating through pages, logging in if required).
    c. Extract cookies from the browser session once it's established.
    d. Include these cookies in the headers of your subsequent HTTP requests made without Selenium.
    Benefits: This approach can dynamically adapt to changes in the site's cookie policy and session management, reducing the likelihood of being blocked due to missing or outdated cookies.
    Considerations: Be aware that using Selenium for frequent or large-scale scraping can be resource-intensive and detectable by more advanced bot detection systems
  1. Switch to Selenium: As a last resort, consider using Selenium for dynamic page rendering and interaction. This should be carefully evaluated due to its higher resource usage and complexity.

Contribution Guidelines

  • Document your attempts, including the headers used and the server's response.
  • Share insights or findings in the discussion, even if they didn't resolve the issue.
  • Respect the website's terms and the legal aspects of web scraping.

I will most likely try to resolve this task today if no one picks it up.

Error when trying to scrape a manga, not sure what counts as a valid URL

Hello, I am a user of a Windows 11 x86-64 machine. I cloned the repo for this project and then turned it into an executable as specified in the readme document. I then tried to have the program scrape this manga using this URL: https://mangakakalot.to/undead-unluck-7025

However, the extraction did not complete, and the program let out this stack trace:

20:18:34 Info Novel with url https://mangakakalot.to/undead-unluck-7025 is not in database, adding it now.
20:18:34 Info Getting novel data for MangaKakalotStrategy
20:18:35 Info Response status code: OK
20:18:36 Error Error occurred while getting novel data from table of contents. Error: System.ArgumentNullException: Value cannot be null. (Parameter 'source')
   at System.Linq.ThrowHelper.ThrowArgumentNullException(ExceptionArgument argument)
   at System.Linq.Enumerable.Select[TSource,TResult](IEnumerable`1 source, Func`2 selector)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.Impl.NovelDataInitializer.FetchContentByAttribute(Attr attr, NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 135
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotInitializer.FetchNovelContentAsync(NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 38
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotStrategy.FetchNovelDataFromTableOfContentsAsync(HtmlDocument htmlDocument) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 95
20:18:36 Info Finished populating Novel data for Undead Unluck
20:18:36 Info Getting chapters data
20:18:36 Info Using Selenium to get chapters data
20:18:36 Error Error while getting chapters data. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
20:18:36 Error Exception when trying to process novel. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
   at Benny_Scraper.BusinessLogic.NovelProcessor.AddNewNovelAsync(Uri novelTableOfContentsUri, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 91
   at Benny_Scraper.BusinessLogic.NovelProcessor.ProcessNovelAsync(Uri novelTableOfContentsUri) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 62
   at Benny_Scraper.Program.RunAsync() in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper\Program.cs:line 105
20:18:36 Info Elapsed time: 00:00:01.8044361

As far as I can understand it, the scraper thinks that the chapter list is empty or something, but I'm not sure. Any tips to get this working?

P.S. When the scraper actually makes an epub or PDF, where does it go? Can I change the output format manually?

Chapters not sorted properly when encountering novels who does not follow the convention of chapter-{chapter #}

I recently found that Classroom of the Elite will not sort properly with the current method of how chapters are sorted when adding it to the database, which affects the order of the files that are created. An example of this is below, where three separate volumes were considered the first three chapters.

https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-2-chapter-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-0-chapter-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-3-chapter-1-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-2-chapter-1-0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.