martial-god / benny-scraper Goto Github PK

Webnovel and Manga Scraper that stores Webnovels as Epubs, and mangas as either PDFs of Comicbook Archives

License: GNU General Public License v3.0

C# 100.00%

database entity-framework-core novels scraping-websites selenium-csharp manga-scraper web-novels manga manga-downloader sqlite

benny-scraper's People

Stargazers

Watchers

Forkers

davidalphafox yonkyunior

benny-scraper's Issues

[enhancement] Supporting webnovel.com

Since I was told to make a request here from reddit, here it is. And I don't really know how it'd be done...but if you are unable to I'd understand

Pdfs throwing "The process cannot access the file because it is being used by another process" on update

This issue is called due to using statement not disposing of file objects properly when updating the file. Switched back to using with brackets.

Switched boolean result for Saved As Single File, should be YES when file is not split.

Handle Mangareaders new Privacy setting page, preventing chapter pages from loading.

Restructure code Architecture

Restructure the abomination that so other can easily understand the code architecture.

Mangakalot and related sites scramble images of some mangas

There are Mangas who have the images scrambled such as the ones above, I need to figure out what algorithm was used and undo it.

Priority on this is not too high as https://mangakatana.com/ does not have that issue.

Add missed chapters to an already downloaded novel

Problem

A feature needs to be added to add individual chapters, whether or not this will be a standalone commandline option will be determined later. This will be useful for cases like the image below where chapters could not be loaded, so an incomplete file was created, another useful use is for those who would only like to get only the most recent or particular range of chapters from a series.

Possible things to do. May change as i go along

This needs to be implemented for each of the possible file types Benny-Scraper currently supports. i.e. Pdf, Epub, and all ComicbookArchives.
Files that will need to be touched in order to get this working. U
a. NovelProcessor.cs - the AddNewNovelAysnc, UpdateExistingNovelAsync
b. ScraperStrategy.cs the FetchContentByAttribute method, as this is what actually grabs our chapters urls from the websites. It may be a good idea to add a new attribute, if so then we need to create a way for users to enter the range of chapters or specific urls when the application runs.
c. All the Generator files, PdfGenerator EpubGenerator, ComicBookArchiveGenerator- ensure they will update or add these chapters without a problem. d.Program.cscommand line region, I think it might be easier to add this as a command line feature first. So a new option will need to be made in theCommandlineOptions.cs`
It might be a good idea to create a column that checks if a novel is not missing chapters when installed. When doing a quick search I found that at least for novels with images, the chapters will be skipped entirely when being added to the database. Notice how the chapter number 45 is missing as the image above shows.

select p.id, p.chapter_id, c.title, c.number, p.image, p.url from novel
inner join chapter as c
on c.novel_id = novel.id
join page as p
on p.chapter_id = c.id
where novel.id = 'ABEA4D61-9685-46FA-AE3A-3E4B48E9CD7B'
order by c.number asc;

Task: Fix `Forbidden` Error When Scraping LightNovelWorld.com

This is a task that someone can contribute to that is not too difficult.

Location of Error

ScraperStrategy.cs.LoadHtmlAsync(Uri uri) which is being called by LightNovelWorldStrategy.cs.ScrapeAsync()

Approaches to Try

Update User Agents: The list of user agents might be outdated. Refresh this list with the latest strings from popular browsers. This is most likely the best solution to resolving this
Add New Headers: Mimic a legitimate browser session by adding headers such as Referer, Accept-Language, and proper Cookie values based on a real session.
Rate Limiting and IP Rotation: Implement delays between requests and explore using proxy services for IP rotation to avoid triggering rate limits or IP bans.
Use a headless Selenium to get Cookies: Consider implementing a headless browser solution like Selenium to programmatically access the site and capture the required cookies. This method simulates a real user's browser session, which can help in bypassing detection mechanisms that rely on the absence of typical browser-generated headers and cookies.

Implementation Steps:
a. Use Selenium with a headless browser configuration (e.g., Chrome or Firefox in headless mode) to navigate to the target website. This will make use of the DriverFactory.cs for creating the driver.
b. Perform any necessary interactions to initiate a session (e.g., navigating through pages, logging in if required).
c. Extract cookies from the browser session once it's established.
d. Include these cookies in the headers of your subsequent HTTP requests made without Selenium.
Benefits: This approach can dynamically adapt to changes in the site's cookie policy and session management, reducing the likelihood of being blocked due to missing or outdated cookies.
Considerations: Be aware that using Selenium for frequent or large-scale scraping can be resource-intensive and detectable by more advanced bot detection systems

Switch to Selenium: As a last resort, consider using Selenium for dynamic page rendering and interaction. This should be carefully evaluated due to its higher resource usage and complexity.

Contribution Guidelines

Document your attempts, including the headers used and the server's response.
Share insights or findings in the discussion, even if they didn't resolve the issue.
Respect the website's terms and the legal aspects of web scraping.

I will most likely try to resolve this task today if no one picks it up.

Please add noveldrama.com website

Error when trying to scrape a manga, not sure what counts as a valid URL

Hello, I am a user of a Windows 11 x86-64 machine. I cloned the repo for this project and then turned it into an executable as specified in the readme document. I then tried to have the program scrape this manga using this URL: https://mangakakalot.to/undead-unluck-7025

However, the extraction did not complete, and the program let out this stack trace:

20:18:34 Info Novel with url https://mangakakalot.to/undead-unluck-7025 is not in database, adding it now.
20:18:34 Info Getting novel data for MangaKakalotStrategy
20:18:35 Info Response status code: OK
20:18:36 Error Error occurred while getting novel data from table of contents. Error: System.ArgumentNullException: Value cannot be null. (Parameter 'source')
   at System.Linq.ThrowHelper.ThrowArgumentNullException(ExceptionArgument argument)
   at System.Linq.Enumerable.Select[TSource,TResult](IEnumerable`1 source, Func`2 selector)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.Impl.NovelDataInitializer.FetchContentByAttribute(Attr attr, NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 135
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotInitializer.FetchNovelContentAsync(NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 38
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotStrategy.FetchNovelDataFromTableOfContentsAsync(HtmlDocument htmlDocument) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 95
20:18:36 Info Finished populating Novel data for Undead Unluck
20:18:36 Info Getting chapters data
20:18:36 Info Using Selenium to get chapters data
20:18:36 Error Error while getting chapters data. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
20:18:36 Error Exception when trying to process novel. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
   at Benny_Scraper.BusinessLogic.NovelProcessor.AddNewNovelAsync(Uri novelTableOfContentsUri, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 91
   at Benny_Scraper.BusinessLogic.NovelProcessor.ProcessNovelAsync(Uri novelTableOfContentsUri) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 62
   at Benny_Scraper.Program.RunAsync() in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper\Program.cs:line 105
20:18:36 Info Elapsed time: 00:00:01.8044361

As far as I can understand it, the scraper thinks that the chapter list is empty or something, but I'm not sure. Any tips to get this working?

P.S. When the scraper actually makes an epub or PDF, where does it go? Can I change the output format manually?

Chapters not sorted properly when encountering novels who does not follow the convention of chapter-{chapter #}

I recently found that Classroom of the Elite will not sort properly with the current method of how chapters are sorted when adding it to the database, which affects the order of the files that are created. An example of this is below, where three separate volumes were considered the first three chapters.

https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-2-chapter-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-0-chapter-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-3-chapter-1-0
https://www.lightnovelworld.com/novel/classroom-of-the-elite-547/vol-2-chapter-1-0

Lightnovelworld and its associated sites changes url for novels making it hard to update

The id of

On 2023-09-07 16:19:01.3383848 the url for Supremacy Games was https://www.lightnovelworld.com/novel/supremacy-games-30071448, today that url redirects to https://www.lightnovelworld.com/novel/supremacy-games-16091309. Since the url is what we use to identify if a novel is already stored in the database, a better way to handle sites like this needs to be made.