Giter Site home page Giter Site logo

issung / gchan Goto Github PK

View Code? Open in Web Editor NEW
50.0 5.0 9.0 2.05 MB

Scrape boards & threads from 4chan. Download images, videos and HTML if desired.

License: GNU General Public License v3.0

C# 100.00%
scraper 4chan 4chan-downloader daemon winforms csharp dotnet scrape gchan 4chan-scraper

gchan's Introduction

Hey, I'm Issung

  • I'm a full time Software Engineer building mainly .NET Web APIs.
  • I maintain a couple open source projects, namely GChan and SorterExpress.
  • I enjoy game dev and experiment with it in my spare time.
  • Lots of ideas, not enough time.

Issung's GitHub stats

gchan's People

Contributors

dependabot[bot] avatar fugimuffi avatar issung avatar mhetralla avatar mistressashai avatar orkhanag avatar ricardo1991 avatar royaljackal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gchan's Issues

Cancel downloads in progress when threads removed

New download managers from bugfix/imagelink-download-duplicates-raceconditions-threadsafety branch will allow cancellation of downloads in progress, and stopping future downloads too.

A few things are needed:

  • Need to use a download technique that supports cancellation. Probably need to swap to using HttpClient.
    • Will also need to adjust all download exception handling.
  • When user changes "SaveHTML" setting, cancel thread html downloads if false.
  • When a thread is removed, cancel image downloads and cancel html download.
  • Cancel when user initiates removal, so no more exceptions happen regarding renaming folders.

Search code for comments labeled with this issue's number.

Unable to scrape 4chan /b/ board

I have a load of threads being scraped, but all of the threads I'm scraping off of /b/ don't appear to be scraping. the threads themselves appear, but FileCount never updates and remains at zero with the designated folder doing the same.

8kun makes GChan crash and it marks them as gone

I got this in the logs

[29/12/2020 00:07:30] - AppDomain_UnhandledException - FormatException - The input string is not in the correct format.
   in System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
   in System.Number.ParseInt64(String value, NumberStyles options, NumberFormatInfo numfmt)
   in GChan.Trackers.Thread_8Kun.GetImageLinks()
   in GChan.Trackers.Thread_8Kun.Download()
   in GChan.Trackers.Thread.Download(Object callback)
   in System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   in System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   in System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   in System.Threading.ThreadPoolWorkQueue.Dispatch()

GChan doesn't download any image, it marks everything as gone.

Improve performance by removing JSON -> XML conversion.

This is a headache from YChan (which got forked into GChan by me).
The 4chan API returns JSON, and then it is converted into XML so that XPATH can be used on it, which is kind of like querying it.

XPath example:
image

Apparently there is a JSON alternative in Newtonsoft, which we already have as a dependency: https://www.newtonsoft.com/json/help/html/QueryJsonSelectTokenJsonPath.htm\

This will improve performance and also make the code less convoluted.

NOTE: This is less complicated for board searching and thread imagelink searching, the html page scraping is a bit more tricky. Try the prior options first.

Migrate data persistence to EFCore

This application saves data the user inputs like threads/boards to scrape across application closes/opens.

Originally it was done in a .txt file, very nasty.

Sometime mid 2021 I rewrote that to store the data in an sqlite database in a new class DataController.cs. This is okay but the implentation is a bit shit and there's no easy way to do migrations (changes do the database schema) which I would like have for the future.

I want to stick with SQLite but use EFCore over the top, it gives us migration ease "for free" and makes working with the database easier. https://learn.microsoft.com/en-us/ef/core/get-started/overview/first-app?tabs=netcore-cli

Need to decide if we try putting the existing Board/Thread classes into the DB as is, or if we have seperate classes for database storage and map them back and fourth. Needs discussion.

Refactor CreateNewTracker to improve efficiency when loading data from DB

In GChan/Utils.cs CreateNewTracker(LoadedData data) a tracker is made by creating one from the url from the db, then setting the properties one by one, this looks nasty and will also be spamming NotifyPropertyChanged events one by one to the UI. This will be impacting startup time greatly when the user has a lot of threads saved.

There is a TODO in the code:

// TODO: Should be making trackers based on the LoadedData (pass loadeddata to constructor).
// Rather than making them and then loading them with more data.
// This would help app-startup ui responsiveness as it would reduce the notify property changed spam greatly.

Will require some refactoring of how trackers get made, not much.

  • Make LoadedData (the base class) abstract.
  • Check the type of the loaded data in CreateNewTracker to see if its thread or board.
  • Maybe need a overload for each case that makes a thread/board tracker without checking if its a thread or board.
  • Pass the loaded data into the constructor of the thread/board.
  • Set the properties instantly in the constructor instead of one by one outside the ctor.
  • Clean up remaining code/comments.

Tidy up/improve code.

  • Use var where possible.
  • Use using on objects where possible to improve memory usage, e.g. WebClient occurences.
  • Remove rendundant branches.
  • Fix incorrect naming conventions.
  • Add xdoc comments for methods/properties where needed.
  • GetThreadSubject() is terrible and is copied in multiple places, see if it can be improved.

Change directory select option to use full version dialog

At the moment the directory select setting dialog uses this stupid little window:
image

I hate this version of the windows directory select dialog, I much prefer the "full" one, you know the one that looks like this:
image

I did it once before in my other project SorterExpress, here's a link to the source code of the method that opened it: https://github.com/Issung/SorterExpress/blob/develop/src/SorterExpress/Utilities.cs#L171

From memory it required installing a NuGet package because for some reason its a direct windows API reference or something.

Fix ImageLink GenerateNewFilename()

GenerateNewFilename in ImageLink.cs looks a bit scuffed, it has a switch block assigning to result, and then right below a big chunk of if/elses assigning to result as well, then some commented out code. Looks like the switch is missing a case too? Update this to the new preffered switch expression pattern matching C# feature, much shorter and easier to read.
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/switch-expression

Example of switch expression being used in GChan here:

var idCodeMatch = board.SiteName switch

and here
var destinationDirectory = folderNameFormat switch

Enhancement request / feature propositions, and a minor bug.

I am not good with programming so the complexity of implementing the following ideas is not something i can judge. However i have some ideas that i think can improve the functionality and usability of your software:

  1. Have a small box (or another "cell") to the left of each thread filled with Green for "thread is alive", and Yellow/Red for if a thread has 404'd (i guess this is easy to check since 4chan redirects you when a thread has 404'd) AND if the thread has been archived (I guess one could see if the word "Archived" is present in the .html file at the html element section related to it.)
    Why: So you can tell when a thread is ready to be deleted from the list, instead of checking thread in a web browser.

  2. An option to "append" new posts to the html file, instead of "rewriting" the html file for each new post. I think this is a good addition to the program because some users delete their posts, and when the html file is "rewritten" after the post is deleted their post is gone. This can cause some confusion when reading the locally saved thread.
    In short: Persistence of deleted posts.

Also i noticed a minor bug:
If you rename a thread that has any title such as "no subject" to the title "Cool thread" and click OK it is saved.
If you then again rename the "Cool thread" to "Magic thread" but click "cancel" it does not revert to "Cool thread", it reverts the name back to "no subject" or whatever the original threads name was from 4chan.
In short: When clicking cancel in the "rename thread" dialog box, it reverts it to the original thread name from 4chan instead of the last name you gave it.
Instinctively in order to revert the thread name back to the original i would delete the name i gave the thread (leaving the text-box empty) and click OK. And cancel would simply perform the action "no change to thread name > close dialog box".

I want to say again that you have made a very good program, i use it a lot! Thank you for making this.

Save post number in ImageLink, for ease of use in GenerateNewFilename

Couple of things:

  • Give Tim property some documentation, explain what it is by reading the documentation here: https://github.com/4chan/4chan-API/blob/master/pages/Threads.md
  • 1 location still uses the old constructor, remove the old constructor and update the calling location to use the new one.
  • Update the now only constructor to take no as a parameter, save it in a new public property (give the new property some good documentation too).
  • Rename GenerateNewFilename method to GenerateFilename.
  • Rename URL property to Url (don't forget to update the ToString too).

Create test project and some basic tests

Nowadays writing code without tests makes me nervous.
Tests are a great way to:

  • Define the contract of what your code should do.
  • Ensure you don't break anything.
  • Drive your development (in certain cases).
  • Make intended behaviour clearer to peers.

My preferred testing framework is xunit.

  • You will need to make a new project using this short guide: https://xunit.net/docs/getting-started/netfx/visual-studio.
  • Call the project GChan.Test.
  • Add the original GChan project as a dependency.
  • Add some basic tests for the ImageLink class, test that generating with each different format (and an unexpected format) gives the expected output.
  • Look for other classes than can be easily tested.

CICD:

  • If doing this after making the GitHub Actions pipeline, add testing of this project to the pipeline.
  • If doing this before that issue, update the other issue to say "add unit testing to pipeline".

URL list not saving on closing and thread list info not updating until getting clicked and scrolling issue

Title is self-explanatory, the URLs aren't saved when GChan exits with the appropriated option ticked.
Also the thread list doesn't updates itself until I click each thread. So if I add several threads appended with a comma they are all added to the queue but all stay at FileCount 0 until I click them, then it reloads.

Edit: Something more I was forgetting about, when the queue is long enough there is no scrollbar and it can't be scrolled with the mouse wheel either. It can be scrolled with the keyboard arrows though.

First time using this app so sorry if these are known problems, and thanks, it works great.

Application freezes and needs to be force closed when adding multiple boards

Issue:

Gchan will freeze, become unresponsive and stop downloading and will need to be manually closed using task manager when adding multiple boards from 4Chan. Next launch will carry over settings if "Save URLs on exit" is enable resulting in clearing files to fix temporarily.

Steps to reproduce

Add board eg. /w/. Will work and start scraping. Add second board eg. /wg/, Gchan will freeze, not respond to any further commands resulting it to be force closed manually with task manager. Gchan will further not open if the "Save URLs on exit" is enabled

Fix

Delete "boards.dat" and "threads.dat" from ProgramData Folder. Will restore normal use till next attempt to add multiple boards

System:
Edition Windows 10 Enterprise
Version 20H2
OS build 19042.928

Clearing list won't rename folders.

Not sure if it is intentional, however clearing the list with the "clear" button does not rename folders according to the options set.
Maybe that is intentional but i thought it was odd, due to the redundancy of right-clicking and choosing "remove" for each item.
Right-clicking and choosing remove does rename folders according to options.

Great program though, good work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.