Giter Site home page Giter Site logo

page source code about html-agility-pack HOT 19 CLOSED

 avatar commented on June 16, 2024
page source code

from html-agility-pack.

Comments (19)

elgonzo avatar elgonzo commented on June 16, 2024 1

Disclaimer: I am just a user of HAP and not associated with the project nor its maintainers


Can you tell me how this is possible?

Because the page you see in the web browser are page elements dynamically created by some Javascript or WASM in your web browser.
What your web browser presents as "source code" is not necessarily the page code as provided by the web server. What the browser presents as "source code" is actually an HTML rendering of the page content (the DOM of the page), and that can include parts of the original HTML page delivered by the web server as well as page elements dynamically created by Javascript or WASM.

The reason why you see a page not found error page for https://www.aboveandbeyond.nu/radio/abgt484 is simply because the web server has no page to serve for this URL. If you enter that URL in your web browser or follow the respective link in your report above, you will see this error page as well. So, why don't you see the error page when clicking on the link with this URL on the aboveandbeyond site? Again the answer is Javascript. The click on the link will be intercepted by some Javascript click handler and the resulting page dynamically created in your web browser -- there is actually no request with the URL https://www.aboveandbeyond.nu/radio/abgt484 being sent to the web server...

Familiarize yourself with the web developer tools of the web browser you are using. The web developer tools of web browsers like Chrome and Firefox include a monitor for the network traffic a page is causing (both requests and responses). This also allows you to inspect the data these requests and responses contain, thus enabling you to tell what parts of the page are created how with which data and where that data is coming from.


As for HAP: HAP is only a HTML parser. It is not a browser engine, and thus does not feature a Javascript engine backed by a DOM. So, processing the HTML page from the web server in HAP, you will only get that original page as delivered by the web server without any dynamically created page content.


If you need to capture dynamically created page content, you will need to elicit the help of web browsers or browser engines that can execute Javascript and possibly WASM. It's possible to use standalone web browsers for this using something like Selenium WebDriver. If you don't want or cannot rely on separate standalone web browsers, there are also browser-based automatable UI controls available such as Microsoft's own Edge-based WebView2 or entire embeddable browser engines such as CEF (Chromium Embedded Framework). The latter can be used in .NET with wrappers like CefSharp or CefGlue, but that would be its own enormous topic to tackle, and which HAP's issue tracker also would not be the right place for, either.

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024 1

Therefore, to materialize the HTML with the track list, you have to to trigger/call this click event handler somehow. I mean, how would you do this with HtmlWeb.LoadFromBrowser?

The LoadFromBrowser uses really a web browser, so technically, you can do everything. I will never recommend this solution over using Selenium web browser, but here is an example of clicking on the Download button on our page and loading the source of the download page instead of the home:

var url = "https://html-agility-pack.net/";

var web1 = new HtmlAgilityPack.HtmlWeb();
var isLinkClicked = false;

var doc1 = web1.LoadFromBrowser(url, o =>
{
	var webBrowser = (WebBrowser)o;

	if(!isLinkClicked)
	{
		HtmlElementCollection links = webBrowser.Document.GetElementsByTagName("a");

		foreach (HtmlElement link in links)
		{
			if (link.InnerText == "  Download   ")
			{
				link.InvokeMember("click");
				isLinkClicked = true;
				Thread.Sleep(2000);
				break; // Stop the loop after clicking the right element
			}
		}
	}

	// WAIT until the link has been clicked and an element you want on the new page is now accessible
	return isLinkClicked && title == "Download HTML Agility Pack \r\nHTML parser to read/write DOM\r\n ";
});

// HAP have loaded the source of the `/download/` page, not the home
var text = doc1.DocumentNode.InnerHtml;

Out of topic, but Above and Beyond was amazing when I saw them a few month ago in Quebec ;)

from html-agility-pack.

 avatar commented on June 16, 2024 1

Out of topic, but Above and Beyond was amazing when I saw them a few month ago in Quebec ;)

Above & Beyond, Gabriel & Dresden and Stoneface & Terminal are the MASTERS of trance. Also their remixes are pure gold!

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024

In addition to @elgonzo answer,

It's somewhat possible to get a dynamic text through HAP if you still use the .NET Framework (require a `WebBroswer): https://html-agility-pack.net/from-browser

However, I highly recommend Selenium. Here is a short tutorial about Selenium WebDriver. Learning it will allow you to push everything further, such as if you need something that logs automatically into a website and grabs some text after.

Best Regards,

Jon

from html-agility-pack.

elgonzo avatar elgonzo commented on June 16, 2024

TL;DR: Just don't bother with HTML parsing in this case. Json ftw!

Since i was bored (boredom be praised!), i took a look at what data in which form the page gets and processes, using the network monitor of the web development tools of my web browser. Doing that often gets you a lot of insight into how a web site operates, which in turn can (and in this case does) make your life much easier.

Basically, all the dynamic page content is generated based on json data fetched from the web sever. And it should be much easier and less cumbersome to process the json data using System.Text.Json (or Newtonsoft.Json, if you must) compared to trying to extract the same information from some dynamically created HTML structures...

The list of the shows is fetched from https://www.aboveandbeyond.nu/api/abgt/abgt.json. Each show entry in this json features an "identifier". The identifier value is being used to create the URL for the json data with the track list of the respective show based on the URL template https://www.aboveandbeyond.nu/api/abgt/<identifier>.json. So, if the show identifier is for example abgt473, the URL with the track list json would be https://www.aboveandbeyond.nu/api/abgt/abgt473.json.

(Be mindful if you intend crawling the whole archive. It might happen that the web server decides to temporarily block your IP or some such if it has guards against unusually high amounts of requests coming from the same IP within very short time spans.)

from html-agility-pack.

 avatar commented on June 16, 2024

@elgonzo : thanks so much! I was just trying but using the example from https://html-agility-pack.net/from-browser , I get this error even in .net framework 4.8:
Error CS1061 'HtmlWeb' does not contain a definition for 'LoadFromBrowser' and no accessible extension method 'LoadFromBrowser' accepting a first argument of type 'HtmlWeb' could be found (are you missing a using directive or an assembly reference?)

I will wait between reading different JSON files.

THANKS SO MUCH!!!

from html-agility-pack.

elgonzo avatar elgonzo commented on June 16, 2024

Can you give me an example for reading the JSON in .NET framework?

Depends on which .NET version you are targetting. If you are using .NET 5 or newer, use System.Text.Json (which is included in modern .NET versions). If you are still on an old .NET Framework 4.x build target i would strongly recommend to migrate to .NET 8, unless there are road-blocks preventing you from doing that. Otherwise, if your build target is .NET Framework 4.6 or newer, you can still use System.Text.Json, but you need to import it into your project as a nuget package (https://www.nuget.org/packages/System.Text.Json/8.0.3#supportedframeworks-body-tab).

If your build target is older than .NET Framework 4.6, run for the hills ;-P The only practical choice for a Json processor/serializer on such old .NET frameworks is Json.NET (which is also often called Newtonsoft.Json). You can of course also use Json.NET/Newtonsoft.Json when targetting newer .NET version, but why would you...? ;-P

Regardless of whether you use STJ or Json.NET/Newtonsoft.Json, you have two approaches available to consume the json data:

  1. Create model classes with properties that match the logical data structures and data types of the values in the fetched json _exactly. The json (de)serializer is then practically able to automatically create instances of these model classes with the values from the json data. Easy peasy, the only thing you need to be careful with is that your model classes really exactly match the json data layout and value types.

  2. Consume data from the json documents without deserializing the json into instances of model classes. Both STJ and Json.NET allow you to do that relatively easily by providing types that represent the data structure and values in a json document directly..

(There is also a third approach by reading the json token stream sequentially, requiring you to track nested data structures in the json data yourself. But that doesn't make any sense for what you are trying to do so just ignore that third possible option for reasons of impracticality.)

There should be tutorials about either approach with STJ or Json.NET/Newtonsoft.Json out there -- the world is at your finger tips. If you can't get a handle on how to do that correctly, i'd suggest to seek help on sites like stackoverflow.com. In case you are unfamiliar with stackoverflow.com: Please make an research effort before asking a question, and pay attention to how you ask and that you format the question in a readable manner presenting relevant information pertinent to your problem. The people helping there are unpaid volunteers, and they often unceremoniously close questions if the question gives the impression the asker wants to delegate their work and not bother themselves with putting effort into their work nor the question. To be clear, i am not trying to insinuate that you do, but i want to emphasize that your chances of getting a good answer there often depends to a high degree on how the question appears to the reader: how you ask, how you format the question and that you stick to their rules if you ask on stackoverflow.com.

from html-agility-pack.

elgonzo avatar elgonzo commented on June 16, 2024

I was just trying but using the example from https://html-agility-pack.net/from-browser , I get this error even in .net framework 4.8: Error CS1061 'HtmlWeb' does not contain a definition for 'LoadFromBrowser' and no accessible extension method 'LoadFromBrowser' accepting a first argument of type 'HtmlWeb' could be found (are you missing a using directive or an assembly reference?)

Regardless of that error, i wouldn't know how HtmlWeb.LoadFromBrowser could help you in your particular case. Remember, the dynamic HTML with the track list is created by a Javascript click event handler. Therefore, to materialize the HTML with the track list, you have to to trigger/call this click event handler somehow. I mean, how would you do this with HtmlWeb.LoadFromBrowser?

from html-agility-pack.

 avatar commented on June 16, 2024

I'm a 12 year old girl beginning programmer. I have no idea how to read the JSON.
Can you maybe do it for me so I can see it as an example? I use .NET Framework 4.8 and have installed System.Text.Json
For this specific: https://www.aboveandbeyond.nu/api/abgt/abgt473.json

HtmlWeb.LoadFromBrowser was just in the example the other person gave me.

from html-agility-pack.

elgonzo avatar elgonzo commented on June 16, 2024

My apologies, but i am not going to tutor you here, as this is the issue tracker for HAP and therefore not the right place for that. My comments re json were just meant to point out a more direct way to get the data without needing to jump through the hoops of capturing dynamically created HTML and parsing this HTML for information that is readily available in a different format (json).

I suggested to you to seek out tutorials and/or ask question(s) on stackoverflow.com, and giving you pointers regarding the available general approaches with both STJ and Json.NET/Newtonsoft.Json, so you can do a rough initial calibration of which approach feels better to you and therefore what kind of tutorials to seek and prefer. It's entirely up to you what you take from my suggestions...

from html-agility-pack.

 avatar commented on June 16, 2024

Thanks, I will do that!

  • Do you know how I can find out how much time to wait between reading different JSON files on the server so I don't get blocked?

  • can you tell me where in the source code of the page https://www.aboveandbeyond.nu/abgt you found it's using JSON?
    I try to understand it myself. Where do I need to look? (I use firefox webbrowser)

from html-agility-pack.

 avatar commented on June 16, 2024

For some reason LoadFromBrowser() method is not found here.
I can only choose following methods on a HtmlWeb variable.
What could be wrong? I use .NET Framework 4.8 and latest html agility pack

afbeelding

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024

Hello @trance-babe ,

Could you provide the reference path of your HtmlAgilityPack dll such as: C:\Repos\HtmlAgilityPack\packages\HtmlAgilityPack.1.11.60\lib\Net45\HtmlAgilityPack.dll

One reason that could explain why you don't see it is if you point on a .NET Standard version instead of the version Net45 like me.

Best Regards,

Jon

from html-agility-pack.

 avatar commented on June 16, 2024

Could you provide the reference path of your HtmlAgilityPack dll such as: C:\Repos\HtmlAgilityPack\packages\HtmlAgilityPack.1.11.60\lib\Net45\HtmlAgilityPack.dll

I don't know anything about it. I just installed HtmlAgilityPack using nuget.
What to do to find this path?

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024

Here is an example:

image

from html-agility-pack.

 avatar commented on June 16, 2024

my path: F:\VISUAL STUDIO PROJECT\Lost In Trance\packages\HtmlAgilityPack.CssSelectors.1.0.2\lib\net45\

I also have HtmlAgilityPack.CssSelectors
maybe I need to remove this?

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024

You currently provided the path of HtmlAgilityPack.CssSelectors library, not the HtmlAgilityPack.

Could you show the path of HtmlAgilityPack itself?

from html-agility-pack.

 avatar commented on June 16, 2024

This IS the correct path. Both have this same path. I'don't know why
I can make screenshot if'you don't believe me

from html-agility-pack.

JonathanMagnan avatar JonathanMagnan commented on June 16, 2024

In this case, you probably don't have the latest version of HtmlAgilityPack. You probably have a reference to the v1.4.9 that still doesn't contains this feature.

You must add the latest version of the https://www.nuget.org/packages/HtmlAgilityPack/ package

from html-agility-pack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.