Giter Site home page Giter Site logo

systemguuh / html-deep-text-extractor Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 99 KB

From a URL, retrieve the text snippet contained at the deepest level of the HTML structure of its content.

License: GNU General Public License v3.0

Java 100.00%

html-deep-text-extractor's Introduction

HTML Deep Text Extractor

Objective

Retrieve the text snippet contained at the deepest level of the HTML structure from a given URL.

For example:

http://hiring.axreng.com/internship/example1.html

<html>
	<head>
		<title>
			This is the title.
		</title>
	</head>
<body>
	This is the body.
</body>
</html>

Execution instructions

  • Compile the file using: 'javac HtmlAnalyzer.java'
  • Run with 'java HtmlAnalyzer insert-url-here'

If you want to check the project in its current state:

About the complexity of the program:

  • Best-case complexity: O(n/2)
  • Worst-case complexity: O(n)

About the functions:

  • Conn: function to connect and return the HTML of a page
  • DeepText: function to search for the deepest text
  • HtmlAnalyzer: main structure of the code
  • htmlTag: encapsulation of tags
  • Validation: validation of the HTML structure

Thrown errors:

  • Malformed: if the structure has unclosed opening tags or vice versa
  • Connection error: no internet

Possible improvements or next steps:

  • Throw errors for invalid inputs, incorrect HTML, or input/output errors
  • Handle HTML structures without a deepest text
  • It might be possible to improve the worst-case complexity with a hashmap
  • Create a dynamic array for each tag and its content
  • Regex for self-closing tags

html-deep-text-extractor's People

Contributors

systemguuh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.