Giter Site home page Giter Site logo

hypertext's Introduction

Hypertext

A PHP HTML to pure text transformer that beautifully handles various and malformed HTML.


Hypertext is excellent at pulling text content out of any HTML based document and automatically:

  • Removes CSS
  • Removes scripts
  • Removes headers
  • Removes non-HTML based content
  • Preserves spacing
  • Preserves links (optional)
  • Preserves new lines (optional)

It is directed at using the output in LLM related tasks, such as prompts and embeddings.

Installation

composer require stevebauman/hypertext

Usage

use Stevebauman\Hypertext\Transformer;

$transformer = new Transformer();

// (Optional) Filter out specific elements by their XPath.
$transformer->filter("//*[@id='some-element']");

// (Optional) Retain new line characters.
$transformer->keepNewLines();

// (Optional) Retain anchor tags and their href attribute.
$transformer->keepLinks();

$text = $transformer->toText($html);

Example

For larger examples, please view the tests/Fixtures directory.

Input:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Blog</title>
</head>
<body>
    <h1>Welcome to My Blog</h1>
    <p>This is a paragraph of text on my webpage.</p>
    <a href="https://blog.com/posts">Click here</a> to view my posts.
</body>
</html>

Output (Pure Text):

echo (new Transformer)->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.

Output (Keep New Lines):

echo (new Transformer)->keepNewLines()->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.

Output (Keep Links):

echo (new Transformer)->keepLinks()->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. <a href="https://blog.com/posts">Click Here</a> to view my posts.

Output (Keep Both):

echo (new Transformer)
    ->keepLinks()
    ->keepNewLines()
    ->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
<a href="https://blog.com/posts">Click Here</a> to view my posts.

hypertext's People

Contributors

stevebauman avatar mrk-j avatar peterfox avatar

Stargazers

Serhii Kuzmenko avatar Ben avatar Andy Tang avatar Andrea avatar Michael Rimbach avatar Recca Tsai avatar chaoswey avatar Osama Kamel avatar Edward Aslin avatar Craig Marvelley avatar James avatar Abdelhamid Errahmouni avatar 梁达标 avatar Philip Curley avatar Vincent van as avatar  avatar Victor Nuñez  avatar  avatar  avatar Yogi Satya avatar Nuno Souto avatar Marceau Casals avatar  avatar Iftekher Mahmud avatar  avatar  avatar James Blackwell avatar Muh. Sukrillah avatar otsch avatar A Long Way avatar Lasse Foo-Rafn avatar Bruce Lam avatar Sean Taylor avatar Initred avatar Mohamed Suhaib avatar Maycon Paiva avatar  avatar Yassine Afnisse avatar Kyle Murphy avatar Bubixon avatar Kára S.  avatar Nyongesa Ignatius avatar Gary Blankenship avatar Playinteractive avatar  avatar Vladislav Kambulov avatar  avatar Bob Mulder avatar Sam Carré avatar Wagner Sousa avatar Mateusz Chruszczycki avatar  avatar Renato Soares avatar Vitauts Stočka avatar  avatar Tom Sacré avatar Mr.L avatar zbage avatar Michael avatar Luis Arce avatar  avatar 咸鱼 avatar guanguans avatar  avatar  avatar Derek Lawrie avatar Carlos Augusto Gartner avatar Mansoor Khan avatar Helge Sverre avatar Maurizio avatar Jonas H. avatar -/:;()$&@“.,?!’ avatar federeggiani avatar Faisal Shaikh avatar wilbur.yu avatar Bart P. Lyson avatar Collin Henderson avatar  avatar Christian avatar Simpledev avatar Sheldon Rupp avatar  avatar Kirill Uksusov avatar Matthijs Philip avatar Darwin Luague avatar Stefan Bauer avatar Sohag Hasan avatar  avatar Ganesh Ghalame avatar David Lonjon avatar  avatar Darren Singleton avatar Hashem Moghaddari avatar Dan Alvidrez avatar Michael Lefrancois avatar Ever Daniel Barreto avatar Peter Parsons avatar Prasanth Jayakumar avatar Daniel Sturm avatar Kevin Jiang avatar

Watchers

 avatar  avatar Sandor Horvath avatar

hypertext's Issues

[Question/bug?] Transformed HTML-to-text still includes "&amp;"

Hi @stevebauman thanks for developing this, I'm trying it out on a new project. I'm finding that the transformer does mostly what I'd like it to do, but even though it's decoding some HTML entities like &nbsp; and &ldquo; it's leaving behind &amp;. Is there a reason this one is excluded?

example string: <p>Here's some &nbsp;text that is a bit &ldquo;rough &amp; ready&rdquo;</p>
output: Here's some text that is a bit “rough &amp; ready”

I think this is probably related to using HTMLPurifier, but since it seems the goal is to get to plain text, I'm wondering if maybe an extra step is needed in the transformation pipeline.

[To clarify: I'm using this in the context of preparing text for a Meilisearch index, within a Laravel app.]

php8.3 support

The dependent ezyang/htmlpurifier requires 4.17.0+ to support php8.3.

Ezyang/htmlpurifier v4.16.0 requires PHP ~ 5.6.0 | | ~ 7.0.0 | | ~ 7.1.0 | | ~ 7.2.0 | | ~ 7.3.0 | | ~ 7.4.0 | | ~ 8.0.0 | | ~ 8.1.0 | | ~8.2.0 -> your php version (8.3.3) does not satisfy that requirement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.