Giter Site home page Giter Site logo

html2article's Introduction

Html2Article

.NET平台下,一个高效的从Html中提取正文的工具。
正文提取采用了基于文本密度的提取算法,支持从压缩的Html文档中提取正文,每个页面平均提取时间为30ms,正确率在95%以上。
Html2Article

Html2Article特色

  • 标签无关,提取正文不依赖标签;
  • 支持从压缩的html文档中提取正文内容;
  • 支持带标签输出原始正文;
  • 核心算法简洁高效,平均提取时间在30ms左右。

让你的项目支持Html正文提取

  • PM> Install-Package Html2Article
  • 引入命名空间using StanSoft;
  • 添加如下代码:
// html为你要提取的html文本
string html = "<html>....</html>";
// article对象包含Title(标题),PublishDate(发布日期),Content(正文)和ContentWithTags(带标签正文)四个属性
Article article = Html2Article.GetArticle(html);

Html2Article类

  • Html2Article类是提取正文的核心类
  • Html2Article配置说明
    • AppendMode:是否使用正文追加模式,默认为false,设置为true会将更多符合条件的文本添加到正文。
    • Depth:分析的深度,默认为5,对于行空隙较大的页面可增加此值。
    • LimitCount:字符限定数,当分析的文本数量达到限定数则认为进入正文内容,默认为180个字符。
    • GetArticle(string html):从Html文本中获取Article。

License

Apache 2.0

html2article's People

Contributors

stanzhai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html2article's Issues

对于图片的处理

我使用js写过相似的算法,不过写得比较复杂,有很多基于标签语义的判断。我对比了一下,发现Html2Article的代码确实非常简洁,效率也非常高,但是对图片的处理较弱,不知道有没有这方面的想法可以碰撞一下?

对于图片的处理

我使用js写过相似的算法,不过写得比较复杂,有很多基于标签语义的判断。我对比了一下,发现Html2Article的代码确实非常简洁,效率也非常高,但是对图片的处理较弱,不知道有没有这方面的想法可以碰撞一下?

提取结果小bug

提取结果的开头是一个反标签,按说这样是不合理的

支持多线程吗

我用多线程进行抓取 发现根目录下有个 data.txt 如果快 会锁

感谢开源,小反馈

非常感谢开源,测试了几个网页发现一些小问题
注释错误,标题建议匹配H2和H3
另外压缩的网页有换行
H1中的内容有时候并不能被设定为标题
做了小调整

    /// <summary>
    /// 获取网页标题
    /// </summary>
    /// <param name="html">网页代码</param>
    /// <returns>返回处理过的标题</returns>
    private static string GetTitle(string html)
    {
        string titleFilter = @"<title>[\s\S]*?</title>";
        string clearFilter = @"<.*?>";

        string title = "";
        Match match = Regex.Match(html, titleFilter, RegexOptions.IgnoreCase);
        if (match.Success)
        {
            title = Regex.Replace(match.Groups[0].Value, clearFilter, "");
        }

        // 正文的标题一般在h1中,比title中的标题更干净
        for (int i = 1; i < 4; i++)
        {
            string h1Filter = @"<h"+i+ ".*?>[\\s\\S]*?</h" + i+">";
            MatchCollection mcs = Regex.Matches(html, h1Filter, RegexOptions.IgnoreCase);
            if (mcs.Count==1)
            {

                    string h = Regex.Replace(mcs[0].Groups[0].Value, clearFilter, "").Trim();
                    if (!String.IsNullOrEmpty(h) && title.Trim().Contains(h))
                    {
                        title = h;
                    }
                
                
            }
        }
       

        return title;
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.