Giter Site home page Giter Site logo

budoux's Issues

Clarification on the unicode used on the keys

Hello, I'm porting budoux to zig (and C).
I wonder what exactly is the unicode format of the keys in the model. Are they normalized?
When constructing a key should the key be constructed from range of codepoints or graphemes?
I'm asking because currently all the tests pass expect for the thai model and this might be the issue I'm hitting.

[quality] "のみ"

input: ここから先はチケットを購入されたお客様のみ入ることができます。
actual: ここから/先は/チケットを/購入された/お客様の/み入る/ことができます。
expected: ここから/先は/チケットを/購入された/お客様のみ/入ることが/できます。

input: 基本ギアパワーのみ有効だ
actual: 基本ギアパワーの/み有効だ
expected: 基本ギアパワーのみ/有効だ

[java] `HTMLProcessor.getText()` collapses whitespaces

HTMLProcessor.getText() calls:

    return Jsoup.parseBodyFragment(html).text();

It looks like this collapses whitespaces. Example:

    String html = " H      e ";
    String result = HTMLProcessor.getText(html);

The result becomes H e, collapsing consecutive spaces into one space, and leading and trailing spaces to none.

/cc @tushuhei

[quality] あなたの意図したとおりに情報を伝えることができます。

Input: あなたの意図したとおりに情報を伝えることができます。
Actual:
あなたの/意図したと/おりに/情報を/伝える/ことができます。
Expected:
あなたの/意図したとおりに/情報を/伝える/ことができます。
あなたの/意図した/とおりに/情報を/伝える/ことが/できます。

Found in Blink web_tests.

[Java] Java version emits close tag for self-closing tags

Input: <img>abcdef
Expected: <img>abcdef
Actual: <img></img>abcdef

Unlike Python HTMLParser, Java version uses Jsoup.parseBodyFragment, which supports HTML parsing algorithm, so we don't have to worry about issues like #355.

But the serialization code should be aware of self-closing tags.

[quality] いよいよ

Input: いよいよはじまる
Expected: いよいよ/はじまる
Actual: いよいよは/じまる

Unopened HTML tag causes exception in budoux 0.6

With budoux 0.6.0

budoux --html "foo</p>"

Traceback (most recent call last):
  File "/home/johnc/.local/bin/budoux", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 187, in main
    print(_main(test))
          ^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 171, in _main
    res = parser.translate_html_string(inputs_html)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/parser.py", line 102, in translate_html_string
    return resolve(chunks, html)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 124, in resolve
    resolver.feed(html)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
        ^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 413, in parse_endtag
    self.handle_endtag(elem)
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 84, in handle_endtag
    self.to_skip = self.element_stack.get_nowait()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 199, in get_nowait
    return self.get(block=False)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 168, in get
    raise Empty
_queue.Empty

Issue with custom model

Description

Hi there,
First thanks for the lib, it's impressive the results from such a small footprint😄

The results were not exactly what I wanted for japanese tokenization, so I decided to train my own model and it was quite simple and straightforward. Sadly after importing the generated model in javascript it doesn't work.

import { Parser, loadDefaultJapaneseParser } from 'budoux'
import model from './mymodel.json'

// obviously the following works
const parser = loadDefaultJapaneseParser()
console.log(parser.parse('今日は天気です。'))

// but this doesn't
const parser = new Parser(model)
console.log(parser.parse('今日は天気です。'))

Uncaught TypeError: this.model.values is not a function or its return value is not iterable
at Parser.parse (parser.js:120:47)

`mypy` error does not stop GitHub Actions

As shown in https://github.com/google/budoux/runs/6555048469, our "Style Check" GitHub Action won't be interrupted even if mypy finds an error. Any mypy error should pause the process to flag contributors about its type error.

Run sasanquaneuf/mypy-github-action@a0c442aa252655d7736ce6696e06227ccdd62870
/opt/hostedtoolcache/Python/3.9.12/x64/bin/mypy .
build/lib/budoux/__init__.py: error: Duplicate module named "budoux" (also at
"./budoux/__init__.py")
build/lib/budoux/__init__.py: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#mapping-file-paths-to-modules for more info
build/lib/budoux/__init__.py: note: Common resolutions include: a) using `--exclude` to avoid checking one of them, b) adding `__init__.py` somewhere, c) using `--explicit-package-bases` or adjusting MYPYPATH
Found 1 error in 1 file (errors prevented further checking)

Consider to use DocumentFragment

Have you considered directly building a DomFragment instead of returning strings from your parser ?
This would allow you to avoid unnecessary serialization/parsing/sanitization steps.

Something like this:

// in dom.ts
function parseFromString(html: string): Node {
  const template = document.createElement('template');
  template.innerHTML = html;
  return template.content;
}

// parser.ts
function translateHTMLString(html: string): Node {
  if (html === '') return new DocumentFragment();
  const fragment = parseFromString(html);
  if (Parser.hasChildTextNode(fragment)) {
    const wrapper = document.createElement('span');
    wrapper.append(...fragment.childNodes);
    fragment.appendChild(wrapper);
  }
  this.applyElement(fragment.firstChild as HTMLElement);
  return fragment;
}

// in budoux-base.ts
sync() {
  const translated = this.parser.translateHTMLString(this.innerHTML);
  this.shadow.textContent = '';
  this.shadow.appendChild(translated);
}

You can even avoid having to parse anything at all if you clone the existing nodes instead of grabbing this.innerHTML.

// in budoux-base.ts
sync() {
  let translated: HTMLElement;
  if (Parser.hasChildTextNode(this)) {
    translated = document.createElement('span');
    translated.append(...this.childNodes.map(node => node.cloneNode(true)));
  } else {
    translated = this.firstElementChild!.cloneNode(true) as HTMLElement;
  }

  this.parser.applyElement(translated);

  this.shadow.textContent = '';
  this.shadow.appendChild(translated);
}

This is also likely to be more performant than parsing and serializing the tree multiple times.

Originally posted by @engelsdamien in google/safevalues#256 (comment)

禁則処理

I've noticed a few examples where "禁則処理" is not working properly in BudouX, although they are rare. I suspect this might be due to it not being included in the training data and so on.

I used the latest main branch https://github.com/google/budoux/tree/cb21dadb92fce3ee21157457539e061d9f04d99a

Input:
Adobe Illustratorとは?デザイン・レイアウトの決定版
Actual:
Adobe Illustratorとは/?デザイン・レイアウトの/決定版
Expected:
Adobe Illustratorとは?/デザイン・レイアウトの/決定版
Input:
『バクマン。』に影響を受け、漫画家を目指す
Actual:
『バクマン。/』に/影響を/受け、/漫画家を/目指す
Expected:
『バクマン。』に/影響を/受け、/漫画家を/目指す
Input:
[動画で見る]魚眼レンズで撮影したような動画を作成する方法
Actual:
[動画で/見る/]魚眼レンズで/撮影したような/動画を/作成する/方法
Expected:
[動画で/見る]/魚眼レンズで/撮影したような/動画を/作成する/方法
Input:
電子サインの法的な効力は?
Actual:
電子サインの/法的な/効力は/?
Expected:
電子サインの/法的な/効力は?

ZWSP / WBR insertion causes unintended space trimming on line breaks

When a ZWSP or WBR element appears at the end of a line in source HTML, the space that should be introduced by the line break may be removed. The behavior may vary by browser. Possible solution from the BudouX side is not to insert a separator right before \n.

Demo: https://codepen.io/tushuhei/pen/GRbraYN

HTML:

<p>
  これは
  <b>テスト</b>
  です。
</p>
<p class="zwsp" style="word-break: keep-all; overflow-wrap: anywhere;">
  これは&ZeroWidthSpace;
  <b>テスト</b>
  です。&ZeroWidthSpace;
</p>
<p class="wbr" style="word-break: keep-all; overflow-wrap: anywhere;">
  これは<wbr>
  <b>テスト</b>
  です。<wbr>
</p>

Deduplicate separators when processing HTML

Demo
https://codepen.io/tushuhei/pen/VwqMywj

Setup

<p>xyz<wbr>abcabc</p>
const parser = new HTMLProcessingParser({
  UW4: {a: 1001}, // means "should separate right before 'a'".
});
const paragraph = document.querySelector('p');
parser.applyElement(paragraph);
console.log(paragraph.innerHTML);

Expected

<p>xyz<wbr>abc<wbr>abc</p>

Actual

<p>xyz<wbr><wbr>abc<wbr>abc</p>

We may want to remove duplicated separators in case we need to apply BudouX to the same element multiple times (e.g. Web Components that reuse their Light DOM).

@kojiishi Could you take a look what changes should be applied to html_processor.ts?

[quality] あのイーハトーヴォのすきとおった風、夏でも底に冷たさをもつ青いそら、うつくしい森で飾られたモリーオ市、郊外のぎらぎらひかる草の波。

Input:
あのイーハトーヴォのすきとおった風、夏でも底に冷たさをもつ青いそら、うつくしい森で飾られたモリーオ市、郊外のぎらぎらひかる草の波。
Actual:
あの/イーハトーヴォの/すきと/おった/風、/夏でも/底に/冷たさを/もつ青い/そら、/うつくしい/森で/飾られた/モリーオ市、/郊外の/ぎらぎら/ひかる/草の/波。
Expected:
あの/イーハトーヴォの/すきとおった/風、/夏でも/底に/冷たさを/もつ/青い/そら、/うつくしい/森で/飾られた/モリーオ市、/郊外の/ぎらぎら/ひかる/草の/波。

Use overflow-wrap: anywhere; instead of overflow-wrap: break-word;

Motivation

If display: flex; is applied to the parent element, the overflow-wrap: break-word; becomes ineffective, resulting in text overflow.

overflow-wrap: anywhere; resolves the issue. It's supported on modern web browsers.

https://developer.mozilla.org/en-US/docs/Web/CSS/overflow-wrap

Screen Shot 2023-05-26 at 14 57 39

Key points

The primary difference between overflow-wrap: anywhere; and overflow-wrap: break-word is reflected in the way soft wrap opportunities introduced by word-break are handled when calculating min-content intrinsic sizes.

https://developer.mozilla.org/en-US/docs/Web/CSS/overflow-wrap

anywhere
To prevent overflow, an otherwise unbreakable string of characters — like a long word or URL — may be broken at any point if there are no otherwise-acceptable break points in the line. No hyphenation character is inserted at the break point. Soft wrap opportunities introduced by the word break are considered when calculating min-content intrinsic sizes.

break-word
The same as the anywhere value, with normally unbreakable words allowed to be broken at arbitrary points if there are no otherwise acceptable break points in the line, but soft wrap opportunities introduced by the word break are NOT considered when calculating min-content intrinsic sizes.

Steps to reproduce

Go to the example https://codepen.io/tamanyan/pen/KKyyxMj

Screen Shot 2023-05-26 at 14 51 52
<div class="flex">
  <div>
    <h3>5. word-break: keep-all; + overflow-wrap: break-word; + &lt;wbr&gt;  + flex</h3>
    <p class="keep-all-break-word box">
      グレートブリテン<wbr>および<wbr>北アイルランド連合王国という<wbr>言葉は<wbr>本当に<wbr>長い言葉<wbr>ですね
    </p>
  </div>
</div>
.flex {
  display: flex;
}

.keep-all-break-word {
  word-break: keep-all;
  overflow-wrap: break-word;
}

.box {
  padding: 10px;
  border: 1px solid;
  font-size: 40px;
}

Proposal

Use overflow-wrap: anywhere; instead of overflow-wrap: break-word;

Discussion: does it have any negative impact for SEO?

I have a blog site especially focused on interviews and budoux is a great tool for its readablitiy, but I have a question whether or not it has any negative impacts on SEO when appying it on each article bodies like this:

スクリーンショット 2022-06-09 14 49 46

because the text is divided into many parts by span tags. Maybe this is a bit out of scope for budoux development, but users might face this issue when using it so I posted it here. I would like to hear your opinions.

Thanks.

`javascript/data/models/zh-hans` missing

npm test

results in:

> [email protected] test
> ts-node node_modules/jasmine/bin/jasmine tests/*.ts

src/parser.ts:20:36 - error TS2307: Cannot find module './data/models/zh-hans' or its corresponding type declarations.

20 import {model as zhHansModel} from './data/models/zh-hans';
                                      ~~~~~~~~~~~~~~~~~~~~~~~

[configuration] Usage on browser's web worker

image
The module is not useable in web worker without significant patching. Imo, this can be solved either:

  1. Let user dynamically insert the needed window.
  2. Fallback to jsdom if window is undefined.
  3. Make the Parser class more independent from HTML/DOM since i think the main functionality is still the "parse" function. I don't how feasible it is and can be wrong here.

[quality] お問い合わせ​

input: お気軽にお問い合わせください。
actual: お気軽に​お問い​/合わせください。
expected: お気軽に​/​お問い合わせ​/ください。

Numbers become <a href="tel:">links even if <meta name="format-detection" content="telephone=no"> is set on iOS

By setting <meta name="format-detection" content="telephone=no">, numbers such as phone numbers become text on iOS.

However, the "telephone=no" setting does not work for text in <budoux>, and numbers such as phone numbers become <a href="tel:">.

Is this an iOS issue or a budoux issue?

Sample:
https://codepen.io/rhsk/full/RwEVLBa

<h2>normal</h2>
<p>090-1234-5678</p>
<p>テキストテキストテキストテキストテキスト 090-1234-5678 テキストテキストテキストテキスト</p>

<h2>budoux</h2>
<p><budoux-ja>090-1234-5678</budoux-ja></p>
<p><budoux-ja>テキストテキストテキストテキストテキスト 090-1234-5678 テキストテキストテキストテキスト</budoux-ja></p>
image

[quality] まとめる

Input: 要点をまとめる必要がある。
Expected: 要点を/まとめる/必要が/ある。
Actual: 要点を/まと/める/必要が/ある。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.