Giter Site home page Giter Site logo

Comments (4)

skjang54 avatar skjang54 commented on June 29, 2024 1

Background

  • Seems that there are some no-japanese or low-quality text (url, address, a repetition of meaningless nouns) like:

920,"{""text"": ""DOMASTA 3-8-7 KITASHINJUKU SHINJUKU-KU TOKYO 〒169-0074 東京都新宿区北新宿3-8-7 ドマスタ内 TEL 050-5539-7630 FAX 03-3365-8250""}”
1029,"{""text"": ""[http://wiki.basercms.net/ver4/関数リファレンス/blogPosts](http://wiki.basercms.net/ver4/%E9%96%A2%E6%95%B0%E3%83%AA%E3%83%95%E3%82%A1%E3%83%AC%E3%83%B3%E3%82%B9/blogPosts)""}"
1526,"{""text"": ""セクシー | 鳥 | かわいい | アート | ROCK | 自然 | 車 | 三角 | ロック | アメカジ | プロレス | シュール | 音楽 | 花 | パロディ | クライミング | パンダ | シンプル | 宇宙""}”

We also require that 80% of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to remove docuetns that do not contain at least two of the following English words: the, be, to, of, and, that, have, with; this adequately deals with ostensibly English documents that contain no coherent English text.

  • We might their method by customizing it to Japanese so that we can reduce low quality data

TODOs

  • remove text that do not contain at least N of the following frequent pattern (stopword, postposition): "は","を","に","と","の", "て", "へ", "です", "だ”
  • test sample data and find appropriate hyperparameter

from dps.

fujiki-1emon avatar fujiki-1emon commented on June 29, 2024 1

Thanks!
I think you coded in a such way (>) to test and see what kind of sentences are filtered out.
freq or cnt, either is OK. But I just thought that the comparison should have been flipped (to <= i.e. less than or equal to).

And as we know Kevin's opinion, we'd like not to filter out a lot of texts, so I think it's at least enough if there's one such frequent chars.
However, we can tune the parameter later. Thanks for your work.

from dps.

fujiki-1emon avatar fujiki-1emon commented on June 29, 2024

@skjang54 I wonder if we need to flip this comparison?

# return freq_char_ratio <= (
return freq_char_ratio > ( # return bad text
sum([re.search(chr, text)!=None for chr in JAPANESE_FREQ_CHAR_LIST])
/len(JAPANESE_FREQ_CHAR_LIST)
)

I think if a sentence has at least one of those frequent chars, we can keep the sentence.
By contrast, if a sentence doesn't have any of those, it's not a consistent Japanese sentence at high probability.

from dps.

skjang54 avatar skjang54 commented on June 29, 2024

@fujiki-1emon Yeah you are right. I fixed it to the right comparison, thanks!
And I changed the threshold ratio to the int value.

from dps.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.