Japanese pre-procesesing - remove text with low rate of Japanese stopwords about dps HOT 4 CLOSED

eleutherai commented on June 29, 2024

Japanese pre-procesesing - remove text with low rate of Japanese stopwords

from dps.

Comments (4)

skjang54 commented on June 29, 2024 1

Background

Seems that there are some no-japanese or low-quality text (url, address, a repetition of meaningless nouns) like:

920,"{""text"": ""DOMASTA 3-8-7 KITASHINJUKU SHINJUKU-KU TOKYO 〒169-0074 東京都新宿区北新宿3-8-7 ドマスタ内 TEL 050-5539-7630 FAX 03-3365-8250""}”
1029,"{""text"": ""[http://wiki.basercms.net/ver4/関数リファレンス/blogPosts](http://wiki.basercms.net/ver4/%E9%96%A2%E6%95%B0%E3%83%AA%E3%83%95%E3%82%A1%E3%83%AC%E3%83%B3%E3%82%B9/blogPosts)""}"
1526,"{""text"": ""セクシー | 鳥 | かわいい | アート | ROCK | 自然 | 車 | 三角 | ロック | アメカジ | プロレス | シュール | 音楽 | 花 | パロディ | クライミング | パンダ | シンプル | 宇宙""}”

Deepmind Gopher Paper used this preprocessing:

We also require that 80% of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to remove docuetns that do not contain at least two of the following English words: the, be, to, of, and, that, have, with; this adequately deals with ostensibly English documents that contain no coherent English text.

We might their method by customizing it to Japanese so that we can reduce low quality data

TODOs

remove text that do not contain at least N of the following frequent pattern (stopword, postposition): "は","を","に","と","の", "て", "へ", "です", "だ”
test sample data and find appropriate hyperparameter

from dps.

fujiki-1emon commented on June 29, 2024 1

Thanks!
I think you coded in a such way (>) to test and see what kind of sentences are filtered out.
freq or cnt, either is OK. But I just thought that the comparison should have been flipped (to <= i.e. less than or equal to).

And as we know Kevin's opinion, we'd like not to filter out a lot of texts, so I think it's at least enough if there's one such frequent chars.
However, we can tune the parameter later. Thanks for your work.

from dps.

fujiki-1emon commented on June 29, 2024

@skjang54 I wonder if we need to flip this comparison?

dps/dps/spark/prep/japanese_prep.py

Lines 55 to 59 in 3731bb2

    
           # return freq_char_ratio <= (  
        
           return freq_char_ratio > ( # return bad text 
        
               sum([re.search(chr, text)!=None for chr in JAPANESE_FREQ_CHAR_LIST]) 
        
               /len(JAPANESE_FREQ_CHAR_LIST) 
        
           )

I think if a sentence has at least one of those frequent chars, we can keep the sentence.
By contrast, if a sentence doesn't have any of those, it's not a consistent Japanese sentence at high probability.

from dps.

skjang54 commented on June 29, 2024

@fujiki-1emon Yeah you are right. I fixed it to the right comparison, thanks!
And I changed the threshold ratio to the int value.

from dps.

Japanese pre-procesesing - remove text with low rate of Japanese stopwords about dps HOT 4 CLOSED

Comments (4)

Background

TODOs

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# return freq_char_ratio <= (
	return freq_char_ratio > ( # return bad text
	sum([re.search(chr, text)!=None for chr in JAPANESE_FREQ_CHAR_LIST])
	/len(JAPANESE_FREQ_CHAR_LIST)
	)