How about adding some heuristic filters similar to MassiveText's Quality Filtering?<br

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Re-consider about filter functions <code class="notranslate"

MassiveText Quality Filtering about dps HOT 3 CLOSED

jayseok-park commented on June 29, 2024 1

MassiveText Quality Filtering

from dps.

Comments (3)

Taekyoon commented on June 29, 2024

@jayseok-park Are there any benchmarks to define these constrains?

from dps.

jayseok-park commented on June 29, 2024

There is a report on validation loss at a smaller scale (1.4B) of quality filtering, but there is no ablation for each constraints. I have no idea how they defined these numbers.

https://arxiv.org/abs/2112.11446

or maybe for now, it might be ok to add a filter like this and adjust the numbers later by looking at the distribution or something like data exploration.

from dps.

Taekyoon commented on June 29, 2024

Re-consider about filter functions

doc_len_filter: Filter any doc that does not contain between min_doc_len and max_doc_len words
- This function is necessary to filter text dataset
mean_word_len_filter: Filter any doc whose mean word length is outside the range of min_word_len to max_word_len characters
- In Korean language, we do understand text without space, so this may not be needed
symbol_to_word_ratio_filter: Filter any doc with a symbol-to-word ratio greater than symbol_to_word_ratio for either the hash symbol or the ellipsis
- Can follow current defined setting
bullet_ellipsis_filter: Filter any doc with more than bullet_point_ratio of lines starting with a bullet point, or more than ellipsis_ratio ending with an ellipsis
- Can follow current defined setting
alphabetic_word_ratio_filter: Filter any doc that alphabetic_word_ratio of words in a document does not contain at least one alphabetic character
- Can follow current defined setting
least_k_essential_words_filter: Filter any doc that does not contain at least k of the following English word_list: the, be, to, of, and, that, have, with (language specific words may needed)
- This function needs to discussion to pick k-words