Giter Site home page Giter Site logo

sklearn-chinese-keyword-extractor's Introduction

Chinese Keyword Extractor

The model will process a brute-force search algorithm to find the most repeated substrings in a string. The substrings are defined as keywords. The frequency of a keyword will be calibrated by its parental strings.

from __future__ import unicode_literals
from text.KeywordExtractor import KeywordExtractor

Import Data

Crawl the Chinese product names from TreeMall online shopping website

data = [
    '缺圖<箱購>【熊寶貝】衣物柔軟精純淨溫和 3.2L x 4入組',
    '<超值7件組>【白蘭】超濃縮洗衣精1瓶+6 補充包(2.7Kg x1瓶+1.6Kg x6包)(蘆薈清淨)',
    '熊寶貝 柔軟護衣精(2018新包裝)(沁藍海洋香3.2L)',
    '【白蘭】含熊寶貝馨香精華大自然馨香洗衣粉 4.25kg(2入)',
    '【LUX 麗仕】柔亮絲滑潤髮乳 NEW 750ml',
    '【1/2短效期】【DOVE 多芬】清爽水嫩潔膚塊 4入',
    '【白蘭】茶樹除菌超濃縮洗衣精 2. 7Kg',
    '<TreeMall 來店禮獨家組>【白蘭】含熊寶貝超濃縮洗衣精 1+3件組(純淨溫和)',
    '【白蘭】茶樹除菌洗衣粉4.25kg',
    '【白蘭】強效除?過敏洗衣粉 4.25kg',
    '【白蘭】動力配方洗碗精(鮮柚)1kg',
    '【熊寶貝】衣物香氛袋清香21g',
    '<4入瓶裝箱購 贈購物袋>【熊寶貝】衣物柔軟精3.0L/3.2Lx4(七款選)(沁藍海洋香 3.2L)',
    '<超值組>【DOMESTOS 多霸道】多功能除菌清潔劑500ml x 2',
    '<箱購> 【立頓】黃牌精選紅茶 200G x 36入組',
    '【LUX 麗仕】絲蛋白精華沐浴乳水嫩柔膚1L',
    '<TreeMall 來店禮獨家組>【白蘭】含熊寶貝超濃縮洗衣精 1+3件組(花漾清新)',
    '【超值任選】【DOVE 多芬】水潤植萃潤髮乳  玫瑰精華 500ml',
    '<超值7件組>【白蘭】超濃縮洗衣精1瓶+6 補充包(2.7Kg x1瓶+1.6Kg x6包)(強效潔淨除蹣)',
    '【熊寶貝】衣物香氛袋薰衣21g',
    '<TreeMall 來店禮獨家組>【白蘭】含熊寶貝超濃縮洗衣精 1+3件組(大自然馨香)',
    '<超值12件組>【白蘭】不含熊濃縮洗衣精12送6組(1.6kg x12包)送衛生紙6包(蘆薈親膚)',
    '【蒂沐蝶 歐洲天然有機洗沐獨家組】深層純淨1+4超值組+新款植萃皂 贈荷木方形梳(玫瑰保濕皂)',
    '<超值7件組>【白蘭】超濃縮洗衣精1瓶+6 補充包(2.7Kg x1瓶+1.6Kg x6包)(茶樹除菌)',
    '<箱購>【白蘭】強效潔淨除\uee80超濃縮洗衣粉1.9kg x 9入',
    '<超值組>【白蘭】超濃縮洗衣精1+6件組(2.7Kg x1瓶+1.6Kg x6包)',
    '【DOVE 多芬】滋養柔膚沐浴乳 舒敏溫和配方 補充包 2017版  650g',
    '<1+2組>【白蘭】超濃縮洗衣精1+2補(2.7kg x1+1.6kg x2包)(強效潔淨除蹣)',
    '【白蘭】蘆薈親膚超濃縮洗衣精 2. 7Kg',
    '<超值組>【白蘭】蘆薈親膚超濃縮洗衣精1+6件組(2.7Kg x1瓶+1.6Kg x6包)',
    '【康寶】濃湯-自然原味銀魚海帶芽 2*37g',
    '箱購【白蘭】陽光馨香超濃縮洗衣精補充包 1.6Kgx8入',
    '獨家贈【DOVE 多芬】滋養柔膚沐浴乳 滋養柔嫩配方(1Lx1+650mlx5)',
    '【Timotei 蒂沐蝶 】深層純淨護髮乳 500g',
    '【白蘭】含熊寶貝馨香呵護精華純淨溫和洗衣精補充包 1.65kg',
    '東森獨家【白蘭】陽光馨香洗衣粉超值組(東森獨規)',
    '【諾淨】酵素低敏濃縮洗衣精 (護色) 1.5L',
    '<超值7件組>【白蘭】含熊寶貝超濃縮洗衣精 1+6件組 (大自然馨香)',
    '【LUX 麗仕】煥膚香皂煥活冰爽 6入 85g',
    '<箱購>【白蘭】陽光馨香超濃縮洗衣精 2.7Kg  x 4入組',
    '<1+2組>【白蘭】超濃縮洗衣精1+2補(2.7kg x1+1.6kg x2包)(陽光馨香)',
    '【Simple】清妍清新舒緩潔面露 50ML+ 清妍旅行組(2x50ml)',
    '【白蘭】動力配方洗碗精(檸檬)2.8kg',
    '<超值12件組>【白蘭】不含熊濃縮洗衣精12送6組(1.6kg x12包)送衛生紙6包(錯誤)',
    '<箱購>【白蘭】強效除\uee80過敏洗衣粉 4.25kg x 4入組',
    '【潔而亮】特強去污液(清新檸檬芬芳)500ml',
    '【LUX 麗仕】精油香氛沐浴乳迷醉甜香1L',
    '<超值7件組>【白蘭】含熊寶貝超濃縮洗衣精 1+6件組 (2.8Kg x1瓶+1.65 x6包)(森林晨露)',
    '【白蘭】含熊寶貝馨香精華大自然馨香超濃縮洗衣精 2.8kg*1 +補充包1.65kg*2',
    '熊寶貝 柔軟護衣精(2018新包裝)(淡雅櫻花香3.0L)',
    '<箱購>【AXE】黯黑經典香體噴霧150ml x 6入',
    '【白蘭】含熊寶貝馨香精華純淨溫和超濃縮洗衣精 1+9件組(2.8Kg x1瓶+1.65Kg x9包)',
    '【DOVE 多芬】舒柔水嫩沐浴乳(新)  1000ML',
    '<箱購> 【立頓】奶茶粉原味罐裝 450g x 12入組',
    '【LUX 麗仕】日本極致修護髮膜 200g'
]

Fit the model

ke = KeywordExtractor()
ke.fit(data)
KeywordExtractor(enable_english=True, min_ch_keyword_len=2,
                 min_en_keyword_len=3, n_keyword=5, ngram_range=(0, 10),
                 to_lowercase=False)

Extract kewords

ke.transform(data)
array([['熊寶貝', '箱購', '入組', '純淨溫和', '衣物'],
       ['白蘭', '超濃縮洗衣精', '件組', '超值', '補充包'],
       ['衣精', '熊寶貝', '沁藍海洋香', '柔軟護衣精', '柔軟'],
       ['白蘭', '洗衣', '馨香', '含熊寶貝', '洗衣粉'],
       ['麗仕', 'LUX', '髮乳', '潤髮乳', '柔亮絲滑潤髮乳'],
       ['多芬', 'DOVE', '短效期', '清爽水嫩潔膚塊', '水嫩'],
       ['白蘭', '超濃縮洗衣精', '濃縮洗衣精', '衣精', '茶樹除菌'],
       ['白蘭', '件組', '超濃縮洗衣精', '含熊寶貝超濃縮洗衣精', '純淨溫和'],
       ['白蘭', '洗衣', '洗衣粉', '茶樹除菌', '茶樹除菌洗衣粉'],
       ['白蘭', '洗衣', '洗衣粉', '強效', '過敏洗衣粉'],
       ['白蘭', '動力配方洗碗精', '鮮柚', '', ''],
       ['熊寶貝', '衣物', '衣物香氛袋清香', '衣物香氛袋', ''],
       ['熊寶貝', '箱購', '衣物柔軟精', '衣物', '沁藍海洋香'],
       ['超值', '超值組', '除菌', '多霸道', '多功能除菌清潔劑'],
       ['箱購', '入組', '立頓', '黃牌精選紅茶', ''],
       ['麗仕', 'LUX', '精華', '沐浴乳', '柔膚'],
       ['白蘭', '件組', '超濃縮洗衣精', '含熊寶貝超濃縮洗衣精', '含熊寶貝'],
       ['超值', '精華', '多芬', 'DOVE', '髮乳'],
       ['白蘭', '超濃縮洗衣精', '件組', '超值', '補充包'],
       ['熊寶貝', '衣物', '衣物香氛袋薰衣', '衣物香氛袋', ''],
       ['白蘭', '件組', '超濃縮洗衣精', '馨香', '含熊寶貝超濃縮洗衣精'],
       ['白蘭', '濃縮洗衣精', '件組', '超值', '含熊'],
       ['超值', '超值組', '純淨', '獨家組', '蒂沐蝶'],
       ['白蘭', '超濃縮洗衣精', '件組', '超值', '補充包'],
       ['白蘭', '超濃縮洗衣', '箱購', '洗衣粉', '強效潔淨除'],
       ['白蘭', '超濃縮洗衣精', '件組', '超值', '超值組'],
       ['補充包', '多芬', 'DOVE', '配方', '溫和'],
       ['白蘭', '超濃縮洗衣精', '濃縮洗衣精', '衣精', '強效潔淨除蹣'],
       ['白蘭', '超濃縮洗衣精', '濃縮洗衣精', '衣精', '蘆薈親膚超濃縮洗衣精'],
       ['白蘭', '超濃縮洗衣精', '件組', '超值', '超值組'],
       ['自然', '自然原味銀魚海帶芽', '濃湯', '康寶', ''],
       ['白蘭', 'Kgx', '超濃縮洗衣精', '箱購', '馨香'],
       ['獨家', '多芬', 'DOVE', '配方', '沐浴乳'],
       ['純淨', '髮乳', '蒂沐蝶', '深層純淨護髮乳', '深層純淨'],
       ['白蘭', '洗衣精', '馨香', '補充包', '洗衣'],
       ['白蘭', '洗衣', '超值', '馨香', '獨家'],
       ['濃縮洗衣精', '衣精', '酵素低敏濃縮洗衣精', '護色', '諾淨'],
       ['白蘭', '件組', '超濃縮洗衣精', '超值', '馨香'],
       ['麗仕', 'LUX', '煥膚香皂煥活冰爽', '', ''],
       ['白蘭', '超濃縮洗衣精', '箱購', '馨香', '入組'],
       ['白蘭', '超濃縮洗衣精', '馨香', '陽光馨香', '濃縮洗衣精'],
       ['清新', '清妍清新舒緩潔面露', '清妍旅行組', '清妍', 'Simple'],
       ['白蘭', '檸檬', '動力配方洗碗精', '', ''],
       ['白蘭', '濃縮洗衣精', '件組', '超值', '含熊'],
       ['白蘭', '洗衣', '箱購', '入組', '洗衣粉'],
       ['清新', '特強去污液', '潔而亮', '清新檸檬芬芳', ''],
       ['麗仕', 'LUX', '沐浴乳', '香氛', '精油香氛沐浴乳迷醉甜'],
       ['白蘭', '件組', '超濃縮洗衣精', '超值', '含熊寶貝超濃縮洗衣精'],
       ['白蘭', '超濃縮洗衣精', '補充包', '馨香', '含熊寶貝'],
       ['衣精', '熊寶貝', '柔軟護衣精', '柔軟', '新包裝'],
       ['箱購', '黯黑經典香體噴霧', 'AXE', '', ''],
       ['白蘭', '超濃縮洗衣精', '件組', '馨香', '含熊寶貝'],
       ['多芬', 'DOVE', '沐浴乳', '舒柔水嫩沐浴乳', '水嫩'],
       ['箱購', '入組', '立頓', '奶茶粉原味罐裝', ''],
       ['麗仕', 'LUX', '日本極致修護髮膜', '', '']], dtype='<U32')

sklearn-chinese-keyword-extractor's People

Contributors

x01963815 avatar

Watchers

James Cloos avatar

sklearn-chinese-keyword-extractor's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.