Giter Site home page Giter Site logo

what-s-the-novel-type's Introduction

通过小说名字判断小说类型

通过FastText来进行文本分类

使用爬虫爬取了小说网站的收藏榜(因为此榜单人气高),扔到FastText里去训练。

同时使用 jieba 分词库来提高准确率。

网站爬虫

  • 起点(男频)
  • 创世(男频)
  • 晋江(女频)
  • 笔趣阁(综合)

起点、创世爬的都是收藏榜,长期人气高。

爬虫代码(以起点玄幻为例):

full_lines = []  # 结果的每一行


# 获取网页源码
def get_html(url):
    headers = {
        'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
    req = urllib.request.Request(url=url, headers=headers)
    html = urllib.request.urlopen(req)
    return html.read().decode('utf-8', 'ignore')


# 保存到文本文件
def save_text_file(file_name, contents):
    with open(file_name, 'w', encoding='utf-8') as f:
        f.write(contents)


# 提取起点每一个类型的书名数组
def add_qidian_novel_type(type_id, type_name):
    global full_lines
    for page in range(1, 25):  # 遍历每一页
        html = get_html("https://www.qidian.com/rank/collect?chn=" + type_id + "&page=" + str(page))
        names = re.findall(
            '<h4><a href="//book.qidian.com/info/\\d+" target="_blank" data-eid="qd_C40" data-bid="\\d+">(.+?)</a></h4>'
            , html)
        print(type_name, "(", str(page), ") : ", len(full_lines), " : ", names)
        for name in names:
            full_lines += ["__label__" + type_name + " " + " ".join(jieba.cut(name))]

add_qidian_novel_type("21", "玄幻")
add_qidian_novel_type("1", "玄幻")  # 奇幻
add_qidian_novel_type("2", "武侠")
add_qidian_novel_type("22", "修真")  # 仙侠
add_qidian_novel_type("4", "都市")  # 言情?
add_qidian_novel_type("15", "现实")
add_qidian_novel_type("6", "军事")
add_qidian_novel_type("5", "历史")
add_qidian_novel_type("7", "游戏")
add_qidian_novel_type("8", "体育")
add_qidian_novel_type("9", "科幻")
add_qidian_novel_type("10", "灵异")
add_qidian_novel_type("12", "二次元")

random.shuffle(full_lines)  # 随机打乱顺序
fulltext = "\n".join(full_lines)  # 组成字符串
save_text_file('data/qidian_names.txt', fulltext)  # 保存到文件

其中晋江用的是本地,因为那个排行榜不分页,而且用上面的代码获取过来会乱码(试了几个编码照样乱码),就存到本地再读取了。

笔趣阁有一部分页面会乱码,只能获取一部分,但是数量是真的多。

神经网络训练

网上说的那些词向量什么的我也不懂,一堆语料库扔进去训练就是了。

调整了一下几个参数,不知道准确性如何。

训练过程

classifier = fastText.train_supervised("data/novel_names.txt", lr=0.1, wordNgrams=1, loss="hs")
model = classifier.save_model("data/novel_names.model")
# classifier = fastText.load_model("data/novel_names.model")

运行示例

运行之前,先合并需要用到的语料库

# 保存到文本文件
def save_text_file(file_name, contents):
    with open(file_name, 'w', encoding='utf-8') as f:
        f.write(contents)


# 判断哪些素材要进行训练
sources = ['biquge', 'qidian', 'chuangshi', 'jinjiang']
fulltext = ""
for source in sources:
    fulltext += open("data/"+source+"_names.txt", encoding='utf-8').read() + "\n"
save_text_file("data/novel_names.txt", fulltext)

进行判断

# 判断小说类型的调用语句
def judge(novel):
    print(novel, " : ", classifier.predict(" ".join(jieba.cut(novel))))

judge("机械囚宠:元帅亲点爱")
judge("指掌浩瀚")
judge("医世倾城")
judge("夏蝉冬雪")

运行结果

机械囚宠:元帅亲点爱  :  (('__label__都市',), array([0.52540648]))
指掌浩瀚  :  (('__label__修真',), array([0.40694034]))
医世倾城  :  (('__label__玄幻',), array([0.61438417]))
夏蝉冬雪  :  (('__label__都市',), array([0.15081741]))

what-s-the-novel-type's People

Contributors

iwxyi avatar

Stargazers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.