Giter Site home page Giter Site logo

embedding_study's Introduction

bert

说明

参考 bert-as-service,只保留生成embedding的代码。

  • 想要了解具体的bert实现,请参考 google-bert

  • 已经预训练好的模型,请访问 模型链接.

  • bert_classification。基于 bert 的分类模型,下游任务(二分类、多分类、多标签分类都可以修改)

ELMO-tf

ELMO : Deep contextualized word representations https://arxiv.org/abs/1802.05365

说明

参考 ELMO-tf,修改部分代码,适应于中文语料。 这位韩国小哥哥写的代码很清晰,相对于原始的实现,可读性好很多。原始的实现需要自行整理,搭建中文处理机制。

word2vec

中文词向量:https://github.com/Embedding/Chinese-Word-Vectors 腾讯词向量:链接:https://pan.baidu.com/s/1meeKUBKbGMyTGrx664F4Ng 密码:xfh1

说明

执行word2vec目录下,word2vec_embedding.py文件即可。

该项目使用腾讯词向量进行 词向量、句向量的计算。

  • 词向量(查表,没有就按字)
  • 句向量(词性加权,词向量,最后求平均)

使用:

下载腾讯词向量,tencent_45000.txt 应该就可以了

其他

#!/usr/bin/env bash

# 重置git的方法,(第二步需要 修改,git add 最好手动确定下)
#1. Checkout
git checkout --orphan latest_branch
#2. Add all the files
git add -A
#3. Commit the changes
git commit -am "commit message"
#4. Delete the branch
git branch -D master
#5.Rename the current branch to master
git branch -m master
#6.Finally, force update your repository
git push -f origin master

embedding_study's People

Contributors

yc-wind avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

embedding_study's Issues

ELMO代码中的疑问,希望解答

感觉ELMO模型accuracy计算有点问题,详细情况见下图。
image

因为计算loss需要用到forward_output, 计算forward_pred需要用到forward_projection,因为argsoftmax需要从词表维度中选择最可能的词id,代码中是从elmo_hidden维度选择最可能的hidden id。

中文词向量训练的问题

你好,我用于测试的中文数据格式如下:
image
每行是一个已经分词的句子

word表的大小是9w,char表的大小是3600
ELMo的batch_size是128(原配置是1024,但是我机器不行,改成128了),其他不变,跑出来的结果是这样:
image

损失没啥问题,但是精度最高只有1点几。。。想问下这个是哪里的问题?

我按照你的代码写的是这样的:
image

另外,我现在训练好模型后,怎么获取到中文词向量呢?

万分感谢!

One problem about result

Hi, if I change the input into ["好难啊", "怎么办呢"], the result's shape is still (10,768).
image
I'd like to ask why I have result, thank you!

very slow for bert

I tested the speed of this versus bert as a service to generate 1 sentence embeddings but this one is a lot slower, do you have suggestions to keep the model in memory or how speed this up?

如何返回句向量

这个输入多个词组成的句子,返回的是每个词向量,怎么返回句向量呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.