Giter Site home page Giter Site logo

baidu / familia Goto Github PK

View Code? Open in Web Editor NEW
2.6K 159.0 598.0 6.11 MB

A Toolkit for Industrial Topic Modeling

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.60% Shell 2.27% C++ 58.80% Python 36.09% Dockerfile 0.24%
topic-modeling topic-models lda sentence-lda twe nlp

familia's Introduction

logo

Build Status License

Familia 开源项目包含文档主题推断工具、语义匹配计算工具以及基于工业级语料训练的三种主题模型:Latent Dirichlet Allocation(LDA)、SentenceLDA 和Topical Word Embedding(TWE)。 支持用户以“拿来即用”的方式进行文本分类、文本聚类、个性化推荐等多种场景的调研和应用。考虑到主题模型训练成本较高以及开源主题模型资源有限的现状,我们会陆续开放基于工业级语料训练的多个垂直领域的主题模型,以及这些模型在工业界的典型应用方式,助力主题模型技术的科研和落地。(English)

News!!!

近期,我们在PaddleHub 1.8版本中上线了Familia中的LDA模型,根据数据集的不同,区分为lda_news、lda_novel和lda_webpage。

PaddleHub使用起来非常便捷,我们以lda_news的使用来进行例子介绍。

  1. 首先,在使用PaddleHub之前,需要先安装PaddlePaddle深度学习框架,更多安装说明请查阅飞桨快速安装

  2. 安装Paddlehub: pip install paddlehub

  3. lda_news模型安装:hub install lda_news

  4. 具体使用:

import paddlehub as hub

lda_news = hub.Module(name="lda_news")
jsd, hd = lda_news.cal_doc_distance(doc_text1="今天的天气如何,适合出去游玩吗", doc_text2="感觉今天的天气不错,可以出去玩一玩了")
# jsd = 0.003109, hd = 0.0573171

lda_sim = lda_news.cal_query_doc_similarity(query='百度搜索引擎', document='百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。')
# LDA similarity = 0.06826

results = lda_news.cal_doc_keywords_similarity('百度是全球最大的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。')
# [{'word': '百度', 'similarity': 0.12943492762349573}, 
#  {'word': '信息', 'similarity': 0.06139783578769882}, 
#  {'word': '找到', 'similarity': 0.055296603463188265}, 
#  {'word': '搜索', 'similarity': 0.04270794098349327}, 
#  {'word': '全球', 'similarity': 0.03773627056367886}, 
#  {'word': '超过', 'similarity': 0.03478658388202199}, 
#  {'word': '相关', 'similarity': 0.026295857219683725}, 
#  {'word': '获取', 'similarity': 0.021313585287833996}, 
#  {'word': '中文', 'similarity': 0.020187103312009513}, 
#  {'word': '搜索引擎', 'similarity': 0.007092890537169911}]

更加具体的介绍和使用方法可以看这里:https://www.paddlepaddle.org.cn/hublist?filter=en_category&value=SemanticModel

应用介绍

Familia目前包含的主题模型的对应论文介绍可以参考相关论文

主题模型在工业界的应用范式可以抽象为两大类: 语义表示和语义匹配。

  • 语义表示 (Semantic Representation)    对文档进行主题降维,获得文档的语义表示,这些语义表示可以应用于文本分类、文本内容分析、CTR预估等下游应用。

  • 语义匹配 (Semantic Matching)

    计算文本间的语义匹配度,我们提供两种文本类型的相似度计算方式:

    • 短文本-长文本相似度计算,使用场景包括文档关键词抽取、计算搜索引擎查询和网页的相似度等等。
    • 长文本-长文本相似度计算,使用场景包括计算两篇文档的相似度、计算用户画像和新闻的相似度等等。

更详细的内容及工业界应用案例可以参考Familia Wiki , 如果想要对上述应用范式进行基于Web的可视化展示,可以参考Familia-Visualization

代码编译

第三方依赖包括gflags-2.0glogs-0.3.4protobuf-2.5.0, 同时要求编译器支持C++11, g++ >= 4.8, 兼容Linux和Mac操作系统。 默认情况下执行以下脚本会自动获取依赖并安装。

$ sh build.sh # 包含获取并安装第三方依赖的过程

模型下载

$ cd model
$ sh download_model.sh

我们会陆续开放不同领域的多种主题模型,来满足更多不同的场景需求。

Demo

Familia自带的Demo包含以下功能:

  • 语义表示计算 利用主题模型对输入文档进行主题推断,以得到文档的主题降维表示。

  • 语义匹配计算   计算文本之间的相似度,包括短文本-长文本、长文本-长文本间的相似度计算。

  • 模型内容展现    对模型的主题词,近邻词进行展现,方便用户对模型的主题有直观的理解。

具体的Demo使用说明可以参考使用文档

注意事项

  • 若出现找不到libglog.so, libgflags.so等动态库错误,请添加third_party至环境变量的LD_LIBRARY_PATH中。

    export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH

  • 代码中内置简易的FMM分词工具,只针对主题模型中出现的词表进行正向匹配。若对分词和语义准确度有更高要求,建议使用商用分词工具,并使用自定义词表的功能导入主题模型中的词表。

问题咨询

欢迎提交任何问题和Bug Report至Github Issues。 或者发送咨询邮件至{ familia } at baidu.com

Docker

docker run -d \
    --name familia \
    -e MODEL_NAME=news \
    -p 5000:5000 \
    orctom/familia

MODEL_NAME can be one of news/novel/webpage/webo

API

http://localhost:5000/swagger/

Citation

The following article describes the Familia project and industrial cases powered by topic modeling. It bundles and translates the Chinese documentation of the website. We recommend citing this article as default.

Di Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao, Jinhua Peng, Huang He, Hua Wu. 2018. Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering. arXiv preprint arXiv:1808.03733.

@article{jiang2018familia,
  author = {Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu},
  title = {{Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering}},
  journal = {arXiv preprint arXiv:1808.03733},
  year = {2018}
}

Further Reading: Federated Topic Modeling

Copyright and License

Familia is provided under the BSD-3-Clause License.

familia's People

Contributors

desmonday avatar jiangjiajun avatar jxfruit avatar lianrzh avatar liyiheng avatar maverickhelo avatar mzyhappy avatar orctom avatar rollroll90 avatar songyf avatar zeyuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

familia's Issues

请教一下对query_doc_sim的理解

以下是我的理解,但不确定对不对,并有一些疑问,所以请教一下,盼回,感谢。

  1. 输入query和document后,会使用LDA算法对document进行主题推断,即得到主题id和主题的概率。
  2. 遍历query中的所有单词,根据TWE算法训练好的词向量表示和主题向量表示,计算cos相似度,使用相应的公式计算即可得到query和document的最终相似度。(公式贴不上,那个公式我知道,所以应该没有问题)
    我的疑问如下:
  3. 使用LDA算法对document进行主题推断是使用的http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf这篇论文的算法**吗?
  4. 我看之前别人提到的问题得知你们使用的是TWE-1模型。TWE-1是先训练得到词向量表示和主题向量表示,然后再结合形成主题词向量表示。。所以上述第2步中,计算cos相似度的词向量和主题向量仅是TWE-1训练好的词向量表示和主题向量表示吗?是不是此处和所谓的主题词向量无关??根据源代码来看,query中的词是唯一确定的一个向量表示,而不是在不同的主题下有不同的表示(因为query过短,没有丰富的上下文信息来做主题推断)。
  5. 虽然我看了TWE那篇论文,但对于训练过程仍然不太清晰。。是不是使用skip-gram模型先训练得到词向量表示,然后使用属于某个主题的词向量的平均值初始化主题向量,然后在词向量表示不变的情况下学习主题向量表示。那也就是说TWE-1模型训练得到的词向量表示和skip-gram训练得到的是一样的??TWE-1的创新点是在于也学习了主题向量表示,并且可以进一步得到主题词嵌入?

Ubuntu编译错误

Ubuntu16.04.4编译错误

Target build/libfamilia.a
ar crv build/libfamilia.a build/vose_alias.o build/inference_engine.o build/model.o build/vocab.o build/document.o build/sampler.o build/config.o build/util.o build/semantic_matching.o build/tokenizer.o build/demo/inference_demo.o build/demo/doc_distance_demo.o build/demo/query_doc_sim_demo.o build/demo/word_distance_demo.o build/demo/topic_word_demo.o build/demo/show_topic_demo.o
a - build/vose_alias.o
a - build/inference_engine.o
a - build/model.o
a - build/vocab.o
a - build/document.o
a - build/sampler.o
a - build/config.o
a - build/util.o
a - build/semantic_matching.o
a - build/tokenizer.o
a - build/demo/inference_demo.o
a - build/demo/doc_distance_demo.o
a - build/demo/query_doc_sim_demo.o
a - build/demo/word_distance_demo.o
a - build/demo/topic_word_demo.o
a - build/demo/show_topic_demo.o
g++ -I./include/ -I./include/familia -I./third_party/include -I/usr/include/python3.5 -pipe -W -Wall -fPIC -std=c++11 -fno-omit-frame-pointer -fpermissive -O3 -ffast-math -c python/cpp/familia_wrapper.cpp -o python/cpp/familia_wrapper.o
python/cpp/familia_wrapper.cpp:429:1: warning: missing initializer for member ‘PyModuleDef::m_slots’ [-Wmissing-field-initializers]
};
^
python/cpp/familia_wrapper.cpp:429:1: warning: missing initializer for member ‘PyModuleDef::m_traverse’ [-Wmissing-field-initializers]
python/cpp/familia_wrapper.cpp:429:1: warning: missing initializer for member ‘PyModuleDef::m_clear’ [-Wmissing-field-initializers]
python/cpp/familia_wrapper.cpp:429:1: warning: missing initializer for member ‘PyModuleDef::m_free’ [-Wmissing-field-initializers]
g++ -I./include/ -I./include/familia -I./third_party/include -I/usr/include/python3.5 -pipe -W -Wall -fPIC -std=c++11 -fno-omit-frame-pointer -fpermissive -O3 -ffast-math -shared python/cpp/familia_wrapper.o -L/home/karvn/Project/Familia-1.1.2/third_party/lib -L/usr/lib -L./build/ -lfamilia -lprotobuf -lglog -lgflags -l -o python/demo/familia.so
g++: error: python/demo/familia.so: No such file or directory
Makefile:103: recipe for target 'python/demo/familia.so' failed
make: *** [python/demo/familia.so] Error 1

Failed to build

$git:(master) sudo sh build.sh
rm -rf glog-0.3.4.tar.gz glog-0.3.4
wget --no-check-certificate http://raw.githubusercontent.com/ZeyuChen/third_party/master/package//glog-0.3.4.tar.gz && tar -zxf glog-0.3.4.tar.gz
--2017-07-25 14:16:37-- http://raw.githubusercontent.com/ZeyuChen/third_party/master/package//glog-0.3.4.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.72.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.72.133|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/ZeyuChen/third_party/master/package//glog-0.3.4.tar.gz [following]
--2017-07-25 14:16:37-- https://raw.githubusercontent.com/ZeyuChen/third_party/master/package//glog-0.3.4.tar.gz
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.72.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 522508 (510K) [application/octet-stream]
Saving to: ‘glog-0.3.4.tar.gz’

glog-0.3.4.tar.gz 100%[======================================================================================================>] 510.26K 1.01MB/s in 0.5s

2017-07-25 14:16:38 (1.01 MB/s) - ‘glog-0.3.4.tar.gz’ saved [522508/522508]

cd glog-0.3.4 && export CFLAGS=-fPIC && export CXXFLAGS=-fPIC && ./configure -prefix=/mnt/shared/Familia/third_party --with-gflags=/mnt/shared/Familia/third_party && make && make install
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
/mnt/shared/Familia/glog-0.3.4/missing: Unknown --is-lightweight' option Try /mnt/shared/Familia/glog-0.3.4/missing --help' for more information
configure: WARNING: 'missing' script is too old or missing
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking how to run the C preprocessor... gcc -E
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... mt
checking if mt is a manifest tool... no
checking for ANSI C header files... no
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for ANSI C header files... (cached) no
checking for stdint.h... (cached) yes
checking for sys/types.h... (cached) yes
checking for inttypes.h... (cached) yes
checking for unistd.h... (cached) yes
checking syscall.h usability... yes
checking syscall.h presence... yes
checking for syscall.h... yes
checking sys/syscall.h usability... yes
checking sys/syscall.h presence... yes
checking for sys/syscall.h... yes
checking execinfo.h usability... yes
checking execinfo.h presence... yes
checking for execinfo.h... yes
checking libunwind.h usability... no
checking libunwind.h presence... no
checking for libunwind.h... no
checking ucontext.h usability... yes
checking ucontext.h presence... yes
checking for ucontext.h... yes
checking sys/utsname.h usability... yes
checking sys/utsname.h presence... yes
checking for sys/utsname.h... yes
checking pwd.h usability... yes
checking pwd.h presence... yes
checking for pwd.h... yes
checking syslog.h usability... yes
checking syslog.h presence... yes
checking for syslog.h... yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking glob.h usability... yes
checking glob.h presence... yes
checking for glob.h... yes
checking unwind.h usability... yes
checking unwind.h presence... yes
checking for unwind.h... yes
checking windows.h usability... no
checking windows.h presence... no
checking for windows.h... no
checking size of void ... 8
checking for uint16_t... yes
checking for u_int16_t... yes
checking for __uint16... no
checking for sigaltstack... yes
checking for sigaction... yes
checking for dladdr... no
checking for fcntl... yes
checking for pread... yes
checking for pwrite... yes
checking for attribute... yes
checking for __builtin_expect... yes
checking for _sync_val_compare_and_swap... yes
checking for the pthreads library -lpthreads... no
checking whether pthreads work without any flags... no
checking whether pthreads work with -Kthread... no
checking whether pthreads work with -kthread... no
checking for the pthreads library -llthread... no
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking if more special flags are required for pthreads... no
checking whether to check for GCC pthread/shared inconsistencies... yes
checking whether -pthread is sufficient with -shared... yes
checking for pthread_self in -lpthread... yes
checking for main in -lgflags... no
checking for gtest-config... no
checking for main in -lgtest... no
checking support for pthread_rwlock
functions... yes
checking whether the compiler implements namespaces... yes
checking what namespace STL code is in... std
checking whether compiler supports using ::operator<<... 1
checking for ucontext.h... (cached) yes
checking sys/ucontext.h usability... yes
checking sys/ucontext.h presence... yes
checking for sys/ucontext.h... yes
checking how to access the program counter from a struct ucontext... uc_mcontext.gregs[REG_RIP]
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/glog/logging.h
config.status: creating src/glog/raw_logging.h
config.status: creating src/glog/vlog_is_on.h
config.status: creating src/glog/stl_logging.h
config.status: creating libglog.pc
config.status: creating src/config.h
config.status: executing depfiles commands
config.status: executing libtool commands
make[1]: Entering directory '/mnt/shared/Familia/glog-0.3.4'
CDPATH="${ZSH_VERSION+.}:" && cd . && aclocal-1.14 -I m4 /bin/bash: aclocal-1.14: command not found Makefile:957: recipe for target 'aclocal.m4' failed make[1]: *** [aclocal.m4] Error 127 make[1]: Leaving directory '/mnt/shared/Familia/glog-0.3.4' depends.mk:32: recipe for target '/mnt/shared/Familia/third_party/include/glog/logging.h' failed make: *** [/mnt/shared/Familia/third_party/include/glog/logging.h] Error 2
rm -rf inference_demo
rm -rf doc_distance_demo
rm -rf query_doc_sim_demo
rm -rf word_distance_demo
rm -rf topic_word_demo
rm -rf show_topic_demo
rm -rf build
rm -rf python/cpp/.o
rm -rf python/demo/
.so
rm -rf python/demo/.pyc
find src -name "
.pb.[ch]*" -delete
/mnt/shared/Familia/third_party/bin/protoc --cpp_out=./src --proto_path=./proto proto/config.proto make: /mnt/shared/Familia/third_party/bin/protoc: Command not found Makefile:98: recipe for target 'include/config.pb.h' failed make: *** [include/config.pb.h] Error 127

希望python接口提供更多底层接口实现

想通过TWE实现类似'关键词提取'、'词与词之间的相似度'等方面的需求,但目前提供的python接口貌似还没法获取到模型的底层数据,例如:词向量(word_emb)、主题向量(topic_emb)等

无法安装成功

系统ubuntu 17,执行sh build.sh后报错如下:
python/cpp/familia_wrapper.cpp:517:1: warning: missing initializer for member ‘PyModuleDef::m_traverse’ [-Wmissing-field-initializers]
python/cpp/familia_wrapper.cpp:517:1: warning: missing initializer for member ‘PyModuleDef::m_clear’ [-Wmissing-field-initializers]
python/cpp/familia_wrapper.cpp:517:1: warning: missing initializer for member ‘PyModuleDef::m_free’ [-Wmissing-field-initializers]
g++ -I./include/ -I./include/familia -I./third_party/include -I/usr/include/python3.6 -pipe -W -Wall -fPIC -std=c++11 -fno-omit-frame-pointer -fpermissive -O3 -ffast-math -shared python/cpp/familia_wrapper.o -L/home/zyb/dev/Familia/Familia/third_party/lib -L/usr/lib -L./build/ -lfamilia -lprotobuf -lglog -lgflags -lpython3.6 -o python/demo/familia.so
/usr/bin/ld: 找不到 -lpython3.6
collect2: error: ld returned 1 exit status
Makefile:106: recipe for target 'python/demo/familia.so' failed
make: *** [python/demo/familia.so] Error 1

请教下怎么解决?谢谢

Python demo 使用TWE计算关键词与文档的相关性

sh run_doc_keywords_demo.sh

Enter Keywords: 分析
Enter Document: 在分析文档时,我们往往会抽取一些文档的关键词做标签(tag),这些tag在用户画像和推荐任务中扮演着重要角色。从文档中抽取关键词,常用的方法是利用词的TF和IDF信息,此外,还可利用主题模型,估计一个文档产生单词的概率作为该单词的重要度指标:
----------------------------
分析	0.0

请问为啥输出相似度是0呢?

Wiki中引用的外部链接URL出现拼写错误

Wiki: https://github.com/baidu/Familia/wiki/语义匹配应用介绍

对于较难的短文本-短文本语义匹配任务,则可以考虑引入有监督信号并利用 Deep Structured Semantic Model(DSSM)[5] 或 Convolutional Latent Semantic Model(CLSM)[9] 这些更复杂的神经网络模型进行语义相关性的计算。

Convolutional Latent Semantic Model(CLSM)[9] 的链接URL出现拼写错误:microsoft.com 被错误拼写成 microsoft.uom

demo 问题

./inference_demo --model_dir="./model/news" --conf_file="lda.conf"
ERROR: unknown command line flag 'conf_file'
ERROR: unknown command line flag 'model_dir'

请问这是什么原因?

Model的格式是开放的吗?

你好,
试用了一下familia,觉得Project中提供的model效果不错。
由于目前familia还未开放训练model的功能。
所以,是否有计划开放model的格式呢?以便开发者自行训练专有模型。

是否可提供mode训练接口

你好,请问是否可以提供model训练的接口?
我想尝试下针对特定领域的语料进行train看下效果

Ubuntu16.04中libglog.so.0缺失

在Ubuntu16.04中使用Familia,已在/etc/profile中添加export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH,并已source,但是在python2环境中运行python的demo时依旧报ImportError: libglog.so.0: cannot open shared object file: No such file or directory这个错误,请问该如何解决?谢谢!
另外,如何在python3中调用Familia?

词表文件vocab_info.txt 中 后面两列表示的含义是什么?

word ** 0 64993051.0 14733379.0
word 公司 1 63616379.0 10172354.0
word 发展 2 47508805.0 12369856.0
word 市场 3 46457573.0 10721555.0
word 工作 4 44673705.0 13794068.0
word 企业 5 41719656.0 9119845.0
word 记者 6 33865364.0 16632006.0
word 服务 7 29299537.0 8780211.0
word 经济 8 28382905.0 8445786.0

Python提取主题词

请问Python有打算开放输入主题编号,输出主题词的接口吗?现只有nearest_words_around_topic这一个接口,但得到的结果没有直接获得主题词好。
另外,电商领域的模型什么时候开放呢?
谢谢!

运行sh build.sh 报错

build/vose_alias.o build/inference_engine.o build/model.o build/vocab.o build/document.o build/sampler.o build/config.o build/util.o build/semantic_matching.o build/tokenizer.o build/demo/inference_demo.o build/demo/doc_distance_demo.o build/demo/query_doc_sim_demo.o build/demo/word_distance_demo.o build/demo/topic_word_demo.o build/demo/document_keywords_demo.o build/demo/show_topic_demo.o
g++ -pipe -W -Wall -fPIC -std=c++11 -fno-omit-frame-pointer -fpermissive -O3 -ffast-math -I./include/ -I./include/familia -I./third_party/include -I/home/hadoop/anaconda2/include/python2.7 build/demo/inference_demo.o -L/home/hadoop/lg/Familia/third_party/lib -L/home/hadoop/anaconda2/lib -L./build/ -lfamilia -lprotobuf -lglog -lgflags -o inference_demo
./build//libfamilia.a(model.o):在函数‘familia::TopicModel::topic_sum(int) const’中:
model.cpp:(.text+0x22a):对‘std::__throw_out_of_range_fmt(char const*, ...)’未定义的引用
./build//libfamilia.a(sampler.o):在函数‘familia::MHSampler::construct_alias_table()’中:
sampler.cpp:(.text+0xabe):对‘std::__throw_out_of_range_fmt(char const*, ...)’未定义的引用
./build//libfamilia.a(util.o):在函数‘familia::split(std::vector<std::string, std::allocatorstd::string >&, std::string const&, char)’中:
util.cpp:(.text+0x1ec):对‘std::__throw_out_of_range_fmt(char const*, ...)’未定义的引用
/home/hadoop/lg/Familia/third_party/lib/libprotobuf.a(common.o):在函数‘std::vector<void ()(), std::allocator<void ()()> >::_M_range_check(unsigned long) const’中:
common.cc:(.text._ZNKSt6vectorIPFvvESaIS1_EE14_M_range_checkEm[_ZNKSt6vectorIPFvvESaIS1_EE14_M_range_checkEm]+0x4a):对‘std::__throw_out_of_range_fmt(char const*, ...)’未定义的引用
collect2: 错误:ld 返回 1
make: *** [all] 错误 1

一直报错。请问是什么原因?

wiki demo计算值和default_random_engine?

无论在win10的Linux Bash还是ubuntu下编译,运行run_doc_distance_demo.sh
请输入文档1:
在人工智能发展得最为系统化的硅谷,AI工程师们的薪水远高于其他领域的同行。随着人工智能概念的不断深入人心,人工智能的人才愈发的紧俏,时至今日,大学刚毕业的博士也能坐拥八九十万的年薪,与资深的硅谷工程师相媲美。
请输入文档2:
在国内,部分企业早已瞄准人才的短板,走在了业界的前面。百度是最早进行AI的人才培养布局的,他们同国内诸多高校开展合作,共建工程实验室,在数据开放和资源共享上进行各种合作。这种方式类似美国在人工智能教育领域推行的“硅谷-斯坦福”校企联动模式,一方面斯坦福大学为硅谷提供了人才和科研成果,另一方面硅谷为斯坦福大学提供资金支持和大数据,以助力他们的科研能有更大的突破。

得到的结果都是
Jensen-Shannon Divergence = 0.0533303
Hellinger Distance = 0.241383
与编写 demo的
Jensen Shannon Divergence = 0.0283928
Hellinger Distance = 0.179698
期望值不一样,(run_query_doc_sim_demo.sh也有出入),是demo错了,还是我编译运行有问题?
default_random_engine 在visual studio查看下是mt19937,而linux表现来看是minstd_rand0,是不是将default_random_engine表示的更明确好点?

运行build.sh出错

windows10下用cygwin运行build.sh, 出现错误如图。
请问这是为什么,怎么解决啊?谢谢!!
_20180412002619

TWE的具体实现以及模型展示结果中的计算方式

TWE的原论文中实际上有三种实现方式,请问这个包里具体实现了哪一种呢?
另外,TWE的模型展示中展示了两列结果。其中多项分布结果直接由LDA得到,而每个主题embedding结果是如何计算的呢?根据论文,TWE中的embedding是直接利用了每个词的主题分配,似乎并不影响原来的LDA结果,能否进一步说明?

flag 'flagfile' was defined more than once

如果不export第三方库的话会出现动态库错误,但是引入第三方库后就遇到这个问题
ERROR: flag 'flagfile' was defined more than once (in files 'src/gflag.cc' and ' ') 分别是绝对路径和相对路径

语义表示demo,切换到webpage主题时报错

使用默认的news主题,运行正常,但尝试切换至webpage时,出现越界问题

运行命令
./inference_demo --model_dir="./model/webpage" --conf_file="lda.conf"

输出

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0725 12:37:52.894461 2739172288 inference_engine.cpp:19] Inference Engine initializing...
I0725 12:37:52.895038 2739172288 util.h:58] Loading prototxt: ./model/webpage/lda.conf
I0725 12:37:52.895468 2739172288 model.cpp:31] Loading model: ./model/webpage/webpage_lda.model
I0725 12:37:52.895481 2739172288 model.cpp:32] Loading vocab: ./model/webpage/vocab_info.txt
I0725 12:37:53.145177 2739172288 vocab.cpp:45] Load vocabulary success! #vocabulary size = 283827
I0725 12:37:53.148116 2739172288 model.cpp:47] Loading word topic from ./model/webpage/webpage_lda.model
F0725 12:37:55.325737 2739172288 model.cpp:60] Check failed: term_id < vocab_size() (283827 vs. 283827) Term id out of range!
*** Check failure stack trace: ***
[1]    51265 abort      ./inference_demo --model_dir="./model/webpage" --conf_file="lda.conf"

环境:
MacOS

不支持python3

系统环境python2和python3共存,执行build.sh后生成的familia.so可以用在python2中,但是不能用在python3中

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import familia
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dynamic module does not define module export function (PyInit_familia)

when making familia, error This file was generated by an older version of protoc

Familia/third_party/bin/protoc --cpp_out=./src --proto_path=./proto proto/config.proto
mv src/config.pb.h ./include/familia
mv src/config.pb.cc ./src/config.cpp
g++ -I./include/ -I./include/familia -I./third_party/include -I/usr/include/python2.7 -pipe -W -Wall -fPIC -std=c++11 -fno-omit-frame-pointer -fpermissive -O3 -ffast-math -MM -MT build/inference_engine.o src/inference_engine.cpp >build/inference_engine.d
In file included from ./include/familia/inference_engine.h:11:0,
from src/inference_engine.cpp:5:
./include/familia/config.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
^
./include/familia/config.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
^
./include/familia/config.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
^
make: *** [build/inference_engine.o] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.