hit-scir / pyltp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from huangfj/pyltp

1.5K 70.0 352.0 8.97 MB

pyltp: the python extension for LTP

C++ 56.21% Python 41.32% CMake 2.47%

python chinese-nlp

pyltp's Introduction

pyltp

pyltp 是语言技术平台（Language Technology Platform, LTP）的 Python 封装。

在使用 pyltp 之前，您需要简要了解语言技术平台（LTP）能否帮助您解决问题。

目前基于Pytorch的LTP4 已经发布，而PyLTP将会只有非常有限的维护，请大家移步使用[LTP 4](LTP 4)

依赖支持情况

Python 2.7, 3.x, and PyPy (PyPy2.7 >= 5.7)

一个简单的例子

下面是一个使用 pyltp 进行分词的例子

# -*- coding: utf-8 -*-
from pyltp import Segmentor
segmentor = Segmentor("/path/to/your/cws/model")
words = segmentor.segment("元芳你怎么看")
print("|".join(words))
segmentor.release()

除了分词之外，pyltp 还提供词性标注、命名实体识别、依存句法分析、语义角色标注等功能。

详细使用方法可以参考 example

安装

第一步，安装 pyltp

使用 pip 安装
```
 $ pip install pyltp
```
或从源代码安装
```
 $ git clone https://github.com/HIT-SCIR/pyltp
 $ cd pyltp
 $ git submodule init
 $ git submodule update
 $ python setup.py install
```
- Mac系统出现版本问题使用 MACOSX_DEPLOYMENT_TARGET=10.7 python setup.py install
- 编译时间较长（约5分钟左右），请耐心等待
第二步，下载模型文件

七牛云，当前模型版本 3.4.0

制作安装包

git submodule init
git submodule update
python setup.py bdist_wheel

版本对应

pyltp 版本：0.4.0
LTP 版本：3.4.0
模型版本：3.4.0

作者

冯云龙 << [email protected] >> 2020-7-30 重写代码，换用 Pybind11
徐梓翔 << [email protected] >> 2015-01-20 解决跨平台运行问题
刘一佳 << [email protected] >> 2014-06-12 重组项目
HuangFJ << [email protected] >> 本项目最初作者

pyltp's People

Contributors

Stargazers

Watchers

Forkers

jellchou endyul dapeng2018 xsongx kwin-wang shenbeyond vode zenghsh3 ailab403 chenmoshushi seedsquall qjay612 toolkitsz gbacillus michelle190 likaiguo frankblood cleverdeng leoking01 songofhack zk12001 yanzqing huangpeng1126 heihei2015 qaz734913414 yangqiokay delphine0379 registerhuxiao pheelyli yongliangliu javelir shihuaxing ieee820 fulquan dyxty01 nowucme feifei8 zlzr200599 gladuo tim5go myechona lhyxcxy fajunchen leakey1905 dengyuning grainw wj1031924 oyjwhy hfxunlp liyijincom toywei yqy inetfun sophiealex gst-group chankeh leezqcst jhnlp maggie0830 roottan liumiaomiaoyabi zemu121 kenataccosys mengqhui wushicanasl yongyehuang luojie233 phantomgrapes kimii abnering aigaosheng zhangj311 mathshelly2014 lijielife nlpprof codingafuture leeeeoliu ajoeajoe nju-luke doubleen wotaoyanjiaobu tongzhenguo ryfan-rs searchmodel pokbe xinqiyang alex-yip spbohai niu2niu2niu pengpengpengqiqiu-lee royzhenggao hualichenxi fucheng830 innerface gongliym arfu2016 zhangrongen stayhigh songyandong focox

pyltp's Issues

新问题

我分词可以的，但是分句的时候出现问题
Boost.Python.ArgumentError: Python argument types in
SentenceSplitter.split(unicode)
did not match C++ signature:
split(class std::basic_string<char,struct std::char_traits,class std::allocator >)

运行example.py时，显示Segmentor: Model not loaded!

加载cws.model好像没成功

pip mac下编译出错

错误提示：

ltp/src/utils/unordered_map.hpp:8:12: fatal error: 'tr1/unordered_map' file not found
#include <tr1/unordered_map>
^
1 error generated.

Mac OS下安装失败

在Mac OS系统下安装失败，提示信息如下：

clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
error: command 'clang++' failed with exit status 1

----------------------------------------

Command "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;file='/private/var/folders/dq/15h1sz4j16502ttfyfvs09tw0000gn/T/pip-build-sNJWOa/pyltp/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /var/folders/dq/15h1sz4j16502ttfyfvs09tw0000gn/T/pip-kTPFza-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/dq/15h1sz4j16502ttfyfvs09tw0000gn/T/pip-build-sNJWOa/pyltp/

could not process directory files

Dear All,

Thanks for your work. I recently tried to use pyltp to process some new data and found it no longer working. It looks like the program stops at the iteration over the second file in the directory. I used to have more than two thousand files processed with pyltp a few months ago.
I have tried to used file pointer as well, still got nothing but the first file result. Could you guys have a look at my shabby python code?
Best regards,
Y.

srllabeller4.txt

提供分句接口

分词输出结果是空的

>>> from pyltp import Segmentor
>>> segmentor = Segmentor()
>>> segmentor.load("d:/data/ltp/cws.model")
>>> words = segmentor.segment("元芳你怎么看")
>>> print "|".join(words)

>>>
我的环境是win7 64位。 Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32

Fail to load cws.model

按照如下命令安装 pyltp 之后：

$ git clone https://github.com/HIT-SCIR/pyltp
$ git submodule init
$ git submodule update
$ python setup.py install

不论是否对 ltp_data 进行替换，我这一直显示如下信息并造成Python崩溃：

$ python example.py
Segmentor: Model not loaded!

��P��S��
Python(10114,0x7fff7f118000) malloc: *** error for object 0x10c721bd8: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

将 postagger、parser、recognizer 及 labeller 注释之后 Python 不再崩溃，但是仍旧显示 “Segmentor: Model not loaded!”

Python process 及机器信息如下：

Process: Python [10123]
Path: /usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
Identifier: Python
Version: 2.7.11 (2.7.11)
Code Type: X86-64 (Native)
Parent Process: bash [9210]
Responsible: iTerm [7398]

Model: MacBookPro12,1, BootROM MBP121.0167.B16, 2 processors, Intel Core i7, 3.1 GHz, 16 GB, SMC 2.28f7
Graphics: Intel Iris Graphics 6100, Intel Iris Graphics 6100, Built-In
Memory Module: BANK 0/DIMM0, 8 GB, DDR3, 1867 MHz, 0x80CE, 0x4B3445424533303445422D45474346202020
_Memory Module: BANK 1/DIMM0, 8 GB, DDR3, 1867 MHz, 0x80CE, _0x4B3445424533303445422D45474346202020
Thunderbolt Bus: MacBook Pro, Apple Inc., 27.1

windows long_description UnicodeDecodeError

问题描述在这里：https://groups.google.com/forum/#!topic/ltp-cloud/CF4ARAzr9bE

也有用户反映：https://drive.google.com/file/d/0B-oE63u432AkaWltdDhLM3hwdnhUYXNxUTVoSEVFVTRPZHpn/view?usp=sharing

关于pyltp怎么导入自己的词典

打扰了！我想问下pyltp可以导入自己的附加词语字典来使分词效果更适合自己的运用场景？

多进程调用LTP失败

我想用Python采用多进程池调用LTP

from multiprocessing import Pool

if __name__ == '__main__':
    p = Pool(int(arguments[u'--count']))
    for page in range(start, end):
        p.apply_async(segement_task, args=(arguments[u'LTP_DATA_MODEL'], page, ))
    print u'Waiting for all subprocesses done...'
    p.close()
    p.join

在子进程中采用如下代码加载

def segement_task(model_path, page):
    segmentor = Segmentor()
    segmentor.load(os.path.join(model_path, "cws.model"))
    print u'load segmentor'
    postagger = Postagger()
    postagger.load(os.path.join(model_path, "pos.model"))
    print u'load postagger'
    parser = Parser()
    parser.load(os.path.join(model_path, "parser.model"))
    print u'load parser'
    recognizer = NamedEntityRecognizer()
    recognizer.load(os.path.join(model_path, "ner.model"))
    print u'load recognizer'
    labeller = SementicRoleLabeller()
    labeller.load(os.path.join(model_path, "srl/"))
    print u'load labeller'

界面只打印出 load segmentor 就退出了

请问应该如何多进程调用

pip install pyltp fail.

Anaconda virtualenv, Python 2.7.11

(venv) $ pip install pyltp
Collecting pyltp
  Downloading pyltp-0.1.8.tar.gz (3.4MB)
Installing collected packages: pyltp
  Running setup.py install for pyltp: started
    Running setup.py install for pyltp: finished with status 'error'
    running install
    running build
    running build_ext
    building 'pyltp' extension
    creating build
    creating build/temp.macosx-10.5-x86_64-2.7
    creating build/temp.macosx-10.5-x86_64-2.7/src
    creating build/temp.macosx-10.5-x86_64-2.7/ltp
    creating build/temp.macosx-10.5-x86_64-2.7/ltp/thirdparty
    creating build/temp.macosx-10.5-x86_64-2.7/ltp/thirdparty/boost

    ... ...

    ltp/src/ner/decoder.cpp:27:7: error: use of undeclared identifier 'rep'
          rep.insert(from * T + to);
          ^
    ltp/src/ner/decoder.cpp:38:10: error: use of undeclared identifier 'rep'
      return rep.find(code) != rep.end();
             ^
    ltp/src/ner/decoder.cpp:38:28: error: use of undeclared identifier 'rep'
      return rep.find(code) != rep.end();
                               ^
    In file included from ltp/src/ner/decoder.cpp:1:
    In file included from ltp/src/ner/decoder.h:7:
    ltp/src/utils/smartmap.hpp:72:5: warning: field '_cap_entries' will be initialized after field '_num_buckets' [-Wreorder]
        _cap_entries(INIT_CAP_ENTRIES),
        ^
    ltp/src/utils/smartmap.hpp:589:3: note: in instantiation of member function 'ltp::utility::SmartMap<int, ltp::utility::__Default_CharArray_HashFunction, ltp::utility::__Default_CharArray_EqualFunction>::SmartMap' requested here
      IndexableSmartMap() : entries(0), cap_entries(0) {}
      ^
    ltp/src/utils/smartmap.hpp:75:5: warning: field '_len_key_buffer' will be initialized after field '_hash_buckets' [-Wreorder]
        _len_key_buffer(0),
        ^
    ltp/src/utils/smartmap.hpp:79:5: warning: field '_val_buffer' will be initialized after field '_hash_buckets_volumn' [-Wreorder]
        _val_buffer(0),
        ^
    In file included from ltp/src/ner/decoder.cpp:1:
    In file included from ltp/src/ner/decoder.h:6:
    In file included from ltp/src/framework/decoder.h:5:
    ltp/src/utils/math/sparsevec.h:154:10: warning: private field 'norm' is not used [-Wunused-private-field]
      double norm;
             ^
    12 warnings and 8 errors generated.
    error: command 'gcc' failed with exit status 1

$ gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

release pyltp 0.1.9 to pypi

@endyul

加载本地词典出现问题

segmentor = Segmentor()
segmentor.load_with_lexicon('E:/code/LTP/ltp_data/cws.model','E:/code/python_code/extractEventsTestApi/searchTrigger/dict/mydict.txt')，mydict.txt是u8编码，格式如下
恒生
农银信用债
等等
出现Segmentor: Model not loaded!，是怎么回事
我不清楚是不是我的用法有问题啊

安装失败，能帮忙看一看吗？

mac系统下python2.7.11；Linux－ubuntu系统下python2.7.6，两个系统下两种安装方法都安装失败了，而且错误都是编译器的问题？求高人指导！

只安装pyltp是否可以独立运行？

windows 7下只用pip成功安装了pyltp，在注释掉加载lib文件的那行代码后，成功运行example.py。我想问的是，仅仅安装pyltp情况下，是否可以正确执行ltp的所有功能？

句法分析的下标

pyltp里arc.head是从0开始的（0表示根），但好像语义角色标注和ltp里都是用-1表示根

pyltp 0.1.9.1 conflict with anaconda 4.2.0

I intall pyltp by pip in anaconda 4.2.0 Ubuntu16.04 LTS envirnment, I get a problem when I run import pyltp:

... undefined symbol: _ZTVNSt7__cxx1119basic_istringstreamIcSt11char_traitsIcESaIcEEE

and then I install by source, still have the same problem. I try to use sudo python setup.py intall and add package path in sys.path like:

import sys
sys.path.append(/usr/local/lib/python2.7/dist-packages/pyltp-0.1.9.1-py2.7-linux-x86_64.egg)
import pyltp

error message become:

 /home/?/anaconda2/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /usr/local/lib/python2.7/dist-packages/pyltp-0.1.9.1-py2.7-linux-x86_64.egg/pyltp.so)

So I think the problem come from anaconda 4.2.0 envirnment libstdc++, I run strings /home/ldy/anaconda2/bin/../lib/libstdc++.so.6 | grep GLIBCXX

GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_FORCE_NEW
GLIBCXX_DEBUG_MESSAGE_LENGTH

There is no GLIBCXX_3.4.20 in anaconda 4.2.0 libstdc++

Solution:
my solution is sample:

cd ~/anaconda2/lib
rm libstdc++.so.6.0.19
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6 libstdc++.so.6.0.19

check strings libstdc++.so.6 | grep GLIBCXX, you will find the following info:

...
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21
GLIBCXX_3.4.22
GLIBCXX_DEBUG_MESSAGE_LENGTH

I write down this to help other people who meet the same problem.

python 调用模型失败。

环境： windows10, python2.7

`segmentor = Segmentor()
segmentor.load('E:/LTP/3.3.0/ltp_data/cws.model')

postagger = Postagger()
postagger.load('E:/LTP/3.3.0/ltp_data/pos.model')

parser = Parser()
parser.load('E:/LTP/3.3.0/ltp_data/parser.model')`

我在用上面代码调用的时候，程序出现这个错误：
Process finished with exit code -1073741819 (0xC0000005)

查了一遍，实在不知道错误出现在哪里。使用anaconda安装的，之前用的版本0.1.8是没问题的，这两天换了新的机子，用pip install安装，刚刚看了一下，安装的版本是： Anaconda2\Lib\site-packages\pyltp-0.1.9.1.dist-info。

请问是什么原因呢？谢谢！！

Mac下运行出现如下错误

Traceback (most recent call last):
File "example/example.py", line 11, in
from pyltp import Segmentor, Postagger, Parser, NamedEntityRecognizer, SementicRoleLabeller
ImportError: dlopen(/Library/Python/2.7/site-packages/pyltp-0.1.3-py2.7-macosx-10.10-intel.egg/pyltp.so, 2): Symbol not found: __ZNSbIwSt11char_traitsIwESaIwEE4_Rep11_S_terminalE
Referenced from: /Library/Python/2.7/site-packages/pyltp-0.1.3-py2.7-macosx-10.10-intel.egg/pyltp.so
Expected in: flat namespace
in /Library/Python/2.7/site-packages/pyltp-0.1.3-py2.7-macosx-10.10-intel.egg/pyltp.so

句法分析结果怎么看？

用例子获得了下面的结果，但是第三行的句法分析的结果怎么看，有人可以每个给解释一下吗？
例如：
**，3：ATT，是说 ‘**’ 和 ‘与’这两个词是ATT关系吗，ATT是定中的意思？ ** <-与？
与， 5：LAD，是说 '与' 和 ‘加强’ 这两个词是LAD关系吗，LAD就是左附加关系，与<-加强？

我的理解对吗？但是感觉结果好像不太对啊， ‘与’不是应该和'**银行'是左附加关系么？

** 进出口银行与 **银行加强合作
ns v n c ni v v
3:ATT 3:ATT 6:SBV 5:LAD 3:COO 0:HED 6:VOB

运用几个功能返回结果为空，但是并未报错。

命名实体识别，语义角色标注在调用后返回为空。分词，词性标注，分词仍然能用。
用的anaconda自带的Spyder编辑器，python 3.5
`from pyltp import Parser,Postagger,Segmentor,NamedEntityRecognizer,SementicRoleLabeller

segmentor = Segmentor() # 分词
segmentor.load('E:\ltp-data-v3.3.1(1)\ltp_data\cws.model') # 加载模型
words = segmentor.segment('运用几个功能返回结果为空，但是并未报错。') # 分词
print ('\t'.join(words))
segmentor.release() # 释放模型

postagger=Postagger()#词性标注
postagger.load('E:\ltp-data-v3.3.1(1)\ltp_data\pos.model')
postages=postagger.postag(words)
print( '\t'.join(postages))
postagger.release()

parser = Parser() # 依存句法分析
parser.load('E:\ltp-data-v3.3.1(1)\ltp_data\parser.model') # 加载模型
arcs = parser.parse(words, postages) # 句法分析
print ("\t".join("%d:%s" % (arc.head, arc.relation) for arc in arcs))
parser.release() # 释放模型

命名实体识别

recognizer = NamedEntityRecognizer() # 命名实体识别
recognizer.load('E:\ltp-data-v3.3.1(1)\ltp_data\ner.model') # 加载模型
netags = recognizer.recognize(words, postages) # 命名实体识别
print ('\t'.join(netags))
recognizer.release()

labeller = SementicRoleLabeller() # 语义角色标注
labeller.load('E:\ltp-data-v3.3.1(1)\ltp_data\srl') # 加载模型
roles = labeller.label(words, postages, netags, arcs) # 语义角色标注
for role in roles:
print (role.index, "".join(
["%s:(%d,%d)" % (arg.name, arg.range.start, arg.range.end) for arg in role.arguments]))
labeller.release() # 释放模型
返回结果为运用几个功能返回结果为空，但是并未报错。
v m q n v n v a wp c d d v wp
0:HED 3:ATT 4:ATT 1:VOB 1:COO 5:VOB 5:COO 7:VOB 1:WP 13:ADV 13:ADV 13:ADV 1:COO 1:WP

运行README.rst中的例子出错：Segmentor: Model not loaded!

-- coding: utf-8 --

import os
from pyltp import Segmentor
segmentor = Segmentor()
filename = r'/home/jingjin/code/py_code/ltp_data/cws.model'
if os.path.exists(filename):
print 'cws.model exists'
else:
print 'sorry'
segmentor.load(filename)
words = segmentor.segment("元芳你怎么看")
print "|".join(words)

输出：
cws.model exists
Segmentor: Model not loaded!

环境：
ubuntu 14.04 LTS
32位
python2.7.6

句法分析结果路径查看

有人知道，根据上面的结果，如何找到手机对应的词--->有瑕疵吗？应该从哪几个路劲入手？

非常感谢！

怎么获得分词后结果的该词的下标？

words = segmentor.segment(sentence)

在python里，words是vectorofString类型，例如我想获得‘**’这个词的位置，怎么使用？主要目的是想试试这种类型怎么处理

例如，如果是list类型，可以这样获得下标：
words.index('**')

刚刚试了，报错。

pyltp不能安装

clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
error: command 'clang++' failed with exit status 1

Failed building wheel for pyltp
Running setup.py clean for pyltp
Failed to build pyltp
Installing collected packages: pyltp
Running setup.py install for pyltp ... error
Complete output from command /Users/wangshang1011/anaconda/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/0y/9qm2kz4s1jz3ghpbhqt6chg00000gn/T/pip-build-oagyNZ/pyltp/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/0y/9qm2kz4s1jz3ghpbhqt6chg00000gn/T/pip-Wljke6-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'pyltp' extension
creating build
creating build/temp.macosx-10.6-x86_64-2.7
creating build/temp.macosx-10.6-x86_64-2.7/src
creating build/temp.macosx-10.6-x86_64-2.7/ltp
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty/boost
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty/boost/libs
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty/boost/libs/regex
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty/boost/libs/regex/src
creating build/temp.macosx-10.6-x86_64-2.7/ltp/thirdparty/maxent
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/splitsnt
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/segmentor
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/postagger
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/ner
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/parser.n
creating build/temp.macosx-10.6-x86_64-2.7/ltp/src/srl
creating build/temp.macosx-10.6-x86_64-2.7/patch
creating build/temp.macosx-10.6-x86_64-2.7/patch/libs
creating build/temp.macosx-10.6-x86_64-2.7/patch/libs/python
creating build/temp.macosx-10.6-x86_64-2.7/patch/libs/python/src
creating build/temp.macosx-10.6-x86_64-2.7/patch/libs/python/src/object
creating build/temp.macosx-10.6-x86_64-2.7/patch/libs/python/src/converter
clang++ -fno-strict-aliasing -I/Users/wangshang1011/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Iltp/include/ -Iltp/thirdparty/boost/include/ -Iltp/thirdparty/eigen-3.2.4 -Iltp/thirdparty/maxent/ -Iltp/src/ -Iltp/src/splitsnt -Iltp/src/segmentor/ -Iltp/src/postagger/ -Iltp/src/ner/ -Iltp/src/parser.n/ -Iltp/src/srl/ -Iltp/src/utils/ -Iltp/src/srl/ -Ipatch/include/ -I/Users/wangshang1011/anaconda/include/python2.7 -c src/pyltp.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/pyltp.o -std=c++11 -Wno-c++11-narrowing -stdlib=libc++
clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
error: command 'clang++' failed with exit status 1

python2.7版本输出出现乱码

下面是示例：

sentence = "**进出口银行与**银行加强合作" segmentor = Segmentor() segmentor.load(os.path.join(MODELDIR, "cws.model")) words = segmentor.segment(sentence) print "\t".join(words)

输出的内容是：涓浗杩涘嚭鍙� 閾惰 涓� 涓浗閾惰 鍔犲己鍚堜綔

如果我用：
sentence = u"**进出口银行与**银行加强合作"

程序又会报错。

请问2.7应该如果使用？感谢！

更新文档

详细的文本文档
release方法
外部词典
readme中增加模型下载的说明

字符编码问题，pyltp可否将字符编码逻辑囊括进来。输入和输出都可以为unicode字符串

由于需要将pyltp作为分词器接入scikit-learn的CountVectorizer，在调用过程中其内部调用无法将pyltp返回的utf-8字符串decode('utf8')。希望可以将字符编码转换的过程内置到pyltp中，也就是encode('utf8')和decode('utf8')处理可以在pyltp中执行。这样可以提供极大的便利。谢谢！

分词不成功

Python 2.7.6 通过pip安装pyltp失败

In file included from ltp/src/ner/decoder.cpp:1:
In file included from ltp/src/ner/decoder.h:8:
ltp/src/utils/unordered_set.hpp:75:27: error: redefinition of '__gnu_cxx::hash<unsigned long long>'
        template<> struct hash<unsigned long long> {
                          ^~~~~~~~~~~~~~~~~~~~~~~~
ltp/src/utils/unordered_map.hpp:75:27: note: previous definition is here
        template<> struct hash<unsigned long long> {
                          ^
In file included from ltp/src/ner/decoder.cpp:1:
In file included from ltp/src/ner/decoder.h:8:
ltp/src/utils/unordered_set.hpp:80:37: error: redefinition of 'hash<type-parameter-0-0 *>'
        template<typename T> struct hash<T *> {
                                    ^~~~~~~~~
ltp/src/utils/unordered_map.hpp:80:37: note: previous definition is here
        template<typename T> struct hash<T *> {
                                    ^
In file included from ltp/src/ner/decoder.cpp:1:
In file included from ltp/src/ner/decoder.h:8:
ltp/src/utils/unordered_set.hpp:85:27: error: redefinition of '__gnu_cxx::hash<std::string>'
        template<> struct hash<std::string> {
                          ^~~~~~~~~~~~~~~~~
ltp/src/utils/unordered_map.hpp:85:27: note: previous definition is here
        template<> struct hash<std::string> {
                          ^
In file included from ltp/src/ner/decoder.cpp:1:
ltp/src/ner/decoder.h:15:8: error: no template named 'unordered_set' in namespace 'std'; did you mean 'unordered_map'?
  std::unordered_set<size_t> rep;
  ~~~~~^~~~~~~~~~~~~
       unordered_map
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__hash_table:217:85: note: 'unordered_map' declared here
    template <class, class, class, class, class> friend class _LIBCPP_TYPE_VIS_ONLY unordered_map;
                                                                                    ^
In file included from ltp/src/ner/decoder.cpp:1:
ltp/src/ner/decoder.h:15:8: error: too few template arguments for class template 'unordered_map'
  std::unordered_set<size_t> rep;
       ^

此处省略一堆warning和note。。。

56 warnings and 5 errors generated.
error: command 'cc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools, tokenize;__file__='/private/tmp/pip_build_root/pyltp/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().

关于分词问题

比如：阿里联手工商破获重庆亿元刷单案，分词结果是：阿里|联手|工商|破获|重庆亿|元|刷单|案，怎么来解决“重庆亿”粘合的问题，我加了自定义词典了，还是会出现这个问题。

cygwin下编译失败

Env: Win7 + cygwin x64
Python: [GCC 4.8.3] on cygwin
GCC: 4.9.2

无论是pip安装还是开发版安装均编译失败，错误信息如下：
running build
running build_ext
building 'pyltp' extension
gcc -fno-strict-aliasing -ggdb -O2 -pipe -Wimplicit-function-declaration -fdebug-prefix-map=/usr/src/ports/python/python-2.7.8-1.x86_64/build=/usr/src/debug/python-2.7.8-1 -fdebug-prefix-map=/usr/src/ports/python/python-2.7.8-1.x86_64/src/Python-2.7.8=/usr/src/debug/python-2.7.8-1 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Iltp/include/ -Iltp/thirdparty/boost/include/ -Iltp/thirdparty/maxent/ -Iltp/src/ -Iltp/src/segmentor/ -Iltp/src/postagger/ -Iltp/src/ner/ -Iltp/src/parser -Iltp/src/srl/ -Iltp/src/utils/ -Iltp/src/__util/ -Iltp/src/srl/ -Ipatch/include/ -I/usr/include/python2.7 -c src/pyltp.cpp -o build/temp.cygwin-1.7.35-x86_64-2.7/src/pyltp.o
cc1plus: 警告：command line option ‘-Wimplicit-function-declaration’ is valid for C/ObjC but not for C++
cc1plus: 警告：command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from patch/include/boost/python/detail/prefix.hpp:13:0,
from patch/include/boost/python/args.hpp:8,
from patch/include/boost/python.hpp:11,
from src/pyltp.cpp:14:
patch/include/boost/python/detail/wrap_python.hpp:88:0: 警告：“SIZEOF_LONG”重定义

define SIZEOF_LONG 4

^
In file included from patch/include/boost/python/detail/wrap_python.hpp:50:0,
from patch/include/boost/python/detail/prefix.hpp:13,
from patch/include/boost/python/args.hpp:8,
from patch/include/boost/python.hpp:11,
from src/pyltp.cpp:14:
/usr/include/python2.7/pyconfig.h:1013:0: 附注：这是先前定义的位置
#define SIZEOF_LONG 8
^
In file included from /usr/include/python2.7/Python.h:58:0,
from patch/include/boost/python/detail/wrap_python.hpp:142,
from patch/include/boost/python/detail/prefix.hpp:13,
from patch/include/boost/python/args.hpp:8,
from patch/include/boost/python.hpp:11,
from src/pyltp.cpp:14:
/usr/include/python2.7/pyport.h:886:2: 错误：#error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
#error "LONG_BIT definition appears wrong for platform (bad gcc/glibc config?)."
^
In file included from patch/include/boost/python/object/make_instance.hpp:9:0,
from patch/include/boost/python/object/make_ptr_instance.hpp:8,
from patch/include/boost/python/to_python_indirect.hpp:11,
from patch/include/boost/python/converter/arg_to_python.hpp:10,
from patch/include/boost/python/call.hpp:15,
from patch/include/boost/python/object_core.hpp:14,
from patch/include/boost/python/args.hpp:25,
from patch/include/boost/python.hpp:11,
from src/pyltp.cpp:14:
patch/include/boost/python/object/instance.hpp:14:36: 警告：类型属性在定义后被忽略 [-Wattributes]
struct BOOST_PYTHON_DECL_FORWARD instance_holder;
^
error: command 'gcc' failed with exit status 1

请问该如何解决，谢谢。

Python版本"元芳你怎么看"分词分不开

Hi, 我用主页上的demo"元芳你怎么看"想看分词效果，使用的ltp_data/cws.model，发现词分不开，是一整句，这是什么原因？

另外，example.py中"**进出口银行与**银行加强合作。"可以实现分词，我把这个换成“元芳你怎么看”，也是分不开的。

Python2.7, gcc 4.4.6

“Command terminated”的中断

单句字长度为0,或长度大于1024会有“Command terminated”的中断，建议可以在说明文件里提示下这个限制，方便定位错误。

按照文档方法安装在使用的时候出现问题

错误信息：Boost.Python.ArgumentError: Python argument types in
Segmentor.segment(Segmentor, unicode)
did not match C++ signature:
segment(struct Segmentor {lvalue}, class std::basic_string<char,struct std::char_traits,class std::allocator >)
方便说一下怎么回事吗

个性化分词接口

请问是否可以多线程？

如果可以，在何处设置呢？ltp命令行似乎是提供了的？

依存句法分析结果和语言云不一样，model是3.3版本

这个python接口是和语言云分析一样的吗，用3.3 model的话？打扰了，谢谢！

update test model to 3.3.1

mac os 提示安装成功，但是 from pyltp import Segmentor的时候说 no module named pyltp

mac os 通过pip install 安装成功了，但是在程序中无法调用pyltp 我又安装git的方法试了一次，还是不行。这是为什么呢

分词和词性标注的自定义词典接口

提供readthedocs文档

upgrade to ltp 3.3.1

python3安装pyltp出错,求教

环境: ubuntu 14.04 x64, 下载
系统默认python 为 python2.7
cmake 也用apt安装了

安装 python3, python3-dev, 并修改了默认的python: 在 /usr/bin 下将python3 复制改名为 python, python3-config复制改名为 python-config
1.测试了以下命令, 都和所说明的一致
python --version
Python 3.4.0

python-config --includes
-I/usr/include/python3.4m -I/usr/include/python3.4m

2.运行cmake -DLTP_HOME=/home/he/ltp-3.1.2 出错, 信息如下

CMake Error at /usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:108 (message):
Could NOT find PythonLibs (missing: PYTHON_LIBRARIES PYTHON_INCLUDE_DIRS)
Call Stack (most recent call first):
/usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:315 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-2.8/Modules/FindPythonLibs.cmake:208 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
boost_python/CMakeLists.txt:1 (find_package)

请问该如何解决,谢谢

mac下pip与源码编译安装都出错。

running install
running bdist_egg
running egg_info
creating pyltp.egg-info
writing pyltp.egg-info/PKG-INFO
writing top-level names to pyltp.egg-info/top_level.txt
writing dependency_links to pyltp.egg-info/dependency_links.txt
writing manifest file 'pyltp.egg-info/SOURCES.txt'
reading manifest file 'pyltp.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '.hpp' under directory 'ltp/src/framework'
warning: no files found matching '.hpp' under directory 'ltp/src/segmentor'
warning: no files found matching '.hpp' under directory 'ltp/src/postagger'
warning: no files found matching '.hpp' under directory 'ltp/src/ner'
warning: no files found matching '.hpp' under directory 'ltp/src/parser.n'
warning: no files found matching '.hpp' under directory 'ltp/src/srl'
writing manifest file 'pyltp.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-intel/egg
running install_lib
running build_ext
building 'pyltp' extension
creating build
creating build/temp.macosx-10.6-intel-2.7
creating build/temp.macosx-10.6-intel-2.7/src
creating build/temp.macosx-10.6-intel-2.7/ltp
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty/boost
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty/boost/libs
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty/boost/libs/regex
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty/boost/libs/regex/src
creating build/temp.macosx-10.6-intel-2.7/ltp/thirdparty/maxent
creating build/temp.macosx-10.6-intel-2.7/ltp/src
creating build/temp.macosx-10.6-intel-2.7/ltp/src/splitsnt
creating build/temp.macosx-10.6-intel-2.7/ltp/src/segmentor
creating build/temp.macosx-10.6-intel-2.7/ltp/src/postagger
creating build/temp.macosx-10.6-intel-2.7/ltp/src/ner
creating build/temp.macosx-10.6-intel-2.7/ltp/src/parser.n
creating build/temp.macosx-10.6-intel-2.7/ltp/src/srl
creating build/temp.macosx-10.6-intel-2.7/patch
creating build/temp.macosx-10.6-intel-2.7/patch/libs
creating build/temp.macosx-10.6-intel-2.7/patch/libs/python
creating build/temp.macosx-10.6-intel-2.7/patch/libs/python/src
creating build/temp.macosx-10.6-intel-2.7/patch/libs/python/src/object
creating build/temp.macosx-10.6-intel-2.7/patch/libs/python/src/converter
clang++ -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Iltp/include/ -Iltp/thirdparty/boost/include/ -Iltp/thirdparty/eigen-3.2.4 -Iltp/thirdparty/maxent/ -Iltp/src/ -Iltp/src/splitsnt -Iltp/src/segmentor/ -Iltp/src/postagger/ -Iltp/src/ner/ -Iltp/src/parser.n/ -Iltp/src/srl/ -Iltp/src/utils/ -Iltp/src/srl/ -Ipatch/include/ -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/pyltp.cpp -o build/temp.macosx-10.6-intel-2.7/src/pyltp.o -std=c++11 -Wno-c++11-narrowing -stdlib=libc++
clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
clang: error: invalid deployment target for -stdlib=libc++ (requires OS X 10.7 or later)
error: command 'clang++' failed with exit status 1

System Version: macOS 10.12 (16A323)
Kernel Version: Darwin 16.0.0

windows ci failed

https://ci.appveyor.com/project/Oneplus/ltp4j/build/14

vc9 (python 2.7) 的编译环境没有cstdint。

一个解决方法是把ltp的cstdint都改成boost/cstdint.hpp
另一个解决方法是pyltp里添一个cstdint的patch

@endyul 我更倾向于第一种方案。

Postagger.postag不能接受python list of str

In [1]: from pyltp import Postagger
In [2]: postagger = Postagger()
In [3]: postagger.load("/data/ltp/ltp-models/3.2.0-server/ltp_data/pos.model")
In [4]: postagger.postag(["A", "B", "C"])
---------------------------------------------------------------------------
ArgumentError                             Traceback (most recent call last)
<ipython-input-4-8af9244afe40> in <module>()
----> 1 postagger.postag(["A", "B", "C"])

ArgumentError: Python argument types in
    Postagger.postag(Postagger, list)
did not match C++ signature:
    postag(Postagger {lvalue}, std::vector<std::string, std::allocator<std::string> >)

主要原因是我们在封装时boost.Python没提供python list of str的接口。

请问ltp可以自定义词表吗？

有一些分词由于和自定义需求不同，分错了。请问可以添加自定义词表，使得某个领域的分词更适用吗？可以的话，如果添加。谢谢！

Author in setup.py

请梓翔把自己填进去吧 😄