Giter Site home page Giter Site logo

bigdata-ustc / edunlp Goto Github PK

View Code? Open in Web Editor NEW
50.0 2.0 18.0 129.65 MB

A library for advanced Natural Language Processing towards multi-modal educational items.

License: Apache License 2.0

Makefile 0.12% Python 99.88%
python deep-learning machine-learning word2vec doc2vec tokenizer katex formula natural-language-processing

edunlp's Introduction

EduNLP

VERSION PyPI test codecov Documentation Status Download License DOI

EduNLP is a library for advanced Natural Language Processing in Python and is one of the projects of EduX plan of BDAA. It's built on the very latest research, and was designed from day one to be used in real educational products.

EduNLP now comes with pretrained pipelines and currently supports segment, tokenization and vertorization. It supports varies of preprocessing for NLP in educational scenario, such as formula parsing, multi-modal segment.

EduNLP is commercial open-source software, released under the Apache-2.0 license.

Quickstart

Installation

Git and install by pip

# basic installation
pip install .

# full installation
pip install .[full]

or install from pypi:

# basic installation
pip install EduNLP

# full installation
pip install EduNLP[full]

Usage

from EduNLP import get_pretrained_i2v
i2v = get_pretrained_i2v("d2v_all_300", "./model")
item_vector, token_vector = i2v(["the content of item 1", "the content of item 2"])

Tutorial

For more details, please refer to the full documentation (latest | stable).

Resource

We will continuously publish new datasets in Standard Item Format (SIF) to encourage the relevant research works. The data resources can be accessed via another EduX project EduData

Contribute

EduNLP is still under development. More algorithms and features are going to be added and we always welcome contributions to help make EduNLP better. If you would like to contribute, please follow this guideline(开发指南).

Citation

If this repository is helpful for you, please cite our work

@misc{bigdata2021edunlp,
  title={EduNLP},
  author={bigdata-ustc},
  publisher = {GitHub},
  journal = {GitHub repository},
  year = {2021},
  howpublished = {\url{https://github.com/bigdata-ustc/EduNLP}},
}

edunlp's People

Contributors

baooooom avatar fannazya avatar karin0018 avatar kenelmqlh avatar nnnyt avatar pingzhili avatar shangzixue avatar tswsxk avatar wintermelon008 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

edunlp's Issues

Download Pretrain Model

When I use the model store in http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip,I download the model via network and interface EduNLP provided for me.However, this way doesn;t work until I read the source code and download this model through wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip.The main error information is

EduNLP, INFO Use pretrained t2v model d2v_all_256 downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256 Traceback (most recent call last): File "Annoy.py", line 70, in <module> Annoy_inst = Annoy('d2v_all_256',10) File "Annoy.py", line 27, in __init__ self.i2v = get_pretrained_i2v(i2v) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v return _class.from_pretrained(*params, model_dir=model_dir) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained return cls("text", name, pretrained_t2v=True, model_dir=model_dir) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in __init__ self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR)) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v model_path = get_data(url, model_dir) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data return download_data(url, data_dir, override) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data _data_dir = download_file(url, save_path, override) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file return decompress(save_path) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress return un_zip(file) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip zip_file.extract(name, uz_path) File "/usr/lib/python3.6/zipfile.py", line 1507, in extract return self._extract_member(member, path, pwd) File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member with self.open(member, pwd=pwd) as source, \ File "/usr/lib/python3.6/zipfile.py", line 1396, in open raise BadZipFile("Bad magic number for file header") zipfile.BadZipFile: Bad magic number for file header

[Bug] BertModel `add_special_tokens` and `resize_token_embeddings` problem

🐛 Description

@nnnyt @KenelmQLH
After I added TALEduBERT to our project and did some test, I found current get_pretrained_i2v function will return unmatched BertTokenizer and BertT2V (about special tokens).
More specifically:

  1. In the initialization of BertTokenizer, tokens such as [FIGURE], [TAG] are added to self.tokenizer (which is huggingface tokenizer). in my case, it will increase the size of tokenizer, since there were no these tokens in TALEduBERT. So these tokens will be tokenized to ids out of the embedding layer range.
  2. Usually we have to do model.resize_token_embeddings(len(tokenizer)) after tokenizer.add_special_tokens(), and indeed there is one in Vector/t2v.BertModel (if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))). However, as @KenelmQLH required, "T2V has to be separated from tokenizer".
  3. Another point is, even if I upload a resized TALEduBERT to model-hub and pass the test, the '[FIGURE]' etc. tokens are still meaningless to model.

I've got two solutions here: 1) simply see these tokens as [UNK] in TALEduBERT, this may have to make some changes in BertTokenizer. 2) do resize_token_embeddings to the origin TALEduBERT, save and upload the new one to model-hub.
😢But both seems to be not so proper, what do you think?

Error Message

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

To Reproduce

http://base.ustc.edu.cn/data/model_zoo/modelhub/bert_pub/1/tal_edu_bert.zip
I haven't push commits, you may download it and try yourself:)

What have you tried to solve it?

I ve stated in the Description

Environment

This has no relation with environment.

[Feature] Optimize Tokenzation incluing multi-mode problems, Parser and Formula optimization

Description

(A clear and concise description of what the feature is.)

  • Handle multi-mode problems
    • AST Graph
    • Image
  • Handle noise problems when identify $...$ in Parser (need better rules)
  • Handle Formula ast problems when identify $AB=BC$ and $123$ (consider preprocessing)

References

[FEATURE] Add pretrained rnn model

Description

Add a pretrained rnn framework, which can be easily used to train a ELMo-like language model

References

  • list reference and related literature
  • list known implementations

PureTextTokenizer: inconsistent results when tokenizing same sentences

🐛 Description

(A clear and concise description of what the bug is.)
I try to tokenize an item twice using PureTextTokenzier, but get inconsistent results.

Error Message

To Reproduce

The code I use:

from EduNLP.Tokenizer import PureTextTokenizer
tokenizer = PureTextTokenizer()
item = ["有公式$\\FormFigureID{wrong1?}$,如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$,则$z=x+7 y$的最大值为$\\SIFBlank$"]
next(tokenizer(item))
next(tokenizer(item))

What have you tried to solve it?

Environment

Environment Information

Operating System: MacOS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) anaconda/python3.7

Additional context

Download Model Error

🐛 Description

When I call get_pretrained_i2v() for the first time, the EduNLP will download model for me automatically.However, the model.zip file does not work when the program try to extract it.

Error Message

downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip
downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB
Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB
downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256
Traceback (most recent call last):
File "Annoy.py", line 70, in
Annoy_inst = Annoy('d2v_all_256',10)
File "Annoy.py", line 27, in init
self.i2v = get_pretrained_i2v(i2v)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v
model_path = get_data(url, model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data
return download_data(url, data_dir, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data
_data_dir = download_file(url, save_path, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file
return decompress(save_path)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress
return un_zip(file)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip
zip_file.extract(name, uz_path)
File "/usr/lib/python3.6/zipfile.py", line 1507, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member
with self.open(member, pwd=pwd) as source,
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.delete the model in ~/.EduNLP/model/
2.rerun get_pretrained_i2v()

What have you tried to solve it?

run following command in bash
1.wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip
2.mv general_all_256.zip ~/.EduNLP/model
then extract the zip file,success

Environment

Environment Information

Operating System: Description: Ubuntu 18.04.5 LTS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) python3.6.9

[Bug] EduNLP/tests/test_vec/test_vec.py PyTest failed when CUDA is available

🐛 Description

When running PyTest on a CUDA-available environment, runtime error occurs.
This problem is invisible as CUDA set to unavailable.
@KenelmQLH

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)

tem_tokens = [['已知', '集合', 'mathord', '=', 'mathord', '\mid', ...], ['复数', 'mathord', '=', 'textord', '+', 'textord', ...], ['埃及',...hord', ...], ['某校', '课外', '学习', '小组', '研究', '作物', ...], ['已知', '圆', 'mathord', 'textord', '{ }', '\supsub', ...], ...]
tmpdir = local('/tmp/pytest-of-pingzhili/pytest-16/test_rnn0')

def test_rnn(stem_tokens, tmpdir):
method = "sg"
filepath_prefix = str(tmpdir.mkdir(method).join("stem_tf_"))
filepath = train_vector(
stem_tokens,
filepath_prefix,
10,
method=method,
train_params=dict(min_count=0)
)
w2v = W2V(filepath, method=method)
with pytest.raises(TypeError):
RNNModel("Error", w2v, 20)
for rnn_type in ["Rnn", "lstm", "GRU"]:
rnn = RNNModel(rnn_type, w2v, 20, device="cpu")
# print('DEBUG', next(rnn.rnn.parameters()).device)
tokens = rnn.infer_tokens(stem_tokens[:1])

tests/test_vec/test_vec.py:152:


EduNLP/Vector/rnn/rnn.py:81: in infer_tokens
tokens = self(items, **kwargs)[0]
EduNLP/Vector/rnn/rnn.py:67: in call
tokens, item = self.rnn(torch.LongTensor(seq_idx), torch.LongTensor(seq_len))
../../.local/lib/python3.6/site-packages/torch/nn/modules/module.py:1051: in _call_impl
return forward_call(*input, **kwargs)


self = DataParallel(
(module): LM(
(embedding): Embedding(96, 10)
(rnn): RNN(10, 20)
)
)
inputs = (tensor([[28, 34, 2, 8, 2, 37, 2, 3, 4, 5, 10, 3, 2, 10, 3, 39, 3, 13,
6, 2, 8, 13, 10, 3, 6, 3, 6, 3, 6, 3, 40, 6, 2, 42, 2, 8]]), tensor([36]))
kwargs = {}
t = Parameter containing:
tensor([[-1.3489e+00, 1.0146e+00, -6.5648e-01, 1.3789e+00, 8.4049e-01,
-4.5570e-01, ...01, 2.2516e+00,
-9.9800e-01, -1.1827e+00, 1.8549e+00, 5.5470e-01, -6.5661e-01]],
requires_grad=True)

def forward(self, *inputs, **kwargs):
with torch.autograd.profiler.record_function("DataParallel.forward"):
if not self.device_ids:
return self.module(*inputs, **kwargs)
for t in chain(self.module.parameters(), self.module.buffers()):
if t.device != self.src_device_obj:
raise RuntimeError("module must have its parameters and buffers "
"on device {} (device_ids[0]) but found one of "
"them on device: {}".format(self.src_device_obj, t.device))
E RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
../../.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:156: RuntimeError

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. clone code in a CUDA available environment, checkout to dev branch and cd into EduNLP directory
  2. run python3.6 -m pytest

What have you tried to solve it?

  1. Add self.embedding.set_device(device) to this line. I believe this is another bug, though after fixing it PyTest failed still.
  2. I guess the key might be in the EduNLP/Vector/embedding.py, however, I haven't locate it.

Environment

Environment Information Actually I ran it on sis18.

Operating System: Linux sis18.ustcdm.org 3.10.0-1160.el7.x86_64

Python Version: python3.6

Additional context

[Feature] Build an automatically mapping table to download pretrained models.

Description

(A clear and concise description of what the feature is.)

  • Now the name-to-url mapping tables for our pretrained models is defined by a static dict() in T2V , which is too inflexible
  • We need to maintain a dynamic mapping tables in modelhub, providing funciton like
    - add model's config(url and T2V) to mapping table when uploading model from Modlhub
    - get model's config(url and T2V) from mapping table when downloading model from EduNLP

References

  • list reference and related literature
  • list known implementations

[Feature] Add Bert

I think the first step is to reuse some popular packages like transformers from huggingface, where the filepath is expected to be changed so that it can be consistent with get_pretrained_i2v

[Bug] Download Model Error

🐛 Description

When I call get_pretrained_i2v() for the first time, the EduNLP will download model for me automatically.However, the model.zip file does not work when the program try to extract it.

Error Message

downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip
downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB
Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB
downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256
Traceback (most recent call last):
File "Annoy.py", line 70, in
Annoy_inst = Annoy('d2v_all_256',10)
File "Annoy.py", line 27, in init
self.i2v = get_pretrained_i2v(i2v)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v
model_path = get_data(url, model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data
return download_data(url, data_dir, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data
_data_dir = download_file(url, save_path, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file
return decompress(save_path)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress
return un_zip(file)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip
zip_file.extract(name, uz_path)
File "/usr/lib/python3.6/zipfile.py", line 1507, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member
with self.open(member, pwd=pwd) as source,
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.delete the model in ~/.EduNLP/model/
2.rerun get_pretrained_i2v()

What have you tried to solve it?

run following command in bash
1.wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip
2.mv general_all_256.zip ~/.EduNLP/model
then extract the zip file,success

Environment

Environment Information

Operating System: Description: Ubuntu 18.04.5 LTS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) python3.6.9

[Bug] some hidden error when using sif4sci or GensimWordTokenizer

🐛 Description

sif4sci may return None;Similarly,GensimWordTokenizer may return None, ethier.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

import json
from EduNLP.SIF import sif4sci, is_sif, to_sif

def load_items2():
  items = []
  with open("OpenLUNA.json", encoding="utf-8") as f:
      for line in f:
          items.append( json.loads(line))
  return items

items = load_items2()

# ----------------------------------------- #
tokenization_params1 = {
  "formula_params": {
    "method": "linear",
    "symbolize_figure_formula": True
  }
}

tokenizer = GensimWordTokenizer(symbol="fgm")

# ----------------------------------------- #
wrong_num = 0
for item in items:  
  res = sif4sci(item["stem"], symbol="gm", tokenization_params=tokenization_params1, errors="ignore")
  # res = tokenizer(item["stem"])

  if res is None:
    wrong_num += 1


print(f"There are {wrong_num} / {len(items)} wrong cases!")
# There are 156 / 792 wrong cases!

What have you tried to solve it?

Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.

But, as I look at the specific error, I find one main type related to SIF Parser. So I wonder if we need to handle this problem ?

For example, Parser can not identify "n=" and "p="

(1)

s1 = "执行右面的程序框图,则输出的n=$\\FigureID{3bf20b93-8af1-11eb-b205-b46bfc50aa29}$$\\FigureID{59b88b3f-8af1-11eb-9450-b46bfc50aa29}$$\\FigureID{63116570-8b75-11eb-b694-b46bfc50aa29}$$\\FigureID{6a006177-8b76-11eb-9ac0-b46bfc50aa29}$$\\FigureID{088f15e9-8b7c-11eb-959f-b46bfc50aa29}$"
is_sif(s1)
RecursionError                            Traceback (most recent call last)
<ipython-input-3-a8de420882df> in <module>
     11 
     12 # ----------------------------------------- #
---> 13 is_sif(s1)
     14 
     15 # ----------------------------------------- #

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\sif.py in is_sif(item, check_formula, return_parser)
     50     """
     51     item_parser = Parser(item, check_formula)
---> 52     item_parser.description_list()
     53     if item_parser.fomula_illegal_flag:
     54         raise ValueError(item_parser.fomula_illegal_message)

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description_list(self)
    344         """
    345         # print('call description_list')
--> 346         self.description()
    347         if self.error_flag:
    348             # print("Error")

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description(self)
    304         #         if self.error_flag:
    305         #             return
--> 306         self.txt_list()
    307         if self.error_flag:
    308             return

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

... last 1 frames repeated, from the frame below ...

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

RecursionError: maximum recursion depth exceeded in comparison

Environment

Operating System: windows

Python Version: Pyhon 3.6

Additional context

[Feature] Build pipline

Description

(A clear and concise description of what the feature is.)
make I2V in a more clear pipeline form for application users

  • Build pipeline for tokenization(sif, seg and tokenize) and vectorization(t2v),

References

[FIX]Address “_pickle.UnpicklingError: could not find MARK” problem

🐛 Description

(A clear and concise description of what the bug is.)
It will raise the error "_pickle.UnpicklingError: could not find MARK" if I call "I2V"program.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)
Traceback (most recent call last):
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 857, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 1230, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 602, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 436, in load
obj._load_specials(fname, mmap, compress, subname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 481, in _load_specials
setattr(self, attrib, val)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 1461, in new_func1
return func(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 791, in syn1neg
self.trainables.syn1neg = value
AttributeError: 'Doc2Vec' object has no attribute 'trainables'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/bmk/8.3test/test.py", line 17, in
ll=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 57, in get_pretrained_t2v
return T2V(model_name, model_path, *args)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 29, in init
self.i2v: Vector = MODELS[model](*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/gensim_vec.py", line 118, in init
self.d2v = Doc2Vec.load(filepath)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 861, in load
return load_old_doc2vec(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/doc2vec.py", line 91, in load_old_doc2vec
old_model = Doc2Vec.load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/word2vec.py", line 1617, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
obj = unpickle(fname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 379, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
_pickle.UnpicklingError: could not find MARK
(base) [bmk@sis10 8.3test]$ /home/bmk/anaconda3/bin/python /home/bmk/8.3test/test.py
EduNLP, INFO Use pretrained t2v model d2v_sci_256
downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_science_256.zip is saved as data/general_science_256.zip
downloader, INFO file existed, skipped
Traceback (most recent call last):
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 857, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 1230, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 602, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 436, in load
obj._load_specials(fname, mmap, compress, subname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 481, in _load_specials
setattr(self, attrib, val)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 1461, in new_func1
return func(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 791, in syn1neg
self.trainables.syn1neg = value
AttributeError: 'Doc2Vec' object has no attribute 'trainables'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/bmk/8.3test/test.py", line 17, in
ll=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 57, in get_pretrained_t2v
return T2V(model_name, model_path, *args)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 29, in init
self.i2v: Vector = MODELS[model](*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/gensim_vec.py", line 118, in init
self.d2v = Doc2Vec.load(filepath)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 861, in load
return load_old_doc2vec(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/doc2vec.py", line 91, in load_old_doc2vec
old_model = Doc2Vec.load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/word2vec.py", line 1617, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
obj = unpickle(fname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 379, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
_pickle.UnpicklingError: could not find MARK

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.M=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
2.X=I2V("text","d2v",filepath='/home/bmk/8.3test/test/model/general_science_256.bin')

What have you tried to solve it?

Using a different version python, such as python3.6.8 in centos.

Environment

Environment Information

Operating System: ...
Linux
Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8)
python 3.85 64-bit('base':conda)

Additional context

[FEATURE] Add an option for checking formulas in is_sif

Description

Add an option for SIF/is_sif to decide whether to check formulas in parser before segment.
(the current option provided in SIF/sif4sci only decide whether to use is_sif in sif4sci)

References

  • list reference and related literature
  • list known implementations

处理英文数学习题公式问题

代码:
tokenizer = BertTokenizer.from_pretrained('D:\python\bert\liuyu\Transformer-Bert\bert-base-uncased')
t2v = BertModel('D:\python\bert\liuyu\Transformer-Bert\bert-base-uncased')
s=to_sif(text)
print(s)
item=tokenizer(s,return_tensors='pt')

输出:
[[ $video$ $1$]] $What$'$s$ $shown$ $above$ $is$ $the$ $graph$ $of$ $the$ $function$ $The$ $graph$ $as$ $been$ $sliced$ $along$ $the$ $plane$ $x = 1$ ($pictured$ $in$ $white$), $which$ $gives$ $a$ $curve$ $in$ $space$ $corresponding$ $to$ $all$ $points$ $on$ $the$ $graph$ $where$ $x=1$ ($pictured$ $in$ $red$). What is the slope of the tangent line to this curve at the point $(1, 1, 0)$? [[ numeric-input 1]]

为什么前面的单词也被$包括?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.