The edunlp from bigdata-ustc

Description

Implement QuesNet as described in QuesNet: A Unified Representation for Heterogeneous Test Questions

References

When I use the model store in http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip,I download the model via network and interface EduNLP provided for me.However, this way doesn;t work until I read the source code and download this model through wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip.The main error information is

EduNLP, INFO Use pretrained t2v model d2v_all_256 downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256 Traceback (most recent call last): File "Annoy.py", line 70, in <module> Annoy_inst = Annoy('d2v_all_256',10) File "Annoy.py", line 27, in __init__ self.i2v = get_pretrained_i2v(i2v) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v return _class.from_pretrained(*params, model_dir=model_dir) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained return cls("text", name, pretrained_t2v=True, model_dir=model_dir) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in __init__ self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR)) File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v model_path = get_data(url, model_dir) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data return download_data(url, data_dir, override) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data _data_dir = download_file(url, save_path, override) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file return decompress(save_path) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress return un_zip(file) File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip zip_file.extract(name, uz_path) File "/usr/lib/python3.6/zipfile.py", line 1507, in extract return self._extract_member(member, path, pwd) File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member with self.open(member, pwd=pwd) as source, \ File "/usr/lib/python3.6/zipfile.py", line 1396, in open raise BadZipFile("Bad magic number for file header") zipfile.BadZipFile: Bad magic number for file header

Add handling of OOV

Description

Add handling of OOV in I2V, D2V and T2V module

[Doc] prettify tutorials and blitz examples

[Bug] BertModel `add_special_tokens` and `resize_token_embeddings` problem

🐛 Description

@nnnyt @KenelmQLH
After I added TALEduBERT to our project and did some test, I found current get_pretrained_i2v function will return unmatched BertTokenizer and BertT2V (about special tokens).
More specifically:

In the initialization of BertTokenizer, tokens such as [FIGURE], [TAG] are added to self.tokenizer (which is huggingface tokenizer). in my case, it will increase the size of tokenizer, since there were no these tokens in TALEduBERT. So these tokens will be tokenized to ids out of the embedding layer range.
Usually we have to do model.resize_token_embeddings(len(tokenizer)) after tokenizer.add_special_tokens(), and indeed there is one in Vector/t2v.BertModel (if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))). However, as @KenelmQLH required, "T2V has to be separated from tokenizer".
Another point is, even if I upload a resized TALEduBERT to model-hub and pass the test, the '[FIGURE]' etc. tokens are still meaningless to model.

I've got two solutions here: 1) simply see these tokens as [UNK] in TALEduBERT, this may have to make some changes in BertTokenizer. 2) do resize_token_embeddings to the origin TALEduBERT, save and upload the new one to model-hub.
😢But both seems to be not so proper, what do you think?

Error Message

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

To Reproduce

http://base.ustc.edu.cn/data/model_zoo/modelhub/bert_pub/1/tal_edu_bert.zip
I haven't push commits, you may download it and try yourself:)

What have you tried to solve it?

I ve stated in the Description

Environment

This has no relation with environment.

[Coverage] tokenizer

Write test codes to cover the tokenizer

[Feature] Optimize Tokenzation incluing multi-mode problems, Parser and Formula optimization

Description

(A clear and concise description of what the feature is.)

Handle multi-mode problems
- AST Graph
- Image
Handle noise problems when identify $...$ in Parser （need better rules）
Handle Formula ast problems when identify $AB=BC$ and $123$ （consider preprocessing）

References

[!important] the test will switch from travis to github actions due to the exhaust of credits

[FEATURE] Add pretrained rnn model

Description

Add a pretrained rnn framework, which can be easily used to train a ELMo-like language model

References

list reference and related literature
list known implementations

[Feature] Add \textf in sif

PureTextTokenizer: inconsistent results when tokenizing same sentences

🐛 Description

(A clear and concise description of what the bug is.)
I try to tokenize an item twice using PureTextTokenzier, but get inconsistent results.

Error Message

To Reproduce

The code I use:

from EduNLP.Tokenizer import PureTextTokenizer
tokenizer = PureTextTokenizer()
item = ["有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
next(tokenizer(item))
next(tokenizer(item))

What have you tried to solve it?

Environment

Environment Information

Operating System: MacOS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) anaconda/python3.7

Additional context

[Feature] DisenQNet

Description

Implement DisenQNet as described in DisenQNet: Disentangled Representation Learning for Educational Questions

References

DisQNet

Download Model Error

🐛 Description

When I call get_pretrained_i2v() for the first time, the EduNLP will download model for me automatically.However, the model.zip file does not work when the program try to extract it.

Error Message

downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip
downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB
Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB
downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256
Traceback (most recent call last):
File "Annoy.py", line 70, in
Annoy_inst = Annoy('d2v_all_256',10)
File "Annoy.py", line 27, in init
self.i2v = get_pretrained_i2v(i2v)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v
model_path = get_data(url, model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data
return download_data(url, data_dir, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data
_data_dir = download_file(url, save_path, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file
return decompress(save_path)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress
return un_zip(file)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip
zip_file.extract(name, uz_path)
File "/usr/lib/python3.6/zipfile.py", line 1507, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member
with self.open(member, pwd=pwd) as source,
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.delete the model in ~/.EduNLP/model/
2.rerun get_pretrained_i2v()

What have you tried to solve it?

run following command in bash
1.wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip
2.mv general_all_256.zip ~/.EduNLP/model
then extract the zip file,success

Environment

Environment Information

Operating System: Description: Ubuntu 18.04.5 LTS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) python3.6.9

[Bug] EduNLP/tests/test_vec/test_vec.py PyTest failed when CUDA is available

🐛 Description

When running PyTest on a CUDA-available environment, runtime error occurs.
This problem is invisible as CUDA set to unavailable.
@KenelmQLH

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)

tem_tokens = [['已知', '集合', 'mathord', '=', 'mathord', '\mid', ...], ['复数', 'mathord', '=', 'textord', '+', 'textord', ...], ['埃及',...hord', ...], ['某校', '课外', '学习', '小组', '研究', '作物', ...], ['已知', '圆', 'mathord', 'textord', '{ }', '\supsub', ...], ...]
tmpdir = local('/tmp/pytest-of-pingzhili/pytest-16/test_rnn0')

def test_rnn(stem_tokens, tmpdir):
method = "sg"
filepath_prefix = str(tmpdir.mkdir(method).join("stem_tf_"))
filepath = train_vector(
stem_tokens,
filepath_prefix,
10,
method=method,
train_params=dict(min_count=0)
)
w2v = W2V(filepath, method=method)
with pytest.raises(TypeError):
RNNModel("Error", w2v, 20)
for rnn_type in ["Rnn", "lstm", "GRU"]:
rnn = RNNModel(rnn_type, w2v, 20, device="cpu")
# print('DEBUG', next(rnn.rnn.parameters()).device)
tokens = rnn.infer_tokens(stem_tokens[:1])

tests/test_vec/test_vec.py:152:

EduNLP/Vector/rnn/rnn.py:81: in infer_tokens
tokens = self(items, **kwargs)[0]
EduNLP/Vector/rnn/rnn.py:67: in call
tokens, item = self.rnn(torch.LongTensor(seq_idx), torch.LongTensor(seq_len))
../../.local/lib/python3.6/site-packages/torch/nn/modules/module.py:1051: in _call_impl
return forward_call(*input, **kwargs)

self = DataParallel(
(module): LM(
(embedding): Embedding(96, 10)
(rnn): RNN(10, 20)
)
)
inputs = (tensor([[28, 34, 2, 8, 2, 37, 2, 3, 4, 5, 10, 3, 2, 10, 3, 39, 3, 13,
6, 2, 8, 13, 10, 3, 6, 3, 6, 3, 6, 3, 40, 6, 2, 42, 2, 8]]), tensor([36]))
kwargs = {}
t = Parameter containing:
tensor([[-1.3489e+00, 1.0146e+00, -6.5648e-01, 1.3789e+00, 8.4049e-01,
-4.5570e-01, ...01, 2.2516e+00,
-9.9800e-01, -1.1827e+00, 1.8549e+00, 5.5470e-01, -6.5661e-01]],
requires_grad=True)

def forward(self, *inputs, **kwargs):
with torch.autograd.profiler.record_function("DataParallel.forward"):
if not self.device_ids:
return self.module(*inputs, **kwargs)
for t in chain(self.module.parameters(), self.module.buffers()):
if t.device != self.src_device_obj:
raise RuntimeError("module must have its parameters and buffers "
"on device {} (device_ids[0]) but found one of "
"them on device: {}".format(self.src_device_obj, t.device))
E RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
../../.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:156: RuntimeError

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

clone code in a CUDA available environment, checkout to dev branch and cd into EduNLP directory
run python3.6 -m pytest

What have you tried to solve it?

Add self.embedding.set_device(device) to this line. I believe this is another bug, though after fixing it PyTest failed still.
I guess the key might be in the EduNLP/Vector/embedding.py, however, I haven't locate it.

Environment

Environment Information

Actually I ran it on sis18.

Operating System: Linux sis18.ustcdm.org 3.10.0-1160.el7.x86_64

Python Version: python3.6

Additional context

[Feature] Build an automatically mapping table to download pretrained models.

Description

(A clear and concise description of what the feature is.)

Now the name-to-url mapping tables for our pretrained models is defined by a static dict() in T2V , which is too inflexible
We need to maintain a dynamic mapping tables in modelhub, providing funciton like
- add model's config(url and T2V) to mapping table when uploading model from Modlhub
- get model's config(url and T2V) from mapping table when downloading model from EduNLP

References

list reference and related literature
list known implementations

[Feature] Add Bert

I think the first step is to reuse some popular packages like transformers from huggingface, where the filepath is expected to be changed so that it can be consistent with get_pretrained_i2v

[Doc] add api docs

[Bug] Download Model Error

🐛 Description

When I call get_pretrained_i2v() for the first time, the EduNLP will download model for me automatically.However, the model.zip file does not work when the program try to extract it.

Error Message

downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip is saved as /root/.EduNLP/model/general_all_256.zip
downloader, INFO /root/.EduNLP/model/general_all_256.zip already exists. Send resume request after 3.56KB
Downloading /root/.EduNLP/model/general_all_256.zip 100.00%: 5.15GB | 5.15GB
downloader, INFO /root/.EduNLP/model/general_all_256.zip is unzip to /root/.EduNLP/model/general_all_256
Traceback (most recent call last):
File "Annoy.py", line 70, in
Annoy_inst = Annoy('d2v_all_256',10)
File "Annoy.py", line 27, in init
self.i2v = get_pretrained_i2v(i2v)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/usr/local/lib/python3.6/dist-packages/EduNLP-0.0.5-py3.6.egg/EduNLP/Vector/t2v.py", line 54, in get_pretrained_t2v
model_path = get_data(url, model_dir)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 223, in get_data
return download_data(url, data_dir, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 190, in download_data
_data_dir = download_file(url, save_path, override)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/download_data.py", line 161, in download_file
return decompress(save_path)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 17, in decompress
return un_zip(file)
File "/usr/local/lib/python3.6/dist-packages/EduData/DataSet/download_data/utils.py", line 37, in un_zip
zip_file.extract(name, uz_path)
File "/usr/lib/python3.6/zipfile.py", line 1507, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.6/zipfile.py", line 1577, in _extract_member
with self.open(member, pwd=pwd) as source,
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.delete the model in ~/.EduNLP/model/
2.rerun get_pretrained_i2v()

What have you tried to solve it?

run following command in bash
1.wget http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_all_256.zip
2.mv general_all_256.zip ~/.EduNLP/model
then extract the zip file,success

Environment

Environment Information

Operating System: Description: Ubuntu 18.04.5 LTS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) python3.6.9

[Chore] establish the model hub

[Feaure] Add pretrained bert and tal-edu-bert

Description

Add pretrained bert and tal-edu-bert to modelhub
Modify the link of pretained model in EduNLP

References

https://github.com/tal-tech/edu-bert

[Feature] add more pretrained w2v model

[Bug] some hidden error when using sif4sci or GensimWordTokenizer

🐛 Description

sif4sci may return None；Similarly，GensimWordTokenizer may return None, ethier.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

import json
from EduNLP.SIF import sif4sci, is_sif, to_sif

def load_items2():
  items = []
  with open("OpenLUNA.json", encoding="utf-8") as f:
      for line in f:
          items.append( json.loads(line))
  return items

items = load_items2()

# ----------------------------------------- #
tokenization_params1 = {
  "formula_params": {
    "method": "linear",
    "symbolize_figure_formula": True
  }
}

tokenizer = GensimWordTokenizer(symbol="fgm")

# ----------------------------------------- #
wrong_num = 0
for item in items:  
  res = sif4sci(item["stem"], symbol="gm", tokenization_params=tokenization_params1, errors="ignore")
  # res = tokenizer(item["stem"])

  if res is None:
    wrong_num += 1


print(f"There are {wrong_num} / {len(items)} wrong cases!")
# There are 156 / 792 wrong cases!

What have you tried to solve it?

Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.

But, as I look at the specific error, I find one main type related to SIF Parser. So I wonder if we need to handle this problem ?

For example, Parser can not identify "n=" and "p="

(1)

s1 = "执行右面的程序框图，则输出的n=$\\FigureID{3bf20b93-8af1-11eb-b205-b46bfc50aa29}$$\\FigureID{59b88b3f-8af1-11eb-9450-b46bfc50aa29}$$\\FigureID{63116570-8b75-11eb-b694-b46bfc50aa29}$$\\FigureID{6a006177-8b76-11eb-9ac0-b46bfc50aa29}$$\\FigureID{088f15e9-8b7c-11eb-959f-b46bfc50aa29}$"
is_sif(s1)

RecursionError                            Traceback (most recent call last)
<ipython-input-3-a8de420882df> in <module>
     11 
     12 # ----------------------------------------- #
---> 13 is_sif(s1)
     14 
     15 # ----------------------------------------- #

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\sif.py in is_sif(item, check_formula, return_parser)
     50     """
     51     item_parser = Parser(item, check_formula)
---> 52     item_parser.description_list()
     53     if item_parser.fomula_illegal_flag:
     54         raise ValueError(item_parser.fomula_illegal_message)

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description_list(self)
    344         """
    345         # print('call description_list')
--> 346         self.description()
    347         if self.error_flag:
    348             # print("Error")

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description(self)
    304         #         if self.error_flag:
    305         #             return
--> 306         self.txt_list()
    307         if self.error_flag:
    308             return

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

... last 1 frames repeated, from the frame below ...

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

RecursionError: maximum recursion depth exceeded in comparison

Environment

Operating System: windows

Python Version: Pyhon 3.6

Additional context

[Feature] Build pipline

Description

(A clear and concise description of what the feature is.)
make I2V in a more clear pipeline form for application users

Build pipeline for tokenization(sif, seg and tokenize) and vectorization(t2v),

References

https://spacy.io/api

[FIX]Address “_pickle.UnpicklingError: could not find MARK” problem

🐛 Description

(A clear and concise description of what the bug is.)
It will raise the error "_pickle.UnpicklingError: could not find MARK" if I call "I2V"program.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)
Traceback (most recent call last):
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 857, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 1230, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 602, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 436, in load
obj._load_specials(fname, mmap, compress, subname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 481, in _load_specials
setattr(self, attrib, val)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 1461, in new_func1
return func(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 791, in syn1neg
self.trainables.syn1neg = value
AttributeError: 'Doc2Vec' object has no attribute 'trainables'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/bmk/8.3test/test.py", line 17, in
ll=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 57, in get_pretrained_t2v
return T2V(model_name, model_path, *args)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 29, in init
self.i2v: Vector = MODELS[model](*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/gensim_vec.py", line 118, in init
self.d2v = Doc2Vec.load(filepath)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 861, in load
return load_old_doc2vec(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/doc2vec.py", line 91, in load_old_doc2vec
old_model = Doc2Vec.load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/word2vec.py", line 1617, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
obj = unpickle(fname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 379, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
_pickle.UnpicklingError: could not find MARK
(base) [bmk@sis10 8.3test]$ /home/bmk/anaconda3/bin/python /home/bmk/8.3test/test.py
EduNLP, INFO Use pretrained t2v model d2v_sci_256
downloader, INFO http://base.ustc.edu.cn/data/model_zoo/EduNLP/d2v/general_science_256.zip is saved as data/general_science_256.zip
downloader, INFO file existed, skipped
Traceback (most recent call last):
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 857, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 1230, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 602, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 436, in load
obj._load_specials(fname, mmap, compress, subname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 481, in _load_specials
setattr(self, attrib, val)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/utils.py", line 1461, in new_func1
return func(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/base_any2vec.py", line 791, in syn1neg
self.trainables.syn1neg = value
AttributeError: 'Doc2Vec' object has no attribute 'trainables'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/bmk/8.3test/test.py", line 17, in
ll=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 122, in get_pretrained_i2v
return _class.from_pretrained(*params, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 93, in from_pretrained
return cls("text", name, pretrained_t2v=True, model_dir=model_dir)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/I2V/i2v.py", line 35, in init
self.t2v = get_t2v_pretrained_model(t2v, kwargs.get("model_dir", MODEL_DIR))
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 57, in get_pretrained_t2v
return T2V(model_name, model_path, *args)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/t2v.py", line 29, in init
self.i2v: Vector = MODELS[model](*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/EduNLP-0.0.5-py3.8.egg/EduNLP/Vector/gensim_vec.py", line 118, in init
self.d2v = Doc2Vec.load(filepath)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 861, in load
return load_old_doc2vec(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/doc2vec.py", line 91, in load_old_doc2vec
old_model = Doc2Vec.load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/word2vec.py", line 1617, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
obj = unpickle(fname)
File "/home/bmk/anaconda3/lib/python3.8/site-packages/gensim/models/deprecated/old_saveload.py", line 379, in unpickle
return _pickle.loads(file_bytes, encoding='latin1')
_pickle.UnpicklingError: could not find MARK

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.M=get_pretrained_i2v("d2v_sci_256", model_dir="./data")
2.X=I2V("text","d2v",filepath='/home/bmk/8.3test/test/model/general_science_256.bin')

What have you tried to solve it?

Using a different version python, such as python3.6.8 in centos.

Environment

Environment Information

Operating System: ...
Linux
Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8)
python 3.85 64-bit('base':conda)

Additional context

[FEATURE] Add an option for checking formulas in is_sif

Description

Add an option for SIF/is_sif to decide whether to check formulas in parser before segment.
（the current option provided in SIF/sif4sci only decide whether to use is_sif in sif4sci）

References

list reference and related literature
list known implementations

处理英文数学习题公式问题

代码：
tokenizer = BertTokenizer.from_pretrained('D:\python\bert\liuyu\Transformer-Bert\bert-base-uncased')
t2v = BertModel('D:\python\bert\liuyu\Transformer-Bert\bert-base-uncased')
s=to_sif(text)
print(s)
item=tokenizer(s,return_tensors='pt')

输出：
[[ $video$ $1$]] $What$'$s$ $shown$ $above$ $is$ $the$ $graph$ $of$ $the$ $function$ $The$ $graph$ $as$ $been$ $sliced$ $along$ $the$ $plane$ $x = 1$ ($pictured$ $in$ $white$), $which$ $gives$ $a$ $curve$ $in$ $space$ $corresponding$ $to$ $all$ $points$ $on$ $the$ $graph$ $where$ $x=1$ ($pictured$ $in$ $red$). What is the slope of the tangent line to this curve at the point $(1, 1, 0)$? [[ numeric-input 1]]

为什么前面的单词也被$包括？

[Feature] Optimize model structure

Description

Unified model stucture and interfaces

References

list reference and related literature
list known implementations

Description

(A clear and concise description of what the feature is.)

Support GPU when inferencing question vectors with Bert, Elmo.

References

N/A

[Doc] Finish English version tutorial

[Feature] Evaluate pretrained model in difficulty prediction

Description

Create a framework for fine-tuning pretrained model in downstream tasks

References

list reference and related literature
list known implementations

bigdata-ustc / edunlp Goto Github PK

edunlp's Introduction

EduNLP

Quickstart

Installation

Usage

Tutorial

Resource

Contribute

Citation

edunlp's People

Contributors

Stargazers

Watchers

Forkers

edunlp's Issues

Description

References

Description

🐛 Description

Error Message

To Reproduce

What have you tried to solve it?

Environment

Description

References

Description

References

🐛 Description

Error Message

To Reproduce

What have you tried to solve it?

Environment

Additional context

Description

References

🐛 Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

🐛 Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Additional context

Description

References

🐛 Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Description

References

🐛 Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Additional context

Description

References

🐛 Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Additional context

Description

References

Description

References

Description