hazyresearch / bootleg Goto Github PK

View Code? Open in Web Editor NEW

212.0 19.0 27.0 7.78 MB

Self-Supervision for Named Entity Disambiguation at the Tail

Home Page: http://hazyresearch.stanford.edu/bootleg

License: Apache License 2.0

Python 85.88% Jupyter Notebook 13.90% Shell 0.14% Makefile 0.08%

machine-learning named-entity-disambiguation ai self-supervision

bootleg's People

Contributors

Stargazers

Watchers

bootleg's Issues

AssertionError: After eval, some sentences had left over mentions {0: {0}}

I have been attempting to run the notebook end2end_ned_tutorial on my mac book. It is throwing this error. The data (entity_db and nq) is downloaded correctly; however, when attempting to run the model, it gives me the following error after evaluating:

What might the error be?

Question : Wikipedia / Wikidata Aliases

Whilst experimenting with bootleg over some literature we noted that some entities we'd expect to find were not being picked up. For example "apoptosis", so I did a little digging in the aliases file (specifically alias2qids_wiki.json which was referenced in the example configs) and the term was not present. The word is included as parts of other terms but not by itself. There is a wikipedia page for that concept and has the same title (https://en.wikipedia.org/wiki/Apoptosis) and there is a corresponding wikidata entry https://www.wikidata.org/wiki/Q14599311 however it has a different title, it does however have apoptosis as a label several times.

I noted there was also the file alias2qids.json which was slightly bigger file and did include the term. However trying that file instead produced a lot more mentions tags but for a lot of irrelevant terms (including basics words like "of", "also", "from" etc). I wondered if you were aware of the loss of terms like apoptosis and whether that was the result of filtering them other down at some stage or you had a generation process which might have dropped it for some other reason (I couldn't tell whether you used a particular source from another project or not).

On a side note what is the best explanation of each source dataset and its derived forms and subsequent processed (into models, embeddings etc) and when each one is used? There are a lot of files and its not always clear when one is required or used (particularly those in the prep subfolders) and as to when you could substitute or modify one and which steps it would need feeding into (for example can you change the alias terms to some extent without having to retrain everything).

Thanks

Tony

Feasibility of adding new types to disambiguate

Hi there,

I read your paper on Bootleg, and I must say I was quite impressed with the results you managed to achieve on the Named Entity Disambiguation to WikiData terms.

Just trying to understand how it all ties together, and I was having a few questions I was hoping to get answered. I was looking for which entity types are supported for NED, after which I ran into this file: data/sample_emb_data/type_vocab.json. Are these all the types that are supported for NED?

Reason for asking is that I was considering to make use of bootleg to recognize and disambiguate all entities that are subclasses/instances of computer science terms or technical terms in a running text. These two types are however not in the vocabulary. Do you think it would be feasible to extend the vocabulary with these kind of types? Would this need additional training?

Thank you for your time.

Kind regards,
Cas

Answer gets significantly wrong when input is long

When I apply the following code on long articles

from bootleg.end2end.bootleg_annotator import BootlegAnnotator
ann = BootlegAnnotator()
result = ann.label_mentions(text)

The results significantly get worse when the article is longer. Is it expected?
You may use the below text to test:
And yet, 100 years from now, Obama’s presidency will be hailed as the most transformative of our lifetimes, and Donald Trump’s will be viewed with the same scorn that followed the Dred Scott decision. Like that pre-Civil War Supreme Court case, Trump will forever be condemned as a racial reprobate whose words and actions inspired white supremacists and neo-Nazis.\nLast week’s slaughter of Muslims in New Zealand was allegedly committed by a fascist who claimed to draw inspiration from President Trump, among others. It was the latest in a long line of tragedies that our president failed to clearly condemn. After the 2017 riots in Charlottesville, Trump proclaimed a moral equivalence between neo-Nazis and their opponents. Following the killings in Christchurch, the president dismissed the threat of white supremacy while claiming the rising tide of violence coming from the far right was limited to a few troublemakers with “very serious problems.”\nTrump’s acting chief of staff appeared on national television to declare that “the president is not a white supremacist.” White House counselor Kellyanne Conway once again shamed herself by dismissing the fascist mass murderer from Australia as an “eco-terrorist.” The president’s apologists denied that the current commander in chief was inspiring right-wing violence. But the Center for Strategic and International Studies reported in November that far-right attacks rose in Europe by 43 percent since 2016, while right-wing terrorist attacks have quadrupled in the United States over the same time. Hate crimes rose 17 percent in 2017.\nThis troubling chapter in U.S. history has one author — and his name is Donald Trump. This sputtering reality star has created a political identity and corrupt presidency inspired by the wave of racism that followed Obama’s. The Manhattan multimillionaire’s 2016 calls for a Muslim ban and creation of a Muslim registry; the claim of ignorance toward former Ku Klux Klan leader David Duke; the attack on a Hispanic judge’s integrity; the callousness shown toward a Muslim Gold Star family; and the anti-Semitic tweet featuring a Star of David and piles of $100 bills next to Hillary Clinton’s face. These are just a few of the racially charged offenses that Trump committed even before Americans elected him president.\nThe shocking conclusion to the 2016 campaign made millions of Americans, including me, look foolish for believing that Obama’s victories in 2008 and 2012 had proved that the United States had emerged from the scourge of racism infecting it for more than four centuries. I remain shocked that this strain of bigotry still fuels the political careers of Trump and his enablers on Capitol Hill.\nThe Rev. Martin Luther King Jr. often said that “the arc of the moral universe is long, but it bends toward justice.” That is but one reason why the rise in bigotry shown to Muslims, Jews, Hispanics, blacks and “others” has been so discouraging in the age of Trump. Like those who believed these racists were relics from a bygone age, I had also convinced myself that my Republican colleagues were so repelled by racism that they would never support a leader who provided inspiration to neo-Nazis and white supremacists; the manifesto of the New Zealand killer and the words of David Duke after Charlottesville showed just how wrong I was.\nThat’s why any policy differences I had with Obama now seem so insignificant. Americans who still have faith in the upward arc of King’s moral universe should be grateful for Obama’s presidency and the way his election exposed the white racism that is still at large in our land. If changing the Constitution and reelecting Obama two more times would break the fever that now ravages Trump’s Washington, I would cheerfully champion the passage of that constitutional amendment, slap a “Hope and Change” sticker on my shirt, and race to the nearest voting booth to support the man historians will remember as the most significant president since Abraham Lincoln.\nRead more:

Batch processing on label_mentions is not working

Hi @lorr1 ,

Is it possible to have batch processing for label_mentions method? Currently it is taking a list of sentences, but internally in the code, I can see it is processing one sentence at a time. Is it possible to do this batchwise to speed up the process?

Installation error

Hi,

I am trying to install the latest Bootleg via

git clone [email protected]:HazyResearch/bootleg bootleg
cd bootleg
python3 setup.py install

After that, when I run the annotation-on-the-fly-example, it failed due to missing many packages, e.g., urllib3, idna, charset_normalizer, etc.

Could you please help to check the current installation guide?

Thanks a lot.

Is there any way to replace the current NER ?

Hi ,
Thanks a lot for the project .It is indeed wonderful.

However , I would like to replace NER engine . I want to use Flair , instead of Spacy.

Can I do that ?

Couldn't handle unknown alias in test time

Hi,

Thanks for your work.
I found one potential problem of current pipeline is that it couldn't handle unknown alias in test time. As shown in this code,

bootleg/bootleg/datasets/dataset.py

Line 686 in 2189b60

assert entitysymbols.alias_exists(

it assumes the alias is in the alias map.

However, in practice, if we already have a test set with detected mentions and want to run bootleg over it with dump_preds mode. It is very likely to encoder unknown alias. Could you please consider to replace the assertion with a PAD alias to make it possible to run.

Consider to Benchmark Bootleg?

Hi,

Thanks for your great work!
I saw that Bootleg has changed the architecture to be a bi-encoder.
I am curious about that:

How the latest Bootleg model performs in comparison with the initial Bootleg model?
How Bootleg performs on popular benchmark datasets (e.g., datasets as used in GENRE.)?

Are you planning to add the official evaluation results of Bootleg on these benchmarks?

Thanks!

Details about the development set

Hi,

Thanks for your great work!
I want to conduct some analysis for Bootleg using the Wikipedia development set you used in the paper.
I downloaded the wiki data using this script. However, I didn't find a readme to explain each file's usage. It would be great if you can explain a bit more about them.

Specifically, in the paper, I saw there are a total 5.7M sentences. in Section B.1.
In the downloaded data:
train.jsonl consists of 51056341 sentences, dev.jsonl consists of 4859374 sentences, test.jsonl consists of 4880947 sentences.
The total number of sentences is 60,796,662. Is this data updated?

Do train.jsonl, dev.jsonl and test.jsonl follow 80 / 10 / 10 split with respect to the number of mentions?

Also what is the meaning of merged.jsonl and merged_sample.jsonl?

Also, I saw there are three models of Bootleg, two on full Wikipedia data (one for ablation and one for the benchmarks) and one on micro data. The released model is the ablation model trained with 80 / 10 / 10 partitions (not the 96 / 2 / 2 dev split)?

What does the various slices mean in the jsonl data. I saw the documentation explains about the 5 unif_* slices, how about the remaining ones, what does they mean?

Thanks very much for your help and time!

AssertionError: The last row of the alias table isn't -1, something wasn't loaded right.

While running the notebook end2end_ned_tutorial on the mac book (16 GB 2400 MHz DDR4) . It is throwing this error . The data (entity_db and nq) is downloaded correctly . Please let me know why it may be happening

Installation guide is insufficient

I tried to work Google Colab and followed instruction in installation guide. It gave me Module Not Found Error for emmental and huggingface-hub. I installed them via pip. However, It threw me a new error:

---------------------------------------------------------------------------
ZipImportError                            Traceback (most recent call last)
[<ipython-input-7-2e0c8f4f74b8>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from bootleg.end2end.bootleg_annotator import BootlegAnnotator
      2 ann = BootlegAnnotator()
      3 ann.label_mentions("How many people are in Lincoln")["titles"]

1 frames
[/usr/local/lib/python3.10/dist-packages/bootleg-1.1.1.dev0-py3.10.egg/bootleg/dataset.py](https://localhost:8080/#) in <module>
     18 
     19 from bootleg import log_rank_0_debug, log_rank_0_info
---> 20 from bootleg.layers.alias_to_ent_encoder import AliasEntityTable
     21 from bootleg.symbols.constants import ANCHOR_KEY, PAD_ID, STOP_WORDS
     22 from bootleg.symbols.entity_symbols import EntitySymbols

<frozen zipimport> in get_code(self, fullname)

<frozen zipimport> in _get_module_code(self, fullname)

<frozen zipimport> in _get_data(archive, toc_entry)

ZipImportError: bad local file header: '/usr/local/lib/python3.10/dist-packages/bootleg-1.1.1.dev0-py3.10.egg'

```

Can you fix the repo and guideline ? maybe adding requirements.txt would be useful.

Do you update the knowledge graph periodically ?

Hi @lorr1 / @chanind / @xiaoling / @stephenbach / @senwu / @vincentschen ,

When you update/push new version of bootleg, do you also publish a new dataset (KG dataset to be precise)?

Basically , how frequently do you update the entity_db file ? Do you scrape wikidata periodically and append to the entity_db?

Version comprison between bootleg 1.0.0 and bootleg 1.1.0

Dear author,
First of all, thanks for developing and open sourcing this amazing tool!

I have a quick question about the difference between bootleg 1.0.0 and 1.1.0.
Is bootleg 1.1.0 strictly better than 1.0.0, in that it can at least achieve the same accuracy but also have better performance? If there could be some concrete numbers that would be really good.

Also, in the change log, I noticed that:

bootleg 1.0.0 You will need at least 130 GB of disk space, 12 GB of GPU memory, and 40 GB of CPU memory to run our model.
bootleg 1.1.0 You will need at least XXX GB of disk space, 12 GB of GPU memory, and XXX GB of CPU memory to run our model.

Could you provide the number for XXX, or it is not yet measured? Also, I was wondering why bootleg 1.1.0 has the same GPU memory requirement with bootleg 1.0.0? (From my understanding, the huge entity embedding matrix doesn't need to be stored in the memory) I guess maybe it is because in bootleg 1.1.0, the entity embedding generated by the bert based entity encoder, is also stored in the memory?

Any comments and feedbacks are appreciated!

Conda installation instructions result in ResolvePackageNotFound

Output is as follows:

$ conda env create --name bootleg_dev --file conda_requirements.yml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - torch[version='1.6.0.*,1.6.0.*']
  - tagme=0.1.3
  - transformers=3.0.1
  - marisa-trie=3.1.1

Conda version: 4.9.2
Python version (Anaconda): 3.8.5
CUDA version: 10.2
Output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Installing using pip install -r requirements.txt also results in the following error message, but seems to build everything so that the python setup.py develop completes without error.

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

python-language-server 0.35.1 requires ujson>=3.0.0; python_version > "3", but you'll have ujson 1.35 which is incompatible.
python-jsonrpc-server 0.4.0 requires ujson>=3.0.0, but you'll have ujson 1.35 which is incompatible.

Annotations using entity_emb_file parameter are fast but not matching the accuracy level

Hi,
without using entity_emb_file parameter , the model is taking too long for 1000 word length text articles.

By using the entity_emb_file parameter , the inference time is getting reduced significantly , but the accuracy is getting hit big time. It is not able to identify the entities as well.
I am using this code

annotator_object = BootlegAnnotator(model_name="bootleg_uncased",\
    device=-1,cache_dir="./cache",\
    entity_emb_file="./cache/entity_embeddings.npy",\
    extract_method="custom")

Here one point to mention , I am using custom NER (flair) , which is merged in the custom_module_extractor branch.

Is there something I am missing here ?

Maybe a bug in 'bootleg_annotator.py'

Hi, I found that there is a bug about path of loading entitydb data when I use this tool ;)
In function 'create_config()' of 'bootleg_annotator.py':
config_args["data_config"]["alias_cand_map"] = "alias2qids.json"
should be:
config_args["data_config"]["alias_cand_map"] = "alias2qids"

Bug in Annotator.py

Hi,

I am just going through the tutorial notebooks.

Got an Indentation error and it seems to me these lines are not indented properly.

https://github.com/HazyResearch/bootleg/blob/master/bootleg/annotator.py#L107-L111

Entity embedding training is not using GPU on Google Colab Pro+

Hi @lorr1 ,

The entity_embedding_tutorial
file not using GPU on Google Colab , even though GPU is available.

I am using Google Colab Pro+.
Colab Pro+ has 51 GB RAM.

Building entity data from scratch. -- this steps fails everytime due to out of memory (RAM) error. It is supposed to use GPU , right ?

Here are the config parameters:

data_config:
  context_mask_perc: 0.0
  data_dir: /content/bootleg/tutorials/data
  data_prep_dir: prep
  dev_dataset:
    file: merged_sample.jsonl
    use_weak_label: true
  entity_dir: /content/bootleg/tutorials/data/entity_db
  entity_kg_data:
    kg_symbols_dir: kg_mappings
    use_entity_kg: true
  entity_type_data:
    type_symbols_dir: type_mappings/wiki
    use_entity_types: true
  eval_slices:
  - unif_all
  - unif_NS_all
  - unif_HD
  - unif_TO
  - unif_TL
  - unif_TS
  max_ent_len: 128
  max_seq_len: 128
  max_seq_window_len: 64
  overwrite_preprocessed_data: false
  test_dataset:
    file: merged_sample.jsonl
    use_weak_label: true
  train_dataset:
    file: train.jsonl
    use_weak_label: true
  train_in_candidates: true
  use_entity_desc: true
  word_embedding:
    bert_model: bert-base-uncased
    cache_dir: /content/bootleg/tutorials/data/pretrained_bert_models
    context_layers: 6
    entity_layers: 6
emmental:
  checkpoint_all: true
  checkpoint_freq: 1
  checkpoint_metric: NED/Bootleg/dev/final_loss/acc_boot:max
  checkpointing: true
  clear_intermediate_checkpoints: false
  counter_unit: batch
  evaluation_freq: 21432
  fp16: true
  grad_clip: 1.0
  gradient_accumulation_steps: 1
  l2: 0.01
  log_path: /content/bootleg/tutorials/data/bootleg_wiki
  lr: 2e-5
  lr_scheduler: linear
  n_steps: 428648
  online_eval: false
  dataparallel: false
  use_exact_log_path: true
  warmup_percentage: 0.1
  write_loss_per_step: true
  writer: json
model_config:
  hidden_size: 200
  normalize: true
  temperature: 0.10
run_config:
  #dataloader_threads: 2
  dataset_threads: 20
  eval_batch_size: 32
  log_level: DEBUG
  spawn_method: forkserver
train_config:
  batch_size: 32

Languages Supported

Hey Guys! EPIC Project🥇

Consider this a suggestion for improvement rather than a bug...

It would be nice if you could state anywhere in your documentations/readme which language are supported by this package (even if training and testing will still be required by anyone wishes to add a language). It's important for anyone coming from other low resource languages such as Hebrew or Arabic. It's hard to know if a package may contain any language specific requirements one cannot provide for a certain language.

Thanks.

model_eval hangs

Hello, I'm working in colab, using a GPU (Tesla T4) following the end-to-end tutorial https://github.com/HazyResearch/bootleg/blob/252efc64a0a830b6fdd41a716a7abc00bb0dbc5b/tutorials/end2end_ned_tutorial.ipynb when I get to

bootleg_label_file, bootleg_emb_file = run.model_eval(args=config_args, mode="dump_embs", logger=logger, is_writer=True)

the program hangs on

2021-02-01 20:44:09,465 Loading entity_symbols...
2021-02-01 20:45:11,355 Loaded entity_symbols with 5310039 entities.
2021-02-01 20:45:12,633 Loading slices...

previous attempts to run the same code using only CPU, the colab notebook ran overnight and was also stuck at the same point in the code.

Thank you

Question about the train slices, eval slices and train heads

Hi,

Thanks for your great work.
When I read the code, it is a bit difficult for me to understand the design logic behind the following concepts: train heads, train slices and eval slices.
I found the main setup code and comments are in this section.

bootleg/bootleg/utils/train_utils.py

Line 114 in 252efc6

    
           # We have eval slices, train slices, and train heads. The train heads are used in the slice heads module.

Could you please elaborate more about:

what is train heads, train slices and eval slices?
it seems there are two slice_method (Normal and SBL), what does they stand for?
could you please explain more about the taxonomy and why it designs like that?

If SBL (has base head loss):
    train_heads must include BASE_SLICE and not include FINAL_LOSS
    eval_slices must include BASE_SLICE and FINAL_LOSS
If Normal:
    train_heads can include those from the args (these train heads will be used to evaluate filter/ranking capabilities)
    eval_slices must include FINAL_LOSS and can include BASE_SLICE

Thanks a lot.

No such file

I installed bootleg by
git clone [email protected]:HazyResearch/bootleg bootleg cd bootleg python3 setup.py install
but when i run ann = BootlegAnnotator(), it reports

[Errno 2] No such file or directory: 'bootleg\\data\\entity_db\\entity_mappings\\alias2qids.json'

Published PyPI module is out of date

The most recent version of bootleg on PyPI is version 1.0.5 from August, 2021. However, this project has changed a lot since then, including a large architecture change (according to the readme). It also appears that the version of bootleg on PyPI no longer works with the latest models and data for this project. Is it possible to update the PyPI release of this project to use the most recent code from this repo?

bug of example

`from bootleg.end2end.bootleg_annotator import BootlegAnnotator

ann = BootlegAnnotator()
t = ann.label_mentions("Bob Dylan release Desire")["titles"]`

The bug is:
usage: process_STS_dev.py [-h] [--emmental.seed EMMENTAL.SEED]
[--emmental.verbose EMMENTAL.VERBOSE]
[--emmental.log_path EMMENTAL.LOG_PATH]
[--emmental.use_exact_log_path EMMENTAL.USE_EXACT_LOG_PATH]
[--emmental.min_data_len EMMENTAL.MIN_DATA_LEN]
[--emmental.max_data_len EMMENTAL.MAX_DATA_LEN]
[--emmental.model_path EMMENTAL.MODEL_PATH]
[--emmental.device EMMENTAL.DEVICE]
[--emmental.dataparallel EMMENTAL.DATAPARALLEL]
[--emmental.distributed_backend {nccl,gloo}]
[--emmental.fp16 EMMENTAL.FP16]
[--emmental.fp16_opt_level EMMENTAL.FP16_OPT_LEVEL]
[--emmental.local_rank EMMENTAL.LOCAL_RANK]
[--emmental.n_epochs EMMENTAL.N_EPOCHS]
[--emmental.train_split EMMENTAL.TRAIN_SPLIT [EMMENTAL.TRAIN_SPLIT ...]]
[--emmental.valid_split EMMENTAL.VALID_SPLIT [EMMENTAL.VALID_SPLIT ...]]
[--emmental.test_split EMMENTAL.TEST_SPLIT [EMMENTAL.TEST_SPLIT ...]]
[--emmental.ignore_index EMMENTAL.IGNORE_INDEX]
[--emmental.online_eval EMMENTAL.ONLINE_EVAL]
[--emmental.optimizer {asgd,adadelta,adagrad,adam,adamw,adamax,lbfgs,rms_prop,r_prop,sgd,sparse_adam,bert_adam,None}]
[--emmental.lr EMMENTAL.LR]
[--emmental.l2 EMMENTAL.L2]
[--emmental.grad_clip EMMENTAL.GRAD_CLIP]
[--emmental.gradient_accumulation_steps EMMENTAL.GRADIENT_ACCUMULATION_STEPS]
[--emmental.asgd_lambd EMMENTAL.ASGD_LAMBD]
[--emmental.asgd_alpha EMMENTAL.ASGD_ALPHA]
[--emmental.asgd_t0 EMMENTAL.ASGD_T0]
[--emmental.adadelta_rho EMMENTAL.ADADELTA_RHO]
[--emmental.adadelta_eps EMMENTAL.ADADELTA_EPS]
[--emmental.adagrad_lr_decay EMMENTAL.ADAGRAD_LR_DECAY]
[--emmental.adagrad_initial_accumulator_value EMMENTAL.ADAGRAD_INITIAL_ACCUMULATOR_VALUE]
[--emmental.adagrad_eps EMMENTAL.ADAGRAD_EPS]
[--emmental.adam_betas EMMENTAL.ADAM_BETAS [EMMENTAL.ADAM_BETAS ...]]
[--emmental.adam_eps EMMENTAL.ADAM_EPS]
[--emmental.adam_amsgrad EMMENTAL.ADAM_AMSGRAD]
[--emmental.adamw_betas EMMENTAL.ADAMW_BETAS [EMMENTAL.ADAMW_BETAS ...]]
[--emmental.adamw_eps EMMENTAL.ADAMW_EPS]
[--emmental.adamw_amsgrad EMMENTAL.ADAMW_AMSGRAD]
[--emmental.adamax_betas EMMENTAL.ADAMAX_BETAS [EMMENTAL.ADAMAX_BETAS ...]]
[--emmental.adamax_eps EMMENTAL.ADAMAX_EPS]
[--emmental.lbfgs_max_iter EMMENTAL.LBFGS_MAX_ITER]
[--emmental.lbfgs_max_eval EMMENTAL.LBFGS_MAX_EVAL]
[--emmental.lbfgs_tolerance_grad EMMENTAL.LBFGS_TOLERANCE_GRAD]
[--emmental.lbfgs_tolerance_change EMMENTAL.LBFGS_TOLERANCE_CHANGE]
[--emmental.lbfgs_history_size EMMENTAL.LBFGS_HISTORY_SIZE]
[--emmental.lbfgs_line_search_fn EMMENTAL.LBFGS_LINE_SEARCH_FN]
[--emmental.rms_prop_alpha EMMENTAL.RMS_PROP_ALPHA]
[--emmental.rms_prop_eps EMMENTAL.RMS_PROP_EPS]
[--emmental.rms_prop_momentum EMMENTAL.RMS_PROP_MOMENTUM]
[--emmental.rms_prop_centered EMMENTAL.RMS_PROP_CENTERED]
[--emmental.r_prop_etas EMMENTAL.R_PROP_ETAS [EMMENTAL.R_PROP_ETAS ...]]
[--emmental.r_prop_step_sizes EMMENTAL.R_PROP_STEP_SIZES [EMMENTAL.R_PROP_STEP_SIZES ...]]
[--emmental.sgd_momentum EMMENTAL.SGD_MOMENTUM]
[--emmental.sgd_dampening EMMENTAL.SGD_DAMPENING]
[--emmental.sgd_nesterov EMMENTAL.SGD_NESTEROV]
[--emmental.sparse_adam_betas EMMENTAL.SPARSE_ADAM_BETAS [EMMENTAL.SPARSE_ADAM_BETAS ...]]
[--emmental.sparse_adam_eps EMMENTAL.SPARSE_ADAM_EPS]
[--emmental.bert_adam_betas EMMENTAL.BERT_ADAM_BETAS [EMMENTAL.BERT_ADAM_BETAS ...]]
[--emmental.bert_adam_eps EMMENTAL.BERT_ADAM_EPS]
[--emmental.lr_scheduler {linear,exponential,plateau,step,multi_step,cyclic,one_cycle,cosine_annealing}]
[--emmental.lr_scheduler_step_unit {batch,epoch}]
[--emmental.lr_scheduler_step_freq EMMENTAL.LR_SCHEDULER_STEP_FREQ]
[--emmental.warmup_steps EMMENTAL.WARMUP_STEPS]
[--emmental.warmup_unit {batch,epoch}]
[--emmental.warmup_percentage EMMENTAL.WARMUP_PERCENTAGE]
[--emmental.min_lr EMMENTAL.MIN_LR]
[--emmental.reset_state EMMENTAL.RESET_STATE]
[--emmental.exponential_lr_scheduler_gamma EMMENTAL.EXPONENTIAL_LR_SCHEDULER_GAMMA]
[--emmental.plateau_lr_scheduler_metric EMMENTAL.PLATEAU_LR_SCHEDULER_METRIC]
[--emmental.plateau_lr_scheduler_mode {min,max}]
[--emmental.plateau_lr_scheduler_factor EMMENTAL.PLATEAU_LR_SCHEDULER_FACTOR]
[--emmental.plateau_lr_scheduler_patience EMMENTAL.PLATEAU_LR_SCHEDULER_PATIENCE]
[--emmental.plateau_lr_scheduler_threshold EMMENTAL.PLATEAU_LR_SCHEDULER_THRESHOLD]
[--emmental.plateau_lr_scheduler_threshold_mode {rel,abs}]
[--emmental.plateau_lr_scheduler_cooldown EMMENTAL.PLATEAU_LR_SCHEDULER_COOLDOWN]
[--emmental.plateau_lr_scheduler_eps EMMENTAL.PLATEAU_LR_SCHEDULER_EPS]
[--emmental.step_lr_scheduler_step_size EMMENTAL.STEP_LR_SCHEDULER_STEP_SIZE]
[--emmental.step_lr_scheduler_gamma EMMENTAL.STEP_LR_SCHEDULER_GAMMA]
[--emmental.step_lr_scheduler_last_epoch EMMENTAL.STEP_LR_SCHEDULER_LAST_EPOCH]
[--emmental.multi_step_lr_scheduler_milestones EMMENTAL.MULTI_STEP_LR_SCHEDULER_MILESTONES [EMMENTAL.MULTI_STEP_LR_SCHEDULER_MILESTONES ...]]
[--emmental.multi_step_lr_scheduler_gamma EMMENTAL.MULTI_STEP_LR_SCHEDULER_GAMMA]
[--emmental.multi_step_lr_scheduler_last_epoch EMMENTAL.MULTI_STEP_LR_SCHEDULER_LAST_EPOCH]
[--emmental.cyclic_lr_scheduler_base_lr EMMENTAL.CYCLIC_LR_SCHEDULER_BASE_LR [EMMENTAL.CYCLIC_LR_SCHEDULER_BASE_LR ...]]
[--emmental.cyclic_lr_scheduler_max_lr EMMENTAL.CYCLIC_LR_SCHEDULER_MAX_LR [EMMENTAL.CYCLIC_LR_SCHEDULER_MAX_LR ...]]
[--emmental.cyclic_lr_scheduler_step_size_up EMMENTAL.CYCLIC_LR_SCHEDULER_STEP_SIZE_UP]
[--emmental.cyclic_lr_scheduler_step_size_down EMMENTAL.CYCLIC_LR_SCHEDULER_STEP_SIZE_DOWN]
[--emmental.cyclic_lr_scheduler_mode EMMENTAL.CYCLIC_LR_SCHEDULER_MODE]
[--emmental.cyclic_lr_scheduler_gamma EMMENTAL.CYCLIC_LR_SCHEDULER_GAMMA]
[--emmental.cyclic_lr_scheduler_scale_mode {cycle,iterations}]
[--emmental.cyclic_lr_scheduler_cycle_momentum EMMENTAL.CYCLIC_LR_SCHEDULER_CYCLE_MOMENTUM]
[--emmental.cyclic_lr_scheduler_base_momentum EMMENTAL.CYCLIC_LR_SCHEDULER_BASE_MOMENTUM [EMMENTAL.CYCLIC_LR_SCHEDULER_BASE_MOMENTUM ...]]
[--emmental.cyclic_lr_scheduler_max_momentum EMMENTAL.CYCLIC_LR_SCHEDULER_MAX_MOMENTUM [EMMENTAL.CYCLIC_LR_SCHEDULER_MAX_MOMENTUM ...]]
[--emmental.cyclic_lr_scheduler_last_epoch EMMENTAL.CYCLIC_LR_SCHEDULER_LAST_EPOCH]
[--emmental.one_cycle_lr_scheduler_max_lr EMMENTAL.ONE_CYCLE_LR_SCHEDULER_MAX_LR [EMMENTAL.ONE_CYCLE_LR_SCHEDULER_MAX_LR ...]]
[--emmental.one_cycle_lr_scheduler_pct_start EMMENTAL.ONE_CYCLE_LR_SCHEDULER_PCT_START]
[--emmental.one_cycle_lr_scheduler_anneal_strategy {cos,linear}]
[--emmental.one_cycle_lr_scheduler_cycle_momentum EMMENTAL.ONE_CYCLE_LR_SCHEDULER_CYCLE_MOMENTUM]
[--emmental.one_cycle_lr_scheduler_base_momentum EMMENTAL.ONE_CYCLE_LR_SCHEDULER_BASE_MOMENTUM [EMMENTAL.ONE_CYCLE_LR_SCHEDULER_BASE_MOMENTUM ...]]
[--emmental.one_cycle_lr_scheduler_max_momentum EMMENTAL.ONE_CYCLE_LR_SCHEDULER_MAX_MOMENTUM [EMMENTAL.ONE_CYCLE_LR_SCHEDULER_MAX_MOMENTUM ...]]
[--emmental.one_cycle_lr_scheduler_div_factor EMMENTAL.ONE_CYCLE_LR_SCHEDULER_DIV_FACTOR]
[--emmental.one_cycle_lr_scheduler_final_div_factor EMMENTAL.ONE_CYCLE_LR_SCHEDULER_FINAL_DIV_FACTOR]
[--emmental.one_cycle_lr_scheduler_last_epoch EMMENTAL.ONE_CYCLE_LR_SCHEDULER_LAST_EPOCH]
[--emmental.cosine_annealing_lr_scheduler_last_epoch EMMENTAL.COSINE_ANNEALING_LR_SCHEDULER_LAST_EPOCH]
[--emmental.task_scheduler EMMENTAL.TASK_SCHEDULER]
[--emmental.sequential_scheduler_fillup EMMENTAL.SEQUENTIAL_SCHEDULER_FILLUP]
[--emmental.round_robin_scheduler_fillup EMMENTAL.ROUND_ROBIN_SCHEDULER_FILLUP]
[--emmental.mixed_scheduler_fillup EMMENTAL.MIXED_SCHEDULER_FILLUP]
[--emmental.counter_unit {epoch,batch}]
[--emmental.evaluation_freq EMMENTAL.EVALUATION_FREQ]
[--emmental.writer {json,tensorboard}]
[--emmental.checkpointing EMMENTAL.CHECKPOINTING]
[--emmental.checkpoint_path EMMENTAL.CHECKPOINT_PATH]
[--emmental.checkpoint_freq EMMENTAL.CHECKPOINT_FREQ]
[--emmental.checkpoint_metric EMMENTAL.CHECKPOINT_METRIC]
[--emmental.checkpoint_task_metrics EMMENTAL.CHECKPOINT_TASK_METRICS]
[--emmental.checkpoint_runway EMMENTAL.CHECKPOINT_RUNWAY]
[--emmental.checkpoint_all EMMENTAL.CHECKPOINT_ALL]
[--emmental.clear_intermediate_checkpoints EMMENTAL.CLEAR_INTERMEDIATE_CHECKPOINTS]
[--emmental.clear_all_checkpoints EMMENTAL.CLEAR_ALL_CHECKPOINTS]
[--run_config.spawn_method RUN_CONFIG.SPAWN_METHOD]
[--run_config.eval_batch_size RUN_CONFIG.EVAL_BATCH_SIZE]
[--run_config.dataloader_threads RUN_CONFIG.DATALOADER_THREADS]
[--run_config.log_level RUN_CONFIG.LOG_LEVEL]
[--run_config.dataset_threads RUN_CONFIG.DATASET_THREADS]
[--run_config.result_label_file RUN_CONFIG.RESULT_LABEL_FILE]
[--run_config.result_emb_file RUN_CONFIG.RESULT_EMB_FILE]
[--train_config.dropout TRAIN_CONFIG.DROPOUT]
[--train_config.batch_size TRAIN_CONFIG.BATCH_SIZE]
[--model_config.attn_class MODEL_CONFIG.ATTN_CLASS]
[--model_config.hidden_size MODEL_CONFIG.HIDDEN_SIZE]
[--model_config.num_heads MODEL_CONFIG.NUM_HEADS]
[--model_config.ff_inner_size MODEL_CONFIG.FF_INNER_SIZE]
[--model_config.num_model_stages MODEL_CONFIG.NUM_MODEL_STAGES]
[--model_config.num_fc_layers MODEL_CONFIG.NUM_FC_LAYERS]
[--model_config.custom_args MODEL_CONFIG.CUSTOM_ARGS]
[--data_config.eval_slices DATA_CONFIG.EVAL_SLICES]
[--data_config.train_in_candidates DATA_CONFIG.TRAIN_IN_CANDIDATES]
[--data_config.data_dir DATA_CONFIG.DATA_DIR]
[--data_config.data_prep_dir DATA_CONFIG.DATA_PREP_DIR]
[--data_config.entity_dir DATA_CONFIG.ENTITY_DIR]
[--data_config.entity_prep_dir DATA_CONFIG.ENTITY_PREP_DIR]
[--data_config.entity_map_dir DATA_CONFIG.ENTITY_MAP_DIR]
[--data_config.alias_cand_map DATA_CONFIG.ALIAS_CAND_MAP]
[--data_config.emb_dir DATA_CONFIG.EMB_DIR]
[--data_config.max_seq_len DATA_CONFIG.MAX_SEQ_LEN]
[--data_config.max_aliases DATA_CONFIG.MAX_ALIASES]
[--data_config.overwrite_preprocessed_data DATA_CONFIG.OVERWRITE_PREPROCESSED_DATA]
[--data_config.type_prediction.use_type_pred DATA_CONFIG.TYPE_PREDICTION.USE_TYPE_PRED]
[--data_config.type_prediction.file DATA_CONFIG.TYPE_PREDICTION.FILE]
[--data_config.type_prediction.num_types DATA_CONFIG.TYPE_PREDICTION.NUM_TYPES]
[--data_config.type_prediction.dim DATA_CONFIG.TYPE_PREDICTION.DIM]
[--data_config.train_dataset.file DATA_CONFIG.TRAIN_DATASET.FILE]
[--data_config.train_dataset.use_weak_label DATA_CONFIG.TRAIN_DATASET.USE_WEAK_LABEL]
[--data_config.dev_dataset.file DATA_CONFIG.DEV_DATASET.FILE]
[--data_config.dev_dataset.use_weak_label DATA_CONFIG.DEV_DATASET.USE_WEAK_LABEL]
[--data_config.test_dataset.file DATA_CONFIG.TEST_DATASET.FILE]
[--data_config.test_dataset.use_weak_label DATA_CONFIG.TEST_DATASET.USE_WEAK_LABEL]
[--data_config.word_embedding.bert_model DATA_CONFIG.WORD_EMBEDDING.BERT_MODEL]
[--data_config.word_embedding.use_sent_proj DATA_CONFIG.WORD_EMBEDDING.USE_SENT_PROJ]
[--data_config.word_embedding.layers DATA_CONFIG.WORD_EMBEDDING.LAYERS]
[--data_config.word_embedding.freeze DATA_CONFIG.WORD_EMBEDDING.FREEZE]
[--data_config.word_embedding.cache_dir DATA_CONFIG.WORD_EMBEDDING.CACHE_DIR]
[--data_config.ent_embeddings DATA_CONFIG.ENT_EMBEDDINGS]
process_STS_dev.py: error: unrecognized arguments: --data_config.context_mask_perc 0.0 --data_config.entity_kg_data {"kg_labels":"kg_mappings/qid2relations.json","kg_vocab":"kg_mappings/relation_vocab.json","use_entity_kg":true} --data_config.entity_type_data {"type_labels":"type_mappings/wiki/qid2typeids.json","type_vocab":"type_mappings/wiki/type_vocab.json","use_entity_types":true} --data_config.max_ent_len 128 --data_config.max_seq_window_len 64 --data_config.use_entity_desc True --data_config.word_embedding.context_layers 6 --data_config.word_embedding.entity_layers 6 --emmental.n_steps 428648 --emmental.write_loss_per_step True --model_config.normalize True --model_config.temperature 0.01

Process finished with exit code 2

Error in the end2end module

When I run the end2end module, extract_mentions() needs to load the config.json file in the entity_db folder, but the downloaded entity_db folder does not contain this file.

Traceback (most recent call last):
File "D:/pycharm1/bootleg-master/tutorials/end2end.py", line 33, in
extract_mentions(
File "D:\pycharm1\bootleg-master\bootleg\end2end\extract_mentions.py", line 187, in extract_mentions
entity_symbols: EntitySymbols = EntitySymbols.load_from_cache(entity_db_dir)
File "D:\pycharm1\bootleg-master\bootleg\symbols\entity_symbols.py", line 263, in load_from_cache
config = utils.load_json_file(filename=os.path.join(load_dir, "config.json"))
File "D:\pycharm1\bootleg-master\bootleg\utils\utils.py", line 85, in load_json_file
with open(filename, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\pycharm1\bootleg-master\root_bootleg\data\entity_db\config.json'

Sentence formatting and tokenisation

Could you provide a little detail on the pre tokenisation that appears to have been done on the sentence text in the jsonl files, as I couldn't find any mention or code that appears to deal with that step. Feeding in a raw paragraph produced a key error in the initial entity mentioned detection step that was due to an escape newline in the text. Removing the newline it got passed that step but then seemed to fail when merging subsentence, I tried adding spaces between various token which still resulted in "AssertionError: Sent -1, Al -1".

I noted there was code to window over the "sentence", however it seems anything over a certain length causes it to fail. Otherwise it seems to be splitting up punctuation and all lowercased but would like to know if there is anything else to take care of.

Could you also comment on whether you have any idea how it could react to being given longer passages of text (for predictions against your models or future training)?

Thanks

Tony

Static embeddings are similar

When I extract the static embeddings using the code in entity_embedding_tutorial.ipynb, I get mostly the same embedding for all entities (cell 14, in embedding_as_tensor). Mostly the same means the cosine similarity for all entity embeddings are higher than 0.99. I suspect I might have some misunderstanding.

May I know if I want to use entity embeddings from bootleg, is it the correct way to extract it?

Any help is appreciated. Thank you.

The Embeddings can not be download !

Question about AIDA-CoNLL data

Hi,

Is there any possibility to get your processed AIDA-CoNLL dataset used in the experiment?
Thanks!

Question about alias map

Hi,

Thanks for your great work! Super cool.

I have one question about the alias map generation.
As I have seen in the alias2qids_kore50.json and alias2qids_rss500.json files, the same alias in different examples may have different candidate sets.
e.g. in alias2qids_kore50.json, the candidate list of "david_0" and "david_1" are not identical.

In details:
In "david_1" but not in "david_0":
{'Q5240660', 'Q5236763', 'Q5240530', 'Q16079082', 'Q27827705', 'Q17318723', 'Q18632066', 'Q5238957', 'Q768479', 'Q3017915', 'Q5239424', 'Q20684456', 'Q5234065', 'Q5234667', 'Q5230766', 'Q10264386', 'Q672856', 'Q1174097', 'Q5240118', 'Q583264'}

In "david_0" but not in "david_1":
{'Q312649', 'Q1173922', 'Q19668637', 'Q5236091', 'Q178517', 'Q2420499', 'Q353983', 'Q24248231', 'Q5239917', 'Q336640', 'Q5241350', 'Q184903', 'Q338628', 'Q2071', 'Q1175688', 'Q41564', 'Q1173934', 'Q214601', 'Q5236705', 'Q1177021'}

So I am wondering how was the candidates are generated? are they context dependent?

by the way, could you please explain the score (e.g. ["Q8016", 5947]) associated with each candidate entity? What does it mean and how does it is calculated?

Thanks a lot.

hazyresearch / bootleg Goto Github PK

bootleg's People

Contributors

Stargazers

Watchers

Forkers

bootleg's Issues

Recommend Projects

Recommend Topics

Recommend Org