microsoft / kc Goto Github PK

Knowledge Computing group - MSRA

License: MIT License

Python 70.51% Shell 2.44% Jsonnet 0.39% CSS 7.51% SCSS 7.39% HTML 2.55% JavaScript 9.23%

kc's Introduction

This repository contains code, datasets, and links related to the Knowledge Computing (KC) group at Microsoft Research Asia (MSRA).

Our group is hiring both research interns and full-time employees! If you are interest, please take a look at:

Internship opportunities in KC (PDF);
Researcher or RSDE positions and select "China" on the left-side "Country/Region" menu.

News:

2023-Sep: The Recognizers-Text project reached over 9 million package downloads (across NuGet/npm/PyPI)!
2023-May: Three papers accepted by ACL'23, including MLKD OOD, CoLaDa, and TACR.
2022-Aug: The Recognizers-Text project reached over 5 million package downloads (across NuGet/npm/PyPI)!
2022-May: Tiara (ReTraCk v2), KC's new knowledge base question answering (KBQA) system, has reached #1 in all Generalizable Question Answering (GrailQA) evaluation categories including Overall, Compositional Generalization, and Zero-Shot.
2022-Apr: We have now open-sourced the latest version of the LinkingPark system for automatic semantic table interpretation. This new version includes improved performance, stability, flexibility, and overall results. Contributions and collaboration are very welcome!
2022-Mar: The Recognizers-Text project reached over 4 million package downloads (across NuGet/npm/PyPI)!
2021-Jul: The Recognizers-Text project reached over 3 million package downloads (across NuGet/npm/PyPI)!
2021-May: ReTraCk has reached #1 in the Generalizable Question Answering (GrailQA) leaderboard for knowledge base QA (KBQA).
2020-Dec: The Recognizers-Text project reached over 2 million package downloads (across NuGet/npm/PyPI)!
2020-Nov: The LinkingPark system, developed in partnership between the Knowledge Computing group at MSRA and our collaborators in MSR Cambridge, has gotten 2nd place in the SemTab 2020 challenge (Semantic Web Challenge on Tabular Data to Knowledge Graph Matching)!

Recent Papers:

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text, Qianhui Wu, Huiqiang Jiang, Haonan Yin, Börje F. Karlsson, Chin-Yew Lin, ACL 2023.
Repository: https://github.com/microsoft/KC/tree/main/papers/MLKD_OOD
ColaDa: A Collaborative Label Denoising Framework for Cross-lingual Named Entity Recognition, Tingting Ma, Qianhui Wu, Huiqiang Jiang, Börje F. Karlsson, Tiejun Zhao, Chin-Yew Lin, ACL 2023.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/CoLaDa
TACR: A Table-alignment-based Cell-selection and Reasoning Model for Hybrid Question-Answering, Jian Wu, Yicheng Xu, Yan Gao, Jian-Guang Lou, Börje F. Karlsson, Manabu Okumura, Findings of the Association for Computational Linguistics: ACL 2023.
TIARA: Multi-grained Retrieval for Robust Question Answering over Large Knowledge Bases, Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje F. Karlsson, Tingting Ma, Yuzhong Qu, Chin-Yew Lin, EMNLP 2022, 2022.
Repository: https://github.com/microsoft/KC/tree/master/papers/TIARA
LinkingPark: An Automatic Semantic Table Interpretation System, Shuang Chen, Alperen Karaoglu, Carina Negreanu, Tingting Ma, Jin-Ge Yao, Jack Williams, Feng Jiang, Andy Gordon, Chin-Yew Lin, Journal of Web Semantics, 2022.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/LinkingPark
Rows from Many Sources: Enriching row completions from Wikidata with a pre-trained Language Model, Carina Negreanu, Alperen Karaoglu, Jack Williams, Shuang Chen, Daniel Fabian, Andrew Gordon, Chin-Yew Lin, Wiki Workshop 2022.
On the Effectiveness of Sentence Encoding for Intent Detection Meta-Learning, Tingting Ma, Qianhui Wu, Zhiwei Yu, Tiejun Zhao, Chin-Yew Lin, NAACL 2022.
Repository: https://github.com/microsoft/KC/tree/master/papers/IDML
Decomposed Meta-Learning for Few-Shot Named Entity Recognition, Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, Chin-Yew Lin, Findings of the ACL 2022.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/DecomposedMetaNER
AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER, Weile Chen, Huiqiang Jiang, Qianhui Wu, Börje F. Karlsson, Yi Guan, ACL 2021.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/AdvPicker
ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering, Shuang Chen, Qian Liu, Zhiwei Yu, Chin-Yew Lin, Jian-Guang Lou, Feng Jiang, ACL 2021. (demo paper)
Repository: https://github.com/microsoft/KC/tree/master/papers/ReTraCk
Issues with Entailment-based Zero-shot Text Classification, Tingting Ma, Jin-Ge Yao, Chin-Yew Lin, Tiejun Zhao, ACL 2021. (short paper)
Repository: https://github.com/microsoft/KC/tree/master/papers/Entailment-Issues
BoningKnife: Joint Entity Mention Detection and Typing for Nested NER via prior Boundary Knowledge, Huiqiang Jiang, Guoxin Wang, Weile Chen, Chengxi Zhang, Börje F. Karlsson, arXiv:2107.09429 - 2020/2021.
LinkingPark: An integrated approach for Semantic Table Interpretation, Shuang Chen, Alperen Karaoglu, Carina Negreanu, Tingting Ma, Jin-Ge Yao, Jack Williams, Andy Gordon, Chin-Yew Lin, Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) at ISWC 2020.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/LinkingPark
UniTrans: Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data, Qianhui Wu, Zijia Lin, Börje F. Karlsson, Biqing Huang, Jian-Guang Lou, IJCAI 2020.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/UniTrans
Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language, Qianhui Wu, Zijia Lin, Börje F. Karlsson, Jian-Guang Lou, Biqing Huang, ACL 2020.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/SingleMulti-TS
Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources, Qianhui Wu, Zijia Lin, Guoxin Wang, Hui Chen, Börje F. Karlsson, Biqing Huang, Chin-Yew Lin, AAAI 2020.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/Meta-Cross
Improving Entity Linking by Modeling Latent Entity Type Information, Shuang Chen, Jinpeng Wang, Feng Jiang, Chin-Yew Lin, AAAI 2020.
Exploring Word Representations on Time Expression Recognition, Sanxing Chen, Guoxin Wang, Börje Karlsson, Technical Report - Microsoft Research Asia, 2019.
Towards Improving Neural Named Entity Recognition with Gazetteers, Tianyu Liu, Jin-Ge Yao, Chin-Yew Lin, ACL 2019.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/SubTagger
CAN-NER: Convolutional Attention Network for Chinese Named Entity Recognition, Yuying Zhu, Guoxin Wang, Börje F. Karlsson, NAACL-HLT 2019.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/CAN-NER
GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition, Hui Chen, Zijia Lin, Guiguang Ding, Jian-Guang Lou, Yusen Zhang, Börje F. Karlsson, AAAI 2019.
Repository: https://github.com/microsoft/vert-papers/tree/master/papers/GRN-NER

Related Projects:

VERT (Versatile Entity Recognition & Disambiguation Toolkit) - Open-source repository including code and datasets for the KC papers related to entity extraction/disambiguation/understanding;
microsoft/Recognizers-Text - Open-source library that provides recognition and normalization/resolution of numbers, units, date/time, and sequences (e.g., phone numbers, URLs) expressed in multiple languages.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

kc's People

Contributors

Stargazers

Watchers

Forkers

standardgalactic arleneyuzhiwei toneli leiyangithub tengben0905 tellarin mtt1998 mjeensung wujian1995 yufango0 hitercs yhshu techthiyanes whuhxb sitaocheng daisydan iofu728 qianhuiwu wangxinyufighting

kc's Issues

Missing information on README file and trained models

Missing Information

It seems that in the ReTraCk README, on the "Redis dump files" and "Model checkpoints" sections its missing some informations.
I just want to know if that is a way to disponibilize those parts.

Trained Models

Complementing that models question, there is a trained model available for public use? If not, how is the correct way to train one?

404 - page not found

I am very interested in the paper Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text, but when I opened the code link provided in the paper, I found a 404 error. Could you please provide the source code, thanks.

Exemplary Logical Form Retrieval problem

Thanks for your contribution。
When I got to step two and run enumerate_candidates.py ，something made mistakes.
I have reviewed the code and data you provided and it does not seem to contain a file in.expr.json format. May I ask if this file is generated by additional operations?

In the instructions you said to generate an Exemplary Logical Form as per rng-kbqa. Does that mean I can ignore your way of generating and just do it his way?

I am anticipating your reply

File doesn't exist [Retrack"

Dear author:

Thank you for your work!
Am just wondering why the relevant files mentioned in the README.md doesn't exist.

python ./tests/debug/launch_schema_retriever.py

For examples on how to use ReTraCkRetriever in your code base, please check /tests/debug/launch_schema_retriever.py and

TRAIN.MD.

[TIARA] Get unexpected results

Hi
This article is very creative.
When I execute the following code,

python algorithm/grailqa_generation.py --prompt lf_schema

I get the following results:

>> root@train-grailqa-0:/data1/Projects/KC/papers/TIARA/src# python utils/statistics/grailqa_evaluate.py ../dataset/GrailQA/grailqa_v1.0_dev.json ../logs/grailqa_dev_2023_02_01_14_59_42_log.json

{'em': 0.6590270589974864, 'f1': 0.7318337303152712, 'em_iid': 0.770872567482737, 'f1_iid': 0.8097763582287697, 'em_comp': 0.5977542932628798, 'f1_comp': 0.6808620585344852, 'em_zero': 0.6356673960612691, 'f1_zero': 0.7189804767074774}

Could you please upload your dev result file? I can't sure whether my freebase is set correctly.

Thanks.

Missing file for Entity Disambiguation

Hi.

I want to run PURE with GrailQA for entity retrieval, as you mentioned in README.
When I execute $ sh retriever/scripts/entity_retrieval.sh for entity disambiguation, I found that there is no 'retriever/results/disamb/grail_dev/predictions.json'
in https://github.com/microsoft/KC/blob/main/papers/TIARA/src/utils/disamb_to_entity.py#L18 .

Do you mind if you help me to get that file?

[ReTrack Issue] Unable to reproduce the evaluate results using the demo script

Hello! There seem to be two paths currently to hit the system. Use processed grail qa file with evaluate.py under the parser directory. There is a demo pipeline that can be setup using the demo section in the read me.

Current I'm getting different results for the same questions. I have set all the flags mentioned in

For the best possible results, please enable the complete checker (use_beam_check, use_virtual_forward, use_type_checking, and use_entity_anchor. in the demo overrides.

The Redis cache seems up and running. For a sample of 100 questions, nearly 2-5% loss is there in F1 score, EM .

Can you please help with this issue?
Please let me know if you need any further information

[TIARA] Unable to download TIARA_DATA.zip

Hello,

Thank you for sharing your great work!
I've been attempting to follow the instructions in the README.md to run TIARA, but I'm encountering difficulties downloading the TIARA_DATA.zip file.

On Windows, I've tried to download the file by clicking the provided hyperlink multiple times, but without success. Additionally, I attempted to download it on Linux using the following command:
wget https://kcpapers.blob.core.windows.net/tiara-emnlp2022/TIARA_DATA.zip

Unfortunately, I continue to face issues. Can you please provide some assistance or guidance on how to resolve this problem? Thank you in advance!

Training Steps

Hello,
Can you please update the README with instructions for training the model from scratch?

When will the code for "Layout Generation as Intermediate Action Sequence Prediction" be open sourced?

TIARA：Is the retrieved item in Schema Retriever the classes and relations in the entire KG

This article is very creative and I am very interested.
Is the retrieved item in Schema Retriever the classes and relations in the entire KG？
Are there too many retrieved item candidates?
When will the code for this project be released?

When will the code about TIARA be open sourced?

After reading your article, I feel very innovative, and there are still some details that I don't understand very well, and I want to understand it through code. When will the code be open sourced?

[Retrack] file download

When I click on the download link, the webpage displays: This XML file does not appear to have any style information associated with it. The document tree is shown below.
I try to use wget, but it also reports: ERROR 409: Public access is not permitted on this storage account.
May I ask if there is a problem with the download link?

TIARA data problems

Thanks for your work.
TIARA requires some pre-processed data. It can be downloaded from Azure Storage This download address may be not open to the public. Can you give another access to this link.
When I click the link, the result will be like this.

[ReTraCk] Regeneration of input data files .

Hello,

Thank you for updating the training details for ReTraCk. Can you please point to scripts for regenerating the files uploaded on the Azure storage - (Parser, Dataset, KBSchema) ?

[TIARA] Entity Linker.

Hello, it is mentioned in the readme file to use the PURE project for training mention detection model.
Command: python run_entity.py
--do_train --do_eval --eval_test
--learning_rate=5e-6 --task_learning_rate=5e-6
--train_batch_size=32
--eval_batch_size=108
--context_window 0
--max_span_length 15
--task grailqa
--data_dir grailqa_data/json/
--model bert-base-uncased
--output_dir grailqa_models/checkpoint
--num_epoch 10
--seed 42

But the PURE project has no task "grailqa" as given in arguments.
Can you also mention the code changes required in PURE to make it run for grailqa.

[TIARA] Schema Retriever : Not able to reproduce results.

Thank you for the code.

I'm trying to retrain the schema retriever according to README's instructions.

For class: I was able to reproduce the results (by changing batch size to 128).
For relations: While training the eval loss is decreasing and eval accuracy also reaches around 95%, but prediction on the dev set results in 0 hits@k for all k. (I am not changing any hyperparameters in the code).

Am I missing something? Please help with the issue.

I'm curious the reason because I intuitively think it's right to get the startpoint of relation_trie from self.relation_trie.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.