Giter Site home page Giter Site logo

yahoo / fel Goto Github PK

View Code? Open in Web Editor NEW
335.0 41.0 85.0 305 KB

Fast Entity Linker Toolkit for training models to link entities to KnowledgeBase (Wikipedia) in documents and queries.

License: Apache License 2.0

Java 93.68% PigLatin 3.77% Shell 2.55%
wikipedia entity-links java web

fel's Introduction

ARCHIVED

FEL

Fast Entity Linker Core

This library performs query segmentation and entity linking to a target reference Knowledge Base (i.e., Wikipedia). In its current version it is tailored +towards query entity linking (alternatively, short fragments of text). The main goal was to have an extremely fast linker (< 1 or 2 ms/query on average on a standard laptop) that is completely unsupervised, so that more sophisticated approaches can work on top of it with a decent time budget available. A side effect of this is that the datapack used by the linker occupies <3GB making it suitable to run on the grid (and making the footprint on server machines very low).

Install

Please install maven before you run this project. The project comes with a pom.xml which should install all dependencies when you run the command mvn install.

What does this tool do?

The library performs query and document entity linking. It implements different algorithms that return a confidence score (~log likelihood) that should be (more or less) comparable across pieces of text of different length so one can use a global threshold for linking. The program operates with two datastructures, one big hash and compressed word and entity vectors. The hash is generated out of a datapack that records counts of phrases and entities that co-occur together. These counts might come from different sources, for instance anchor text and query logs. In anchor text, whenever there is a link to a corresponding entity page we would store the anchor and the entity counts. In a query log whenever there is a click to an entity page, we would update the query and entity counts. The word and entity vector files are compressed vector representations that account for the contexts in which the word/entity appears. The library provides a way to learn the entity vectors. Word vectors can be generated using general tools like word2vec, or you can reuse pre-trained word vectors such as those available in Facebook's fastText project.

The library also comes with two different sets of tools for generating the hash and the word vector files.

  • First, you need to generate a datapack that stores counts of phrases and entities. We provide tools for mining a Wikipedia dump and creating the datapack out of it.
  • Generating a hash structure out of a datapack.
  • Generating entity vectors out of a set of entity descriptions, which can be extracted from Wikipedia pages.
  • Compressing word vectors (typical compression ratios are around 10x).

If you use this library, please cite following papers:

@inproceedings{Blanco:WSDM2015,
  Address = {New York, NY, USA},
  Author = {Blanco, Roi and Ottaviano, Giuseppe and Meij, Edgar},
  Booktitle = {Proceedings of the Eight ACM International Conference on Web Search and Data Mining},
  Location = {Shanghai, China},
  Numpages = {10},
  Publisher = {ACM},
  Series = {WSDM 15},
  Title = {Fast and Space-Efficient Entity Linking in Queries},
  Year = {2015}
}

@inproceedings{Pappu:WSDM2017,
  Address = {New York, NY, USA},
  Author = {Pappu, Aasish, and Blanco, Roi, and Mehdad, Yashar and Stent, Amanda, and Thadani, Kapil},
  Booktitle = {Proceedings of the Tenth ACM International Conference on Web Search and Data Mining},
  Location = {Cambridge, UK},
  Numpages = {10},
  Publisher = {ACM},
  Series = {WSDM 17},
  Title = {Lightweight Multilingual Entity Extraction and Linking},
  Year = {2017}
}

Stand-alone query entity linking

There are a number of different rankers/linkers that use different conceptual models. The overall description of the algorithm with some implementation details is at:

Fast and space efficient entity linking for queries

The main class to use is com.yahoo.semsearch.fastlinking.FastEntityLinker

The class can be called with --help to list the available options. They provide interactive linking through stdin (edit the code or extend the class if you need a custom output format).

First download the dataset from webscope following the links provided below.

Example usage call:

mvn exec:java -Dexec.mainClass=com.yahoo.semsearch.fastlinking.FastEntityLinker \
              -Dexec.args="en/english-nov15.hash"

Coherent Entity Linking for Documents

The CoherentEntityLinker class takes entity-mentions and n-best list of entity-links for each entity mention as input. It constructs a lattice from the n-best lists and runs Forward-Backward algorithm.

  • J. Binder, K. Murphy and S. Russell. Space-Efficient Inference in Dynamic Probabilistic Networks. International, Joint Conf. on Artificial Intelligence, 1997.

More coherency algorithms are under experimentation. They will be added in future versions of the code.

mvn clean compile exec:java \
  -Dexec.mainClass=com.yahoo.semsearch.fastlinking.CoherentEntityLinkerWrapper \
  -Dexec.args="en/enwiki.wiki2vec.d300.compressed en/english-nov15.hash test.txt" \
  -Dexec.classpathScope=compile

You can include a mapping file in the entity linker arguments (below) that maps integral entity categories to human-readable entity categories.

Grid based linking

The following command would run the linker on a Hadoop grid:

hadoop jar FEL-0.1.0-fat.jar \
com.yahoo.semsearch.fastlinking.utils.RunFELOntheGrid \
-Dmapred.map.tasks=100 \
-Dmapreduce.map.java.opts=-Xmx3g \
-Dmapreduce.map.memory.mb=3072 \
-Dmapred.job.queue.name=adhoc \
-files en/english-nov15.hash#hash, src/main/bash/id-type.tsv#mapping \
<inputfile> \
<outputfile>

The class reads files that have one query per line - it splits on and takes the first element. The output format is:

entity_type <TAB> query <TAB> modifier <TAB> entity_id

where

  • entity_type is given in the datapack
  • query is the original query
  • modifier is the query string when the entity alias is remove
  • entity_id is the retrieved entity

In general you should rely on thresholding and possibly sticking to the top-1 entity retrieved but this depends on how you are going to use it.

Fiddling with word embeddings

This package also provides code to

  1. quantize word2vec vectors for uni/bigrams,
  2. compress quantized vectors,
  3. generate word vectors for entities, given a set of words that describe them.

More on this can be found in the w2v package.

Mine Wikipedia and Extract Graph-Based Counts

The tool makes use of a datapack that stores counts and aliases (mentions) of entities from different sources. Originally, we used anchor text and query logs. The following describes how to mine and compute the anchor text from a public Wikipedia dump using a hadoop cluster (or if there is not one, you can use hadoop in a single machine). This is based on the code from the Cloud9 toolkit.

More on this can be found in the io package.

####Creating a Quasi-succing entity features hash

The datapack will contain two files: one with the per-entity counts and one with the entity to id mapping. Then, you can hash it using:

com.yahoo.semsearch.fastlinking.hash.QuasiSuccinctEntityHash \
  -i <datapack_file> -e <entity2id_file> -o <output_file>

Models

The following pre-trained models are provided to perform entity linking with the toolkit and are available through the Yahoo! webscope program for research purposes. These models are trained on Wikipedia and distributed using Creative Commons BY SA 4.0 license (see MODELS_LICENSE).

English

Spanish

Chinese (Simplified)

Contact

Roi Blanco, Aasish Pappu

fel's People

Contributors

ageron avatar bigbluehat avatar r-andrew-dev avatar roicho avatar titsuki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fel's Issues

Getting the dataset

Hi,

I would like to use the entity linker and I am trying to get the datasets from the links given in the readme file.

If I log in with my yahoo account and try to submit a request to get the L30 dataset, nothing happens. I am redirected to the "My dataset selection" page again, the request does not seem to be send.

How can I get the dataset?

GC overhead limit when mining wikipedia and extracting anchor text

Hi

I am following the steps provided here to train my model.

I have pre-processed the datapack. But when I am trying to "Build Data Structures and extract anchor text", I am having this GC overhead issue.

screen shot 2018-05-29 at 09 14 53

I have even increased the MAPRED and HADOOP memory to 15G and even provided opts for
Dmapreduce.reduce.java.opts and Dmapreduce.reduce.memory.mb

My system has 8 cores 32 GB, using java 8. This is the snippet of command that I am following.

hadoop \
jar target/FEL-0.1.0-fat.jar \
com.yahoo.semsearch.fastlinking.io.ExtractWikipediaAnchorText \
-Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapreduce.reduce.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapred.job.map.memory.mb=15144 \
-Dmapreduce.map.memory.mb=15144 \
-Dmapreduce.reduce.memory.mb=15144 \
-Dmapred.child.java.opts="-Xmx15g" \
-Dmapreduce.map.java.opts='-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC' \
-Dmapreduce.reduce.java.opts="-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC" \
-input wiki/${WIKI_MARKET}/${WIKI_DATE}/pages-articles.block \
-emap wiki/${WIKI_MARKET}/${WIKI_DATE}/entities.map \
-amap wiki/${WIKI_MARKET}/${WIKI_DATE}/anchors.map \
-cfmap wiki/${WIKI_MARKET}/${WIKI_DATE}/alias-entity-counts.map \
-redir wiki/${WIKI_MARKET}/${WIKI_DATE}/redirects

Could you please suggest why this might be happening?

Pardon me as I am novice to hadoop and java

about the integrated entity linking

Dear aasish:
Thanks for your contribution over entity linking. It is a good tool for entity linking.
After reading your paper and the codes, I find that the FastEntityLinker Class can get candidate mentions and entities, and the CoherentEntityLinkerWrapper Class takes in mentions and gets the coherent entities. However, I don't find the integrated entity linking in the codes. For example, inputting the sentence "Yahoo is a company headquartered in Sunnyvale, CA with Marissa Mayer as CEO", and return the linked entities in Wikipedia "Yahoo", "Sunnyvale, California" and "Marissa_Mayer". Could you please provide the integrated entity linking?
Thank you very much.

Surface form on Coherent Entity Linker

This could be a follow-up from #7:

I'm wondering if it's possible to do the same for coherentEntityLinker. Here is what i'm currently at:

changed EntityResult to add a Span variable
including variable in the map as such:

List<EntityResult> candidates = felCandidates.stream().map(felResult -> {
        String wikiId = ...;
        return new EntityResult(wikiId, felResult.score, felResult.type, felResult.s);

but the Span variable seems to hold the entire query, instead of only the surface form. Is it the expected behaviour? Or is there a step I'm missing?

Thanks!

I have not find the class CoherentEntityLinkerWrapper

Dear Author,

mvn clean compile exec:java -Dexec.mainClass=com.yahoo.semsearch.fastlinking.CoherentEntityLinkerWrapper -Dexec.args="en/enwiki.wiki2vec.d300.compressed en/english-nov15.hash test.txt" -Dexec.classpathScope=compile

when I exec the command, I find there is no class named CoherentEntityLinkerWrapper.
I downloaded your model english-nov15.hash and enwiki.wiki2vec.d300.compressed, but I can't find Entity file. How can I use EntityContextFastEntityLinker with your trained model?

thank you~

Question about Chinese entity linking

Is "mvn exec:java -Dexec.mainClass=com.yahoo.semsearch.fastlinking.FastEntityLinker -Dexec.args=“zh/chinese-dec15.hash" the right command to do fastlinking of Chinese?

I run that command and got into the interactive shell. But when I input some sentence, it does not shows the entities. I tried Spanish, and the same thing happened. What could be the problem? Thanks a lot!

Output hash file size surprisingly small when mining Wikipedia to train our model

Thank you for providing the code.

We were trying out to mine wikipedia using this shell script for our entity linker using the dump for 2018/05/01. We were able to generate the hash file but surprisingly the file size was 284 MB. In contrast, the pre-trained model provided has a file size of 1.3G for English Hash trained from November 2015 Wikipedia

@aasish could you suggest what might be happening wrong. Is it because of the compression or are we missing out on some entities? Is there a way that we could combine both the hash files so that we can take into account the recent entities.

Could not find or load main class

I'm fairly new to the Java/Maven ecosystem, so, I'm sorry if this is completely unrelated to FEL itself.

After clonning the repository and running mvn install, running java -Xmx10G com.yahoo.semsearch.fastlinking.FastEntityLinker --help will always return the same error:
Error: Could not find or load main class com.yahoo.semsearch.fastlinking.FastEntityLinker
Tried running these both on a linux machine and on a Mac machine.

Is there any step that I'm missing?
Thanks!

P.S.: In my Maven local repository, I can find both /it/unimi/dsi/fastutil/ and /com/yahoo/FEL/FEL/

Regarding the returning ID

Hi,
I've come up with the following question. When running FastEntityLinker for a question, it will return entity mentions with score and id. What is the id referring to?
I've tested on wikidata QID and wikipedia pageid, finding out that neither is the id referring to. Thanks!

class not found error

Dear Authors,

When I try to run your tool, I encounter a "java.lang.NoClassDefFoundError: it/unimi/dsi/fastutil/io/BinIO".

I found that the codes for this class is missing in your github. I wonder if I use the wrong commands to compile the project. (mvn compile)

Regards,
zphuang

How could we link the Entity Candidates to their Surface From in query?

Hey, Dear aasish:

I am very curious about how to link the candidates in the result list into their surface form.

Now, my stand-alone result is like:
`>Trump had dinner with me tonight.

Trump had dinner with me tonight. Donald_Trump -3.6968905396651404 1331552

Trump had dinner with me tonight. Dinner -3.811408020070193 1300672

Trump had dinner with me tonight. Come_Dine_with_Me -3.871200967594029 1082062
.......`

However, it seems that we cannot get the surface form of Donald_Trump directly in your code. In another word, could I get the output like this:
Trump(Donald_Trump) had dinner with me tonight.
?
If it is possible, this would be very helpful for our research.
Thank you very much in advance.

lack of PHRASE.model and PHRASE.model when using CoherentEntityLinker

I have read related paper in WSDM2017 and the code. When using the class whose name is
CoherentEntityLinker, the args in the file of README.md are "en/enwiki.wiki2vec.d300.compressed en/english-nov15.hash", I get the wiki2vec and hash files. In fact, there are some other models which are ENTITIES.PHRASE.model and PHRASE.model to load. Would you please provide those ENTITIES.PHRASE.model and PHRASE.model in language of English and Chinese?

about EntityEmbeddings -files parameter

hello:
i want to do some entity embedding,with below command,but i don't know the paramter --files?
"
hadoop jar FEL-0.1.0-fat.jar com.yahoo.semsearch.fastlinking.w2v.EntityEmbeddings -Dmapreduce.job.queuename=adhoc -files word_vectors#vectors E2W entity.embeddings
"
i think it may the input or output of word embedding command,but not
"
java com.yahoo.semsearch.fastlinking.w2v.Quantizer -i <word_embeddings> -o -h
"
can you give me some suggestions? thanks

about the model

Dear author:
Thank you for your contribution. I use the chinese hash model to get the entities using com.yahoo.semsearch.fastlinking.FastEntityLinker class and get results. However,the result is not good enough. Could you do me a favor to improve the model?I have two questions to ask:1.Is the result related to the hash model?2.can I train model using my own data? Thank you very much!

about EntityContextFastEntityLinker - input parameter

Hi,
I want to know that the EntityContextFastEntityLinker class "Word vectors file" and "Entities word vectors file" are the same file?

new FlaggedOption( "hash", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'h', "hash", "quasi succint hash" ),
new FlaggedOption( "vectors", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'v', "vectors", "Word vectors file" ),
new FlaggedOption( "entities", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'e', "entities", "Entities word vectors file" )

Thanks!

How to use it?

Can I run the FEL without Hadoop, if yes, how can I run it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.