Giter Site home page Giter Site logo

yahoo / fel Goto Github PK

View Code? Open in Web Editor NEW
335.0 41.0 85.0 305 KB

Fast Entity Linker Toolkit for training models to link entities to KnowledgeBase (Wikipedia) in documents and queries.

License: Apache License 2.0

Java 93.68% PigLatin 3.77% Shell 2.55%
wikipedia entity-links java web

fel's Issues

How to use it?

Can I run the FEL without Hadoop, if yes, how can I run it?

Surface form on Coherent Entity Linker

This could be a follow-up from #7:

I'm wondering if it's possible to do the same for coherentEntityLinker. Here is what i'm currently at:

changed EntityResult to add a Span variable
including variable in the map as such:

List<EntityResult> candidates = felCandidates.stream().map(felResult -> {
        String wikiId = ...;
        return new EntityResult(wikiId, felResult.score, felResult.type, felResult.s);

but the Span variable seems to hold the entire query, instead of only the surface form. Is it the expected behaviour? Or is there a step I'm missing?

Thanks!

about the integrated entity linking

Dear aasish:
Thanks for your contribution over entity linking. It is a good tool for entity linking.
After reading your paper and the codes, I find that the FastEntityLinker Class can get candidate mentions and entities, and the CoherentEntityLinkerWrapper Class takes in mentions and gets the coherent entities. However, I don't find the integrated entity linking in the codes. For example, inputting the sentence "Yahoo is a company headquartered in Sunnyvale, CA with Marissa Mayer as CEO", and return the linked entities in Wikipedia "Yahoo", "Sunnyvale, California" and "Marissa_Mayer". Could you please provide the integrated entity linking?
Thank you very much.

lack of PHRASE.model and PHRASE.model when using CoherentEntityLinker

I have read related paper in WSDM2017 and the code. When using the class whose name is
CoherentEntityLinker, the args in the file of README.md are "en/enwiki.wiki2vec.d300.compressed en/english-nov15.hash", I get the wiki2vec and hash files. In fact, there are some other models which are ENTITIES.PHRASE.model and PHRASE.model to load. Would you please provide those ENTITIES.PHRASE.model and PHRASE.model in language of English and Chinese?

class not found error

Dear Authors,

When I try to run your tool, I encounter a "java.lang.NoClassDefFoundError: it/unimi/dsi/fastutil/io/BinIO".

I found that the codes for this class is missing in your github. I wonder if I use the wrong commands to compile the project. (mvn compile)

Regards,
zphuang

Question about Chinese entity linking

Is "mvn exec:java -Dexec.mainClass=com.yahoo.semsearch.fastlinking.FastEntityLinker -Dexec.args=“zh/chinese-dec15.hash" the right command to do fastlinking of Chinese?

I run that command and got into the interactive shell. But when I input some sentence, it does not shows the entities. I tried Spanish, and the same thing happened. What could be the problem? Thanks a lot!

about EntityContextFastEntityLinker - input parameter

Hi,
I want to know that the EntityContextFastEntityLinker class "Word vectors file" and "Entities word vectors file" are the same file?

new FlaggedOption( "hash", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'h', "hash", "quasi succint hash" ),
new FlaggedOption( "vectors", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'v', "vectors", "Word vectors file" ),
new FlaggedOption( "entities", JSAP.STRING_PARSER, JSAP.NO_DEFAULT, JSAP.REQUIRED, 'e', "entities", "Entities word vectors file" )

Thanks!

GC overhead limit when mining wikipedia and extracting anchor text

Hi

I am following the steps provided here to train my model.

I have pre-processed the datapack. But when I am trying to "Build Data Structures and extract anchor text", I am having this GC overhead issue.

screen shot 2018-05-29 at 09 14 53

I have even increased the MAPRED and HADOOP memory to 15G and even provided opts for
Dmapreduce.reduce.java.opts and Dmapreduce.reduce.memory.mb

My system has 8 cores 32 GB, using java 8. This is the snippet of command that I am following.

hadoop \
jar target/FEL-0.1.0-fat.jar \
com.yahoo.semsearch.fastlinking.io.ExtractWikipediaAnchorText \
-Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapreduce.reduce.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapred.job.map.memory.mb=15144 \
-Dmapreduce.map.memory.mb=15144 \
-Dmapreduce.reduce.memory.mb=15144 \
-Dmapred.child.java.opts="-Xmx15g" \
-Dmapreduce.map.java.opts='-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC' \
-Dmapreduce.reduce.java.opts="-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC" \
-input wiki/${WIKI_MARKET}/${WIKI_DATE}/pages-articles.block \
-emap wiki/${WIKI_MARKET}/${WIKI_DATE}/entities.map \
-amap wiki/${WIKI_MARKET}/${WIKI_DATE}/anchors.map \
-cfmap wiki/${WIKI_MARKET}/${WIKI_DATE}/alias-entity-counts.map \
-redir wiki/${WIKI_MARKET}/${WIKI_DATE}/redirects

Could you please suggest why this might be happening?

Pardon me as I am novice to hadoop and java

Output hash file size surprisingly small when mining Wikipedia to train our model

Thank you for providing the code.

We were trying out to mine wikipedia using this shell script for our entity linker using the dump for 2018/05/01. We were able to generate the hash file but surprisingly the file size was 284 MB. In contrast, the pre-trained model provided has a file size of 1.3G for English Hash trained from November 2015 Wikipedia

@aasish could you suggest what might be happening wrong. Is it because of the compression or are we missing out on some entities? Is there a way that we could combine both the hash files so that we can take into account the recent entities.

Could not find or load main class

I'm fairly new to the Java/Maven ecosystem, so, I'm sorry if this is completely unrelated to FEL itself.

After clonning the repository and running mvn install, running java -Xmx10G com.yahoo.semsearch.fastlinking.FastEntityLinker --help will always return the same error:
Error: Could not find or load main class com.yahoo.semsearch.fastlinking.FastEntityLinker
Tried running these both on a linux machine and on a Mac machine.

Is there any step that I'm missing?
Thanks!

P.S.: In my Maven local repository, I can find both /it/unimi/dsi/fastutil/ and /com/yahoo/FEL/FEL/

How could we link the Entity Candidates to their Surface From in query?

Hey, Dear aasish:

I am very curious about how to link the candidates in the result list into their surface form.

Now, my stand-alone result is like:
`>Trump had dinner with me tonight.

Trump had dinner with me tonight. Donald_Trump -3.6968905396651404 1331552

Trump had dinner with me tonight. Dinner -3.811408020070193 1300672

Trump had dinner with me tonight. Come_Dine_with_Me -3.871200967594029 1082062
.......`

However, it seems that we cannot get the surface form of Donald_Trump directly in your code. In another word, could I get the output like this:
Trump(Donald_Trump) had dinner with me tonight.
?
If it is possible, this would be very helpful for our research.
Thank you very much in advance.

I have not find the class CoherentEntityLinkerWrapper

Dear Author,

mvn clean compile exec:java -Dexec.mainClass=com.yahoo.semsearch.fastlinking.CoherentEntityLinkerWrapper -Dexec.args="en/enwiki.wiki2vec.d300.compressed en/english-nov15.hash test.txt" -Dexec.classpathScope=compile

when I exec the command, I find there is no class named CoherentEntityLinkerWrapper.
I downloaded your model english-nov15.hash and enwiki.wiki2vec.d300.compressed, but I can't find Entity file. How can I use EntityContextFastEntityLinker with your trained model?

thank you~

about EntityEmbeddings -files parameter

hello:
i want to do some entity embedding,with below command,but i don't know the paramter --files?
"
hadoop jar FEL-0.1.0-fat.jar com.yahoo.semsearch.fastlinking.w2v.EntityEmbeddings -Dmapreduce.job.queuename=adhoc -files word_vectors#vectors E2W entity.embeddings
"
i think it may the input or output of word embedding command,but not
"
java com.yahoo.semsearch.fastlinking.w2v.Quantizer -i <word_embeddings> -o -h
"
can you give me some suggestions? thanks

Getting the dataset

Hi,

I would like to use the entity linker and I am trying to get the datasets from the links given in the readme file.

If I log in with my yahoo account and try to submit a request to get the L30 dataset, nothing happens. I am redirected to the "My dataset selection" page again, the request does not seem to be send.

How can I get the dataset?

Regarding the returning ID

Hi,
I've come up with the following question. When running FastEntityLinker for a question, it will return entity mentions with score and id. What is the id referring to?
I've tested on wikidata QID and wikipedia pageid, finding out that neither is the id referring to. Thanks!

about the model

Dear author:
Thank you for your contribution. I use the chinese hash model to get the entities using com.yahoo.semsearch.fastlinking.FastEntityLinker class and get results. However,the result is not good enough. Could you do me a favor to improve the model?I have two questions to ask:1.Is the result related to the hash model?2.can I train model using my own data? Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.