Giter Site home page Giter Site logo

relation_extration's Introduction

使用指南

  1. deal_corpus.py
  • 替换文章中的化学物质和疾病为MeSH@ID

  • 统计训练集和开发集中的化学物质,疾病和CID关系

  • 需要文件:data/raw_data/CDR_TrainingSet.txt;data/raw_data/CDR_DevelopmentSet.txt

  • 生成文件:data/corpus/train.txt;data/corpus/develop.txt

  • 训练集**500篇文章,开发集**504篇文章.

  • 训练集**标注出665种化学物质,649种疾病.开发集**标注出660种化学物质,589种疾病.

  • 训练集**标注出344种CID关系,开发集中供标注出347种CID关系.

  • 共标注出998种化学物质,882种疾病,527种CID关系.

  1. StanfordTest.java/Passage
  • 进行分句操作

  • 需要文件:data/corpus/train.txt;data/corpus/develop.txt

  • 生成文件:data/corpus/TrainSet.txt;data/corpus/DevelopSet.txt

  • 训练集**4600句,开发集**4577句.

  1. inner_sentence.py
  • 读取文章中的每一句话,判断句子中是否存在实体对,对于句内存在实体对的句子,抽取句内的全部实体
    针对每一对实体计算实体在句内的位置,实体间距离,实体顺序,是否存在于知识特征中,是否存在CID关系.
    并计算该句的句长,和关键词获得的总分数.
    关键字为后来补充内容

  • 需要文件:
    data/corpus/TrainSet.txt;data/corpus/DevelopSet.txt
    data/keyword.txt 保存抽取出的关键词及其权值
    data/knowledge.txt 保存CTD数据库中一百万对化学物质致病关系.

  • 生成文件:
    data/corpus/TrainSentence.txt;data/corpus/DevelopSentence.txt
    格式为:句子 化学物质 疾病 化学物质位置 疾病位置 距离 句长 顺序 关键词 知识 标题 CID

  • 训练集**3602句,开发集**3833句

  1. keyword_sentence.py
  • 抽取全部正例中的句子,用作关键字统计

  • 需要文件:data/corpus/TrainSentence.txt;data/corpus/DevelopSentence.txt

  • 生成文件:data/corpus/keyword_sentence.txt

  • 共3048句

  1. StanfordTest.java/keyword
  • 对句子进行分词,并去除词形变换,保留全部的单词

  • 需要文件:data/corpus/keyword_sentence.txt

  • 生成文件:data/corpus/keyword.txt

  1. keyword_count.py
  • 统计关键词,统计出现次数超过100次的并且全部为字母的单词,并且人工过滤代词,介词,连词等.

  • 需要文件:data/corpus/keyword.txt

  • 生成文件:data/keyword.txt

  • 共31个词

  1. deal_sentence.py
  • 只保留文件中的句子,用作依存句法分析

  • 需要文件:data/corpus/TrainSentence.txt;data/corpus/DevelopSentence.txt

  • 生成文件:data/corpus/TrainDependSentence.txt;data/corpus/DevelopDependSentence.txt

  1. StanfordTest.java/dependency
  • 对每一句话进行句法分析,并生成依赖句法树

  • 需要文件:data/corpus/TrainDependSentence.txt;data/corpus/DevelopDependSentence.txt

  • 生成文件:data/corpus/train_dependecny.txt;data/corpus/develop_dependecny.txt

  1. deal_dependency.py
  • 获取实体对的依存距离及句法标注

  • 需要文件:
    data/corpus/train_dependency.txt;data/corpus/develop_dependency.txt
    data/corpus/TrainSentence.txt;data/corpus/DevelopSentence.txt

  • 生成文件:
    data/corpus/TrainDependResult.txt;data/corpus/DevelopDependResult.txt

  1. annotate.py
  • 用于正例句法标记统计

  • 需要文件:data/corpus/TrainDependResult.txt;data/corpus/DevelopDependResult.txt

  • 生成文件:data/annotate.txt

  1. combine.py
  • 生成最终句内的特征

  • 格式为:依存距离 依存句法树长度 句法标注值 句子 化学物质 疾病 化学物质位置 疾病位置 距离 句长 顺序 关键词 知识 标题 CID

  • 需要文件:
    data/corpus/TrainSentence.txt;data/corpus/DevelopSentence.txt
    data/corpus/TrainDependResult.txt;data/corpus/DevelopDependResult.txt

  • 生成文件:
    data/result/train_in_result.txt;data/result/develop_in_result.txt

  1. out_sentence.py
  • 跨句分析

  • 针对每篇文章的摘要,选取其中的两句,统计跨句的实体对,针对存在实体对的句对,抽取全部的实体对
    针对每一对实体,计算实体间距离,跨句句长,化学物质在句内位置,疾病在句内位置,跨句数量,
    实体对包含其他实体数量,是否在知识内,关键字得分,顺序,关键字得分

  • 需要文件:data/corpus/TrainSet.txt;data/corpus/DevelopSet.txt

  • 生成文件:data/result/train_out_result.txt;data/result/develop_out_result.txt

relation_extration's People

Contributors

whiplashzeb avatar

Watchers

 avatar

Forkers

frankzqy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.