Giter Site home page Giter Site logo

mag2rdf's Introduction

MAG2RDF

MAG2RDF contains the code for generating the Microsoft Academic Knowledge Graph (MAKG) in RDF.

For more information, see The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data.

The main class is MAG2RDF.

Input

The generation of the MAKG RDF data requires to have the data dump files of the Microsoft Academic Graph. Specifically, the following files are needed:

  • Affiliations.txt

  • Authors.txt

  • ConferenceInstances.txt

  • ConferenceSeries.txt

  • FieldOfStudyChildren.txt

  • RelatedFieldOfStudy.txt

  • FieldsOFStudy.txt

  • Journals.txt

  • PaperAbstractsInvertedIndex.txt

  • PaperAuthorAffiliations.txt

  • PaperCitationContexts.txt

  • PaperFieldsOfStudy.txt

  • PaperLanguages.txt

  • PaperReferences.txtA. MAG Dump-Files:

    1. Affiliations.txt.nt i. Wiki articles were expected to be https. A conversion of wiki to dbpedia was not achieved. FIXED ii. Created date is now in column 12 (previously 10)

    2. ConferenceInstances.txt.nt i. Created date is now in column 17 (previously 15)

    3. FieldOfStudyRelationship.txt.nt i. Entity two is now in column 3 (previously 4) ii. Type one is now in column 2 (previously 3) iii. Type two is now in column 4 (previously 6)

    4. Journals.txt.nt i. Created date is now in column 23 (previously 22)

    5. PaperLanguages.txt.nt i. File doesn't exist anymore. If a paper has a tagged language, the language can be found in column 4 of PaperUrls.txt.

B: MAG2RDF Code:

1. textannotation/TullTextAnnotationClientXML
	i.   Line 113: Changed 'new ByteArrayInputStream(xmlstring.getBytes())' to 'new ByteArrayInputStream(xmlstring.getBytes(Charsets.UTF_8))'
	ii.  Line 263: Changed '.type(MediaType.APPLICATION_XML)' to '.type(MediaType.APPLICATION_XML + "; charset=UTF-8")'
  • Papers.txt
  • PaperUrls.txt

To obtain these files, please follow the instructions at https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning.

Processing

Compile MAG2RDF.java, create the corresponding jar file and run

java MAG2RDF.jar

Output

For each input file, the program creates a corresponding output file in the RDF format:

  • Affiliations.txt.nt
  • Authors.txt.nt
  • ConferenceInstances.txt.nt
  • ConferenceSeries.txt.nt
  • FieldOfStudyChildren.txt.nt
  • RelatedFieldOfStudy.txt.nt
  • FieldsOFStudy.txt.nt
  • Journals.txt.nt
  • PaperAbstractsInvertedIndex.txt.nt
  • PaperAuthorAffiliations.txt.nt
  • PaperCitationContexts.txt.nt
  • PaperFieldsOfStudy.txt.nt
  • PaperLanguages.txt.nt
  • PaperReferences.txt.nt
  • Papers.txt.nt
  • PaperUrls.txt.nt

Contact & More Information

More information can be found in my ISWC'19 paper The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data.

Feel free to reach out to me in case of questions or comments:

Michael Färber, [email protected]

How to Cite

Please cite my work (described in this paper) as follows:

@inproceedings{DBLP:conf/semweb/Farber19,
  author    = {Michael F{\"{a}}rber},
  title     = "{The Microsoft Academic Knowledge Graph: {A} Linked Data Source with
               8 Billion Triples of Scholarly Data}",
  booktitle = "{Proceedings of the 18th International Semantic Web Conference}",
  series    = "{ISWC'19}",
  location  = "{Auckland, New Zealand}",
  pages     = {113--129},
  year      = {2019},
  url       = {https://doi.org/10.1007/978-3-030-30796-7\_8},
  doi       = {10.1007/978-3-030-30796-7\_8}
}

Last Major Updates

  • 2020-07-09
  • 2019-07-15

Changes for Version 2020-07-09

A. MAG Dump-Files:

	1. Affiliations.txt.nt
		i.   Wiki articles were expected to be https. A conversion of wiki to dbpedia was not achieved. FIXED
		ii.  Created date is now in column 12 (previously 10)
		
	2. ConferenceInstances.txt.nt
		i.   Created date is now in column 17 (previously 15)
		
	3. FieldOfStudyRelationship.txt.nt
		i.   Entity two is now in column 3 (previously 4)
		ii.  Type one is now in column 2 (previously 3)
		iii. Type two is now in column 4 (previously 6)
		
	4. Journals.txt.nt
		i.   Created date is now in column 23 (previously 22)
		
	5. PaperLanguages.txt.nt
		i.   File doesn't exist anymore. If a paper has a tagged language, the language can be found in column 4 of PaperUrls.txt.
		
B: MAG2RDF Code:

	1. textannotation/TullTextAnnotationClientXML
		i.   Line 113: Changed 'new ByteArrayInputStream(xmlstring.getBytes())' to 'new ByteArrayInputStream(xmlstring.getBytes(Charsets.UTF_8))'
		ii.  Line 263: Changed '.type(MediaType.APPLICATION_XML)' to '.type(MediaType.APPLICATION_XML + "; charset=UTF-8")'

mag2rdf's People

Contributors

michaelfaerber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mag2rdf's Issues

Why there are some duplicate fields in makg?

Thanks for your outstanding MAKG first. In using the data set, I find there are some duplicate fields in it. For example, the paper with id 116022 has two types "http://www.w3.org/2001/XMLSchema#date", "Paper" and two titles "Algorithms for the Construction of Digital Convex Fuzzy Hulls.", "EUSFLAT Conf. (2)". I wonder what it means if an entity has two duplicate fields.
I use the latest version data set in Zenodo.

http://ma-graph.org/ is down

Hi there, thank you for developing and maintaining ma-graph. We use it as resource for our research, but one of our pipelines is now down because it appears that http://ma-graph.org/ is offline.

If this is a known issue, would you be able to indicate how long this might be out-of-service? Apologies if this is the wrong forum to raise this issue.

Thanks again for this contribution to the open data community.

Reification

What reification pattern did you used the represent ternary relations such as Author-Affiliation-Publication from MAG?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.