Artic is metadata extractor from Scientific Papers using two-layer CRF. The current state of the project is not ready to be used. We have open-source this project with the intent to be used by anyone that wants to have a better understanding of what is being developed.
Currently, we are working on supporting this project as a tool and make it available as the version 1.0.
Naturally, we will introduce some changes with the goal of reducing the size of the project and make it more functional.
This is a Master project developed at the Universidade Federal do Rio Grande do Sul.
Presentation is available here.
Please find below the list of 100 papers we used to test Artic:
1 - Sesame: informing user security decisions with system visualization
2 - TALC: Using Desktop Graffiti to Fight Software Vulnerability
3 - You've been warned: an empirical study of the effectiveness of web browser phishing warnings
5 - A frequency-based and a poisson-based definition of the probability of being informative
6 - A new statistical formula for Chinese text segmentation incorporating contextual information
7 - A pseudo random coordinated scheduling algorithm for Bluetooth scatternets
8 - A similarity measure for motion stream segmentation and recognition
9 - Analysis of soft handover measurements in 3G network
10 - Conversation pivots and double pivots
11 - An expressive aspect language for system applications with Arachne
12 - A taxonomy of ambient information systems: four patterns of design
13 - Exploring the role of the reader in the activity of blogging
15 - A geometric constraint library for 3D graphical applications
16 - A resilient packet-forwarding scheme against maliciously packet-dropping nodes in sensor networks
17 - Looking at, looking up or keeping up with people?: motives and use of facebook
18 - A new approach to intranet search based on information extraction
19 - Ambient Social TV: Drawing People into a Shared Experience
20 - A computational approach to reflective meta-reasoning about languages with bindings
21 - Accelerated focused crawling through online relevance feedback
22 - Harvesting with SONAR: the value of aggregating social network information
23 - An intensional approach to the specification of test cases for database applications
24 - A Dependability Perspective on Emerging Technologies
25 - A Machine Learning Based Approach for Table Detection on The Web
26 - A two-phase sampling technique for information extraction from hidden web databases
27 - The Adaptation of Visual Search Strategy to Expected Information Gain
28 - Automatic extraction of titles from general documents using machine learning
29 - Heterogeneous Transfer Learning for Image Clustering via the Social Web
30 - Unsupervised Multilingual Grammar Induction
31 - Unsupervised Argument Identification for Semantic Role Labeling
32 - Automated Rich Presentation of a Semantic Topic
33 - Investigations on Word Senses and Word Usages
34 - A Comparative Study on Generalization of Semantic Roles in FrameNet
35 - Exploiting Heterogeneous Treebanks for Parsing
36 - Cross Language Dependency Parsing using a Bilingual Lexicon
37 - Topological Field Parsing of German
38 - Reinforcement Learning for Mapping Instructions to Actions
39 - A Distributed 3D Graphics Library
40 - Brutus: A Semantic Role Labeling System Incorporating CCG, CFG, and Dependency Features
41 - Temporal Summaries of News Topics
42 - Generating Event Storylines from Microblogs
43 - Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining
44 - Temporal Web Page Summarization
45 - A Cross-Collection Mixture Model for Comparative Text Mining
46 - Temporal Corpus Summarization Using Submodular Word Coverage
47 - From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
48 - Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews
49 - Large-Scale Sentiment Analysis for News and Blogs
50 - Mining and Summarizing Customer Reviews
51 - Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment
52 - Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews
53 - QoS Guaranteed Resource Block Allocation Algorithm in LTE Downlink
55 - Downlink Packets Scheduling in Enterprise WLAN
56 - Cross-layer Scheduling with Secrecy Demands in Delay-aware OFDMA Network
57 - Computational Analysis and Efficient Algorithms for Micro and Macro OFDMA Scheduling
58 - EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
59 - Growing Parallel Paths for Entity-Page Discovery
60 - Crawling Deep Web Entity Pages
61 - Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions
62 - The Role of Query Sessions in Extracting Instance Attributes from Web Search Queries
63 - Understanding Deep Web Search Interfaces: A Survey
64 - Structured Databases on the Web: Observations and Implications
65 - Harnessing the Deep Web: Present and Future
66 - Using Latent-Structure to Detect Objects on the Web
67 - Supporting the Automatic Construction of Entity Aware Search Engines
68 - Example Based Entity Search in the Web of Data
69 - Object Search: Supporting Structured Queries in Web Search Engines
70 - Ad-hoc Object Ranking in the Web of Data
71 - Gulliver in the land of data warehousing: practical experiences and observations of a researcher
72 - Deciding the Physical Implementation of ETL Workflows
73 - Defining ETL Worfklows using BPMN and BPEL
74 - A Model-Driven Framework for ETL Process Development
75 - Modeling How Students Learn to Program
76 - The WEKA Data Mining Software: An Update
77 - GraphLab: A New Framework For Parallel Machine Learning
80 - Towards energy-aware scheduling in data centers using machine learning
81 - Introduction to Probabilistic Topic Models
82 - A Few Useful Things to Know about Machine Learning
83 - EnsembleMatrix: Interactive Visualization to Support Machine Learning with Multiple Classifiers
84 - You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users
85 - Large-Scale Machine Learning with Stochastic Gradient Descent
86 - Cellular Traffic Offloading through Opportunistic Communications: A Case Study
87 - Learning Behavior Styles with Inverse Reinforcement Learning
88 - Uncovering Social Spammers: Social Honeypots + Machine Learning
89 - The Tradeoffs of Large Scale Learning
90 - Bob: A Free Signal Processing and Machine Learning Toolbox for Researchers
92 - Using Scalable Game Design to Teach Computer Science From Middle School to Graduate School
93 - Expressing Computer Science Concepts Through Kodu Game Lab
95 - A Geographical Analysis of Knowledge Production in Computer Science
96 - VLFeat - An open and portable library of computer vision algorithms
97 - The CS10K Project: Mobilizing the Community to Transform High School Computing
98 - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
100 - Coupled Semi-Supervised Learning for Information Extraction
Annotations for the first-level CRF of these papers can be found here. Author Information CRF annotations can be found here. Footnote annotations can be found here. Finally, the JSON gold-standard (expected output) is available here.
Papers 1 to 40 are the papers from SectLabel project.