exercise-cora

coraデータセットを用いた演習

Introduction

Usually raw data have to be re-formatted to analyze data with some analysis methods.
this exercise provides experiences for re-format a raw dataset.
A popular dataset, Cora Dataset cora-classify.tar.gz.2003141524 , shall be used in this exercise.
Matlab script prot489-demo111-grpasgn.2003192044.tgz is supposed to be used for group assignment.
In general, you need to think of what steps are required for the experiment and to design what each script does. Insufficient designs tend to yield buggy and spaghetti program codes that cannot be maintained. In this experiment, the design is given. You may learn this exercise how to design scripts before implementation.

Overview of ht dataset Cora.

Cora dataset contains the following files.

citations.withauthors


Example:	19826 ./Information_Retrieval/Retrieval/http://www.isi.edu#sims#papers#94-sims-agents.ps 1485 1797 1926 5130 5131 21032 21037 22843 30844 33745 42255 59653 59664 63008 71002 101487 297774 297787 $\ast$ C Knoblock Y Arens C Hsu
ID	19826
Major classification	Information_Retrieval
Minor classification	Retrieval
File name	http://www.isi.edu#sims#papers#94-sims-agents.ps

papers : Has 52535 lines. In each line, the paper id, the title and the published year of a paper is recorded.


Example:	2 http://dimacs.rutgers.edu#techps#1994#94-07.ps [Gar] <author> M.R. Garey & D.S. Johnson, </author> <title> Computers and Intractibility: A Guide to the Theory of NP-Completeness, W.H. </title> <publisher> Freeman, </publisher> <address> New York, </address> 1979.
ID	2
filename	http://dimacs.rutgers.edu#techps#1994#94-07.ps
title	Computers and Intractability: A Guide to the Theory of NP-Completeness
year	1979

classifications : Has 30788 lines. In each line, the paper id and the classifications of a paper is provided.


Example:	http://www.isi.edu#sims#papers#94-sims-agents.ps /Information_Retrieval/Retrieval/
filename	http://www.isi.edu#sims#papers#94-sims-agents.ps
classification	/Information_Retrieval/Retrieval/

citations : Has 714266 lines.

Example 172005 0

From 172005

To 0


Example	172005 0
From	172005
To	0

Overview


Consider the task to build a classifier that predict the field of a paper from the word occurrence of its title. Here, the task is limited to binary classification, and the two categories include Networking and Machine Learning (ML).	In this exercise, you extract only the papers classified into Networking or ML in the file “citations.withauthors” and exclude all the other papers.
Consider a simulation that there are two datasets, A and C.	To simulate the situation, you divide the dataset Cora into three groups, A, B, and C.
Feature set consists of word occurrences in the title. You extract features from the titles in A and C. You train SVM with the dataset A and examine the performance on the dataset C.

Code for overviewing the dataset

fukan010.py： Display the number of papers in each category of major classifications. Display the number of papers in each category of minor classifications.

Code for making the frequency features

compfreq010.py: From the file "citations.withauthors", extract the paper id classified into "Networking" or "Machine Learning" (hereinafter, abbreviated as "ML"), and save them in a file named "k.compfreq010.ids.tsv" in the following format:
```
<paperid> <category>
```
therein, give +1 to <catefory> for Networking and -1 for ML.
compfreq020.py: Exclude papers in "k.compfreq010.ids.csv" that are not contained in the file "papers", and save them in a file named "k.compfreq020.ttl.tsv" in the following format:
```
<paperid> <category> <title>
```
Every letter contained in <title> should be transformed to the lower letter. Symbols such as colon should be removed. Stop words in stopwprds.2003141842.txt should be removed．In addition, papers that are not in file "papers" are recorded to a file "k.compfreq020.del.txt" in the following format:
```
<paperid> <category>
```
compfreq030.py: Enumerate all words in the set of papers. Sort them in the alphabetic order. Assign a wordid to each word with natural numbers 1, 2, .... Save them in a file "k.compfreq030.words.tsv" in the following format:
```
<wordid> <word>
```
compfreq040.py：Read "k.compfreq020.ttl.tsv" and count the frequency of each word listed in "k.compfreq030.words.tsv". Make a matrix in which each row is defined as follows.
```
<category> <paperid> <word0001freq> <word0002freq> ...
```
Save the matrix in a file "k.compfreq040.datatab.mat" in the binary Matlab format. The format may be replaced to the SVM-light format specified in SVMlight-format-ArgCV.pdf.
compfreq045.m: Generate a random matrix: Each entry is drawn from uniform distribution $\boldsymbol{U}(0,1);$ $\mathrm{the~~number~~of~rows} = 20;$ the number of columns is equal to the number of papers. Save the matrix in a file "k.compfreq045.mat".
compfreq050.m: Divide the dataset to three subsets, A, B, and C as follows. For each of two categories, "Networking" and "ML", the data are divided into three subsets with ratio $\verb|ratio_abc| = \lbrack0.1,0.3,0.6\rbrack$. Use the matlab function ts_assign_grps.m in the package prot489-demo111-grpasgn.*.tgz to assign each paper to one of three groups. The assignments to the three subsets are saved in a file named "k.compfreq050.cv01.010.tsv" with the following format.
```
<paperid> <subset_index>
```
Give 1 to <subset_index> for the subset A; 2 for B; 3 for C. Repeat this compfreq 20 times to generate "k.compfreq050.cv01.tsv", "k.compfreq050.cv02.010.tsv", ..., "k.compfreq050.cv20.010.tsv". The three digits "010" in the filename means that the first number in $\verb|ratio_abc|$ is equal to $010/100$. In what follows, the assignment named "cv01" is assumed, but do similar procedure for other assignments. In addition, generate a file "k.compfreq040.cv01.010.rich.tsv" with the following format, for the sake of easier verification of your code.
```
<paperid> <category> <subset_index> <title>
```
compfreq060.py: For each word in "k.compfreq040.datatab.mat", count the frequency in each category of each of three subsets, and save the frequencies in a file named "k.compfreq060.cv01.010.words.tsv" in the following format:
```
<wordid> <word> <freq1p> <freq1n> <freq2p> <freq2n> <freq3p> <freq3n>
```
The value freq1p is the frequency of the word in positive class of the subset A; the value freq1n is the frequency in negative A; the value freq2p is of positive B; freq2n is of negative B; freq3p is of positive C; freq3n is of negative C.
compfreq080.m: Read "k.compfreq040.datatab.mat", divide the frequency by （freq1p+freq1n+epsilon) where epsilon=1, write the computational result in a file "k.compfreq080.cv01.010.datatab.mat".
compfreq090.m: Read "k.compfreq080.cv01.datatab.mat" and perform L2-normalization for each paper. Write the computational result in a file "k.compfreq090.cv01.010.datatab.mat".

Code for verification

verify110.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat". Check whether every word with non-zero occurrence in each paper is actually found in the title of the corresponding paper stored in "k.compfreq020.ttl.tsv". Check whether every word with zero occurrence in each paper does not exist in the title of the corresponding paper.
verify120.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat" and extract the sub-matrix that contains only rows corresponding to the subset B. Compute the sum of each column , and check whether the sum is equal to freq1p+freq1n+epsilon.
verify130.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat". Check whether the category label of each paper in the matrix coincides with the category defined in the file "citations.withauthors"
Apply the above scripts for verification both to "k.compfreq090.cv01.010.datatab.mat" generated by your own code and to the one generated by another student's code.

Do the followings:

Use the subset A to train kernel SVM. Use the following kernel function to compute the kernel matrix: $K_p(\boldsymbol{x},\boldsymbol{x}') := \mathrm{exp}(-0.5(|| \boldsymbol{x}-\boldsymbol{x}' || / \sigma) ^ p) $.
Examine three kernels $K _ {0.5}$, $K_1$, $K_2$. The last one $K_2$ is known as the RBF kernel.
The kernel parameter $\sigma$ and the regularization parameter are estimated by cross validation within the training set.
Vary the size of the training subset by varying the ratio of training set with $10%$, $20%$, $30%$, $40%$. That is done by changing $\verb|ratio_abc| = \lbrack 0.1,0.3,0.6 \rbrack$, $\lbrack 0.2,0.2,0.6 \rbrack$, $\lbrack 0.3,0.1,0.6 \rbrack$, $\lbrack 0.4,0.0,0.6 \rbrack$ in the script compfrep050.m.
Make a table describing the pattern recognition performance.
- The rows correspond to $10%$, $20%$, $30%$, and $40%$ of training subsets, respectively.
- The columns correspond to $K _ {0.5}$, $K_1$, $K_2$, respectively.
- In each cell, the average the standard deviation over 20 repetitions cv01, ..., cv20 are written.
- The best performance among the three kernels are bold-faced.
- The performance values which are not significantly different from the best ones are underlined.
Use one-sample t-test for examining the statistical significance.

kroobose / exercise-cora Goto Github PK

exercise-cora's Introduction

exercise-cora

Introduction

Overview of ht dataset Cora.

Overview

Code for overviewing the dataset

Code for making the frequency features

Code for verification

Do the followings:

exercise-cora's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent