Giter Site home page Giter Site logo

exercise-cora's Introduction

exercise-cora

coraデータセットを用いた演習

Introduction

  • Usually raw data have to be re-formatted to analyze data with some analysis methods.
  • this exercise provides experiences for re-format a raw dataset.
  • A popular dataset, Cora Dataset cora-classify.tar.gz.2003141524 , shall be used in this exercise.
  • Matlab script prot489-demo111-grpasgn.2003192044.tgz is supposed to be used for group assignment.
  • In general, you need to think of what steps are required for the experiment and to design what each script does. Insufficient designs tend to yield buggy and spaghetti program codes that cannot be maintained. In this experiment, the design is given. You may learn this exercise how to design scripts before implementation.

Overview of ht dataset Cora.

Overview

Consider the task to build a classifier that predict the field of a paper from the word occurrence of its title. Here, the task is limited to binary classification, and the two categories include Networking and Machine Learning (ML). In this exercise, you extract only the papers classified into Networking or ML in the file “citations.withauthors” and exclude all the other papers.
Consider a simulation that there are two datasets, A and C. To simulate the situation, you divide the dataset Cora into three groups, A, B, and C.
Feature set consists of word occurrences in the title. You extract features from the titles in A and C. You train SVM with the dataset A and examine the performance on the dataset C.

Code for overviewing the dataset

  • fukan010.py: Display the number of papers in each category of major classifications. Display the number of papers in each category of minor classifications.

Code for making the frequency features

  • compfreq010.py: From the file "citations.withauthors", extract the paper id classified into "Networking" or "Machine Learning" (hereinafter, abbreviated as "ML"), and save them in a file named "k.compfreq010.ids.tsv" in the following format:

    <paperid> <category>

    therein, give +1 to <catefory> for Networking and -1 for ML.

  • compfreq020.py: Exclude papers in "k.compfreq010.ids.csv" that are not contained in the file "papers", and save them in a file named "k.compfreq020.ttl.tsv" in the following format:

    <paperid> <category> <title>

    Every letter contained in <title> should be transformed to the lower letter. Symbols such as colon should be removed. Stop words in stopwprds.2003141842.txt should be removed.In addition, papers that are not in file "papers" are recorded to a file "k.compfreq020.del.txt" in the following format:

    <paperid> <category>
  • compfreq030.py: Enumerate all words in the set of papers. Sort them in the alphabetic order. Assign a wordid to each word with natural numbers 1, 2, .... Save them in a file "k.compfreq030.words.tsv" in the following format:

    <wordid> <word>
  • compfreq040.py:Read "k.compfreq020.ttl.tsv" and count the frequency of each word listed in "k.compfreq030.words.tsv". Make a matrix in which each row is defined as follows.

    <category> <paperid> <word0001freq> <word0002freq> ...

    Save the matrix in a file "k.compfreq040.datatab.mat" in the binary Matlab format. The format may be replaced to the SVM-light format specified in SVMlight-format-ArgCV.pdf.

  • compfreq045.m: Generate a random matrix: Each entry is drawn from uniform distribution $\boldsymbol{U}(0,1);$ $\mathrm{thenumberof~rows} = 20;$ the number of columns is equal to the number of papers. Save the matrix in a file "k.compfreq045.mat".

  • compfreq050.m: Divide the dataset to three subsets, A, B, and C as follows. For each of two categories, "Networking" and "ML", the data are divided into three subsets with ratio $\verb|ratio_abc| = \lbrack0.1,0.3,0.6\rbrack$. Use the matlab function ts_assign_grps.m in the package prot489-demo111-grpasgn.*.tgz to assign each paper to one of three groups. The assignments to the three subsets are saved in a file named "k.compfreq050.cv01.010.tsv" with the following format.

    <paperid> <subset_index>

    Give 1 to <subset_index> for the subset A; 2 for B; 3 for C. Repeat this compfreq 20 times to generate "k.compfreq050.cv01.tsv", "k.compfreq050.cv02.010.tsv", ..., "k.compfreq050.cv20.010.tsv". The three digits "010" in the filename means that the first number in $\verb|ratio_abc|$ is equal to $010/100$. In what follows, the assignment named "cv01" is assumed, but do similar procedure for other assignments. In addition, generate a file "k.compfreq040.cv01.010.rich.tsv" with the following format, for the sake of easier verification of your code.

    <paperid> <category> <subset_index> <title>
  • compfreq060.py: For each word in "k.compfreq040.datatab.mat", count the frequency in each category of each of three subsets, and save the frequencies in a file named "k.compfreq060.cv01.010.words.tsv" in the following format:

    <wordid> <word> <freq1p> <freq1n> <freq2p> <freq2n> <freq3p> <freq3n>

    The value freq1p is the frequency of the word in positive class of the subset A; the value freq1n is the frequency in negative A; the value freq2p is of positive B; freq2n is of negative B; freq3p is of positive C; freq3n is of negative C.

  • compfreq080.m: Read "k.compfreq040.datatab.mat", divide the frequency by (freq1p+freq1n+epsilon) where epsilon=1, write the computational result in a file "k.compfreq080.cv01.010.datatab.mat".

  • compfreq090.m: Read "k.compfreq080.cv01.datatab.mat" and perform L2-normalization for each paper. Write the computational result in a file "k.compfreq090.cv01.010.datatab.mat".

Code for verification

  • verify110.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat". Check whether every word with non-zero occurrence in each paper is actually found in the title of the corresponding paper stored in "k.compfreq020.ttl.tsv". Check whether every word with zero occurrence in each paper does not exist in the title of the corresponding paper.

  • verify120.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat" and extract the sub-matrix that contains only rows corresponding to the subset B. Compute the sum of each column , and check whether the sum is equal to freq1p+freq1n+epsilon.

  • verify130.py: Read the matrix in "k.compfreq090.cv01.010.datatab.mat". Check whether the category label of each paper in the matrix coincides with the category defined in the file "citations.withauthors"

  • Apply the above scripts for verification both to "k.compfreq090.cv01.010.datatab.mat" generated by your own code and to the one generated by another student's code.

Do the followings:

  • Use the subset A to train kernel SVM. Use the following kernel function to compute the kernel matrix: $K_p(\boldsymbol{x},\boldsymbol{x}') := \mathrm{exp}(-0.5(|| \boldsymbol{x}-\boldsymbol{x}' || / \sigma) ^ p) $.

  • Examine three kernels $K _ {0.5}$, $K_1$, $K_2$. The last one $K_2$ is known as the RBF kernel.

  • The kernel parameter $\sigma$ and the regularization parameter are estimated by cross validation within the training set.

  • Vary the size of the training subset by varying the ratio of training set with $10%$, $20%$, $30%$, $40%$. That is done by changing $\verb|ratio_abc| = \lbrack 0.1,0.3,0.6 \rbrack$, $\lbrack 0.2,0.2,0.6 \rbrack$, $\lbrack 0.3,0.1,0.6 \rbrack$, $\lbrack 0.4,0.0,0.6 \rbrack$ in the script compfrep050.m.

  • Make a table describing the pattern recognition performance.

    • The rows correspond to $10%$, $20%$, $30%$, and $40%$ of training subsets, respectively.

    • The columns correspond to $K _ {0.5}$, $K_1$, $K_2$, respectively.

    • In each cell, the average the standard deviation over 20 repetitions cv01, ..., cv20 are written.

    • The best performance among the three kernels are bold-faced.

    • The performance values which are not significantly different from the best ones are underlined.

  • Use one-sample t-test for examining the statistical significance.

exercise-cora's People

Contributors

kroobose avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.