Giter Site home page Giter Site logo

c99-sentencebert's Introduction

HOW TO USE THE CODE

  1. You need these packages: SentenceTransformer, LoggingHandler, numpy, logging, copy, statistics
  2. Unzip the folder raw_conversation in the orginal place.
  3. Run python code.
  4. Segment points are stored at file file_result.txt.

C99-SentenceBert

For linear text segmentation

This algorithm is from Freddy Y.Y.Choi in his paper Advances in domain independent linear text segmentation in 2000. (really too old). I feel it strange that no implementation of C99 algorithm is found to the best of my knowledge. I can only find an api in Java. But due to class sun.misc.Compare was deleted from java since java 8, the api didn't work.

In this repo, I implemented C99 algorithm.

  1. Load sentence-bert model from URL and calculate embeddings of the whole document.
  2. Calculate cosine similarities of each sentences and put them in the matrix: sim_matrix
  3. Since an additional occurence of a common word causes a disproportionate increase in sim_matrix in short text segments, we present a ranking scheme which is an adaptation. Each value in the similarity matrix is replaced by its rank in the local region. The rank is the number of neighbouring elements with a lower similarity value. An example of image ranking using a 3x3 rank mask with output range {0,8}:

pic1

Notice the contrast has been improved significantly currently. In this repo, I set mask size = 11 and save values of r(x) into matrix: ranked_matrix. r(x) = number of elements with a lower value / number of elements examined.

  1. VERSION 1(This part is different from Freddy' paper).

    In his paper, the number of divided segments is m. m = u + 1.2* v**(1/2). u is mean of gadient od D(n) and v is variance of D(n). D(n) means inside dencity of n segments. D(n) = sum of m number of sk / sum of m number of ak. Sk refers to the sum of rank values in one segment and a(start, end) = (end - start +1) ** 2. Optimal value of m occurs when gradient of D(n) decreases greatly.

    pic 2

    In his paper, optimal value of segments should be n.

    But I found it boring to perform so much steps. Since the data of mine is short conversation, I decided to set n=3.

  2. VERSION 2 (same as Freddy) The value of segments are calculated automatically.

c99-sentencebert's People

Contributors

lu-tan avatar

Stargazers

omgwwcd avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.