Giter Site home page Giter Site logo

vickumar1981 / stringdistance Goto Github PK

View Code? Open in Web Editor NEW
75.0 6.0 15.0 1.3 MB

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Home Page: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html

License: Other

Scala 87.24% Java 12.64% Shell 0.06% Batchfile 0.06%
levenshtein-distance levenshtein ngram jaro-distance jaro jaro-winkler jaro-winkler-distance dice-coefficient hamming-distance sorensen-dice-distance

stringdistance's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

stringdistance's Issues

Restructure/Refactor test cases

The class that stores test case information needs to be refactored:

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/fixtures/TestCases.scala#L6

TestCase and TestSoundCase are case classes with value members: Option[T] where T is either an Int or Double depending on whether the test is for a distance or fuzzy score metric. These classes need to be generalized to support adding more test cases with greater flexibility.

String and String sound tests may need to be refactored as well to support any new model:

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringDistance.scala

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringSoundScore.scala

Bug in Jaro Winkler score after 1.1.5 version

Jaro Winkler algorithm error after 1.1.5 version update

sample code produces different result

  import com.github.vickumar1981.stringdistance.StringDistance.JaroWinkler
  println(JaroWinkler.score("kkk_k", "kkk"))

version 1.1.5

0.9066666666666667

newer versions

0.30000000000000004

Looks like problem in getCommonChars function of CommonStringDistanceAlgo interface

Using javadoc.io for documentation

Published maven artifacts, that contain documentation, can usually be found on javadoc.io.

For example, the same documentation hosted on the github.io page and linked from the README.md, can be found here as well, https://www.javadoc.io/doc/com.github.vickumar1981/stringdistance_2.12/1.0.7

Is javadoc.io a better location/main place to host the documentation and link to there from the README.md, or continue hosting on a github.io page?

Advantages of javadoc.io:

  • Relatively no maintenance, publishing the artifact updates the documentation.
  • Docs are automatically versioned, whereas on the github.io, you can only see the latest version of the docs.

Disadvantages of javadoc.io:

  • Maven central takes some time to sync. there's some delay between publishing an artifact, and it showing up on the search. even if it doesn't show up on the search doesn't mean it's not available, but it's the process is not instantaneous.

Advantages of using github.io page:

  • More control over when the documentation is updated. Can possibly tie the task to the travis build and automatically update documentation on any check-in to a snapshot or master branch.

Disadvantages of github.io page:

  • Need to maintain the documentation yourself.
  • Version of documentation always points to latest version of jar

Add guidelines for contributing

Need to add a CONTRIBUTING.md file that outlines:

Alternatively, could include the sbt-launch.jar file in the project, and the only dependency would be having Java 8. (https://github.com/sbt/launcher)

  • Need to include instructions about forking/branching from latest snapshot branch and opening PR's against the snapshot branch (currently, 1.0.8-SNAPSHOT is the branch)

  • Instructions on how to run unit tests (sbt test runs unit tests) and the ./test.sh command will run the linter (scalastyle), unit tests, generate a coverage report, and check that code coverage is adequate (100%).

  • Instructions on how to generate and check documentation (sbt doc will create the documentation in your local /target folder of the project)

  • Code style and conventions - Could use https://github.com/excellalabs/intro-to-functional-scala/blob/master/scala-style-guide.md or https://github.com/databricks/scala-style-guide

  • PR etiquette and general guidelines

Push deployed artifact to sonatype on tagged release

Should automate pushing to the sonatype repo - should be tied into the git tag.

i.e., if the current version of the software is 99.99.0, then:

git commit -a -m "v99.99.1"
git tag -a 99.99.1 -m "v99.99.1"
git push origin master
git push --tags origin

would trigger a new deploy artifact.

Artifacts are deployed using the sbt-sonatype plugin.

Published JAR contains Scala Standard Library

The published JAR of the library (see https://mvnrepository.com/artifact/com.github.vickumar1981/stringdistance_2.12/1.1.1) should not contain the Scala Standard Library. Clients using the library will now have two, not necessarily identical Standard Libraries on the classpath. While this might work just fine 95% of the time, it might cause serious headaches in the remaining 5%. See https://mlangc.wordpress.com/2013/07/04/good-objects-breaking-bad/ and https://github.com/sbt/sbt-assembly#publishing-not-recommended for further details.

Add Junit dependency and tests around .util Java wrapper package

Though the .util Java package wrapper class, StringDistince, calls the same underlying implementation as the Scala version, there is a chance of missing parameters/incorrect syntax with the wrapper that could go uncaught. sbt will run on Scala sources, but can add Junit, and map to an sbt task to run both sets of tests.

  • Add Junit as a dependency
  • Create a task for running the java/junit tests
  • Include java sources in coverage report (Is this even possible?)

Upgrade to Scala 3

  • Upgrade for use with Scala 3
  • Scala 3 has new keywords for doing type classes. Old Scala 2 way, should still work, but re-writing for Scala 3 style might break backwards compatibility with Scala 2.

The method is bug -> CommonStringDistanceAlgo.getCommonChars

def getCommonChars(s1: String, s2: String, halfLen: Int): String = {
val commonChars = new StringBuilder()
val strCopy = new StringBuilder(s2)
var n = s1.length
val m = s2.length
s1.zipWithIndex.foreach{
case (ch, chIndex) => {
var foundIt = false
var j = math.max(0, chIndex - halfLen)
while (!foundIt && j <= Math.min(chIndex + halfLen, m - 1)) {
if (strCopy(j) == ch) {
foundIt = true
commonChars.append(ch)
strCopy.setCharAt(j, '\0')
}
j += 1
}
}}
commonChars.toString
}

Benchmarking: Add benchmarks for java utils package

Current guidance on benchmarks are in the CONTRIBUTING.MD: https://github.com/vickumar1981/stringdistance/blob/master/CONTRIBUTING.md#6-running-benchmarks

These benchmark the Scala implementations only. An open question/concern that I have had is if there is a hidden boxing cost when using the java wrapper types Double and Int, instead of the primitive alternatives.

https://dzone.com/articles/java-performance-notes-autoboxing-unboxing

The 1st step, I think, would be to setup the tantamount benchmark tests for the java classes and compare if using the primitive types int and double would be faster.

Finish API documentation

Need to put annotated comments in more files. The docs are incomplete and could use more explanation of what each of the methods and parameters do.

NGram and Overlap score bug

While two identical strings get a score of 1, in every other cases both NGram and Overlap return a score where 0 = most similar and 1 = completely different:

NGram.score("karolin", "karolin", 1)  // 1.0
NGram.score("karolin", "karolyn", 1)  // 0.143
NGram.score("karolin", "kathrin", 1)  // 0.286
NGram.score("karolin", "bcdefgh", 1)  // 1.0

Overlap.score("karolin", "karolin", 1)  // 1.0
Overlap.score("karolin", "karalin", 1)  // 0.143
Overlap.score("karolin", "kathrin", 1)  // 0.286
Overlap.score("karolin", "bcdefgh", 1)  // 1.0

I suspect that's because interfaces.NGramTokenizer.foldNGram expects its fourth argument to be a function taking both token sequences and the number of common tokens in both, while impl.NGramImpl.nGram and impl.OverlapImpl.overlap pass a function with both sequences and the distance between them.

Allow access to underlying n-gram tokenizer

The N-gram implementation: https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/NGramImpl.scala

Generalize StringDistance to ArrayDistance

Provide a more general interface that works with Array[T] where T is an arbitrary type.

String is the specific case of an Array[Char], and most of the underlying implementations can still work with more generalized vectors.

  • The implicits packages and the main classes can still convert to the underlying implementation using .toCharArray

  • The implicits packages should also provide conversions from String to Array[Char] and back.

  • Can introduce an ArrayDistance class with Java wrapper in util.ArrayDistance to complement the String classes, and extend the implementations to Array[T].

longestCommonSeq implementation has exponential complexity

longestCommonSeq is not efficient, it takes a very long time if the strings to be compared have a large distance.

The implemented algorithm has exponential complexity O(2^(n+m)).

Expected Result

predictable performance. algorithm should have O(n*m) complexity as is possible.

c.f. https://www.techiedelight.com/longest-common-subsequence/

--> translated to SCALA:

def LCSLength(X: String, Y: String): Int = {
val m = X.length
val n = Y.length
// lookup table stores solution to already computed subproblems,
// i.e., T[i][j] stores the length of LCS of substring
// X[0…i-1] and Y[0…j-1]
val T = Array.ofDim[Int](m + 1, n + 1)
// fill the lookup table in a bottom-up manner
for (i <- 1 to m) {
for (j <- 1 to n) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// if the current character of X and Y matches
T(i)(j) = T(i - 1)(j - 1) + 1
}
else {
// otherwise, if the current character of X and Y don't match
T(i)(j) = Integer.max(T(i - 1)(j), T(i)(j - 1))
}
}
}
// LCS will be the last entry in the lookup table
T(m)(n)
}

Actual Result

"unpredictable" runtime

Reproduction Steps

Compare "RUE NELSON MANDELA" with "RUE ALBERT UNGEHEUER" -> it just takes way to long to find out that these are not the same.

Increase tests for SoundEx algorithm

Current test coverage is ~86.2% for the SoundEx algorithm.

This can be increased by addressing a few edge cases:

The missing lines are shown in the scoverage report.

https://coveralls.io/builds/34087846/source?filename=src/main/scala/com/github/vickumar1981/stringdistance/interfaces/sound/SoundexAlgo.scala

Cross build for java 7 and scala 2.11

Currently, the jar produced is for java 8 and scala 2.12 b/c that's what I'm using.

Should be able to add build flags to compile for java 7 and scala 2.11 as well.

Need to add more tests/testing around string sequencing algorithms

The following are the sequencing algorithms which are implemented:

Needleman-Wunsh: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Smith-Waterman & Smith-Waterman-Gotoh:
https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Need to add more tests/examples to make sure these implementations are working as expected.

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringDistance.scala#L132

Scala issue: " object vickumar1981 is not a member of package com.github"

Summary.

import com.github.vickumar1981.stringdistance.StringDistance._
should not give errors

What you expected.

no error messages

What happened instead.

Scala (through Spark DataBricks) complains during importing libraries:
object vickumar1981 is not a member of package com.github

import com.github.vickumar1981.stringdistance.StringDistance._

Add a CHANGELOG.md

  • Add a CHANGELOG.md file that tracks changes between revisions
  • Investigate if there's an automated way to do this.

Add Damerau variation of Levenshtein

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Damerau is a variation of Levenshtein where transposition is included as an operation. Normal levenshtein accounts for additions, removals, and substitutions of characters. damerau is additions, removals, substitutions, and transpositions. the upper bound on the damerau distance is the levenshtein distance, and correspondingly the lower bound on the damerau similarity score is the levenshtein similarity score.

Benchmarking: Update benchmarks/performance

This is very open-ended, but at a start, deciding upon a way to run benchmark tests would be a great improvement, and would allow code changes to be tested from the perspective of performance.

** To do **

  • Evaluate bench marking tools available
  • Decide upon bench marking approach
  • Implement benchmarks in test suite.

Update License.txt

Currently, the license is simply a link to the Apache License website. Need to add a License.txt file to the project with the specific project name and other variables in the template filled out.

Will also want to change the links in the README.md to point the license to the new text file.

Implicits not found while calling StringDistance

I'm creating a simple Test program to test ur Scala code, I am facing implicit issues, I tried various thing but is working, here is my sample code

import com.github.vickumar1981.stringdistance.StringDistance._

object test extends App {

val res = Cosine.score("aamir", "rimaa")
}
Its expecting implicit in scope, I simply copied from example from https://github.com/vickumar1981/stringdistance

this is the error message its giving:
Error:(8, 38) ambiguous reference to overloaded definition,
both method score in trait StringMetric of type (s1: String, s2: String)(implicit algo: com.github.vickumar1981.stringdistance.SoundScoringAlgorithm[com.github.vickumar1981.stringdistance.CosineAlgorithm])Boolean
and method score in trait StringMetric of type (s1: String, s2: String)(implicit algo: com.github.vickumar1981.stringdistance.ScoringAlgorithm[com.github.vickumar1981.stringdistance.CosineAlgorithm])Double
match argument types (String,String)
val cosSimilarity: Double = Cosine.score("hello", "chello")() // 0.935

there you haven't mentioned any implicit usage or importing any implicit.
Please look into this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.