vickumar1981 / stringdistance Goto Github PK

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Home Page: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html

License: Other

Scala 87.24% Java 12.64% Shell 0.06% Batchfile 0.06%

levenshtein-distance levenshtein ngram jaro-distance jaro jaro-winkler jaro-winkler-distance dice-coefficient hamming-distance sorensen-dice-distance

stringdistance's People

Stargazers

Watchers

Forkers

tonyle9 ikr0m obrenckle wzjmail zjiaksmc monarchwise jd557 tg44 gps-dev-acc andredurazzo rrramiro xg86 christianwitte bartoll scala-popular-repos

stringdistance's Issues

Increase tests for Metaphone algorithm

Current test coverage is ~79.4% for the Metaphone algorithm.

This can be increased by addressing more of the edge cases:

The missing lines are shown in the scoverage report.

https://coveralls.io/builds/34087846/source?filename=src/main/scala/com/github/vickumar1981/stringdistance/interfaces/sound/MetaphoneAlgo.scala

Restructure/Refactor test cases

The class that stores test case information needs to be refactored:

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/fixtures/TestCases.scala#L6

TestCase and TestSoundCase are case classes with value members: Option[T] where T is either an Int or Double depending on whether the test is for a distance or fuzzy score metric. These classes need to be generalized to support adding more test cases with greater flexibility.

String and String sound tests may need to be refactored as well to support any new model:

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringDistance.scala

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringSoundScore.scala

Bug in Jaro Winkler score after 1.1.5 version

Jaro Winkler algorithm error after 1.1.5 version update

sample code produces different result

  import com.github.vickumar1981.stringdistance.StringDistance.JaroWinkler
  println(JaroWinkler.score("kkk_k", "kkk"))

version 1.1.5

0.9066666666666667

newer versions

0.30000000000000004

Looks like problem in getCommonChars function of CommonStringDistanceAlgo interface

Add scoring metric: Tversky index

Add scoring metric for Tversky index.

https://en.wikipedia.org/wiki/Tversky_index

Note: the index score is assymetric - Tversky.score(s1, s2) will not always equal Tversky.score(s2, s1)

Only supported starting with Android O

the library id very usefull, but it is only supported starting with Android O (--min-api 26) !

Add auto-formatting with scalafmt

Can automatically format the code when running sbt compile by using the scalafmt plugin: https://scalameta.org/scalafmt/

Using javadoc.io for documentation

Published maven artifacts, that contain documentation, can usually be found on javadoc.io.

For example, the same documentation hosted on the github.io page and linked from the README.md, can be found here as well, https://www.javadoc.io/doc/com.github.vickumar1981/stringdistance_2.12/1.0.7

Is javadoc.io a better location/main place to host the documentation and link to there from the README.md, or continue hosting on a github.io page?

Advantages of javadoc.io:

Relatively no maintenance, publishing the artifact updates the documentation.
Docs are automatically versioned, whereas on the github.io, you can only see the latest version of the docs.

Disadvantages of javadoc.io:

Maven central takes some time to sync. there's some delay between publishing an artifact, and it showing up on the search. even if it doesn't show up on the search doesn't mean it's not available, but it's the process is not instantaneous.

Advantages of using github.io page:

More control over when the documentation is updated. Can possibly tie the task to the travis build and automatically update documentation on any check-in to a snapshot or master branch.

Disadvantages of github.io page:

Need to maintain the documentation yourself.
Version of documentation always points to latest version of jar

Add new phonetic similarity: Double Metaphone

Add implementation and tests for double metaphone: https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone

Add guidelines for contributing

Need to add a CONTRIBUTING.md file that outlines:

How and where to install scala and sbt (simple build tool, aka maven, but for scala and also much better)
Links: https://www.scala-sbt.org/1.0/docs/Setup.html, https://www.scala-lang.org/download/2.12.7.html

Alternatively, could include the sbt-launch.jar file in the project, and the only dependency would be having Java 8. (https://github.com/sbt/launcher)

Need to include instructions about forking/branching from latest snapshot branch and opening PR's against the snapshot branch (currently, 1.0.8-SNAPSHOT is the branch)
Instructions on how to run unit tests (sbt test runs unit tests) and the ./test.sh command will run the linter (scalastyle), unit tests, generate a coverage report, and check that code coverage is adequate (100%).
Instructions on how to generate and check documentation (sbt doc will create the documentation in your local /target folder of the project)
Code style and conventions - Could use https://github.com/excellalabs/intro-to-functional-scala/blob/master/scala-style-guide.md or https://github.com/databricks/scala-style-guide
PR etiquette and general guidelines

Update documentation for package: com.github.vickumar1981.stringdistance.util

Documentation can be viewed here: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/util/index.html

Add tests around phonetic similarity

The tests for the phonetic similarities - soundex and metaphone are sparse. Need to add more tests/examples to make sure those are working as expected.

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/fixtures/TestSoundCases.scala#L8

Can also enable/use (currently, disabled for phonetic similarities only) the coverage report to see what part of the algorithms is untested.

Push deployed artifact to sonatype on tagged release

Should automate pushing to the sonatype repo - should be tied into the git tag.

i.e., if the current version of the software is 99.99.0, then:

git commit -a -m "v99.99.1"
git tag -a 99.99.1 -m "v99.99.1"
git push origin master
git push --tags origin

would trigger a new deploy artifact.

Artifacts are deployed using the sbt-sonatype plugin.

Add distance metric: Tanimoto variant of Jaccard score

Add an implementation for the tanimoto variant of the jaccard distance/score.

https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_(extended_Jaccard_coefficient)

Note: For the tanimoto algorithm, the score is unnecessary b/c it will/outhgt to be the same as the jaccard score - the distances should be different.

Published JAR contains Scala Standard Library

The published JAR of the library (see https://mvnrepository.com/artifact/com.github.vickumar1981/stringdistance_2.12/1.1.1) should not contain the Scala Standard Library. Clients using the library will now have two, not necessarily identical Standard Libraries on the classpath. While this might work just fine 95% of the time, it might cause serious headaches in the remaining 5%. See https://mlangc.wordpress.com/2013/07/04/good-objects-breaking-bad/ and https://github.com/sbt/sbt-assembly#publishing-not-recommended for further details.

Add Junit dependency and tests around .util Java wrapper package

Though the .util Java package wrapper class, StringDistince, calls the same underlying implementation as the Scala version, there is a chance of missing parameters/incorrect syntax with the wrapper that could go uncaught. sbt will run on Scala sources, but can add Junit, and map to an sbt task to run both sets of tests.

Add Junit as a dependency
Create a task for running the java/junit tests
Include java sources in coverage report (Is this even possible?)

Upgrade to Scala 3

Upgrade for use with Scala 3
Scala 3 has new keywords for doing type classes. Old Scala 2 way, should still work, but re-writing for Scala 3 style might break backwards compatibility with Scala 2.

Update documentation for base package: com.github.vickumar1981.stringdistance

Update API documentation for package: com.github.vickumar1981.stringdistance

Update documentation under the root package only.
Documentation page can be viewed here: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html

Add new scoring metric: Needleman score

Should add an algorithm/implementation for Needleman-Wunsh similarity.

https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm#Basic_scoring_schemes

Sort of a generalized Levenshtein with a penalty weighted in.

Update documentation for package: com.github.vickumar1981.stringdistance.impl

Documentation can be viewed here: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/impl/index.html

The method is bug -> CommonStringDistanceAlgo.getCommonChars

def getCommonChars(s1: String, s2: String, halfLen: Int): String = {
val commonChars = new StringBuilder()
val strCopy = new StringBuilder(s2)
var n = s1.length
val m = s2.length
s1.zipWithIndex.foreach{
case (ch, chIndex) => {
var foundIt = false
var j = math.max(0, chIndex - halfLen)
while (!foundIt && j <= Math.min(chIndex + halfLen, m - 1)) {
if (strCopy(j) == ch) {
foundIt = true
commonChars.append(ch)
strCopy.setCharAt(j, '\0')
}
j += 1
}
}}
commonChars.toString
}

Add new metric: Smith Waterman

Add the smith waterman similarity, and possibly it's variant smith-waterman gotah.

https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Update documentation for package: com.github.vickumar1981.stringdistance.implicits

Documentation for package can be viewed here: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/implicits/index.html

Benchmarking: Add benchmarks for java utils package

Current guidance on benchmarks are in the CONTRIBUTING.MD: https://github.com/vickumar1981/stringdistance/blob/master/CONTRIBUTING.md#6-running-benchmarks

These benchmark the Scala implementations only. An open question/concern that I have had is if there is a hidden boxing cost when using the java wrapper types Double and Int, instead of the primitive alternatives.

https://dzone.com/articles/java-performance-notes-autoboxing-unboxing

The 1st step, I think, would be to setup the tantamount benchmark tests for the java classes and compare if using the primitive types int and double would be faster.

Add back compatibility with Scala 2.11

Spark EMR currently still uses the 2.11 compiler.

This commit: 7dfbb76 dropped support for 2.11.

Add back in compatibility with Scala 2.11

Add new sound metric: soundex variants

I think the soundex score has a few variants that can be added, something like a RefinedSoundEx score.

https://en.wikipedia.org/wiki/Soundex

Finish API documentation

Need to put annotated comments in more files. The docs are incomplete and could use more explanation of what each of the methods and parameters do.

NGram and Overlap score bug

While two identical strings get a score of 1, in every other cases both NGram and Overlap return a score where 0 = most similar and 1 = completely different:

NGram.score("karolin", "karolin", 1)  // 1.0
NGram.score("karolin", "karolyn", 1)  // 0.143
NGram.score("karolin", "kathrin", 1)  // 0.286
NGram.score("karolin", "bcdefgh", 1)  // 1.0

Overlap.score("karolin", "karolin", 1)  // 1.0
Overlap.score("karolin", "karalin", 1)  // 0.143
Overlap.score("karolin", "kathrin", 1)  // 0.286
Overlap.score("karolin", "bcdefgh", 1)  // 1.0

I suspect that's because interfaces.NGramTokenizer.foldNGram expects its fourth argument to be a function taking both token sequences and the number of common tokens in both, while impl.NGramImpl.nGram and impl.OverlapImpl.overlap pass a function with both sequences and the distance between them.

Allow access to underlying n-gram tokenizer

The N-gram implementation: https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/NGramImpl.scala

can allow access to the tokens, such that a function: Ngram.tokens("something", 2)
willresult in an Array[String] which are the n-grams themselves.
the function to do this is here: https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/interfaces/NGramTokenizer.scala#L9

Generalize StringDistance to ArrayDistance

Provide a more general interface that works with Array[T] where T is an arbitrary type.

String is the specific case of an Array[Char], and most of the underlying implementations can still work with more generalized vectors.

The implicits packages and the main classes can still convert to the underlying implementation using .toCharArray
The implicits packages should also provide conversions from String to Array[Char] and back.
Can introduce an ArrayDistance class with Java wrapper in util.ArrayDistance to complement the String classes, and extend the implementations to Array[T].

longestCommonSeq implementation has exponential complexity

longestCommonSeq is not efficient, it takes a very long time if the strings to be compared have a large distance.

The implemented algorithm has exponential complexity O(2^(n+m)).

Expected Result

predictable performance. algorithm should have O(n*m) complexity as is possible.

c.f. https://www.techiedelight.com/longest-common-subsequence/

--> translated to SCALA:

def LCSLength(X: String, Y: String): Int = {
val m = X.length
val n = Y.length
// lookup table stores solution to already computed subproblems,
// i.e., T[i][j] stores the length of LCS of substring
// X[0…i-1] and Y[0…j-1]
val T = Array.ofDim[Int](m + 1, n + 1)
// fill the lookup table in a bottom-up manner
for (i <- 1 to m) {
for (j <- 1 to n) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// if the current character of X and Y matches
T(i)(j) = T(i - 1)(j - 1) + 1
}
else {
// otherwise, if the current character of X and Y don't match
T(i)(j) = Integer.max(T(i - 1)(j), T(i)(j - 1))
}
}
}
// LCS will be the last entry in the lookup table
T(m)(n)
}

Actual Result

"unpredictable" runtime

Reproduction Steps

Compare "RUE NELSON MANDELA" with "RUE ALBERT UNGEHEUER" -> it just takes way to long to find out that these are not the same.

Update documentation for package: com.github.vickumar1981.stringdistance.interfaces

Documentation can be viewed here: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/interfaces/index.html

Increase tests for SoundEx algorithm

Current test coverage is ~86.2% for the SoundEx algorithm.

This can be increased by addressing a few edge cases:

Test when length is 0: https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/interfaces/sound/SoundexAlgo.scala#L8
Test lines 32, 36, & 37 in implementation:

The missing lines are shown in the scoverage report.

https://coveralls.io/builds/34087846/source?filename=src/main/scala/com/github/vickumar1981/stringdistance/interfaces/sound/SoundexAlgo.scala

Cross build for java 7 and scala 2.11

Currently, the jar produced is for java 8 and scala 2.12 b/c that's what I'm using.

Should be able to add build flags to compile for java 7 and scala 2.11 as well.

Need to add more tests/testing around string sequencing algorithms

The following are the sequencing algorithms which are implemented:

Needleman-Wunsh: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Smith-Waterman & Smith-Waterman-Gotoh:
https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Need to add more tests/examples to make sure these implementations are working as expected.

https://github.com/vickumar1981/stringdistance/blob/master/src/test/scala/TestStringDistance.scala#L132

Scala issue: " object vickumar1981 is not a member of package com.github"

Summary.

import com.github.vickumar1981.stringdistance.StringDistance._
should not give errors

What you expected.

no error messages

What happened instead.

Scala (through Spark DataBricks) complains during importing libraries:
object vickumar1981 is not a member of package com.github

import com.github.vickumar1981.stringdistance.StringDistance._

403 Error for package-list of com.nimbusds javadocs

https://javadoc.io/doc/com.nimbusds/nimbus-jose-jwt/latest/index.html
Access to package-list ist denied. Could you so anything here?

Add a CHANGELOG.md

Add a CHANGELOG.md file that tracks changes between revisions
Investigate if there's an automated way to do this.

Add Damerau variation of Levenshtein

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Damerau is a variation of Levenshtein where transposition is included as an operation. Normal levenshtein accounts for additions, removals, and substitutions of characters. damerau is additions, removals, substitutions, and transpositions. the upper bound on the damerau distance is the levenshtein distance, and correspondingly the lower bound on the damerau similarity score is the levenshtein similarity score.

Release for Scala 2.13

is thread-safe or not?

Hey, I just want to confirm - is the library thread-safe or not?

Thanks

Benchmarking: Update benchmarks/performance

This is very open-ended, but at a start, deciding upon a way to run benchmark tests would be a great improvement, and would allow code changes to be tested from the perspective of performance.

~~** To do **~~

~~Evaluate bench marking tools available~~
~~Decide upon bench marking approach~~
~~Implement benchmarks in test suite.~~

update tests with scalamock and scalacheck

the tests just run a few hard-coded values to makes sure the inputs and outputs match.

on top of those surface level tests, i'd like to add the scalamock and scalacheck libraries and use a set of property-based tests to make sure that algorithms work as intended. the sound/phonetic algorithms are currently not really tested b/c of this

https://scalamock.org/
https://www.scalacheck.org/

Update License.txt

Currently, the license is simply a link to the Apache License website. Need to add a License.txt file to the project with the specific project name and other variables in the template filled out.

Will also want to change the links in the README.md to point the license to the new text file.

Implicits not found while calling StringDistance

I'm creating a simple Test program to test ur Scala code, I am facing implicit issues, I tried various thing but is working, here is my sample code

import com.github.vickumar1981.stringdistance.StringDistance._

object test extends App {

val res = Cosine.score("aamir", "rimaa")
}
Its expecting implicit in scope, I simply copied from example from https://github.com/vickumar1981/stringdistance

this is the error message its giving:
Error:(8, 38) ambiguous reference to overloaded definition,
both method score in trait StringMetric of type (s1: String, s2: String)(implicit algo: com.github.vickumar1981.stringdistance.SoundScoringAlgorithm[com.github.vickumar1981.stringdistance.CosineAlgorithm])Boolean
and method score in trait StringMetric of type (s1: String, s2: String)(implicit algo: com.github.vickumar1981.stringdistance.ScoringAlgorithm[com.github.vickumar1981.stringdistance.CosineAlgorithm])Double
match argument types (String,String)
val cosSimilarity: Double = Cosine.score("hello", "chello")() // 0.935

there you haven't mentioned any implicit usage or importing any implicit.
Please look into this.