Giter Site home page Giter Site logo

almondtools / stringsearchalgorithms Goto Github PK

View Code? Open in Web Editor NEW
44.0 2.0 4.0 636 KB

String matching algorithms for searching a single or multiple strings in large texts

Home Page: http://stringsearchalgorithms.amygdalum.net/

License: GNU Lesser General Public License v3.0

Java 100.00%
algorithms string-search string

stringsearchalgorithms's People

Contributors

almondtools avatar dependabot[bot] avatar sebastianarnold avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

stringsearchalgorithms's Issues

Wildcardmatching

Yet none of the search algorithms supports wildcard matching.

Matching a huge set patterns

We yet support:

  • matching one pattern in text
  • matching multiple pattern in text

Yet none of the algorithms have been tested for huge sets of patterns (containing thousands or millions of patterns).

Find previous

Consider supporting the very common need of searching backwards in a string.

For example, consider adding a method findPrevious() as a sibling to StringFinder.findNext().

StringFinder {

	public abstract StringMatch findNext();
	public abstract StringMatch findPrevious();

Support of regex in StringAndChars

Hello

Do you have an idea when the support of regex would be possible
I'm very interested in something similar to https://code.google.com/p/esmre/ but implemented in java

we have a lot of regex to run in tons of texts, and it's hard to parse a regex to find the relevent words to be able to identify regex that could match in a text, but generally, it could reduce a lot the number of regex to execute

Invalid LONGEST_MATCH results from SetBackwardOracleMatching

I am using SetBackwardOracleMatching with MatchOption.LONGEST_MATCH to match multiple patterns on a StringCharProvider.

For example: "Die Krankheit ist seit dem Aufkommen wirksamer Antibiotika selten geworden".

  • When searching for patterns ("Antibiotik", "Antibiotika"), this configuration returns "Antibiotik", which is not the longest match that I expected.
  • AhoCorasick and WuManber return correct results, but they are too slow or memory-intensive for my applications.
  • When I reverse the pattern order, the result is correct โ€“ so maybe a solution requires some sort of longest-first sorting?

Please see the unit test provided at #8

More efficient Tries/DAWGs

The current trie actually is a directed acyclic word graph (DAWG), yet we have algorithms that do not depend on this:

  • Set-Horspool and Wu-Manber need simple tries (can be implemented more efficiently as double array)
  • Aho-Corasick needs a trie with support links (can also be implemented as double array, but needs additional structure for support links
  • (Set) Backward Oracle Matching uses a factor oracle base on a DAWG, so we will need this implementation at least here

Consequently each algorithm should use the best optimized variant instead of all depending on a DAWG (which is currently labeled trie).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.