Giter Site home page Giter Site logo

dainiusjocas / lucene-grep Goto Github PK

View Code? Open in Web Editor NEW
189.0 6.0 5.0 7.07 MB

Grep-like utility based on Lucene Monitor compiled with GraalVM native-image

License: Apache License 2.0

Makefile 0.80% Clojure 79.96% Dockerfile 0.89% Shell 2.44% Java 15.91%
grep lucene graalvm graalvm-native-image clojure

lucene-grep's Introduction

lucene-grep

Grep-like utility based on Lucene Monitor compiled with GraalVM native-image.

Features

  • Supports Lucene query syntax as described here
  • Multiple queries can be provided
  • Queries can be loaded from a file
  • Supports Lucene text analysis configuration for:
    • char filters
    • tokenizers
    • token filters
    • stemmers for multiple languages
    • predefined analyzers
  • Support multiple query parsers (classic, complex phrase, standard, simple, and surround)
  • Text output is colored or separated with customizable tags
  • Supports printing file names as hyperlinks for click to open (check support for your terminal here)
  • Text output supports templates
  • Scoring mode (disables highlighting for now)
  • Output can be formatted as JSON of EDN
  • Supports text input from STDIN
  • Supports filtering files with GLOB file pattern
  • Support excluding files from processing with GLOB
  • Compiled with GraalVM native-image tool
  • Supports Linux, MacOS, and Windows
  • Fast startup which makes it usable as CLI utility

Startup and memory as measured with time utility on my Linux laptop: Startup time and memory usage

The default output has a format: [FILE_PATH]:[LINE_NUMBER]:[LINE_WITH_A_COLORED_HIGHLIGHT]

NOTE: Not compatible with grep. When compared with grep the functionality is limited in most aspects.

Quickstart

Brew

MacOS and Linux binaries are provided via brew.

Install:

brew install dainiusjocas/brew/lmgrep

Upgrade:

brew upgrade lmgrep

Docker

lmgrep is deployed in the Docker Hub:

echo "Lucene is awesome" | docker run -i dainiusjocas/lmgrep /lmgrep lucene

Windows

On Windows you can install using scoop and the scoop-clojure bucket.

Or just follow these concrete steps:

# Note: if you get an error you might need to change the execution policy (i.e. enable Powershell) with
# Set-ExecutionPolicy RemoteSigned -scope CurrentUser
Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')

scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojure
scoop bucket add extras
scoop install lmgrep

Other platforms

Just grab a binary from Github releases, extract, and place it anywhere on the $PATH.

In case you're running MacOS then give run permissions for the executable binary:

sudo xattr -r -d com.apple.quarantine lmgrep

Then run it:

echo "Lucene is awesome" | ./lmgrep "Lucene"

Examples

Example of the lmgrep:

./lmgrep "main" "*.{clj,edn}"
=>
./src/core.clj:44:(defn -main [& args]
./deps.edn:22:   :main-opts   ["-m" "cognitect.test-runner"]}
./deps.edn:24:  {:main-opts  ["-m" "clj-kondo.main --lint src test"]
./deps.edn:28:  {:main-opts  ["-m clj.native-image core"

The default output is somewhat similar to grep, example:

grep -n -R --include=\*.{edn,clj} "main" ./
=>
./deps.edn:22:   :main-opts   ["-m" "cognitect.test-runner"]}
./deps.edn:24:  {:main-opts  ["-m" "clj-kondo.main --lint src test"]
./deps.edn:26:   :jvm-opts   ["-Dclojure.main.report=stderr"]}
./deps.edn:28:  {:main-opts  ["-m clj.native-image core"

Supports input from STDIN:

cat README.md | ./lmgrep "monitor lucene"

TIP: write your Lucene query within double quotes.

Various options with GLOB file pattern example:

./lmgrep --case-sensitive\?=false --ascii-fold\?=true --stem\?=true --tokenizer=whitespace "lucene" "**/*.md"

TIP: write GLOB file patterns within double quotes.

We can exclude files also with a GLOB pattern.

./lmgrep "lucene" "**/*.md" --excludes="README.md"

TIP: a GLOB pattern is treated as recursive if it contains "**", otherwise the GLOB is matched only against the file name.

Provide multiple queries:

echo "Lucene is\n awesome" |  lmgrep --query=lucene --query=awesome
=>
*STDIN*:1:Lucene is
*STDIN*:2: awesome

Provide Lucene queries in a file:

echo "The quick brown fox jumps over the lazy dog" | ./lmgrep --queries-file=test/resources/queries.json --format=json
=>
{"line-number":1,"line":"The quick brown fox jumps over the lazy dog"}

The contents of the Lucene queries file is in JSON format, e.g.:

[
  {
    "query": "fox"
  },
  {
    "query": "dog"
  }
]

NOTE: when the Lucene queries are specified as a positional argument or with -q or --query params or with the --queries-file, all the queries are concatenated into one list.

Deviations from Lucene query syntax

  • The field names are not supported because there are no field names in a line of text.

Supported options

Usage: lmgrep [OPTIONS] LUCENE_QUERY [FILES]
Supported options:
  -q, --query QUERY                                        Lucene query string(s). If specified then all the positional arguments are interpreted as files.
      --query-parser QUERY_PARSER                          Which query parser to use, one of: [classic complex-phrase simple standard surround]
      --queries-file QUERIES_FILE                          A file path to the Lucene query strings with their config. If specified then all the positional arguments are interpreted as files.
      --queries-index-dir QUERIES_INDEX_DIR                A directory where Lucene Monitor queries are stored.
      --tokenizer TOKENIZER                                Tokenizer to use, one of: [keyword letter standard unicode-whitespace whitespace]
      --case-sensitive? CASE_SENSITIVE                     If text should be case sensitive
      --ascii-fold? ASCII_FOLDED                           If text should be ascii folded
      --stem? STEMMED                                      If text should be stemmed
      --stemmer STEMMER                                    Which stemmer to use for token stemming, one of: [arabic armenian basque catalan danish dutch english estonian finnish french german german2 hungarian irish italian kp lithuanian lovins norwegian porter portuguese romanian russian spanish swedish turkish]
      --presearcher PRESEARCHER              no-filtering  Which Lucene Monitor Presearcher to use, one of: [multipass-term-filtered no-filtering term-filtered]
      --with-score                                         If the matching score should be computed
      --format FORMAT                                      How the output should be formatted, one of: [edn json string]
      --template TEMPLATE                                  The template for the output string, e.g.: file={{file}} line-number={{line-number}} line={{line}}
      --pre-tags PRE_TAGS                                  A string that the highlighted text is wrapped in, use in conjunction with --post-tags
      --post-tags POST_TAGS                                A string that the highlighted text is wrapped in, use in conjunction with --pre-tags
      --excludes EXCLUDES                                  A GLOB that filters out files that were matched with a GLOB
      --skip-binary-files                                  If a file that is detected to be binary should be skipped. Available for Linux and MacOS only.
      --[no-]hidden                                        Search in hidden files. Default: true.
      --max-depth N                                        In case of a recursive GLOB, how deep to search for input files.
      --with-empty-lines                                   When provided on the input that does not match write an empty line to STDOUT.
      --with-scored-highlights                             ALPHA: Instructs to highlight with scoring.
      --[no-]split                                         If a file (or STDIN) should be split by newline.
      --hyperlink                                          If a file should be printed as hyperlinks.
      --with-details                                       For JSON and EDN output adds raw highlights list.
      --word-delimiter-graph-filter WDGF                   WordDelimiterGraphFilter configurationFlags as per https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html
      --show-analysis-components                           Just print-out the available analysis components in JSON.
      --only-analyze                                       When provided output will be analyzed text.
      --explain                                            Modifies --only-analyze. Output is detailed token info, similar to Elasticsearch Analyze API.
      --graph                                              Modifies --only-analyze. Output is a string that can be fed to the `dot` program.
      --analysis ANALYSIS                    {}            The analysis chain configuration
      --query-parser-conf CONF                             The configuration for the query parser.
      --concurrency CONCURRENCY              8             How many concurrent threads to use for processing.
      --queue-size SIZE                      1024          Number of lines read before being processed
      --reader-buffer-size BUFFER_SIZE                     Buffer size of the BufferedReader in bytes.
      --writer-buffer-size BUFFER_SIZE                     Buffer size of the BufferedWriter in bytes.
      --[no-]preserve-order                                If the input order should be preserved.
      --config-dir DIR                                     A base directory from which to load text analysis resources, e.g. synonym files. Default: current dir.
      --analyzers-file FILE                                A file that contains definitions of text analyzers. Works in combinations with --config-dir flag.
      --query-update-buffer-size NUMBER                    Number of queries to be buffered in memory before being committed to the queryindex. Default 100000.
      --streamed                                           Listens on STDIN for json with both query and a piece of text to be analyzed
  -h, --help

NOTE: question marks in zsh shell must be escaped, e.g. --case-sensitive\?=true or within double quotes e.g. "--case-sensitive?=true"

Text Analysis

The text analysis pipeline can be declaratively specified with the --analysis flag, e.g.:

echo "<p>foo bars baz</p>" | \
  ./lmgrep \
  --only-analyze \
  --analysis='
  {
    "char-filters": [
      {"name": "htmlStrip"},
      {
        "name": "patternReplace",
         "args": {
           "pattern": "foo",
           "replacement": "bar"
        }
      }
    ],
    "tokenizer": {"name": "standard"},
    "token-filters": [
      {"name": "englishMinimalStem"},
      {"name": "uppercase"}
    ]
  }
  '
=>
["BAR","BAR","BAZ"]

The action inside lmgrep is as follows:

  • char filters are applied in order:
    • htmlStrip is applied, which removes <p> and </p> from the string (i.e. foo bars baz)
    • patternReplace is applied, which replaces foo with bar (i.e. bar bars baz)
  • tokenization is performed (i.e. [bar bars baz])
  • token filters are applied in order:
    • englishMinimalStem which stems the tokens (i.e. [bar bar baz])
    • uppercase is applied (i.e. [BAR BAR BAZ])
  • The resulting list of tokens is printed to STDOUT.

You can peel the analysis config layer by layer and see what are the intermediate results.

For the full list of supported analysis components see the documentation.

Default text analysis

If analysis is not specified then the default analysis pipeline is used which looks like:

--analysis='
{
  "tokenizer": {
    "name": "standard"
  },
  "token-filters": [
    {
      "name": "lowercase"
    },
    {
      "name": "asciifolding"
    },
    {
      "name": "englishMinimalStem"
    }
  ]
}
'

Predefined analyzers

echo "dogs and cats" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
=>
["dog","cat"]

Note that stopwords were removed and stemming was applied.

The full list of predefined analyzers can be found here.

Tips on Analysis Configuration

Analysis configuration must be a valid JSON and for your case it might make sense to store it in a file.

Store analysis in a file:

echo '{"analyzer": {"name": "English"}}' > analysis-conf.json

Run the text analysis:

echo "dogs and cats" | ./lmgrep --only-analyze --analysis="$(cat analysis-conf.json)"

If your JSON spans multiple lines ask a little help from jq:

echo "dogs and cats" | ./lmgrep --only-analyze --analysis=$(jq -c . analysis-conf.json)

What about resources for analyzers?

Some token filters require a file as an argument, e.g. StopFilterFactory requires words which is a file. By default, the Lucene would load the file under words from the classpath. However, lmgrep is a single binary and there the notion of the classpath makes little sense. To support such analysis components that expects files the Lucene class that loads files was patched to support arbitrary files.

E.g. create a stopwords file:

echo "foo\nbar" > my-stopwords.txt

Run the analysis

echo "foo bar baz" | \
  ./lmgrep \
  --only-analyze \
  --analysis='
  {
    "token-filters": [
      {
        "name": "stop",
        "args": {
          "words": "my-stopwords.txt"
        }
      }
    ]
  }
  '
=>
["baz"]

See the custom stopwords was removed.

Creating files in arbitrary places might be OK for one-off scripts. However, it creates some mess. Therefore, consider creating a folder for your analysis component resources such as: $HOME/.lmgrep.

export LMGREP_HOME=$HOME/.lmgrep
echo "foo\nbar" > $LMGREP_HOME/my-stopwords.txt
echo "foo bar baz" | \
  ./lmgrep \
  --only-analyze \
  --analysis='
  {
    "token-filters": [
      {
        "name": "stop",
        "args": {
          "words": "'$LMGREP_HOME'/my-stopwords.txt"
        }
      }
    ]
  }
  '
=>
["baz"]

Analysis in the queries file

Every query in the queries file can provide its own configuration, e.g.:

[
  {
    "id": "0",
    "query": "dogs",
    "analysis": {
      "analyzer": {
        "name": "English"
      }
    }
  },
  {
    "id": "1",
    "query": "dogs",
    "analysis": {
      "tokenizer": {
        "name": "whitespace"
      }
    }
  }
]

For each unique analysis configuration a pair of Lucene Analyzer and an internal field name is created. Then Lucene Monitor is run over all queries, and their respective fields with their own analyzers for every text input.

WordDelimiterGraphFilter

Using WordDelimiterGraphFilter filter might help to tokenize text is various ways, e.g.:

echo "test class" | ./lmgrep "TestClass" --word-delimiter-graph-filter=99
=>
*STDIN*:1:test class
echo "TestClass" | ./lmgrep "test class" --word-delimiter-graph-filter=99
=>
*STDIN*:1:TestClass

The number 99 is a sum of options as described here.

Phrase Matching with Slop

To match a phrase you need to put it in double quotes:

echo "GraalVM is awesome" | ./lmgrep "\"graalvm is\""
=>
*STDIN*:1:GraalVM is awesome

By default, when phrase terms are not exactly one after another there is no match, e.g.:

echo "GraalVM is awesome" | ./lmgrep "\"graalvm awesome\""
=>

We can provide a slop parameter i.e. ~2 to allow some number of "substitutions" of terms in the document text, e.g.:

echo "GraalVM is awesome" | ./lmgrep "\"graalvm awesome\"~2"
=>
*STDIN*:1:GraalVM is awesome

As a side effect, when the slop is big enough terms can match out of order, e.g.:

echo "GraalVM is awesome" | ./lmgrep "\"awesome graalvm\"~3"
=>
*STDIN*:1:GraalVM is awesome

However, if order is important there is no way to enforce it Lucene query syntax.

Lucene query parsers

Currently, 5 Lucene query parsers are supported:

Query parser configuration

Additional configuration to query parsers can be passed with the --query-parser-conf flag, e.g.:

./lmgrep "query" --query-parser-conf='{"allow-leading-wildcard": false}'

The value must be a JSON string. For the supported configuration values consult the documentation.

Development

Requirements:

  • Clojure CLI
  • Babashka
  • Maven
  • GraalVM with the native-image tool installed and on $PATH
  • GNU Make
  • Docker (just for rebuilding the linux native image).

Build executable for your platform:

make build

It will create an executable binary file named lmgrep stored at the root directory of the repository.

Run the tests:

make test

Lint the core with clj-kondo:

bb lint

Print results with a custom format

./lmgrep --template="FILE={{file}} LINE_NR={{line-number}} LINE={{highlighted-line}}" "test" "**.md"
Template Variable Notes
{{file}} File name
{{line-number}} Line number where the text matched the query
{{highlighted-line}} Line that matched the query with highlighters applied
{{line}} Line that matched the query
{{score}} Score of the match (summed)

When {{highlighted-line}} is used then --pre-tags and --post-tags options are available, e.g.:

echo "some text to to match" | lmgrep "text" --pre-tags="<em>" --post-tags="</em>" --template="{{highlighted-line}}"
=>
some <em>text</em> to to match

Scoring

The main thing to understand is that scoring is for every line separately in the context of that one line as a whole corpus.

Another consideration is that scoring is summed up for every line of all the matches. E.g. query "one two" is rewritten by Lucene into two term queries.

Each individual score is BM25 which is default in Lucene.

--only-analyze

Great for debugging.

The output is a list tokens after analyzing the text, e.g.:

echo "Dogs and CAt" | ./lmgrep --only-analyze     
# => ["dog","and","cat"]

In combination with --explain flag outputs the detailed analyzed text similar to Elasticsearch Analyze API, e.g.:

echo "Dogs and CAt" | ./lmgrep --only-analyze --explain | jq
# => [
  {
    "token": "dog",
    "position": 0,
    "positionLength": 1,
    "type": "<ALPHANUM>",
    "end_offset": 4,
    "start_offset": 0
  },
  {
    "end_offset": 8,
    "positionLength": 1,
    "position": 1,
    "start_offset": 5,
    "type": "<ALPHANUM>",
    "token": "and"
  },
  {
    "position": 2,
    "token": "cat",
    "positionLength": 1,
    "end_offset": 12,
    "type": "<ALPHANUM>",
    "start_offset": 9
  }
]

To draw a token graph you can use the --graph flag, e.g.:

echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph
# =>
digraph tokens {
  graph [ fontsize=30 labelloc="t" label="" splines=true overlap=false rankdir = "LR" ];
  // A2 paper size
  size = "34.4,16.5";
  edge [ fontname="Helvetica" fontcolor="red" color="#606060" ]
  node [ style="filled" fillcolor="#e8e8f0" shape="Mrecord" fontname="Helvetica" ]

  0 [label="0"]
  -1 [shape=point color=white]
  -1 -> 0 []
  0 -> 2 [ label="foobar / FooBar"]
  0 -> 1 [ label="foo / Foo"]
  1 [label="1"]
  1 -> 2 [ label="bar / Bar"]
  2 [label="2"]
  2 -> 3 [ label="baz / Baz"]
  -2 [shape=point color=white]
  3 -> -2 []
}

The --graph flag makes the text analysis output into a valid GraphViz program that can be fed to dot which draws a picture out of the text, magic.

If you have GraphViz installed on your machine, a one-liner to save the image of the text graph:

echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng -o token-graph.png

The output image looks should look like: Token Graph

If you also have ImageMagic installed you can preview the token graph with this one-liner on Ubuntu:

echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng | display

Or MacOS:

echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng | open -a Preview.app -f

Streamed matching

Start lmgrep process once and wait for the input from STDIN that includes both: text and the query. Using such a technique avoids the "cold start" issues when with the stream of text the query is known only when the text is known.

Example:

echo '{"query": "nike~", "text": "I am selling nikee"}' | ./lmgrep --streamed --with-score --format=json --query-parser=simple
#=> {"line-number":1,"line":"I am selling nikee","score":0.09807344}

Is equivalent to:

echo  "I am selling nikee" | ./lmgrep --query="nike~" --with-score --format=json --query-parser=simple
#=> {"line-number":1,"line":"I am selling nikee","score":0.09807344}

All other options are also applicable.

Custom Builds

Raudikko or Voikko stemming for Finnish Language

NOTE: The project is re-architected in a way that Raudikko token filter definition is in the subdirectory. Also, it was put under the deps.edn alias. However clever this change is, the uberjar builder has a hard time. The solution now is to modify deps.edn file so that the raudikko dependency is put under top level :deps. Tools build also has a hard time building an uberjar.

(export LMGREP_FEATURE_RAUDIKKO=true && bb generate-reflection-config && make build)

Environment variables

Check the docs.

Future work

License

Copyright © 2022 Dainius Jocas.

Distributed under The Apache License, Version 2.0.

lucene-grep's People

Contributors

dainiusjocas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lucene-grep's Issues

Text split pattern

Now it is hardcoded to split text by a new line, maybe it is worth to provide a custom one? like two new lines

Custom separators

Not to color the highlight but add an html tags or any custom character sequence

Tokenizer should split on period

The standard tokenizer doesn't split on period '.' and this creates unexpected results.

The WordDelimiterGraphFIlter could do the job.

Text analysis details flag

Similar to Elasticsearch

DELETE index_example

PUT index_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [ "my_custom_word_delimiter_graph_filter" ]
        }
      },
      "filter": {
        "my_custom_word_delimiter_graph_filter": {
          "type": "word_delimiter_graph",
          "type_table": [ "- => ALPHA" ],
          "split_on_case_change": true,
          "split_on_numerics": false,
          "stem_english_possessive": true,
          "preserve_original": true
        }
      }
    }
  }
}

GET index_example/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_custom_word_delimiter_graph_filter"], 
  "text": ["pre BestClass post"]
}
{
  "tokens" : [
    {
      "token" : "pre",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "BestClass",
      "start_offset" : 4,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "Best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "Class",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "post",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

Errors when matching directory

The new multi-files system may pass in directories, like lmgrep 'x' * (using bash expansion, rather than lmgrep's own). This results in an error that hangs:

/tmp/foo
❯ tree -p
.
├── [drwxr-xr-x]  bar
├── [drwxr-xr-x]  baz
└── [-rw-r--r--]  foo

2 directories, 1 file

/tmp/foo
❯ grep 'x' *  
grep: bar: Is a directory
grep: baz: Is a directory

/tmp/foo
❯ lmgrep 'x' *
Exception in thread "main" java.io.FileNotFoundException: baz (Is a directory)
	at com.oracle.svm.jni.JNIJavaCallWrappers.jniInvoke_VA_LIST:Ljava_io_FileNotFoundException_2_0002e_0003cinit_0003e_00028Ljava_lang_String_2Ljava_lang_String_2_00029V(JNIJavaCallWrappers.java:0)
	at java.io.FileInputStream.open0(FileInputStream.java)
	at java.io.FileInputStream.open(FileInputStream.java:219)
	at java.io.FileInputStream.<init>(FileInputStream.java:157)
	at clojure.java.io$fn__11522.invokeStatic(io.clj:229)
	at clojure.java.io$fn__11522.invoke(io.clj:229)
	at clojure.java.io$fn__11435$G__11428__11442.invoke(io.clj:69)
	at clojure.java.io$fn__11534.invokeStatic(io.clj:258)
	at clojure.java.io$fn__11534.invoke(io.clj:254)
	at clojure.java.io$fn__11435$G__11428__11442.invoke(io.clj:69)
	at clojure.java.io$fn__11496.invokeStatic(io.clj:165)
	at clojure.java.io$fn__11496.invoke(io.clj:165)
	at clojure.java.io$fn__11448$G__11424__11455.invoke(io.clj:69)
	at clojure.java.io$reader.invokeStatic(io.clj:102)
	at clojure.java.io$reader.doInvoke(io.clj:86)
	at clojure.lang.RestFn.invoke(RestFn.java:410)
	at lmgrep.grep$grep.invokeStatic(grep.clj:100)
	at lmgrep.core$_main.invokeStatic(core.clj:21)
	at lmgrep.core$_main.doInvoke(core.clj:12)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at lmgrep.core.main(Unknown Source)
^C

Flag to skip binary files

On Linux shell out and file -ib FILE_NAME and skip all that has charset=binary.
MacOS should support something like file.

What to do on windows?

Output the score of the match

Depends on the formatting:

[SCORE]:[FILE_PATH]:[LINE_NUMBER]:[LINE_WITH_A_COLORED_HIGHLIGHT]

Or END per line:

 :file, :line, :column, the line :text (optionally) and :score

Search whole file

Maybe related to #3?

I'm using this to perform topical searches in a wiki folder of markdown files (show me articles where code review and clojure are both mentioned).

Here's a dumb example of something that doesn't work that I'd like to have work:

❯ echo 'foo qux\nbar' | lmgrep 'foo && bar'

In this case, I'm less concerned about having line numbers than I am of just knowing which files matched.

Examples in README

Tailing logs
Json logs data
Grep static site post files
kafka data via kafka console consumer

Hyperlink to line number

Hey, just got around to trying out the hyperlink feature. Works nicely, it's great to be able to jump straight to the file. One little bonus would be to emit <file>#{{line-number}} as it allows opening the file on a particular line number directly (that of the match).

[Bug] Exception when misusing hyperlink

First of all, let me thank you for this tool.
I just discovered it few days ago, and it is really helpful.

The execution of this command:

echo "this will generate an exception" | lmgrep "exception" --hyperlink

Generates an exception:

Exception in thread "async-thread-macro-1" java.lang.NullPointerException
	at lmgrep.formatter$file_string.invokeStatic(formatter.clj:39)
	at lmgrep.formatter$string_output.invokeStatic(formatter.clj:59)
	at lmgrep.grep$matcher_fn$fn__11355.invoke(grep.clj:54)
	at clojure.core$map$fn__5880$fn__5881.invoke(core.clj:2746)
	at clojure.core.async.impl.channels$chan$fn__3302.invoke(channels.clj:300)
	at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:143)
	at clojure.core.async$fn__7869.invokeStatic(async.clj:172)
	at clojure.core.async$pipeline_STAR_$process__8053.invoke(async.clj:531)
	at clojure.core.async$pipeline_STAR_$fn__8183.invoke(async.clj:549)
	at clojure.core.async$thread_call$fn__7976.invoke(async.clj:484)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:829)
	at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:553)
	at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:192)

Thank you.

Version: v2021.04.26
Package: lmgrep-v2021.04.26-linux-amd64.zip

Investigate "lovinssnowballstem" "concatenategraph" token filters

These two fail when compiled with GraalVM.

./lmgrep test/resources/test.txt --only-analyze --analysis="$(cat test/resources/binary/tokenfilters/lovinssnowballstem.json)"   
java.lang.invoke.BoundMethodHandle$Species_LL cannot be cast to java.lang.invoke.SimpleMethodHandle

oracle/graal#3341

org.apache.lucene.analysis.tokenattributes.BytesTermAttributeImpl cannot be cast to org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttribute

This one also fails in REPL

(cast org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttribute
      (org.apache.lucene.analysis.tokenattributes.BytesTermAttributeImpl.))
Execution error (ClassCastException) at java.lang.Class/cast (Class.java:3818).
Cannot cast org.apache.lucene.analysis.tokenattributes.BytesTermAttributeImpl to org.apache.lucene.analysis.miscellaneous.ConcatenateGraphFilter$BytesRefBuilderTermAttribute

[Q] Web frontend?

I wonder if you know any projects that have a ready-made web frontend to do full-text search on some files on disk. I want this to be able to search my notes on mobile.

Support multiple files

Globbing seems to be "built-in", but my terminal already does globbing. Would be useful to support:

lmgrep foo **/*

Instead of requiring

lmgrep foo '**/*'

to fall back onto the custom globbing. This way I can do funky stuff like utilize extglob or fd/find

fd laser | xargs lmgrep 'laser'

GraalVM 21 native-image compilation

When compiled without Lucene quarkus extension:

  • Compilations succeeds
  • Exception at runtime
✗ make build
clojure -M:native
Compiling lmgrep.cli
Compiling lmgrep.core
Compiling lmgrep.fs
Compiling lmgrep.grep
Compiling lmgrep.lucene
[lmgrep:769177]    classlist:   3,196.36 ms,  1.18 GB
[lmgrep:769177]        (cap):     515.88 ms,  1.18 GB
[lmgrep:769177]        setup:   1,928.48 ms,  1.18 GB
[lmgrep:769177]     (clinit):     623.88 ms,  2.72 GB
[lmgrep:769177]   (typeflow):  17,160.38 ms,  2.72 GB
[lmgrep:769177]    (objects):  12,770.39 ms,  2.72 GB
[lmgrep:769177]   (features):   1,214.95 ms,  2.72 GB
[lmgrep:769177]     analysis:  32,817.79 ms,  2.72 GB
[lmgrep:769177]     universe:   2,403.92 ms,  2.72 GB
[lmgrep:769177]      (parse):   5,801.50 ms,  2.72 GB
[lmgrep:769177]     (inline):  10,977.66 ms,  3.26 GB
[lmgrep:769177]    (compile):  50,000.95 ms,  4.44 GB
[lmgrep:769177]      compile:  69,816.24 ms,  4.44 GB
[lmgrep:769177]        image:   4,027.30 ms,  4.44 GB
[lmgrep:769177]        write:     719.09 ms,  4.44 GB
[lmgrep:769177]      [total]: 115,119.38 ms,  4.44 GB
➜  lucene-grep git:(main) ✗ ./lmgrep "test" README.md 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" java.lang.ClassCastException: java.lang.invoke.BoundMethodHandle$Species_L cannot be cast to java.lang.invoke.SimpleMethodHandle
        at com.oracle.svm.methodhandles.MethodHandleIntrinsic.execute(MethodHandleIntrinsic.java:254)
        at java.lang.invoke.MethodHandle.invokeBasic(MethodHandle.java:80)
        at java.lang.invoke.LambdaForm$NamedFunction.invokeWithArguments(LambdaForm.java:74)
        at java.lang.invoke.LambdaForm.interpretName(LambdaForm.java:981)
        at java.lang.invoke.LambdaForm.interpretWithArguments(LambdaForm.java:958)
        at java.lang.invoke.MethodHandle.invokeBasic(MethodHandle.java:171)
        at java.lang.invoke.MethodHandle.invokeBasic(MethodHandle.java:0)
        at java.lang.invoke.Invokers$Holder.invokeExact_MT(Invokers$Holder)
        at org.apache.lucene.util.AttributeFactory$1.createInstance(AttributeFactory.java:148)
        at org.apache.lucene.util.AttributeFactory$StaticImplementationAttributeFactory.createAttributeInstance(AttributeFactory.java:111)
        at org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:213)
        at org.apache.lucene.analysis.standard.StandardTokenizer.<init>(StandardTokenizer.java:132)
        at beagle.text_analysis$tokenizer.invokeStatic(text_analysis.clj:51)
        at beagle.text_analysis$analyzer_constructor$fn__1377.invoke(text_analysis.clj:67)
        at beagle.text_analysis.proxy$org.apache.lucene.analysis.Analyzer$ff19274a.createComponents(Unknown Source)
        at org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:136)
        at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199)
        at org.apache.lucene.document.Field.tokenStream(Field.java:513)
        at org.apache.lucene.index.memory.MemoryIndex.addField(MemoryIndex.java:380)
        at org.apache.lucene.monitor.DocumentBatch$SingletonDocumentBatch.<init>(DocumentBatch.java:112)
        at org.apache.lucene.monitor.DocumentBatch$SingletonDocumentBatch.<init>(DocumentBatch.java:105)
        at org.apache.lucene.monitor.DocumentBatch.of(DocumentBatch.java:60)
        at org.apache.lucene.monitor.Monitor.match(Monitor.java:256)
        at org.apache.lucene.monitor.Monitor.match(Monitor.java:276)
        at lmgrep.lucene$match_text.invokeStatic(lucene.clj:34)
        at lmgrep.lucene$match_monitor.invokeStatic(lucene.clj:78)
        at lmgrep.lucene$highlighter$fn__1730.invoke(lucene.clj:89)
        at lmgrep.grep$match_lines.invokeStatic(grep.clj:75)
        at lmgrep.grep$grep$fn__1671.invoke(grep.clj:97)
        at lmgrep.grep$grep.invokeStatic(grep.clj:97)
        at lmgrep.core$_main.invokeStatic(core.clj:21)
        at lmgrep.core$_main.doInvoke(core.clj:12)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at lmgrep.core.main(Unknown Source)

Emit hyperlinks

There's VTE escape codes for hyperlinks. These allow you to mark out files & their line numbers.

ls supports this with the --hyperlink=auto flag. Kitty has an example script that wraps ripgrep to enable hyperlink support: https://sw.kovidgoyal.net/kitty/kittens/hyperlinked_grep.html?highlight=hyperlink

This uses OSC 8, https://iterm2.com/documentation-escape-codes.html https://github.com/mintty/mintty/wiki/CtrlSeqs#hyperlinks

This would allow clicking matches to open them straight in whatever is configured. I'm searching markdown files, so it would use vim :)

Extending analyzers with eg. Finnish stemmer Voikko

Hi.

As per https://news.ycombinator.com/item?id=26931774 my note about Finnish stemming in lucene-grep.

Let me first say that Finnish support in a tool like this would probably be of value to only that small percentage of Finnish speaking population that would use this tool. Which probably means only me :) So consider this issue as information on my very particular use case and not a request for implementation.

That said, I did test using English stemmer as well. I seem to recall that Lucene can do men -> man and mice -> mouse with some analyser combinations. I don't have Lucene nor Solr/Elasticsearch at the moment to test with. If possible an extended English stemmer could be nice to have.

English test as in readme:

❯ echo "dogs and cats" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
["dog","cat"]
❯ echo "men and mice" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
["men","mice"]

Finnish with lucene-grep

❯ echo "kauppias" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "Finnish"}}'
["kauppias"]
❯ echo "kauppiaan" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "Finnish"}}'
["kaupia"]
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kauppias
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kaupia
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kauppiaan
*STDIN*:1:kauppiaan

Finnish stemmer Voikko analyser results below. I don't have Voikko configured with Lucene/Solr/Elastic search at the moment, so here I'm using Python with libvoikko. Additional platforms, like Solr and ElasticSearch, can be found through here: https://github.com/voikko/corevoikko/wiki

Voikko("fi").analyze("kauppiaan")
{'BASEFORM': 'kauppias',
  'CLASS': 'nimisana',
  'FSTOUTPUT': '[Ln][Xp]kauppias[X]kauppiaa[Sg][Ny]n',
  'NUMBER': 'singular',
  'SIJAMUOTO': 'omanto',
  'STRUCTURE': '=ppppppppp',
  'WORDBASES': '+kauppias(kauppias)'},

Voikko("fi").analyze("kauppias")
[{'BASEFORM': 'kauppias',
  'CLASS': 'nimisana',
  'FSTOUTPUT': '[Ln][Xp]kauppias[X]kauppia[Sn][Ny]s',
  'NUMBER': 'singular',
  'SIJAMUOTO': 'nimento',
  'STRUCTURE': '=pppppppp',
  'WORDBASES': '+kauppias(kauppias)'}]

BASEFORM is what Voikko provides for stemming. Voikko gets both correct, whereas Snowball gives different base words, of which one, kaupia, doesn't exist. Though this has been the case for years in Lucene if I remember correctly. Snowball does get another word, auton -> auto, correct and search works with base and stemmed words. For some reason I couldn't get kauppias to match to its Snowball stem with lucene-grep.

Voikko is originally C++ code. I don't know what Graal's story is with supporting libraries that aren't purely Java.

Support other output formats

Try just to invert the grep options for line number and file name mention, maybe turn of highlighting colors.

[Q] incomplete word search

One usecase where Lucene’s tokenizing approach tends to work less well than something like grep is when for some reason we want to query for a substring of a token, e.g. if the text is “I walked through the town” and I want to search for “oug”.

Does lmgrep offer a performant solution for this kind of case, or would it be a situation where it’s better to stick with regular grep?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.