nishihatapalmer / byteseek Goto Github PK

View Code? Open in Web Editor NEW

39.0 3.0 11.0 29.42 MB

A Java library for byte pattern matching and searching

License: BSD 3-Clause "New" or "Revised" License

Java 100.00%

search-algorithm searching-algorithms pattern-matching regular-expression

byteseek's People

Contributors

Stargazers

Watchers

Forkers

brian-o-tna jibaro lijian8 rhubner cestella m35 pinkonio andy-wagner jayanand1 prateekm edwardraff

byteseek's Issues

Prefix syntax for binary

v3 has syntax for matching binary bytes and hex bytes. The normal prefix for binary bytes is 0b. For hex it is 0x.

However, hex bytes are first class citizens in byteseek, so don't need the 0x prefix. The problem is that "0b" is a valid hex byte in its own right, so creates an ambiguous syntax. Does 0b01010101 match a single byte, or does it match a sequence of 5 hex bytes?

Because of this ambiguity, I'm proposing that the prefix for binary is "0i" instead of "0b".

0i01010101 matches a single byte with the binary value given.
0b01010101 matches a sequence of 5 byte values.

I've tried all the other letter in "binary" for an alternative to "0b", and "0i" seems to be the right one.

"n" sounds like a generic "number", not "binary": "0n01010101".
"a" doesn't sound like anything: "0a01010101"
"r" is just strange: "0r01010101"
"y" kind of works as the last letter since "x" is for hex, so "y" could be binary: 0y01010101.

I think "0i" works better than "0y", but not entirely sure. "0y" is visually more distinct than "0i", and has the virtue of being the last letter like the standard hex syntax.

MutableState has non-mutable initializations resulting in errors on deepCopy calls

The initialization of these lists uses the emptyList collection that is hard-coded to be an always empty list, and so will throw an error if anything is added to these lists.

byteseek/src/main/java/net/byteseek/automata/MutableState.java

Lines 94 to 98 in 2600c05

    
           public MutableState(final boolean isFinal) { 
        
           	this.isFinal = isFinal; 
        
           	this.transitions = Collections.emptyList(); 
        
           	this.associations = Collections.emptyList(); // = new ArrayList<T>(0); 
        
           }

In most of the code base this results in several isEmpty() checks to create a new ArrayList as necessary to add to it.

However, this does not occur in the deepCopy method, resulting in an error when I attempt to build a rather large regex (it has 10000 matchers, not yet at a minimum viable example, but I think this bug is pretty easy/clear fix).

byteseek/src/main/java/net/byteseek/automata/MutableState.java

Lines 418 to 431 in 2600c05

    
           public MutableState<T> deepCopy(Map<DeepCopy, DeepCopy> oldToNewObjects) { 
        
           	@SuppressWarnings("unchecked") 
        
           	// if there is a copy of this in the map, it will be of the same type. 
        
           	MutableState<T> stateCopy = (MutableState<T>) oldToNewObjects.get(this); 
        
           	if (stateCopy == null) { 
        
           		stateCopy = new MutableState<T>(this.isFinal); 
        
           		oldToNewObjects.put(this, stateCopy); 
        
           		for (Transition<T> transition : transitions) { 
        
           			final Transition<T> transitionCopy = transition.deepCopy(oldToNewObjects); 
        
           			stateCopy.transitions.add(transitionCopy); 
        
           		} 
        
           	} 
        
           	return stateCopy; 
        
           }

This is blocking me from being able to create a new TrieMultiSequenceMatcher

Support mark / reset in InputStreamReader

InputStreamReader gives up if it can't find a Window in the cache from a position which has already been read, and throws a WindowMissingException. This is because you can't rewind streams, so if it isn't in the cache, we can't get the data.

If the underlying InputStream supported mark / reset, we have the possibility of rewinding the stream.

It still may fail, but then we just might get another IOException - instead of the guaranteed WindowMissingException (which is an IOException).

If it succeeds, we can read forward again in the stream to obtain the missing data.

fails FileReaderTest

Appears to be reading unsigned bytes as signed:

testReadByte(net.byteseek.io.reader.FileReaderTest) Time elapsed: 0.016 sec <<< FAILURE!
java.lang.AssertionError: Reader FileReader[file:/home/dev/github/byteseek2/byteseek/target/test-classes/TestASCII.txt length: 109161 cache:MostRecentlyUsedCache[size: 0 capacity: 33]] reading at position 112122 should have value 80 expected:<80> but was:<-1>

This is the first of multiple test failures which appear similar though not as simple. Compilation attempted with jdk1.8.0_20 and openjdk-1.7.0 on CentOs

Design: should use Java charset names directly in regular expression language?

Byteseek 3.0 will have support for different character set encodings, which can be expressed in the reg ex language directly.

For example, to say that a string should be encoded as UTF-8, we can write:

(*UTF-8) 'A string encoded in UTF-8'

This uses the Java charset names directly. The upside for this is that byteseek users can specify whatever charset they want which is supported on their system.

The downside is that it lets users control what objects are loaded (possible security issue), and it ties the byteseek language directly to Java. It's only implemented in Java right now, and I have no plans to implement it in anything else - but it feels bad to tie the regex language to the underlying platform it is written in.

ByteMatcherCompiler inefficient

Seems to create the bytematcher only to take the bytes it matches and most often just re-create them. Use of OptimalByteMatcherFactory in ByteMatcherCompiler.

Only sets of bytes need to be optimised - the other matchers have definite values and cannot themselves be optimised further.
Should only optimise once.

Implement SBNDM search

SBNDM is about the fastest search for short patterns. Implement it in byteseek.

Has some nice properties, like ShiftOR, it doesn't really care during search how complex the thing you're looking for is, although pre-processing time rises to process all the complexity beforehand.

It's essentially a factor search like WFR or QF, but for single bytes. Doesn't use much memory (just an index of 256).

Silent replacement of algorithms or throw an error?

Some search algorithms have minimum length constraints on what they can search for (e.g. algorithms that work on q-grams (multiple bytes) in a pattern).

Currently, byteseek 2.1 uses the idea of a fallback searcher - if you instantiate an algorithm that doesn't support a pattern of length 2 say, it reverts back to a simpler algorithm that is usually faster for short patterns.

But this means that the algorithm you asked for is silently replaced (even if by something better).

The alternative is to throw an IllegalArgumentException if a pattern is passed in which doesn't meet the minimum pattern length requirements. This at least guarantees that you always get what you asked for - but sometimes what you ask for isn't viable and you are told that's the case.

In reality, most programmers shouldn't generally directly select the algorithm to use (as there are multiple reasons why particular algorithms should be preferred in different circumstances, including the length of the pattern and the complexity of it and where the complexity exists in the pattern). So searcher factories should be used to obtain a search algorithm, and the factories can embody this logic.

InputStreamReader incorrect window after end of stream

Create an InputStreamReader backed by a byte array, of say 1024 bytes.
read a byte using readByte at position 4096
A window is created by the createWindow method and length not set correctly.

Doesn't build on other machines due to nexus deployment in pom

Report that the code doesn't build on other machines due to set up specific to me.

pom specifies nexus deployment and code signing. This works for me, but doesn't and can't work on other machines. Should limit those activities to deployment only in the pom.

Made some tweaks, but need to test it.

Regular expression syntax definition

The regular expression syntax definition file (in the net.byteseek.parser.regex package) is out of date.

Update so it's correct - and probably add a web page describing the syntax too.

Parse search strings as byteseek regexes, or just convert to byte arrays?

Search algorithms usually provide a String constructor.

Currently, that just lets us match the sequence of bytes encoded by that string (either in the default Charset, or one that's provided).

But since byteseek is byte oriented, a hex string (or full byteseek regex syntax) might be more useful. One person has already asked whether they can pass in hex string bytes directly to the search algorithms. I had to say that wasn't true, and they needed to use the SequenceMatcherCompiler to create a SequenceMatcher for the search algorithm to look for.

The same search algorithms also have a byte[] array constructor, if you want to explicitly search for a byte pattern. It's easy to convert a string to a byte array if that's what needs to be searched for.

So I guess the string constructor is essentially redundant - unless we support byteseek regex construction directly, just get rid of those constructors.

If we did support byteseek regex syntax in the search algorithm String constructors, what do we do with Compiler / Parse Exceptions?

Returning the match including thr wildcards

Hello
First thanks for this library, it could make my task a lot easier.

Im trying to search through a large binary file and match a header (in hex) pattern. Following that header are 512 bytes of data that im actually interested in.

Thats how far i got. Sadly i couldnt get the pattern from the regexParser working, and the regular ByteMatcher only gives me a boolean, if the pattern is included in the file.

InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(new File("test.bin")));

            //RegexParser regexParser = new RegexParser();
            //ParseTree tree = regexParser.parse("64 84 71 94 00 18 .{512});

            byte[] bytes = new byte[6];
            bytes[0] = (byte) 0x64;
            bytes[1] = (byte) 0x84;
            bytes[2] = (byte) 0x71;
            bytes[3] = (byte) 0x94;
            bytes[4] = (byte) 0x00;
            bytes[5] = (byte) 0x18;
            //...

            ByteMatcherCompiler byteMatcherCompiler = new ByteMatcherCompiler();
            ByteMatcher byteMatcher = ByteMatcherCompiler.compileFrom(bytes);

            System.out.println(byteMatcher.matches(inputStreamReader, 0));

Is it possible to search through the file with the commented regex and return all the matches (as byte[] or char[]) found in the binary including the wildcard ( .{512} )data?

Feature - return data matched.

byteseek currently only says at what position a match was found in a stream or a file, it doesn't return the data.

Add a feature to return data at positions in a WindowReader (byte[], char[]?) including using longs, search and match results as parameters.

StringReader broken

The StringReader class has a bug in the createWindow() method, which causes it to keep creating Windows even after the entire string has been read. This leads to near infinite searches.

Fix is to only create a window when the windowPosition is zero.

SequenceMatcherCompiler fails to compile case insensitive strings.

Only produces case sensitive matchers.

Prove OptimiseByteMatcherFactory is optimal.

The OptimiseByteMatcherFactory in the compiler/matcher package makes assumptions about the most appropriate matcher to instantiate. It optimises for matcher speed, rather than compile time (this is OK, that's what it should do).

Need to profile different set byte matchers to see when it should instantiate one over the other. At the moment, it's just a finger in the air.

Search and match variable length wildcards

There is a middle ground between fixed length sequences that byteseek currently supports, and full regular expressions (which needs a lot more dev and test). We can support wildcards between sequence elements.

.* wildcards: A sequence like 01 02 .* 03 04 can be matched easily (01 02, then search for 03 04).
.{1,4} variable length wildcards. A sequence like 01 02 03 .{1,4} 04 05 can be matched as 01 02 03, then search for 04 05.

Also need to support backtracking (optionally).

Regular expression of InvertedByteMatcher incorrect

The regular expression generated by the InvertedByteMatcher is incorrect. It puts the ^ followed by a space (if pretty printing) then the value. There should be no space, and a set is not necessary.

SetBinarySearchMatcher does not match.

Bug in code doesn't match on matchers(byte[]) and matchNoBoundsCheck(byte[]).

Tries to match the array being searched, instead of its own array of bytes.

Need test coverage for all byte matchers really. These matching methods were added later and the existing tests don't cover them.

New Set ByteMatcher type idea.

Possibility of a new ByteMatcherSetMatcher - a Set composed of other ByteMatchers, rather than collapsing all the bytes into a single matcher. Simply iterate over the bytematchers, and return true if any of them match.

Could well be more efficient than translating to a single matcher backed by a bitset or an array with a binary search. e.g. a bitmask and another byte = [&5e 7f] == two tests at most.
Doesn't work well with the current compiler, which discards the parse tree before the optimiser kicks in, and turns it into collections of Bytes instead.

Better documentation

Better documentation is needed to explain how to use byteseek, with examples.

Support matching integer values or ranges > 8 bits

Matching integer values (or ranges of them) for more than a byte, e.g. a 16 bit or 32 bit value.

SequenceMatchers
It is trivial to create a SequenceMatcher which matches 2, 4 or 8 bytes. The unit of matching is then the length of the sequence. To match a single value, it's not really necessary to have such a matcher, as the specific byte sequence could just be inserted as a fixed set of bytes. To match a range of integer values would require the logic in the matcher to accurately assess whether the number observed fell in the integer range.

Byte Dependencies
When matching an integer range, different bytes may match or not depending on the values of others. Deconstructing an IntRange SequenceMatcher into a sequence of ByteMatchers no longer gives identical results.

The four ByteMatchers which could be returned for an Int32RangeMatcher can only give the superset of all possible bytes that might match, which means they will match things the int32rangematcher will not (as they aren't embodying the int range logic).

It is possible that some optimisation algorithms might split up SequenceMatchers differently, operating on the assumption that it's always possible to deconstruct a SequenceMatcher into a list of ByteMatchers, and that running each of them will give identical results.

Search Algorithm Issues
If these are plugged into a search algorithm, some algorithms will have false positive matches. This is because they all assume that the bytes which can match at a particular position are fixed and always the same - but if we have > 8 bit numbers, a byte which matches in one position becomes dependant on the value of a byte in another position.

Any search algorithm which always validates a match by running the Matchers will be accurate. However, other algorithms like ShiftOR "compile" the pattern down to a lookup table of byte matches, so they don't actually run the SequenceMatcher to validate a match at all (and that is one reason they are fast).

Bounds checking does not look right

Bounds checking condition in the file ListMultiSequenceMatcher.java(Line no:169) doesn't look good

Typo: "first" should be "last" in syntax.md page

On the https://github.com/nishihatapalmer/byteseek/blob/master/syntax.md page where it is written:

 &7f                 # 0x7f = 01111111. Match bytes with all first seven bits set.

I think "first" should be "last", i.e.:

 &7f                 # 0x7f = 01111111. Match bytes with all last seven bits set.

High level interface needed

Byteseek so far has focussed on providing matcher, searcher, parser and compiler primitives that compose and give essential core features.
However, it's hard to use effectively or safely. There are "gotcha" search patterns, e.g. wildcards at the end of a pattern that can render a search algorithm painfully slow.

It needs to be possible to use byteseek without requiring specialist knowledge on how to use it safely. Documentation can help here, for those who want to use the primitives directly, but it should be possible to match or search any pattern reasonably efficiently over any data source without having to understand anything more than that.

Implement HashChain search

HashChain is pretty much the fastest search for longer patterns. Implement it for byteseek.

It works well for low alphabets and high alphabets, and is tunable for different memory requirements. It's a factor search, most similar to WFR in that it uses a rolling hash construction. It creates chains of hashes in the hash table with a fingerprint that allows verification that the next hash is correct before doing another index lookup.

Need to figure out where its worst cases get triggered. Search time rises dramatically (like QF, and WFR). In practice it seems quite hard to trigger if the hash table is sufficiently empty or we aren't dealing with pathological low alphabet or repetitive data. It still does better than WFR or QF in these situations though.

Nevertheless, byteseek needs to know if it is appropriate to select this algorithm or whether another slower but without the same extreme worst case behaviour needs to be used instead. Or if the pattern can be modified (use a substring) that would avoid the pathological case? A pattern composed only of zeros might not do well with hash chain.

Performance Issue with small alphabets and long texts

I tried to benchmark your library with StringBench.

Yet there seems to be a performance issue with all of your implementations of AbstractSequenceSearcher.

You can reproduce this by

checking out StringBench
selecting BSBoyerMooreHorspoolIncubationTest.java
removing the @Ignore
starting the test

I yet did not test it with any other parameters, so perhaps this performance issue is related to other scenarios, too.

Maybe this issue is related to an incorrect test setup. If so please show me how to correct it.

	public MutableState(final boolean isFinal) {
	this.isFinal = isFinal;
	this.transitions = Collections.emptyList();
	this.associations = Collections.emptyList(); // = new ArrayList<T>(0);
	}

	public MutableState<T> deepCopy(Map<DeepCopy, DeepCopy> oldToNewObjects) {
	@SuppressWarnings("unchecked")
	// if there is a copy of this in the map, it will be of the same type.
	MutableState<T> stateCopy = (MutableState<T>) oldToNewObjects.get(this);
	if (stateCopy == null) {
	stateCopy = new MutableState<T>(this.isFinal);
	oldToNewObjects.put(this, stateCopy);
	for (Transition<T> transition : transitions) {
	final Transition<T> transitionCopy = transition.deepCopy(oldToNewObjects);
	stateCopy.transitions.add(transitionCopy);
	}
	}
	return stateCopy;
	}