ssdeep-project / ssdeep Goto Github PK
View Code? Open in Web Editor NEWFuzzy hashing API and fuzzy hashing tool
Home Page: https://ssdeep-project.github.io/ssdeep/index.html
License: GNU General Public License v2.0
Fuzzy hashing API and fuzzy hashing tool
Home Page: https://ssdeep-project.github.io/ssdeep/index.html
License: GNU General Public License v2.0
For whom concern,
This is Tsukasa OI, a maintainer of ssdeep.
Sorry for not maintaining for a long time while I was busy on the job. I'm now reviewing the original C source code again and looking for some improvements. However, there is an issue (the major one): preserving portability in C is hard. Per-OS code spreads everywhere. Some tools / fragments are old and we don't even know what platform/tools to support.
(even if we don't rewrite it in Rust, we definitely need some cleaning)
Then, a Rust guy recommended me to try rewriting it in Rust. Well... (about 2 weeks later) the result looks... promising.
I ported libfuzzy and a part of ssdeep (CLI) to Rust and... it performs faster than libfuzzy when comparing fuzzy hashes, even if we don't use any unsafe
blocks (on fuzzy hash generation, the safe Rust version was about 15% slower). With unsafe Rust, it's definitely faster than libfuzzy (both in comparison and hash generation) and surprisingly... it got faster than ffuzzy++, my C++ port of libfuzzy (generally faster than libfuzzy and has a specialized API for large scale clustering) when I enabled LTO build. I haven't implemented all features in ssdeep (CLI) but it seems more readable.
In the process doing this, I found a bug inside fuzzy.c (I am struggling to find a failure test case because it seems very hard to reproduce) and will fix later (probably next week).
Anyway, back to Rust. It looks promising but I'm not sure whether this is the future we (as a project) should go. At least, we should discuss about it.
In a few weeks, I will release Rust port of the original ssdeep (at least, most features) and libfuzzy in my GitHub (not in ssdeep-project) and I would like to hear your thoughts.
could you please add an option to skip symlinks on recursive?
With fixes for building on Windows.
How is that possible? Could you open pushing branches?
Hello @jessek ,
I have been using LSH (TLSH, ssdeep) methods for malware detection for a while now.
I have hit a point where I need to do faster inference/search.
Is there some paper or document that I can refer which can help me convert a hash to its vector form? I need to do this inorder to build an ANN index on a SSDEEP cluster centroids.
I've looked a a lot of resources of the web, but none of them help in extracting the actual bucket value. Most of them talk about using the actual ssdeep hash(plain text) to implement solutions similar to edit distances.
Any if you could point me to any helpful resources, i'd be really greatful. :)
I want a static library of windows. Include x86 and x64 libraries.
Can you provide a compilation method?
Hi,
When building ssdeep 2.14 in Debian, I can see the following messages:
main.cpp: In function ‘void generate_filename(state*, char*, char*, char*)’:
main.cpp:242:15: warning: ignoring return value of ‘char* realpath(const char*, char*)’, declared with attribute warn_unused_result [-Wunused-result]
realpath(input, fn);
~~~~~~~~^~~~~~~~~~~
engine.cpp: In function ‘int hash_file(state*, char*)’:
engine.cpp:58:5: warning: ‘%s’ directive output truncated writing 79 bytes into a region of size 68 [-Wformat-truncation=]
int hash_file(state *s, TCHAR *fn) {
^~~~~~~~~
In file included from /usr/include/stdio.h:938:0,
from main.h:27,
from engine.cpp:3:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:65:44: note: ‘__builtin___snprintf_chk’ output 89 or more bytes into a destination of size 77
__bos (__s), __fmt, __va_arg_pack ());
^
engine.cpp:58:5: warning: ‘%s’ directive output truncated writing 79 bytes into a region of size 68 [-Wformat-truncation=]
int hash_file(state *s, TCHAR *fn) {
^~~~~~~~~
In file included from /usr/include/stdio.h:938:0,
from main.h:27,
from engine.cpp:3:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:65:44: note: ‘__builtin___snprintf_chk’ output 89 or more bytes into a destination of size 77
__bos (__s), __fmt, __va_arg_pack ());
^
[...]
cycles.cpp: In function ‘int done_processing_dir(char*)’:
cycles.cpp:51:11: warning: ignoring return value of ‘char* realpath(const char*, char*)’, declared with attribute warn_unused_result [-Wunused-result]
realpath(fn,d_name);
~~~~~~~~^~~~~~~~~~~
cycles.cpp: In function ‘int processing_dir(char*)’:
cycles.cpp:107:11: warning: ignoring return value of ‘char* realpath(const char*, char*)’, declared with attribute warn_unused_result [-Wunused-result]
realpath(fn,d_name);
~~~~~~~~^~~~~~~~~~~
cycles.cpp: In function ‘int have_processed_dir(char*)’:
cycles.cpp:158:11: warning: ignoring return value of ‘char* realpath(const char*, char*)’, declared with attribute warn_unused_result [-Wunused-result]
realpath(fn,d_name);
~~~~~~~~^~~~~~~~~~~
[...]
ar: `u' modifier ignored since `D' is the default (see `U')**
Thanks!
Regards,
Eriberto
ssdeep needs modernization. Because this program has long history, some part of the code is getting old. As a part of major program restructuring, I'm doing some modernization.
and
, or
and not
)&&
, ||
and !
.uint8_t
and uint64_t
) will be changed to make the program portable. Note that uint32_t
argument of fuzzy_hash_buf
will be preserved for now.I've to compare two slightly different binary files with ssdeeep.
But, when I submit the command in my terminal I obtain a blank response.
I've tried the online tool provided by the official site but the outcome was the same.
Is it normal or am I doing something wrong?
Is the libfuzzy part also released under GPL, instead of LGPL?
My python code that hashes/compares millions of data and takes a while to complete. We have installed an NVIDIA GPU and we want to utilise it to do the hashing and compare.
Some forums tell that ssdeep was designed to run on CPU. I have a doubt that this is true but can anyone here confirm thsi? Also, if there will be any suggestions on how i can utilize GPU?
Im not a pro in python.
SSDEEP is not matching as similar files with low entropy, which indeed have got just trivial changes
For example long low entropy text followed by some additional character(s):
python -c 'print 1000000 * "asdfghjkl" + "a" ' > master
python -c 'print 1000000 * "asdfghjkl" + "b" ' > similar1
python -c 'print 1000000 * "asdfghjkl" + 1000 * "b" ' > similar2
python -c 'print 1000001 * "asdfghjkl" ' > similar3
# With naked eye you can see that the ssdeep hashes are similar to big extent
$ ssdeep -b master similar*
ssdeep,1.1--blocksize:hash:hash,filename
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrS:G,"master"
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrb:P,"similar1"
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrj:v,"similar2"
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrn:r,"similar3"
# But all the similarity matches are only 0% ... which is wrong.
$ ssdeep -a -d master similar*
similar1 matches master (0)
similar2 matches master (0)
similar2 matches similar1 (0)
similar3 matches master (0)
similar3 matches similar1 (0)
similar3 matches similar2 (0)
# Same result you get with even more trivial tests like:
$ python -c 'print 1000000 * "xy" + "a" ' > xmaster
$ python -c 'print 1000000 * "xy" + "b" ' > xsimilar
$ ssdeep -b xmaster xsimilar
ssdeep,1.1--blocksize:hash:hash,filename
3:Ccdcb:Cceb,"xmaster"
3:CcdcS:CceS,"xsimilar"
$ ssdeep -a -d xmaster xsimilar
xsimilar matches xmaster (0)
# The change can actually be inside of the file, it doesn't have to be at the end of file
$ python -c 'print 100000 * "asdfghjkl" + "a" + 1000000 * "qwertyuiop" ' > xmaster
$ python -c 'print 100000 * "asdfghjkl" + "b" + 1000000 * "qwertyuiop" ' > xsimilar
$ ssdeep -b xmaster xsimilar
ssdeep,1.1--blocksize:hash:hash,filename
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrS:G,"xmaster"
96:zrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrb:P,"xsimilar"
$ ssdeep -b -a -d xmaster xsimilar
xsimilar matches xmaster (0)
# It is interesting that data with much higher entropy do not have this issue
# it seems that the similarity comparison is only failing when there there is not enough
# entropy on the end of file to spread over bigger number of ssdeep hash characters
dd if=/dev/urandom of=rmaster bs=1M count=1
cp rmaster rsimilar
echo "a" >> rmaster
echo "b" >> rsimilar
$ ssdeep -b rmaster rsimilar
ssdeep,1.1--blocksize:hash:hash,filename
24576:J5WVyy+jicHjXWrzojYzsQIqEmD+PR3VOt1TRclj5aWhpth:JLDXWQRqqtVP5lhzh,"rmaster"
24576:J5WVyy+jicHjXWrzojYzsQIqEmD+PR3VOt1TRclj5aWhptJ:JLDXWQRqqtVP5lhzJ,"rsimilar"
$ ssdeep -a -d rmaster rsimilar
rsimilar matches rmaster (99)
It is not straightforward to install this on Mac M1. Try to fix this issue or at least incorporate the solution here https://stackoverflow.com/a/75386653/16434675 to the documentation
For times when processing several files for similarity, it would be helpful to have a bloom filter, but for SSDeep. Something where the algorithm could answer "have I seen a file like this before", but without having to store and check every single prior ssdeep hash.
When I run this line
C:> gcc -Wall -Ic:\path\to\includes sample.c fuzzy.dll
I got error gcc.exe: error: fuzzy.dll: No such file or directory
Hi
I apologize if this is out of topic. I am trying to include ssdeep header files and call the fuzzy hasing functions in my Xcode project and apart from including the "fuzzy.h" header file, I think I need to configure other settings as well as I am getting this error
Undefined symbols for architecture x86_64:
"_fuzzy_hash_filename", referenced from:
_main in main.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Would anyone know how to resolve this issue?
An input file such as this, as created and minimized with AFL:
ssdeep,1.0--blocksize:hash:hash,filename
"
crashes ssdeep when provided as input to ssdeep -m $FILE
:
% ./inst/bin/ssdeep -m ssdeep.crash
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 2) > this->size() (which is 1)
zsh: abort ./inst/bin/ssdeep -m ssdeep.crash
%
On fuzzy.dll (in the prebuilt Win32 archive), fuzzy_hash_file
and fuzzy_hash_stream
functions will not work properly if you normally build the program which uses fuzzy.dll with Visual C++.
Because struct FILE
is managed by separate instance of Microsoft CRT, mixing multiple CRTs (multiple versions or Debug/Release builds) causes problems. Internally, POSIX file descriptor is managed by __pioinfo
and its entry (struct __crt_lowio_handle_data
on UCRT) has corresponding Win32 handle and other information. Since every instance of Microsoft CRT has its own __pioinfo
(note: on universal CRT [VS2015 or later], __pioinfo
is no longer exported), mixing CRT can cause serious inconsistency problems:
fopen
is not considered open.Potential Errors Passing CRT Objects Across DLL Boundaries: https://msdn.microsoft.com/en-us/library/ms235460.aspx
fuzzy.dll references msvcrt.dll
and there's no problem if Win32 program also uses msvcrt.dll
. However, this is very unlikely. Programs built on Visual C++ are normally linked against version-specific CRT and/or universal CRT.
Although it's possible to link against version-specific CRT (but not universal CRT; as of September 2017), it's much safer to build fuzzy.dll on Visual C++. At least, linking against universal CRT (new CRT for Windows; ucrtbase.dll
and API sets) is required to resolve this issue because there will be no version-related issues anymore (there will be however, Debug/Release DLL issues).
README
for CRT object sharing (adding /MD
or /MDd
option; will be committed later)fuzzy.c
, edit_distn.c
and find-file-size.c
possible to compile on Visual C++After these changes, Windows build pipeline may be separated from this repository.
Hi,
I am in the process of completing an assignment for a University course and I am having issues comparing two images.
I am using a jpg image and changing the original by adding small text to the image. Whenever, ssdeep in ran it comes back with a 0% match when only a small change has occurred. My commands are:
ssdeep * > ss-hash-all-jpg.txt
ssdeep -a -m ss-hash-all-jpg.txt *
Can you please advise as to whether these commands are correct or whether there are any advanced commands I need to use?
Thanks
FB
Alpine Linux is becoming an increasingly popular distro for secure, light, Docker-based deployments. It would be great to see this reversing library in there as well.
Hi, when I run ssdeep recursively on a directory the program will crash if the target dir contains any file on 7th level or deeper.
More observations:
C:\
OS: Windows 10 Pro
Reproducibility: 100 %
ssdeep -V
: 2.14.1
# won't crash
mkdir a\b\c\d\e\f
type nul > a\b\c\d\e\f\6th-level.txt
ssdeep -r a
# will crash
mkdir a\b\c\d\e\f\g
type nul > a\b\c\d\e\f\g\7th-level.txt
ssdeep -r a
# won't crash
hashdeep -r a
md5deep -r a
sha1deep -r a
sha256deep -r a
tigerdeep -r a
whirlpooldeep -r a
Hi,
I am attempting the installation of ssdeep
. I cloned the git repository but did not find any configure file as per described in the INSTALL
instructions file.
I am not sure if the documentation is accurate or I am missing something as it seems to be a rather standard installation.
Looking forward to your reply.
Best,
Shaman
Pls see files in this archive
They are quite similar, but ssdeep returns 0 similarity on 2 of them. This is caused by has_common_substring(_pa) check in score_strings function. Without it I get score 60. Btw there is negligible performance impact if you remove it.
0.txt ssdeep: 24:+RPhWeV1T3rQ+QV1T3rQXV1T3rQ62H2M3kpann:uPAefDM+QfDMXfDM62H2Mkp6
1.txt ssdeep: 24:+RPy6qV1T3rQcV1T3rQ64V1T3rQqV1T3rQ0pa5:uPUfDMcfDM64fDMqfDM0pe
2.txt ssdeep: 24:+RP4zpirdqV1T3rQxunV1T3rQ8V1T3rQxV1T3rQTpa5:uP4zgr4fDMYnfDM8fDMxfDMTpe
0.txt vs 1.txt: 65
0.txt vs 2.txt: 0
1.txt vs 2.txt: 77
When I issue the command pip install ssdeep
, I receive the following error:
----- Installing 'ssdeep' -----
Collecting ssdeep
Using cached ssdeep-3.2.tar.gz
Complete output from command python setup.py egg_info:
running egg_info
creating pip-egg-info\ssdeep.egg-info
writing pip-egg-info\ssdeep.egg-info\PKG-INFO
writing dependency_links to pip-egg-info\ssdeep.egg-info\dependency_links.txt
writing requirements to pip-egg-info\ssdeep.egg-info\requires.txt
writing top-level names to pip-egg-info\ssdeep.egg-info\top_level.txt
writing manifest file 'pip-egg-info\ssdeep.egg-info\SOURCES.txt'
warning: manifest_maker: standard file '-c' not found
_ssdeep_cffi_b2f2ace7x627c7d55.c
ssdeep_pycache__ssdeep_cffi_b2f2ace7x627c7d55.c(213): fatal error C1083: Cannot open include file: 'fuzzy.h': No such file or directory
Traceback (most recent call last):
File "C:\Users\emili\AppData\Local\Programs\Python\Python36-32\lib\distutils_msvccompiler.py", line 423, in compile
self.spawn(args)
File "C:\Users\emili\AppData\Local\Programs\Python\Python36-32\lib\distutils_msvccompiler.py", line 542, in spawn
return super().spawn(cmd)
File "C:\Users\emili\AppData\Local\Programs\Python\Python36-32\lib\distutils\ccompiler.py", line 909, in spawn
spawn(cmd, dry_run=self.dry_run)
File "C:\Users\emili\AppData\Local\Programs\Python\Python36-32\lib\distutils\spawn.py", line 38, in spawn
_spawn_nt(cmd, search_path, dry_run=dry_run)
File "C:\Users\emili\AppData\Local\Programs\Python\Python36-32\lib\distutils\spawn.py", line 81, in _spawn_nt
"command %r failed with exit status %d" % (cmd, rc))
distutils.errors.DistutilsExecError: command 'C:
Program Files (x86)
Microsoft Visual Studio 14.0\VC\BIN
cl.exe' failed with exit status 2
Thanks!
Hi
When I compute a fuzzy hash for a file, I will see the output such as the following
48:<rest of fuzzy hash>
Where there is a number at the front of the hash. What does the number at the front mean? Is it part of the fuzzy hash?
seq 10000 > 1-col
par < 1-col > multicol
seq 10010 > 1-col-ver2
The files 1-col multicol and 1-col-ver2 are more than 70% identical. But ssdeep sees multicol and 1-col as 0% identical.
I have the feeling it is due to the fuzzy hashing looking at too big a chunk.
I hit this problem when comparing articles. Article 1 has line numbers and 80 char per line, article 2 had no line numbers and 60 chars per line. A part from this the two articles where identical.
Similarity of these ssdeeps is 36 but as you can see they are nearly identical.
char s0[] = "3:FEROlMk3/DXO2EXhIWAlvgulM4jIL2Q:FEROik3guWe9i4jIL2Q";
char s1[] = "3:FEROlMk3/DXO2EXhIWAlvgulM4jILdMQ:FEROik3guWe9i4jI2Q";
s01 = fuzzy_compare(s0, s1); // 36
Problem is in fuzzy.c in score_strings function. At the end of function there is code that can adjust the score.
if (score > block_size/MIN_BLOCKSIZE * MIN(s1len, s2len))
{
score = block_size/MIN_BLOCKSIZE * MIN(s1len, s2len);
}
Hi,
I am working on a Python Machine Learning course and SSDEEP file comparison is one of the assigned tasks.
I downloaded the compiled binaries and unzipped the files on my Windows 10 system. then, I double-clicked the files, but nothing happened. just a blinking cursor on a black command prompt. How do I use SSDEEP in windows for a python environment?
running PIP install SSDEEP in my env failed miserably. How do I install ssdeep so it can work in python as an import ssdeep?
Same ssdeep.exe,
Same ssdeep fuzzy hash list,
Same detect directory
Get Different result:
Scan php directory 137 result ,and 317 result? Different platform(windows 7 , windows 10 ,ubuntu,) dirfferent result? different win10 get different result item, so, what happened ? I am very confused.
What's Wrong?
Hi all,
This component is listed as GPL 2 in the license file. Some of the headers state 'GPL 2 or later', however Main.cpp states that it is licensed as GPL 2 only.
Would it be possible to correct all headers to 'GPL 2 or later'?
Additionally, ssdeep.h contains a Facebook copyright. Facebook does not normally license any of their code under the GPL. Please could you confirm there is not a missing Facebook license in the header?
Thank you.
MSVC does not have fseeko/ftello, causing compilation of the unmodified fuzzy.c file to fail on VS2022 (17.9.34728.123)
.
Relevant log entries (irrelevant paths stripped):
.\ssdeep\fuzzy.c(572,10): warning C4013: 'ftello' undefined; assuming extern returning int
.\ssdeep\fuzzy.c(575,7): warning C4013: 'fseeko' undefined; assuming extern returning int
fuzzy.obj : error LNK2001: unresolved external symbol fseeko
fuzzy.obj : error LNK2001: unresolved external symbol ftello
fatal error LNK1120: 2 unresolved externals
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.