Giter Site home page Giter Site logo

brandondahler / data.hashfunction Goto Github PK

View Code? Open in Web Editor NEW
255.0 14.0 41.0 3.97 MB

C# library to create a common interface to non-cryptographic hash functions.

License: MIT License

C# 98.90% Smalltalk 0.14% PowerShell 0.95%
hash-algorithm hash-functions non-cryptographic-hash-functions dotnet-library

data.hashfunction's Introduction

Deskasoft International has officially taken over maintenance and ongoing development of this library at https://github.com/Deskasoft/Data.HashFunction.


Data.HashFunction License

Data.HashFunction is a C# library to create a common interface to non-cryptographic hash functions and provide implementations of public hash functions. It is licensed under the permissive and OSI approved MIT license.

All functionality of the library is tested using xUnit. A primary requirement for each release is 100% code coverage by these tests.

All code within the libarary is commented using Visual Studio-compatible XML comments.

Status

Master

Build Status Test Status

NuGet

Name Normal
Data.HashFunction.Interfaces Version Status
Data.HashFunction.Core Version Status
Data.HashFunction.BernsteinHash Version Status
Data.HashFunction.Blake2 Version Status
Data.HashFunction.Buzhash Version Status
Data.HashFunction.CityHash Version Status
Data.HashFunction.CRC Version Status
Data.HashFunction.ELF64 Version Status
Data.HashFunction.FNV Version Status
Data.HashFunction.HashAlgorithm Version Status
Data.HashFunction.Jenkins Version Status
Data.HashFunction.MurmurHash Version Status
Data.HashFunction.Pearson Version Status
Data.HashFunction.SpookyHash Version Status
Data.HashFunction.xxHash Version Status

Implementations

All implementation packages depend on the Data.HashFunction.Interfaces and Data.HashFunction.Core NuGet packages.

The following hash functions have been implemented from the most reliable reference that could be found.

  • Bernstein Hash
    • BernsteinHash - Original
    • ModifiedBernsteinHash - Minor update that is said to result in better distribution
  • Blake2
    • Blake2b
  • BuzHash
    • BuzHashBase - Abstract implementation, there is no authoritative implementation
    • DefaultBuzHash - Concrete implementation, uses 256 random 64-bit integers
  • CityHash
  • CRC
    • CRC - Generalized implementation to allow any CRC parameters between 1 and 64 bits.
    • CRCStandards - 71 implementations on top of CRC that use the parameters defined by their respective standard. Standards and their parameters provided by CRC RevEng's catalogue.
  • ELF64
  • FNV
    • FNV1Base - Abstract base of the FNV-1 algorithms
    • FNV1 - Original
    • FNV1a - Minor variation of FNV-1
  • Hash Algorithm Wrapper
    • HashAlgorithmWrapper - Wraps existing instance of a .Net HashAlgorithm
    • HashAlgorithmWrapper - Wraps a managed instance of a .Net HashAlgorithm
  • Jenkins
    • JenkinsOneAtATime - Original
    • JenkinsLookup2 - Improvement upon One-at-a-Time hash function
    • JenkinsLookup3 - Further improvement upon Jenkins' Lookup2 hash function
  • Murmur Hash
    • MurmurHash1 - Original
    • MurmurHash2 - Improvement upon MurmurHash1
    • MurmurHash3 - Further improvement upon MurmurHash2, addresses minor flaws
  • Pearson hashing
    • PearsonBase - Abstract implementation, there is no authoritative implementation
    • WikipediaPearson - Concrete implementation, uses values from Wikipedia article
  • SpookyHash
    • SpookyHashV1 - Original
    • SpookyHashV2 - Improvement upon SpookyHashV1, fixes bug in original specification
  • xxHash
    • xxHash - Original and 64-bit version.

Each family of hash functions is contained within its own project and NuGet package.

Usage

The usage for all hash functions has been standardized and is accessible via the System.Data.HashFunction.IHashFunction and System.Data.HashFunction.IHashFunctionAsync interfaces. The core package, Data.HashFunction.Core, only contains abstract hash function implementations and base functionality for the library. In order to use a specific hashing algorithms, you will need to reference its implementation packages.

IHashFunction implementations should be immutable and stateles. All IHashFunction methods and members should be thread safe.

using System;
using System.Data.HashFunction;
using System.Data.HashFunction.Jenkins;

public class Program
{
    public static readonly IJenkinsOneAtATime _jenkinsOneAtATime = JenkinsOneAtATimeFactory.Instance.Create();
    public static void Main()
    {
        var hashValue = _jenkinsOneAtATime.ComputeHash("foobar");

        Console.WriteLine(hashValue.AsHexString());
    }
}

Release Notes

See Release Notes wiki page.

Contributing

Feel free to propose changes, notify of issues, or contribute code using GitHub! Submit issues and/or pull requests as necessary.

There are no special requirements for change proposal or issue notifications.

Code contributions should follow existing code's methodologies and style, along with XML comments for all public and protected namespaces, classes, and functions added.

License

Data.HashFunction is released under the terms of the MIT license. See LICENSE for more information or see http://opensource.org/licenses/MIT.

data.hashfunction's People

Contributors

awalsh128 avatar brandondahler avatar dbckr avatar thebigb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data.hashfunction's Issues

Re-namespace and rename packages

While this project started out under the System.Data.HashFunction namespace and package name, it is no longer allowed for non-Microsoft packages to be pushed under the System.* package name.

To keep consistency, we'll want to update the namespace along with the package name.

add t1ha (Fast Positive Hash)

Fast Positive Hash - just the fastest portable hash function.

Briefly, it is a portable 64-bit hash function:

  1. Intended for 64-bit little-endian platforms, predominantly for Elbrus and x86_64,
    but portable and without penalties it can run on any 64-bit CPU.
  2. In most cases up to 15% faster than StadtX hash, xxHash, mum-hash, metro-hash, etc.
    and all others portable hash-functions (which do not use specific hardware tricks).
  3. Provides a set of terraced hash functions.
  4. Currently not suitable for cryptography.
  5. Licensed under Zlib License.

Result of CityHash64 doesn't match the reference implementation on strings of 64*n (n > 1) lengths

using System.Collections.Generic;
using System.Data.HashFunction.CityHash;
using System;
using System.Text;

namespace HashTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var symbols = new List<char>{'a', 'b', 'x', 'y'};
            var hasher = CityHashFactory.Instance.Create(new CityHashConfig{ HashSizeInBits = 64 });

            foreach (var symbol in symbols)
            {
                for (var length = 2; length < 4; ++length)
                {
                    var s = new string(symbol, length*64);
                    var bytes = Encoding.ASCII.GetBytes(s);
                    var hash = BitConverter.ToUInt64(hasher.ComputeHash(bytes).Hash, 0);
                    Console.WriteLine($"{symbol}, {length}: 0x{hash:x}");
                }
            }

            Console.ReadKey();
        }
    }
}

vs python (uses reference implementation, via cityhash module)

import cityhash

for symbol in ('a', 'b', 'x', 'y'):
    for length in (2, 3):
        hash = cityhash.CityHash64(symbol*length*64)
        print("{}, {}: {:#x}".format(symbol, length, hash))

vs C++ (using https://github.com/google/cityhash via conan)

#include <string>
#include <cstdio>

#include "city.h"

int main(int, char*[]) {
	for (const auto symbol : {'a', 'b', 'x', 'y'}) {
		for (const auto length : {2, 3}) {
			const auto s = std::string(length*64, symbol);
			const auto hash = CityHash64(s.data(), s.size());
			printf("%c, %d: 0x%lx\n", symbol, length, hash);
		}
	}

	return 0;
}

C#:

a, 2: 0x17eb9429608efa10
a, 3: 0xd173291f9db2d8d1
b, 2: 0xd7f220816e41070d
b, 3: 0x36074be8fc81c410
x, 2: 0x77f3f0a5f76761d5
x, 3: 0xfe9c5c96274e4df9
y, 2: 0x85b294ba426c41c7
y, 3: 0xd7dcefe6faea4424

Python and C++:

a, 2: 0x8732752111926e2c
a, 3: 0xf7b22b0a38b54ca8
b, 2: 0x9f0c541d796fd1f1
b, 3: 0x453fb3d655153452
x, 2: 0x87e1532e643b0d29
x, 3: 0x128cf8134a32840
y, 2: 0xdf18ce2fbf974758
y, 3: 0x28e79fea5420f5f5

It looks like there is a difference in tail symbol processing

Endianess issue?

Hi

when looking on your library I noticed that for CRC16 (ARC in your standards enumerator), the library returns {0x3D,0xBB} when inface it should be reversed {0xBB,0x3D}.
look at http://www.lammertbies.nl/comm/info/crc-calculation.html and other online calculators on the web. They all produce 0xBB3D and yours is the only one producing the wrong endianess

How to get hash code

Hi,

I am trying to use your library to get the hash code of a given string. I'd like to use the hash code for load balancing and I'm looking for a consistent hash code (for a given string, I always get the same hash code).

Here is what it looks like:

int nodeIndex = GetHashCode(sessionId) % nodeCount;
Nodes[nodeIndex].DoWork();

Does this library provide such functionality?

Support Span<byte> as Hash-function input

It would be very useful for the hash functions to accept a ReadOnlySpan instead of only Stream and byte[]. More and more .NET apis are enabled for use with the various Span-datatypes and so it also happens quite often, that instead of a byte[] you are dealing with a Span, which cannot directly be used with this library.

Another advantage is the possibility to cast between various types of Span-types with the help of MemoryMarshal. This makes it possible to cast a string to a ReadOnlySpan without having to encode or copy the string first. As I often want to compute hashes from strings, this would drastically improve the performance and reduce needed allocations.

I would also be happy if this would only be supported in the block-transformer interfaces in the beginning, where you already support ArraySegment.

xxHash gives incorrect value when using AsHexString method.

I had to create a custom as hex string method to get the correct value.

Either the value is being stored under the hood with the bytes reversed, or just when using the AsHexString method the value is being reversed. Looking at the source, it's probably the storage format since AsHexString didn't appear to be reversing anything.

I had to generate my own helper method to get the hex string correctly. This was required for me as I'm trying to match up xxhashes from different languages and the c++ version I'm using (and several others I checked) did not have this issue.

//My own helper method I had to use to work around the ordering issue.
public static string AsHexString(byte[] hash, bool uppercase)
{
Array.Reverse(hash);//Note this reverse. This is currently required to work correctly
StringBuilder stringBuilder = new StringBuilder(hash.Length);
string format = uppercase ? "X2" : "x2";
foreach (byte num in hash)
stringBuilder.Append(num.ToString(format));
return stringBuilder.ToString();
}

CRC Hash Error

i try using CRC hash but i get this error:

"ICRC does not contain a definition for Value" and no extension method 'Value' accepting a first argument of type 'ICRC' could be found (are you missing a using directive or an assembly reference?)

My Code is:

var c = CRCFactory.Instance.Create(new CRCConfig()
            {
                HashSizeInBits = 32,
            });
            Console.WriteLine("Enter:");
            var result = c.ComputeHash(Encoding.UTF8.GetBytes(Console.ReadLine()));
            var dt = result.AsHexString();
            Console.WriteLine(dt);
            Console.ReadKey();

Audit all test vectors

Issue #30 identified an issue in SpookyHash that ultimately was not caught because the test vectors that were originally defined for the SpookyHash implementation were just plain wrong.

The obvious follow up is: did I screw up any other test vectors and therefore implementations?

Example in README is wrong signature

In the README code example, ComputeHash accepts a string:

var hashValue = _jenkinsOneAtATime.ComputeHash("foobar");

...however, ComputeHash actually accepts a byte array.

Let me know and I'll gladly create a pull request.

Support for stream block processing

Provide support for stream block processing, like a default .NET HashAlgorithm:

var sourceStream = ... // From anywhere
var hashAlgorithm = ... // HashFunction

var bufferSize = 8192;
using (Stream stream = new MemoryStream)
{
    var buffer = new byte[bufferSize];
    int bytesRead;
    while ((bytesRead = sourceStream.Read(buffer, 0, buffer.Length)) > 0)
    {
        hashAlgorithm.TransformBlock(buffer, 0, bytesRead, null, 0);

        stream.Write(buffer, 0, bytesRead);

        blobLength += bytesRead;
    }

    hashAlgorithm.TransformFinalBlock(new byte[0], 0, 0);
}

Potential .NET Core Migration

My company is currently considering porting some of our projects over to .NET Core and your MurmurHash library (and its dependencies) are in use all over the place (requiring us to either swap out the hashing algorithm in use or to migrate your libraries to move forward).

Do you have any plans to prepare these libraries for use with .NET Core/Standard? If not, would you be open to a PR that does so?

Performance

Which is the most perforant algorithm? Is there somewhere a table with measures?

AsHexstring() function in CityHash gives reverse output

    public static readonly ICityHash _cityHash = CityHashFactory.Instance.Create(new CityHashConfig() {  
    HashSizeInBits = 64 });

   var hashTest = _cityHash.ComputeHash(Encoding.ASCII.GetBytes("Hello World"));
   string hashHexTest = hashTest.AsHexString();

The output of hashHexTest is "d3c5587c645895fe"
The actual result should be "fe9558647c58c5d3"

Further more i did some more investigation and different implementation also holds true for the Expected value.

If anyone can help with the issue and help me find a way to rectify the method that will be helpful

Different strings, but the xxHash calculation result is the same

calculat hash for different unique string with xxHash algorithm, but got same result.
c# code:

using System;
using System.Text;

using System.Data.HashFunction;
using System.Data.HashFunction.xxHash;

namespace DemoNS
{
    public class Demo
    {
        static void Main(string[] args)
        {
            uint v1 = CalcXXHash("cdkey-637302380103173928-f8830392-f3bf-4e92-aa73-6d8e9e6c0260-1199177810");
            Console.WriteLine("".PadLeft(50, '-'));
            uint v2 = CalcXXHash("cdkey-637302378177363195-42ce23ac-282f-4a8e-96c7-13a12d58c153-589858");
            return;
        }

        public static uint CalcXXHash(string origin)
        {
            IxxHash ixxHash = xxHashFactory.Instance.Create(new xxHashConfig()
            {
                HashSizeInBits = 32,
                Seed = 0
            });
            Console.WriteLine("origin: {0}", origin);
            byte[] byteData = Encoding.UTF8.GetBytes(origin);
            IHashValue hashValue = ixxHash.ComputeHash(byteData);
            Console.WriteLine("AsBase64String(): {0}", hashValue.AsBase64String());
            Console.WriteLine("AsHexString(): {0}", hashValue.AsHexString());
            byte[] hash = hashValue.Hash;
            string hashArray = string.Join(", ", Array.ConvertAll(hash, (byte item) => item.ToString()));
            Console.WriteLine("hashArray: {0}", hashArray);
            return BitConverter.ToUInt32(hash, 0);
        }
    }
}

result:

origin: cdkey-637302380103173928-f8830392-f3bf-4e92-aa73-6d8e9e6c0260-1199177810
AsBase64String(): 1Rt69Q==
AsHexString(): d51b7af5
hashArray: 213, 27, 122, 245
--------------------------------------------------
origin: cdkey-637302378177363195-42ce23ac-282f-4a8e-96c7-13a12d58c153-589858
AsBase64String(): 1Rt69Q==
AsHexString(): d51b7af5
hashArray: 213, 27, 122, 245

Outdated CityHash NuGet package with too slow 128-bit implementation

It seems that System.Data.HashFunction.CityHash NuGet package is outdated. We had a performance test and the repository's version of CityHash128 (from the current master branch) is very good.
image

But the NuGet package's version of CityHash128 is very bad. It's even slower than MD5 and also consumes high memory amount:
image

I used the benchmark source code below to compare the performance and memory usage:

using System;
using System.Data.HashFunction.CityHash;
using System.Security.Cryptography;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace TestCityHash
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<CityHash128vs64vs32>();
        }
    }

    [ShortRunJob]
    [MemoryDiagnoser]
    public class CityHash128vs64vs32
    {
        private readonly byte[] _data = new byte[1000];

        private readonly MD5 _md5 = MD5.Create();
        private readonly ICityHash _cityHash128 = HashFactory.CreateCityHash128();
        private readonly ICityHash _cityHash64 = HashFactory.CreateCityHash64();
        private readonly ICityHash _cityHash32 = HashFactory.CreateCityHash32();

        public CityHash128vs64vs32()
        {
            new Random().NextBytes(_data);
        }

        [Benchmark(Baseline = true)]
        public byte[] Md5() => _md5.ComputeHash(_data);

        [Benchmark]
        public byte[] CityHash128() => _cityHash128.ComputeHash(_data).Hash;

        [Benchmark]
        public byte[] CityHash64() => _cityHash64.ComputeHash(_data).Hash;

        [Benchmark]
        public byte[] CityHash32() => _cityHash32.ComputeHash(_data).Hash;
    }

    public class HashFactory
    {
        public static ICityHash CreateCityHash128()
        {
            var config = new CityHashConfig
            {
                HashSizeInBits = 128
            };

            return CityHashFactory.Instance.Create(config);
        }

        public static ICityHash CreateCityHash64()
        {
            var config = new CityHashConfig
            {
                HashSizeInBits = 64
            };

            return CityHashFactory.Instance.Create(config);
        }

        public static ICityHash CreateCityHash32()
        {
            var config = new CityHashConfig
            {
                HashSizeInBits = 32
            };

            return CityHashFactory.Instance.Create(config);
        }
    }
}

I have compared the source code (package version vs branch version) and have to conclude that the CityHash128 implementation from the NuGet package is definitely outdated. Maybe it is possible to publish a new version of the NuGet package that contains implementation from the current master branch? E.g. from v2.0.0 to 2.0.1. That would be great. Thank you!

Cancel ComputeHashAsync

It would be nice if you could pass a CancellationToken into ComputeHashAsync so you can cancel the operation. This would be useful, for example, if you are computing the hash of a huge file and want to provide a way to cancel within your application.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.