multiformats / multihash Goto Github PK

View Code? Open in Web Editor NEW

879.0 60.0 123.0 942 KB

Self describing hashes - for future proofing

Home Page: https://multiformats.io/multihash/

License: MIT License

Makefile 15.88% Shell 63.55% Python 20.57%

multiformats hash-functions varint protocol multihash

multihash's Introduction

multihash

Self identifying hashes

Multihash is a protocol for differentiating outputs from various well-established cryptographic hash functions, addressing size + encoding considerations.

It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist. See jbenet/random-ideas#1 for a longer discussion.

Table of Contents
Example
Format
Implementations
Table for Multihash
- Other Tables
Notes
Visual Examples
Maintainers
Contribute
License

Example

Outputs of <encoding>.encode(multihash(<digest>, <function>)):

# sha1 - 0x11 - sha1("multihash")
111488c2f11fb2ce392acb5b2986e640211c4690073e # sha1 in hex
CEKIRQXRD6ZM4OJKZNNSTBXGIAQRYRUQA47A==== # sha1 in base32
5dsgvJGnvAfiR3K6HCBc4hcokSfmjj # sha1 in base58
ERSIwvEfss45KstbKYbmQCEcRpAHPg== # sha1 in base64

# sha2-256 0x12 - sha2-256("multihash")
12209cbc07c3f991725836a3aa2a581ca2029198aa420b9d99bc0e131d9f3e2cbe47 # sha2-256 in hex
CIQJZPAHYP4ZC4SYG2R2UKSYDSRAFEMYVJBAXHMZXQHBGHM7HYWL4RY= # sha256 in base32
QmYtUc4iTCbbfVSDNKvtQqrfyezPPnFvE33wFmutw9PBBk # sha256 in base58
EiCcvAfD+ZFyWDajqipYHKICkZiqQgudmbwOEx2fPiy+Rw== # sha256 in base64

Note: You should consider using multibase to base-encode these hashes instead of base-encoding them directly.

Format

<varint hash function code><varint digest size in bytes><hash function output>

Binary example (only 4 bytes for simplicity):

fn code  dig size hash digest
-------- -------- -----------------------------------
00010001 00000100 10110110 11111000 01011100 10110101
sha1     4 bytes  4 byte sha1 digest

Why have digest size as a separate number?

Because otherwise you end up with a function code really meaning "function-and-digest-size-code". Makes using custom digest sizes annoying, and is less flexible.

Why isn't the size first?

Because aesthetically I prefer the code first. You already have to write your stream parsing code to understand that a single byte already means "a length in bytes more to skip". Reversing these doesn't buy you much.

Why varints?

So that we have no limitation on functions or lengths.

What kind of varints?

A Most Significant Bit unsigned varint (also called base-128 varints), as defined by the multiformats/unsigned-varint.

Don't we have to agree on a table of functions?

Yes, but we already have to agree on functions, so this is not hard. The table even leaves some room for custom function codes.

Implementations

clj-multihash
cpp-multihash
dart-multihash
elixir-multihash
- elixir-multihash
- elixir-multihashing
go-multihash
haskell-multihash
js-multihash
- js-multiformats
- js-multihash (archived)
java-multihash
- multiformats/java-multihash
- comodal/hash-overlay
kotlin-multihash
- kotlin-multihash
- multiformat
net-multihash
nim-libp2p
ocaml-multihash
php-multihash
python-multihash
- multiformats/py-multihash
- ivilata/pymultihash
- multihash sub-module of Python module multiformats
ruby-multihash
rust-multihash
- by @multiformats
- by @google
scala-multihash
swift-multihash
- by @multiformats
- by @yeeth

Table for Multihash

We use a single Multicodec table across all of our multiformat projects. The shared namespace reduces the chances of accidentally interpreting a code in the wrong context. Multihash entries are identified with a multihash value in the tag column.

The current table lives here

Other Tables

Cannot find a good standard on this. Found some different IANA ones:

They disagree. :(

Notes

Multihash and randomness

Obviously multihash values bias the first two bytes. Do not expect them to be uniformly distributed. The entropy size is len(multihash) - 2. Skip the first two bytes when using them with bloom filters, etc. Why not _ap_pend instead of _pre_pend? Because when reading a stream of hashes, you can know the length of the whole value, and allocate the right amount of memory, skip it, or discard it.

Insecure / obsolete hash functions

Obsolete and deprecated hash functions are included in this list. MD4, MD5 and SHA-1 should no longer be used for cryptographic purposes, but since many such hashes already exist they are included in this specification and may be implemented in multihash libraries.

Non-cryptographic hash functions

Multihash is intended for "well-established cryptographic hash functions" as non-cryptographic hash functions are not suitable for content addressing systems. However, there may be use-cases where it is desireable to identify non-cryptographic hash functions or their digests by use of a multihash. Non-cryptographic hash functions are identified in the Multicodec table with a tag hash value in the tag column.

Visual Examples

These are visual aids that help tell the story of why Multihash matters.

Consider these 4 different hashes of same input

Same length: 256 bits

Different hash functions

Idea: self-describe the values to distinguish

Multihash: fn code + length prefix

Multihash: a pretty good multiformat

Multihash: has a bunch of implementations already

Contribute

Contributions welcome. Please check out the issues.

Check out our contributing document for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS Code of Conduct.

Small note: If editing the README, please conform to the standard-readme specification.

License

multihash's People

Contributors

Stargazers

Watchers

Forkers

silky greglook tehmaze zabirauf neoteo neocities subtly moreati dignifiedquire candeira ivilata fil btrask parkan mcgppeters pombredanne richardlitt jonnycrunch chriscool benjaminbollen kumavis wires celeduc tabrath richardschneider yangwao jeremybanks domsteil kuldeepkeshwar pgte donaldtsang sramos30 pawel-dubiel nanomobile jiyuanwang johncoffee monadicus jamesray1 smartcontract andreiamatuni ivan386 changjiashuai alvarlaigna ntninja rcy1122 hantengyue mukang dalavancloud holochain decanus itsn1x jayd2446 jasnell arnetheduck arronmabrey quantiply z-figaro leogr syyunn mattnite cnxtech meluwu ogennadi pawelwludyka1981 geoffreyporto mekkanikka terrorizer1980 hussinfaik sts0mrg0 mys-github stevenans66 nachtwaffen crypt0r3n3g4d3 murd4m1773n-m4f14 usapp-stack chevdor isabella232 marcelraschke techiesachin-r pakistanecn1 pikejun cryptomaniak0 laplacekorea crazycupcakes40 bafu fishseabowl positive-spaces-io gema-arta testlabor sg495 fieryswampshire newrain7803 hub500 bellyfat patricoferris luchoturtle nelsonic pinkdiamond1 andre-beautrait beautrait

multihash's Issues

multicodec table and multihash table are out of sync

https://github.com/multiformats/multicodec/blob/master/table.csv has a lot more entries, which includes relevant entries for example for the keccak family

Moving this repo to https://github.com/multiformats/

This repo should probably be moved to: https://github.com/multiformats/

And maybe it should be merged into https://github.com/multiformats/specs

RFC: How should a digest_size > hash function's output be handled?

How should a multihash implementation handle a call where the digest is longer than the specified output of that hash function? E.g. (pseudocode)

// The provided digest is 26 bytes long,
// but SHA1 can't produce a digest longer than 20 bytes
mh = multihash.encode(digest='abcdefghijklmnopqrstuvwxyz', code=Codes['sha1'])

Conversely, is the following multihash valid? (hex encoded, spaces added for legibility)

11 1a 6162636465666768696a6b6c6d6e6f707172737475767778797a

The code field declares the hash function is SHA-1 (0x11). The length field declares the digest is 26 bytes long, and the received digest field is 26 bytes long. However SHA-1 can't produce 26-byte digest.

When a multihash library is called to encode/decode such an input it could:

a. Signal an error, and stop encoding/decoding?
b. Set the length field to len(digest), then continue with processing?
c. Truncate the digest field, before continuing with processing?

The behaviour is currently unspecified, i.e. implementations can do whatever they want. AFAICT go-multihash does b. I'd like to propose that a. as a required behaviour for standards conforming implementations.

What do you think? Have I missed a specification document? Would you prefer I sent this to a mailing list (e.g. ipfs-users)?

RFC: Multihash with size != length(digest)

Hi, this is somehow related with #17, but without regard to the correct digest length produced by a hash function.

The question is how to handle multihash-encoded digests where the encoded length doesn't match the length of the string of bytes coming afterwards. I mostly see that this should be an invalid multihash:

\0x11\0x06abc

since there are only 3 digest bytes where 6 are expected. Maybe we could pad with \x00\x00\x00 but it looks a little bit forced to me.

The real question is how to interpret something like this:

\0x11\0x03abcdef

Is this the multihash \0x11\0x03abc plus trailing debris, and thus invalid as a whole? Should otherwise the whole string be moved around as a complete multihash with 3 useless characters?

The implication here is whether to explicitly and separately store a length with languages (such as Python) where the length of a digest string is readily available and keeping a (matching) length would be redundant.

Question: trying to match hashes using sha256 (apparently I don't have sha2-256)

Simple question. What is the unix command to create this hash from you example?

# sha2-256 0x12 - sha2-256("multihash")
12209cbc07c3f991725836a3aa2a581ca2029198aa420b9d99bc0e131d9f3e2cbe47 # sha2-256 in hex

I'm having trouble finding a match:

echo -n multihash | sha256sum 
9cbc07c3f991725836a3aa2a581ca2029198aa420b9d99bc0e131d9f3e2cbe47  -

Even the length is different. I assume sha256sum is not sha2 ?? The ultimate goal is to hash in JavaScript and match an IPFS link. Thank you..

Numbering of hashes matching IANA

I would very much prefer to have the numbering of the hashes the same as in the IANA registry. If there are some missing, it is easy to add them to the registry.

This would also make it much easier to publish an internet-draft on multihashes, without creating yet another registry.

Fix the table

The hashtable.csv here needs to be updated with the values here: https://github.com/multiformats/multicodec/blob/master/table.csv

Support for Keccak

We need support for base Keccak as well as the already added SHA-3 standard subset of Keccak.
Ethereum calls its hash algo of choice "SHA-3" but was chosen before the standard was settled, and is more accurately called "Keccak-256".

Instance	Definition
eth_sha3(M)	Keccak[256](M, 256)
SHA3-224(M)	Keccak[448](M
SHA3-256(M)	Keccak[512](M
SHA3-384(M)	Keccak[768](M
SHA3-512(M)	Keccak[1024](M
SHAKE128(M, d)	Keccak[256](M
SHAKE256(M, d)	Keccak[512](M

table from sha-3 wiki

Specification of format and behaviours

Thinking about #16 I began writing a spec to document multihash in more detail.

The WIP is in https://github.com/moreati/multihash/tree/spec

Base32 encoding issues

The base32 implementations that you are using does not seem to be the same as https://tools.ietf.org/rfc/rfc4648.txt

Can you give us more informations on the base32 specification that you are following? I would like to add it to https://github.com/mseri/crypto-multihash

please help me-silly or is there a typo?

New to multihash (sounds sound ;) ) so pardon my ignorance and possibly RTFM question.
https://github.com/multiformats/multihash/blob/master/README.md#idea-self-describe-the-values-to-distinguish
has an example for sha3 256 which starts with 1420caad, but according to the multihash table (https://github.com/multiformats/multihash/blob/master/README.md#table-for-multihash-v100-rc-semver) 0x16 sha3-256. So why example doesn't start with 16? (I guessed that those examples are autogenerated so there must be no mistakes/bug, thus I guess it is all my misunderstanding)

Keccak 256 algorithm issue

hi ,
I am using keccak-256 algorithm for generating multi-hash for string "Hello world". It supposed give output as ed6c11b0b5b808960df26f5bfc471d04c1995b0ffd2055925ad1be28d6baadfd ,but it is giving
1b20ed6c11b0b5b808960df26f5bfc471d04c1995b0ffd2055925ad1be28d6baadfd.

1b20 value is prefixed with the expected has-code. i want to know is this intentional and if it so,i want to know where exactly the concatenation is happening.

Two questions: Size is de facto in first byte sometimes? Formal standard?

At ascribe, we'd like to make BigchainDB interoperate with IPFS. One step will be identifying BigchainDB transactions by their multihash. (Currently we assume all hashes are sha3-256, but that's easy enough to change, and it would be nice to allow for future hash functions.)

We currently have two questions about multihash:

The first byte identifies the hash function, so for example 0x16 means sha3-256. The second byte is the digest size in bytes, but wait, isn't that already encoded in the first byte (at least in this case)? Is the second byte as digest size just there for hash functions where the output size must be specified independently (as with SHAKE128)?
Is multihash going to be proposed as a formal standard with some standards body (such as an RFC)?

We created a related issue on the BigchainDB repository: bigchaindb/bigchaindb#100

Why isn't encoding included in the prefix?

This may be a n00b question but shouldn't we include the encoding (hex, Base32, Base58, Base64 etc) too in the prefix to make it fully self-describing? The same hash often needs to be communicated in different encodings because of various constraints (URL safety, storage limits, easier typing by humans, disambiguity while communicating over phone).

Why was encoding format left out? Will this be covered by variants?

Loggable stringified version

In writing CID i wanted a way to expand a CID into a human readable log version, for development. So i wrote this:

This means multihash needs an equivalent version.

@greglook already wrote one over at https://github.com/multiformats/clj-multihash it looks like:

hash:sha2-256:dbd318c1c462aee872f41109a4dfd3048871a03dedd0fe0e757ced57dad6f2d7

My comments on it there: multiformats/clj-multihash#7

Standard Implementation Test Suite

we should have a standard test suite in this repo, to be used across implementations.

Perhaps:

require an implementation can produce an executable we can call to exercise an implementation like the go-multihash bin. (options below)
setup a sharness test suite with clear examples that can test any bin
make it easy for this suite to be included in an implementation to test it.
setup a sharness suite to test all known implementations in this repo.

@chriscool maybe this is something you could help with? o/

Proposed options of the bin:

> multihash --help
usage: multihash [options] [FILE]
Print or check multihash checksums.
With no FILE, or when FILE is -, read standard input.

Options:
  -a, --algorithm string  one of: sha1, sha2-256, sha2-512, sha3 (default: sha2-256)
  -c, --check string      check checksum matches
  -e, --encoding string   one of: raw, hex, base58, base64 (default: base58)
  -l, --length int        checksum length in bits (truncate). -1 is default (default: -1)
  -q, --quiet             quiet output (no newline on checksum, no error text)

Skein hash (for fine-grained uses)

Skein hashes can have arbitrary lengths.
Which means that the same algorithm can be used to generate 256, 512 and 1024 bit hashes.
A longer hash means files are less likely to have hash collisions.
They can also produce shorter hashes for convenience.
It is as fast as BLAKE, faster than all other SHA3 finalists and SHA2.

C# implementation

I've created a new implementation in C# here. Supports all suggested hash algorithms.

Add MD5 (and MD4?)

We should add MD5 and MD4 just because tons of people are still using these and we should help them upgrade. these are non cryptographic hashes now. As per http://multiformats.io/multihash/#what-about-non-cryptographic-hashes-like-murmur3-cityhash-etc

Argon2

Can we have argon2d?

First byte used to represent the encoding or settle on base58bitcoin-multihash?

There probably aren't going to be that many different encodings, however consider if the first byte (shown as an ASCII char in '') represented the encoding:
e.g.

'1' or 0x31 = Binary #Just put this here for fun really
'2' or 0x32 = Hex
'3' or 0x33 = Base32
'4' or 0x34 = Base36 #Case insensitive letters and numbers
'5' or 0x35 = Base58Bitcoin #Guessing bitcoin is the alphabet (e.g. not Ripple)?
'6' or 0x36 = Base64

etc.

So the examples from the page would change to be

# sha1 - 0x11
211140beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 # sha1 in hex
324a0qvp7pqn3y3yvt5egvn3z7hdw4xeuh8tg # sha1 in base32
55dqx43zNtUUbPj97vJhpHyUUPyrmXG # sha1 in base58
6ERQL7se16j8P28ldDdR/PFvCddqKMw== # sha1 in base64

# sha2-256 0x12
212202c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae # sha2-256 in hex
328g2r9nmddmfzhmfz6dmaf0x610k84u25nr690xzm3wrmqm8c9kefbg # sha256 in base32
5QmRJzsvyCQyizr73Gmms8ZRtvNxmgqumxc2KUp71dfEmoj # sha256 in base58
6EiAsJrRraP/Gj/mbRTwdMEE0E0ItcGSDv6D5il6IYmbnrg== # sha256 in base64

Then the multihash would be truly self-describing, instead of settling on what appears to be a de-facto base58 encoding. It would also eliminate the need for the -e parameter to the hashpipe program.

I didn't work through it, but looking over all the currently generated multihashes in existence, it might be possible to implement this in a backwards compatible way, if clever characters are chosen for the different encodings such that they don't intersect with possible numbers in the wild (e.g. 1 is out because hex has it for SHAs, Q is out etc).

That of multihash might just mean base58 in any case either the encoding should be put into the multihash or different encodings probably ought to be dropped altogether to avoid confusion and have a universal scheme, maybe some superset of the multihash called the unihash (then the multihash could be used for hash name that really do contain multiple hashes, something this doesn't quite support, but which could easily).

Support for Tiger Tree Hash and Merkle Hash Tree

@jbenet
The Tiger tree hash is a widely used form of hash tree. It uses a binary hash tree (two child nodes under each node), usually has a data block size of 1024 bytes and uses the cryptographically secure Tiger hash.

Tiger tree hashes are used in Gnutella, Gnutella2, and Direct Connect P2P file sharing protocols and in file sharing applications such as Phex, BearShare, LimeWire, Shareaza, DC++ and Valknut.

Example: RBOEI7UYRYO5SUXGER5NMUOEZ5O6E4BHPP2MRFQ

Short name: TTH
Alternative name: Tree Tiger Hash
URN: urn:tree:tiger:[TTH in base32]
Magnet: magnet:?xt=urn:tree:tiger:[TTH in base32]
Default size: 192 bits
Default encoding: Base32

Guidelines Question: make it easier to upgrade the table

STUB issue. cc @diasdavid @mikolalysenko @chriscool @Kubuxu @lgierth

Haskell imp

By @lukehoersten: https://github.com/LukeHoersten/multihash

Hackage link: https://hackage.haskell.org/package/multihash-0.1.1

varint limitations of multihash

As multihash plans to introduce varints can we agree on some limitations that would allow many implementations to be much simpler (I am looking here at for example C).

My proposal for future:
Use MSB Varint (1 on most significant bit means that there is still data) with limitation of 4 bytes, and MSB of 4th byte unused (and 1 there is invalid). This means that whole varint can be fitted into 31 bits which is nice feature for low level languages.

This gives us: 127^4 == 260,144,641 possible hashing functions and 260144641/2^20 = 248MiB maximum hash length.

cc @jbenet

Mapping to openssl implementations

How exactly does this table hashtable.csv map to the implementation in openssl, as used by say node:
https://stackoverflow.com/questions/14168703/crypto-algorithm-list

[ 'DSA',
  'DSA-SHA',
  'DSA-SHA1',
  'DSA-SHA1-old',
  'RSA-MD4',
  'RSA-MD5',
  'RSA-MDC2',
  'RSA-RIPEMD160',
  'RSA-SHA',
  'RSA-SHA1',
  'RSA-SHA1-2',
  'RSA-SHA224',
  'RSA-SHA256',
  'RSA-SHA384',
  'RSA-SHA512',
  'dsaEncryption',
  'dsaWithSHA',
  'dsaWithSHA1',
  'dss1',
  'ecdsa-with-SHA1',
  'md4',
  'md4WithRSAEncryption',
  'md5',
  'md5WithRSAEncryption',
  'mdc2',
  'mdc2WithRSA',
  'ripemd',
  'ripemd160',
  'ripemd160WithRSA',
  'rmd160',
  'sha',
  'sha1',
  'sha1WithRSAEncryption',
  'sha224',
  'sha224WithRSAEncryption',
  'sha256',
  'sha256WithRSAEncryption',
  'sha384',
  'sha384WithRSAEncryption',
  'sha512',
  'sha512WithRSAEncryption',
  'shaWithRSAEncryption',
  'ssl2-md5',
  'ssl3-md5',
  'ssl3-sha1',
  'whirlpool' ]

I'm not sure which hash function would give a correct hash to label it sha2-256 for instance,
is it RSA-SHA256 sha256 ? etc...

Could you provide any guidance on this?
thnx :-)

Why not use the existing crypt format? $.$

Is this moot for functions whose output already includes these extra params?

Like bcrypt for example where the output already includes the work factor:

$2a$10$...

is it too late for a suffix version?

some things sort by prefixes (i.e. leveldb), and you would maybe rather keep the hashes uniformly distributed (at least you can reason about it).

For example, this would play much nicer with leveldb if it was a suffix not a prefix.

Reference test cases

We've split out our Scala multihash implementation (from L-SPACE) into a little lib: https://github.com/mediachain/scala-multihash

I'd love to add some tests before pushing this up to maven, but it seems silly to reinvent the wheel. Are there any reference test cases I can use to verify our implementation? A csv of input bytes, algo, output bytes seems ideal.

Non-cryptographic hashes

What do we do about non-cryptographic hashes, such as these? Do we include them in the same table?

I'm leaning towards yes.

sha3/keccak variants and clarification

Only one sha3 is in the multihash list and it's size cannot be used to infer it's type. sha3-256 can be used and truncated to 128-bits and sha3-512 truncated to 256-bits. Here is a snippet from Mar-2014 NIST document:

The four SHA-3 hash functions are defined from the KECCAK[c] function specified in Sec. 5.2 by appending two bits to the message and by specifying the length of the output, as follows:
SHA3-224(M) = KECCAK[448] (M || 01, 224); 
SHA3-256(M) = KECCAK[512] (M || 01, 256); 
SHA3-384(M) = KECCAK[768] (M || 01, 384); 
SHA3-512(M) = KECCAK[1024] (M || 01, 512).
In each case, the capacity is double the digest length, i.e., c = 2d. The two bits that are appended to the message (i.e., 01) support domain separation; i.e., they distinguish the messages for the SHA-3 hash functions from messages for the SHA-3 XOFs discussed in Sec. 6.2, as well as other domains that may be defined in the future.

Java 9 & Bouncy Castle Digests <> Multihash Supported Library

I've added decoding support to my Java message digest library hash-overlay. There are no complete encoding utilities (intentional), but there are functions for varint encoding, getters for multihash function codes, and getters for digest lengths.

You can quickly visualize all of the supported digest functions here. And the intersection of multihash functions here.

Opening this issue to see if there is any interest in adding this implementation to this projects readme.

Implementations should allow locking hashes

Multihash already allows users to give their own multihash table, and thus they can already "lock" the library to only accept a set of them. But perhaps this isn't messaged very clearly.

Multihash implementations should allow locking to a certain set of hashes, and make this very clear for developers so that they do it. Otherwise -- in some kinds of applications that accept things from the wire and just check for self-consistent validity -- people may be able to force weak/non-crypto hashes.

We should recommend that implementations have something like:

multihash.Configure(multihash.Config{
  Hashes: map[string]int{ // int is minimum size allowed.
    "sha2-256":    224
    "sha2-512":    224,
    "sha3-512":    224,
    "sha3-224":    224,
    "sha3-256":    224,
    "sha3-384":    224,
    "sha3-512":    224,
    "blake2b-256": 256,
    "blake2b-512": 256,
})

Or could even take a set of ranges

Examples are incorrect/incorrectly labeled

QmRJzsvyCQyizr73Gmms8ZRtvNxmgqumxc2KUp71dfEmoj # sha256 in base58

That's the base58 hash of "foo" not "sha256"
Perhaps the intention was it's a sha256 hash of "foo" in base58.

Nicer Readme + github pages?

Would be nice to have a nicer readme + github pages-- like http://jbenet.github.io/hashpipe/

Parameterized hashing

Different types of hashing may require parameters not currently specified, for example Argon2 has parameters for iterations, memory usage and parallelism. On the one hand, such parameters could be viewed like a salt and treated out-of-band. However, swapping out Argon2 with a hypothetical Argon3 would be much easier if such parameters were treated in-band - similar to length.

As one cannot anticipate the needs of future hash functions, I believe this would require a new arbitrary length <params> such fields would require a new <params> section. Thoughts?

Support bcrypt

bcrypt is currently one of the best options for storing passwords. Coda Hale explains it better than I could.

Can multi hash be compatible with RFC 6920

Just wondering if the devs here have seen the work at the IETF:

https://tools.ietf.org/html/rfc6920

Could the two efforts be compatible?

Shouldn't type (and perhaps size) be extensible in order to future proof?

It seems that the current table of types https://github.com/multiformats/multihash/blob/master/hashtable.csv has started to grow quite rapidly. If the choice of 0xb2 is reversed then it would be possible to use the high-bit with a UTF-8-like scheme to extend to multiple bytes as necessary, but obviously many other extension schemes would be possible.

fncode with size of the hash output is not unique

https://github.com/multiformats/multihash/blob/master/hashtable.csv says:
blake2s-256 0xb260
blake2b-256 0xb220

blake2s-256 and blake2b-256 for the same input generate different hashcodes of the same size but have the same fncode 0xb2 and different digsize than their lengths which are both 0x20.

This may be ok for people who store a table of 2 byte codes that each map to a function name and output size, but it does not work for my usecase of only fncode and store size in my own data format which is more general than hashes. I therefore cant use fncode with the actual output size because its not unique and will have to make my own table of fncodes.

Do the hash functions have more parameters than I'm considering? Why is it done this way? Or is the csv file wrong?

Add the "Identity" hash function

sometimes the values to hash are smaller than a cryptographic hash digest, and merits just actually storing the data itself as the hash value. (Example use case: a multihash of an ed25519 256bit pub key [0]).

Implementation:

add a code for the "identity" hash function. (Could use 0x10 and don't add sha0, or 0x50)

Pros:

can store ECC pub keys easier and use them in projects that use multihashes as IDs (IPFS)
can store values smaller than the hash ("a type")

Cons:

this worsens the "keep things uniform" expectation people have from hashes

[0]:

"On the other hand, elliptic-curve points are typically sent as 256 bits, and then a 256-bit hash doesn't save any space. Sending R instead of H(R,M) also has the advantage of allowing batch signature verification, which is typically twice as fast as verifying one signature at a time. That's why Ed25519 doesn't use this compression." -- from http://blog.cr.yp.to/20140323-ecdsa.html

SHAKE128/256 are not fully specified

According to the Wikipedia page on SHA-3 SHAKE128/256 are parameterised functions that need to have an output digest length specified. As of now I'm unsure how they are actually meant to work with multihash (~~I will try and look into existing libraries and list what they do soon~~ done); potentially the digest length given with the multihash could be used, but that should probably be specified explicitly since it's a special case compared to how the length would be used with other hash types.

Similarly adding support for other Keccac variants as proposed in #54 may have the same issue since Keccac is designed to have an arbitrary length output.

Clojure implementation

Hi @jbenet, I've implemented multihash in Clojure if you'd like to add it to the list in the README:
https://github.com/greglook/clj-multihash

The motivation was another project of mine which shares some ideas with IPFS. After reading through the spec and learning more about it, I think I may just drop the blob layer and use IPFS instead. :)

Future proofing length

Right now the multihash standard can only handle hash lengths of up to 255 bytes (2040 bits).

If we use a varint format to store the length of the hash, we can handle hashes of any lengths (essentially future-proofing the standard against increases in hash length)

using a "top bit = 1 -> more bytes follow && top bit = 0 -> last byte" approach means that all current hashes would still be valid despite the change in standard.

Swift multihash reimplementation/update

The current SwiftMultihash implementation looks to be out of date (e.g. no varint encoding). It also doesn't seem to use the Swift Package Manager.

I actually had started writing my own over at https://github.com/cloutiertyler/swift-multihash before I found that one. I've tried to use the most idiomatic Swift possible. I will be actively developing mine further and I was hoping I might be able to have it moved under multiformats if it makes sense.

I'll be adding more test cases and a README.md tomorrow.

Murmur3-128 output differs between 32-bit and 64-bit implementations by design

The current version is MurmurHash3, which yields a 32-bit or 128-bit hash value. When using 128-bits, the x86 and x64 versions do not produce the same values, as the algorithms are optimized for their respective platforms.

Source: Wikipedia

I think this is a major issue for a hashing function for multihash. I believe the idea is for the hash to be reproducible anywhere.

This could be fixed by splitting the Murmur3-128 multihash code into two, whether the 32 bit or 64 bit implementation is to be used.

Alternatively, Murmur could be removed, as it's an unsafe, non-cryptographic hash function, and I am not aware of anything that depends on a multihash library and uses murmur.

http://bench.cr.yp.to/primitives-hash.html