well-typed / cborg Goto Github PK

View Code? Open in Web Editor NEW

186.0 186.0 86.0 4.08 MB

Binary serialisation in the CBOR format

Home Page: https://hackage.haskell.org/package/cborg

Haskell 99.86% C 0.09% Shell 0.05%

haskell

cborg's People

Contributors

Stargazers

Watchers

Forkers

jberthold adinapoli adamgundry thoughtpolice shimuuar arianvp michalt bgamari zimurgh ibotty silky queertypes teh 23skidoo axman6 qnikst sopvop ghorn obadz mikedlr traviswhitaker jb55 tittoassini hvr parakrama1995 neongreen ptek timds ts468 imalsogreg prillan nunoalexandre avieth taneb haskell-mouse erikd quasicomputational tsurucapital deepakkapiswe platonic-io haskell-vanguard hexresearch phadej andreaspk deepfire sjakobi aallzz cfraz89 jimbo4350 jredrado felixonmars infinity0 sumo iliastsi codiepp manish364824 yaitskov nirvananimbusa newhoggy alexbiehl georgefst adetokunbo goolord tomferon hs-monadfail-st-remove anton-latukha ryanglscott finleymcilwaine plow-technologies riugabachi parsonsmatt ocharles lehins scarf-sh kubek2k wavewave shlevy gelisam tristancacqueray amesgen space-vacuum dnadales iconnect andreasabel sambnt

cborg's Issues

Bug in implementation of `decode` for Either

At https://github.com/well-typed/binary-serialise-cbor/blob/master/Data/Binary/Serialise/CBOR/Class.hs#L202 the recent PR to add more strictness (#5) failed to add in the Left and Right constructors when decoding Eithers. I haven't tested it, but I expect that at best this will always cause decoding to fail, and at worse could cause an infinite loop, but that's probably unlikely (the instance for Either is relying on its own instance recursively, the x's both have type Either a b)

Compilation with GHC 7.6 fails

Example from Travis:
https://travis-ci.org/well-typed/binary-serialise-cbor/jobs/117276121

The culprit seems to be the introduction of orI# (wasn't there in GHC 7.6) and changes to comparisons (e.g., ># used to return Bool in GHC 7.6, now returns Int#). I guess the main question is: do we want to support GHC 7.6? (since 8.0 release is soon so the last 3 GHC versions would be 7.8, 7.10 and 8.0) If yes, then we'll probably need to have a bunch of ifdefs...

Rename executables demo-dump-cbor and demo-aeson

Even though these are just demos, they can be useful utilities in their own right. Names like dump-cbor and json2cbor would make this more apparent.

Consider writing some CBOR extensions

There's a few CBOR spec extensions we might want to consider writing.

The purpose would be to get generic CBOR tools be able to decode some things we might want to use.

parallel arrays. Values that are logically arrays of records represented as a record of arrays.
compact single-type primitive arrays. Arrays of e.g. Int/Word16/32/64, Float, Double etc represented not as normal individually CBOR tagged values, but with one single tag up front and then all the values packed together (as a CBOR byte array). Could also include in the array-type tag if it's big or little endian. We'd loose the variable width integers for these but we could do ridiculously fast memcpy (or memcpy + bswap) encoding and decoding. Useful for scientific data.
bit arrays. This is much like the above, but the encoding has to be slightly different since it has to include a count of the bits.
multi-dimensional arrays. This is just the array bounds, lower and upper in each dimension, followed by the array data. Could use two tags to cover 0-based and explicit lower bound based.

Think of an awesome name for this package

Before we release this package publicly, I think @dcoutts and I both agree the name is kind of a mouthful. It would be nice if this could have a shorter name and module space before we publicly release it and make everyone mad for breaking it.

This should be considered low priority (since, as stated, the eventual plan is for this to become binary itself), but would be nice to think about.

Exorcise `binary` dependency

The only reason the library internally depends on binary is for the Decoder type. Given that we eventually want to replace binary altogether, I'm guessing we should probably get rid of this and use our own inline version. This will force users to have a separate (or rewritten) code path, but I imagine that's an overall small cost given the large API change. I'd guess we might as well force people into it if they're going to use the new interface, then they need a new Decoder.

/cc @dcoutts

Convenient API for reading/writing sequence-style files incrementally

We have an incremental API already but it's incremental in this sense:

for output, given the whole value to serialise, the output is produced and can be emitted incrementally (in constant space)
for input, the input can be supplied to the decoder incrementally in chunks

Note that for output, we still need to supply the whole value to serialise (though it need not be fully evaluated) and for input we only get the whole value back at the end, not bit by bit.

In the general case it's a bit tricky to do much better than this (think big complicated tree-shaped data), but for files that are basically sequences of values then we should be able to get decoded elements one by one, or supply output elements one by one.

This is indeed possible and people are doing this already. The goal here is to provide something in the library that makes this more convenient.

See these two existing examples https://gist.github.com/dcoutts/798812e040a61ad969c27a45549943c0

One issue is putting a proper CBOR list header and footer in the file, so that it's not just a sequence of top level CBOR values (which is technically allowed by the standard but isn't well supported by existing tools). Another question is if there's any way to support variations like file headers as many real-world use cases would need some header info before a sequence, or perhaps multiple sequences.

Serialise vs Serialize

As discussed at the Haskell Boston meetup, the code contains both spellings of the word and there needs to be an issue and a pull request to fix that.

Allow demo-dump-cbor to parse files containing a sequence of cbor values

Currently only the first one is dumped, without an indication that the file has more content.

Write a cool pretty printer for an `Encoding`

One of the nicest things about this library is that the Encoding type are really just "deep embeddings", or as we like to call them: syntax trees. This allows a variety of 'fun' things.

When you get an Encoding, that's really something like a function Tokens -> Tokens, and you apply a TkEnd to get a Tokens you can traverse over recursively.

It would be neat if we had a way to 'pretty print' this Encoding into a representation of what will actually be a CBOR value. This would be quite useful for visualising how a particular data type would be encoded into CBOR, or merely how it's structurally represented internally.

Write the tutorial

There's a skeleton module in the repository right now. It should be filled out with many examples and wonderful prose.

Simple example fails

Code below fails with message Data.Binary.Serialise.CBOR.deserialise: failed at offset 0 : expected null

deserialise $ serialise (TInt 1)

UTCTime serialisation slow

For testing I replaced my Binary instances with Serialise from this package and serialization performance was 10-20x slower. I narrowed it down to UTCTime. Replacing that with a dummy empty serialiser speeds up my code from 3 seconds to 150 milliseconds.

Instance in question:

https://github.com/well-typed/binary-serialise-cbor/blob/9528877a4d85642be787c05efee39d2b3e0e078e/Data/Binary/Serialise/CBOR/Class.hs#L402

Support building with GHCJS

cc @dcoutts @kosmikus

Hey guys!

I'm trying (perhaps with a bit of foolish, Xmas spirit!) to install IRIS' big-kitchen-sink-restful-server using GHCJS, and as it depends upon binary-serialise-cbor I'm hitting an issue similar to the hashable one which has to do with unboxed constructors:

    Data/Binary/Serialise/CBOR/ByteOrder.hs:202:46:
        Couldn't match expected type ‘Word#’ with actual type ‘Word64#’
        In the first argument of ‘wordToFloat64#’, namely ‘w#’
        In the first argument of ‘D#’, namely ‘(wordToFloat64# w#)’

    Data/Binary/Serialise/CBOR/ByteOrder.hs:209:40:
        Couldn't match expected type ‘Word64#’ with actual type ‘Word#’
        In the third argument of ‘writeWord64Array#’, namely ‘w#’
        In the expression: writeWord64Array# mba# 0# w# s'

I suspect here the problem might lie in the different word size between the native vs the JS world. Do you guys think it's possible to issue a patch with a sane dose of CPP to make the package buildable on ghcjs? 😉 Is even possible at all?

Thanks a ton!

Add Generic Serialise based on DefaultSignatures

UTCTime tag 1 is not handled

According to the RFC, encoded UTC time values can use a 0 tag to represent a serialized string in ISO-8601 format, or tag 1 to represent a numeric representation, of the number of seconds passed since the UNIX epoch.

As of today, we don't handle tag 1. That should be fixed before release (and is probably fairly easy).

See also #51 - it's important to make sure this case is efficient as well (and that's probably easier than the tag 0 case).

Add benchmarks for all core instances

Every instance we have should have a nice little microbenchmark to test things like encoding and decoding speed for individual cases.

Larger macro benchmarks will help catch more real regressions (even if a change might otherwise look good), so these are still useful for nice local optimizations.

/cc #15

Rename the internal `Decoder` type

The Decoder type in the .Decoding module has an unfortunate naming conflict with the Decoder type from binary, which is mighty confusing. It's definitely confusing to me, considering they have nothing in common.

It wouldn't be a big deal if this type wasn't exposed, but it needs to be exposed for the tests.

It would probably be worth renaming this to avoid any potential confusion or whatever (hopefully with a low amount of bike shedding).

`Data.Binary.Serialise.CBOR.Term` is confusing

I'm a bit confused by CBOR.Term module. It seems to be used only for testing that the optimized and reference implementations do the same thing. But then I'd expect it to be inside the tests/ directory (instead of being exposed). If it's supposed to be used for testing also by the users, then it's still confusing, because there's CBOR.FlatTerm which seems to do that.

Look into Packman benchmarks

Another competitor has appeared!

jberthold@ba522df

(more along the lines of read/show, but intriguing nonetheless!)

Confusing error message in generic Serialise instance for sums

We have

      when (fromIntegral n /= nF + 1) $
        fail $ "Wrong number of fields: expected="++show nF++" got="++show n

leading to error messages such as

Wrong number of fields: expected=14 got=14

Export properties from the main package, for client users

These properties need to be moved out of the test suite, since they're not really dependent on QuickCheck, and moved into the actual package itself.

These three properties are very convenient for any client user, because they can immediately add them as QuickCheck tests to their own test suite, for their own data type.

The FlatTerm properties ensure that encoders/decoders are correct, assuming a correct implementation of our library.
The actual serialise roundtrip will help catch any kind of weird bugs they may want to report to us.

This should be fixed, and a note added into the tutorial, encouraging users of this library to add these properties to their own test suites.

Write some more demos and examples

Demonstrations are cool. Everyone loves demonstrations. We should have a lot more of them.

Tidy up the Haddock documentation

In particular, almost all of the user-facing API should either have Haddocks, or if it's not generally useful, be privatized to the module. I've mostly gotten us there, but it'll need a few passes to finish up.

Serialise1

In the fashion of Show1 and friends.

This might be useful when writing instances by hand.

Somehow related to #15

Think about Strict ByteString APIs

Several of the APIs we have return or consume lazy ByteStrings, but many common cases or APIs involve one-shot (de)serialization of values with strict ones. This is a really common annoyance for a lot of people (even if I live with it) to have to import and use toStrict from bytestring, so it would be nice to avoid it.

We might want to change some of the naming of the exposed APIs a bit in order to accommodate this, too.

Consider `streaming-bytestring`?

This is just a suggestion, and not at all an area I'm an authority on. But, I wonder what a deserialization library would look like if implemented using a streaming-bytestring, which is described in the README as "lazy ByteString done right".

Basically, it's implemented as a monad transformer, and thus readFile can actually perform (strict) IO, rather than be hidden under unsafeInterleaveIO.

Bikeshed the module namespace and some APIs

After a chat with @dcoutts earlier, there could probably be a bit of restructuring done in the package and renaming to make things a bit more fluid and consistent (exposing lazy vs strict interfaces, module hierarchy and module naming, etc).

We'll probably talk a bit more about this later; consider this issue a place holder.

Decide the eventual fate of this package

This should be considered a meta ticket for tracking the inevitable replacement of binary, and what our plans should be in accomplishing that.

"Instance Bonanza" - add lots of instances

This package needs some love by adding instances to the Serialise class, which is currently somewhat lackluster. Adding a billion instances is what I like to call "Instance Bonanza", as it goes on for a while.

Essentially, anything within the scope of the Haskell Platform is probably fair game.

Error with 'C pre-processor' phase

I'm able to build the library fine but using cabal and stack tools using standalone build commands but when trying to load the library into a GHCi session I see the following problem with the Cpp preprocessor. Not sure if this is a binary-serialise-cbor problem or a problem upstream.

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/downloaded/5f34802a2e7a77d7e5b4b1c421cd4b93d91919f32c47b88fa95e3355911096b5.git/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:178:0:
     warning: "CURRENT_PACKAGE_KEY" redefined [enabled by default]
     #define CURRENT_PACKAGE_KEY "binar_0SlfD4kPaIaKT7PcGaBUM0"
     ^

In file included from <command-line>:10:0: 

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:157:0:
     note: this is the location of the previous definition
     #define CURRENT_PACKAGE_KEY "bench_2neKwoSHCjAHYmwpaCnoJF"
     ^

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/downloaded/5f34802a2e7a77d7e5b4b1c421cd4b93d91919f32c47b88fa95e3355911096b5.git/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:178:0:
     warning: "CURRENT_PACKAGE_KEY" redefined [enabled by default]
     #define CURRENT_PACKAGE_KEY "binar_0SlfD4kPaIaKT7PcGaBUM0"
     ^

In file included from <command-line>:10:0: 

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:157:0:
     note: this is the location of the previous definition
     #define CURRENT_PACKAGE_KEY "bench_2neKwoSHCjAHYmwpaCnoJF"
     ^

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/downloaded/5f34802a2e7a77d7e5b4b1c421cd4b93d91919f32c47b88fa95e3355911096b5.git/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:178:0:
     warning: "CURRENT_PACKAGE_KEY" redefined [enabled by default]
     #define CURRENT_PACKAGE_KEY "binar_0SlfD4kPaIaKT7PcGaBUM0"
     ^

In file included from <command-line>:10:0: 

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/autogen/cabal_macros.h:157:0:
     note: this is the location of the previous definition
     #define CURRENT_PACKAGE_KEY "bench_2neKwoSHCjAHYmwpaCnoJF"
     ^

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/downloaded/5f34802a2e7a77d7e5b4b1c421cd4b93d91919f32c47b88fa95e3355911096b5.git/Data/Binary/Serialise/CBOR/Class.hs:47:0:
     error: missing binary operator before token "("
     #if MIN_VERSION_time(1,5,0)
     ^

/home/sdiehl/Git/alpha-sheets/backend/server/bench/serialise/.stack-work/downloaded/5f34802a2e7a77d7e5b4b1c421cd4b93d91919f32c47b88fa95e3355911096b5.git/Data/Binary/Serialise/CBOR/Class.hs:260:0:
     error: missing binary operator before token "("
     #if MIN_VERSION_time(1,5,0)
     ^
phase `C pre-processor' failed (exitcode = 1)

<no location info>:
    Could not find module ‘Data.Binary.Serialise.CBOR’
    It is a member of the hidden package ‘binary-serialise-cbor-0.1.1.0@binar_0SlfD4kPaIaKT7PcGaBUM0’.

Bug in direct serialise/deserialise - possibly GHC bug?

The linked code fails deterministically with GHC 7.10.3 on both MacOS and Linux. It seems very much like GHC bug, probably something to do with fusion, but I'm neither expert on fusion nor on your library, so you might have a better idea where to search.

Link to GIST test code

Skip tag for single-constructor data types?

The generic serialisation and deserialisation has a special case for single-constructor-single-field data types (instance for GSerialiseEncode (K1 i a)), but does not introduce a special case for single-constructor-multiple-fields (instance for GSerialiseEncode (f :+: g)). This means if you have something like

data Foo = Foo {
    some   :: ..
  , record :: ..
  , type   :: ..
  , with   :: ..
  , lots   :: ..
  , of     :: ..
  , fields :: ..
  }

and we serialize a bunch of these, every one will have an unnecessary extra tag field.

Make `demo-dump-cbor` more generally useful

We've been using CBOR in our projects for some time with good success, but one thing @Oblosys noted is demo-dump-cbor could be much, much more generally useful for all kinds of stuff.

This is a meta-ticket to keep track of various improvements (notably the ones originally filed by @Oblosys):

Allow demo-dump-cbor to parse files containing a sequence of cbor values (#78)
Add more output options for demo-dump-cbor (#77)
Don't require demo-dump-cbor input file to have a .cbor extension (#76)
Rename executables demo-dump-cbor and demo-aeson (#75)

and more. Please submit useful suggestions here and we can divvy them off or discuss for now, but I did want a place to keep track of overall improvements.

Vector instance decoder fails when peekAvailable returns 0 (and container is exactly chunkSize)

Test case:

module Main where

import Control.Monad
import qualified Data.Vector.Storable       as S
import qualified Data.ByteString            as BS
import qualified Data.ByteString.Lazy       as BL
import qualified Data.Binary.Serialise.CBOR as CBOR

main :: IO ()
main = do
  -- Split strict bytestring into chunks and return as lazy one
  let evilChunker i bs = let (b1,b2) = BS.splitAt i bs
                         in BL.fromChunks [b1,b2]
  -- Test case
  let ann = [S.replicate 128 (0::Double)]
      bs  = (BS.concat . BL.toChunks . CBOR.serialise) ann
  forM_ [1 .. BS.length bs - 1] $ \i -> do
    print i
    print $ ann == CBOR.deserialise (evilChunker i bs)

Output

GHCi, version 7.10.3: http://www.haskell.org/ghc/  :? for help
[1 of 1] Compiling Main             ( testcase.hs, interpreted )
:Ok, modules loaded: Main.
*Main> :main
1
True
2
*** Exception: DeserialiseFailure 3 "expected list len"
*Main>

I think it happens when list header get split between chunks. I run into this bug when trying to compress serialized data using gzip

Address any `TODO FIXME`s in the code

What it says on the tin. I'll make notes of things as I touch them up, but they should all roughly be addressed before the initial release.

Don't require demo-dump-cbor input file to have a .cbor extension

This will make it easier to use on tfidf .tfs files.

Generic default instances appear to be not significantly faster than cereal

I initially had several UTCTimes and ran into #51, observing cbor 10x slower than cereal. After removing them from the test case, the performance of this package and cereal appear about the same (~17 us for a roundtrip, with cereal being faster at deserializing and this package faster at serializing).

Here's my test case (which is a cleaned up version of one of the ADTs we serialize in our app, with UTCTimes replaced by Text).

instance CBOR.Serialise PPTS
instance CBOR.Serialise AMs
instance CBOR.Serialise AM
instance CBOR.Serialise SM
instance CBOR.Serialise CH
instance CBOR.Serialise RMs
instance CBOR.Serialise RM
instance CBOR.Serialise Im
instance CBOR.Serialise VDs

newtype SM = SM { _sm :: (HS.HashSet Text) }
  deriving (NFData, Eq, Show, Generic)

newtype CH = CH { _ch :: Maybe Text }
  deriving (NFData, Eq, Show, Generic)

newtype RMs = RMs { _rm :: [RM] }
  deriving (NFData, Eq, Show, Generic)

data Im = Im (HM.HashMap Text Text) VDs
    deriving (NFData, Eq, Show, Generic)

newtype VDs = VDs { _vlsL :: HM.HashMap Text (Text,Text) }
    deriving (NFData, Eq, Show, Generic)

data RM = RM Text [Text] [Text] Text
    deriving (NFData, Show, Eq, Generic)

data PPTS = PPTS SM CH Im RMs AMs (Maybe Text)
  deriving (NFData, Eq, Show, Typeable, Generic)

data AM = AM Text Text [Text] Text Int
   deriving (NFData, Show, Eq, Generic)

newtype AMs = AMs [AM] 
  deriving (NFData, Eq, Show, Generic)

-- make this UTCTime for a real test case:
fakeTime = "asdf-2345234-sasdf UTC"
ppts = 
 PPTS 
  (SM (HS.fromList ["asdf", "2345234 23452345", "asdfasdf", "2345"]))
  (CH $ Just "he dfdfdfdf dfdfddf llp")
  (Im (HM.fromList [("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f"),("sasd 5555555987","dff f")]) (VDs (HM.fromList [ ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")), ("sasd 55 344455555987",("dfffffffffff f","asdfasdfdddf")) ])))
  (RMs [
     RM (fakeTime) ["hello", "world"] ["hedddddddddddddddddddddddddddlp"] "somenaml"
   , RM (fakeTime) ["asdfasdfhello", "world"] ["hedddddddddddddddddddddddddddlp"] "somenaml"
   , RM (fakeTime) ["hello", "wasdfdforld"] ["hedddaddfdfdddddddddddddddddddddddddlp"] "somenaml"
   , RM (fakeTime) ["he444444444444444asdkhjasdfllo", "world"] ["hedddddddddddddddaddfdfdddddddddddddlp"] "somenaml"
    ])
  (AMs [
     AM (fakeTime) "fffffffffffffffffffffffff" ["s","adfasdfasdfasdf"] "hellpasd akjha" 9874587484845
   , AM (fakeTime) "fffffffffffffffffffffffff" ["s","adfasdfasdfasdf"] "hellpasd akjha" 9874345484845
   , AM (fakeTime) "fffffff     asdf ffffffffffffffffff" [] "hellpasd akjha" 3434345484845
   ])
  Nothing

-- and instance for `cereal`:

instance C.Serialize PPTS
instance C.Serialize AMs
instance C.Serialize AM
instance C.Serialize SM
instance C.Serialize CH
instance C.Serialize RMs
instance C.Serialize RM
instance C.Serialize Im
instance C.Serialize VDs
instance (Eq a, Hashable a, C.Serialize a) => C.Serialize (HS.HashSet a) where
  put = C.put . HS.toList
  get = HS.fromList <$> C.get
instance (Eq a, Hashable a, C.Serialize a, C.Serialize b) => C.Serialize (HM.HashMap a b) where
  put = C.put . HM.toList
  get = HM.fromList <$> C.get

instance C.Serialize Text where
  put = C.put . T.encodeUtf8
  get = T.decodeUtf8 <$> C.get

If I change the HashMap and HashSet to lists in the usual way performance looks like (for serializatioon + deserialization):

cereal: 23.15 μs +  19.54 μs
cbor:    17.17 μs + 19.30 μs

That's as far as I could justify tweaking the type. We're struggling with serialization performance but don't have the time to write and test definitions like these by hand: https://github.com/well-typed/binary-serialise-cbor/blob/master/bench/versus/Macro/CBOR.hs

Let me know if I'm missing something obvious, but otherwise I hope the above is a useful test case. Thanks for your work on this package!

Add more output options for demo-dump-cbor

E.g. Json, hexadecimal, ..

Fix 32 bit build

See here: https://github.com/ondrap/binary-serialise-cbor/commit/c7c21405b25c54a5babdf7d0521d31f1dc1f0b30 - the code quite obviously doesn't compile on 32 bit platforms yet.

LLVM code generation bug: decode . encode != id

I was running the serial-bench tests with CBOR commit ab0f193, and this turned up:

  test/Spec.hs:22: 
  1) cbor/cbor
       Falsifiable (after 27 tests and 4 shrinks): 
       expected: Just [SomeData 53169 70 55.3817683321392]
        but got: Just [SomeData (-12367) 70 55.3817683321392]
       [ArbSomeData {toSomeData = SomeData 53169 70 55.3817683321392}]

Add more information about tagged types in the Encoding pretty printer

As I noted in #69, it would be nice to add some more info to the pretty printer about the content of tags, so we can add notes such as # tag(0)[time string], # tag(64)[uint8 Typed Array] (see #62 and #63). I'm happy to do this one day when we support more of the extensions.

Integrate LLVM builds on Travis

As noted on the tin. This will help prevent issues like #67 from cropping up in the future.

Unfortunately I don't think we'll be able to easily do this on Appveyor, meaning we can't cover the 32bit/LLVM codegen combo. But the amount of 32 bit cases are relatively small so this might be OK.

Faster vector support

This is mostly a brain dump for an idea in the interpreter that I want to try out.

The goal is to support faster boxed Vector/Array (and similar types) and also unboxed Vector/Array.

For the boxed case we need to be able to do certain key things in the ST monad. This can be purely in the interpreter, it does not have to leak into the decoders. We need to be able to allocate an ST array, and write elements into it and freeze the result. The idea is to add Decoder constructors to instruct the interpreter to do this. The interpreter would probably use a list-like stack structure to hold the arrays. The interpreter would have to move into ST.

For the unboxed case while it would be possible to take a similar approach, it may be better to instead tell the interpreter "please decode 37 floats now", and have the interpreter do that directly. In much the same way as it currently decodes byte strings or text strings (which are unbounded length encodings obviously).

One interesting challenge is efficiently decoding unboxed Vector style parallel arrays, ie when you've got something like Vector (Int, Float) then it's really represented as an array of int and an array of float. It would be faster to encode these things similarly in CBOR as a pair of arrays each containing just the one type. This would then let us use the above fast decoding of arrays of primitive types.

For example one of the micro-benchmarks is to encode/decode 1000 records with the structure (Int64, Word8, Double). If we're not trying to stream this then the best rep would of course be the parallel array style, and we could likely beat a fast implementation of the regular sequence-of-records style.

Compiling the benchmarks with 64bit GHC 7.10.3 is bananas

I noticed something like 2GB resident and a minute to compile a single module, which I believe was https://github.com/well-typed/binary-serialise-cbor/blob/master/bench/Macro/PkgCereal.hs (I'll double check this).

It's still pretty bad with GHC 7.8 but the compile times for vs-other-libs seems noticeably quicker. We should quantify these slowdowns and investigate a little before reporting bugs upstream.

Remove the MINIMAL pragma from Serialise

Given that there are Generic default methods for the Serialise typeclass, the MINIMAL pragma is useless and only serves to generate warnings (if there isn't a Generic instance, it will fail with type error). Same problem was with aeson: haskell/aeson#290 or https://ghc.haskell.org/trac/ghc/ticket/10959

Look into reviving MsgPack benchmarks

These apparently exist but aren't used.

Push code coverage close to 100%

As of 40909fc, on Windows 10/GHC 7.8.4 64bit:

$ cabal configure --enable-tests --enable-coverage
$ cabal test
$ hpc report ...
 74% expressions used (4702/6332)
 31% boolean coverage (30/95)
      27% guards (23/83), 30 always True, 1 always False, 29 unevaluated
      58% 'if' conditions (7/12), 2 always True, 1 always False, 2 unevaluated
     100% qualifiers (0/0)
 65% alternatives used (606/931)
 79% local declarations used (49/62)
 76% top-level declarations used (234/305)

Or:

Implement a new `Get` interface

The binary package has two major use cases:

Serializing Haskell values to bytes and back, for transmission or storage.
Parsing arbitrary binary data in externally defined formats into Haskell values.

Currently, this library implements 1. We have not implemented 2, nor have we put much thought into a design for a faster implementation of 2 than what binary offers. It might look something like .Read in the CBOR case. Or not.

Support `Rational`, `Fixed` and perhaps `Scientific` via CBOR decimal fractions or binary fractions (big floats)

CBOR directly supports decimal and binary fractions. These are numbers represented as x*10^e or x*2^e. The mantissa 'x' can be a positive or negative small int or a CBOR big int. The exponent 'e' can only be a positive or negative small int (ie up to 64bit but not a big int)

And there is also an extension for rationals, ie x/y where both x and y can be small or big ints.

We should use the CBOR decimal fraction for our Data.Fixed support. Currently we just encode the Integer mantissa and leave the exponent implicit in the type. It'd be better for debugging and data recovery if the exponent was also stored.
We should use the CBOR rational extension for the Haskell Rational type.
The Scientific type from the scientific package exactly corresponds to a CBOR decimal fraction, and we should encode it as such. (Of course where that instance should live is a good question).

It's not clear that we have any standard types that are best represented by binary fractions. We don't have a standard big float type.