Rummage is a Clojure client library for Amazon’s SimpleDB (SDB). It depends upon the standard AWS SDK for Java, and provides a Clojure-idiomatic API for the SDB-related functionality therein.
This is a fork of Rich Hickey’s original SDB client library implementation that was later maintained in miscellaneous ways by various contributors on various forks. This is a nearly complete rewrite, made necessary by my needing/wanting various changes (linked for historical/reference purposes only) compared to Rich’s original implementation.
rummage 1520s, "act of arranging cargo in a ship," aphetic of M.Fr. arrumage "arrangement of cargo," from arrumer "to stow goods in the hold of a ship," from a- "to" + rumer, probably from Germanic (cf. O.N. rum "compartment in a ship," O.H.G. rum "space," O.E. rum, see room). Meaning "to search (the hold of a ship) thoroughly" first recorded 1620s. Rummage sale (1858) originally was a sale at docks of unclaimed goods.
This library is functionally very new, even though it was forked from an established project. There will be changes.
-
Add/flesh out docstrings for the entire API + the stuff in encoding
-
Add notes on offline / local use (documented on rhickey’s readme)
-
Complete documentation in sections marked as TODO here
-
The default encoding for URLs doesn’t seem great — they don’t sort lexicographically in a particularly useful way. Maybe something like "TLD domain [sub-domain sub-domain2 …] path". I vaguely recall there being some standard encoding like this for e.g. handling of dates in hadoop, etc. Not in a rush to run into the pit that is URL encoding/parsing, though.
-
determine whether a default encoding configuration should be provided; that is, if a bare
client
is provided to e.g.query
orput-attrs
, should we just use one of the encoding schemes incemerick.rummage.encoding
? If so, which one? I’m all for simplicity, good defaults, and convention allowing for configuration, but I don’t think any reasonable default exists (or perhaps I just haven’t thought of one yet).
Rummage is available in Maven central. Add it to your Maven project’s pom.xml
:
<dependency> <groupId>com.cemerick</groupId> <artifactId>rummage</artifactId> <version>0.0.3</version> </dependency>
or your leiningen/cake project.clj:
[com.cemerick/rummage "0.0.3"]
I strongly recommend squelching the AWS SDK’s very verbose logging before using Rummage (the former spews a variety of stuff out on INFO that I personally think should be in DEBUG or TRACE). You can do this with this snippet:
(.setLevel (java.util.logging.Logger/getLogger "com.amazonaws") java.util.logging.Level/WARNING)
Translate as necessary if you’re using log4j, etc.
You should be familiar with SDB itself before sensibly using this library; in particular, you’ll need to understand its data model and the semantics of the operations underlying Rummage’s implementation, which are all documented here.
You’ll first need to load the library and create a SDB client object to do anything:
(require '[cemerick.rummage :as sdb]) (def client (sdb/create-client "your aws id" "your aws secret-key"))
client
here is an instance of com.amazonaws.services.simpledb.AmazonSimpleDBClient
.
Once you have a client
object, you can use all of Rummage’s administrative operations:
=> (doseq [n ["foo" "bar"]] (create-domain client n)) nil => (list-domains client) ("bar" "foo") => (domain-metadata client "foo") {:itemNamesSizeBytes 0, :attributeNamesSizeBytes 0, :attributeValueCount 0, :itemCount 0, :attributeNameCount 0, :attributeValuesSizeBytes 0, :timestamp 1299688823} => (doseq [n (list-domains client)] (delete-domain client n)) nil (list-domains client) ()
SimpleDB stores all data as strings, so you have to specify how Rummage should encode and decode your data. This is done by providing a configuration map as the first argument to each of Rummage’s functions; this configuration map must contain a set of functions that encode data into strings for storage in SDB, and decode strings retrieved from SDB into (presumably) richer data types.
The cemerick.rummage.encoding
namespace provides a number of configuration maps implementing
encoding strategies suitable for (hopefully) most datasets that you can use without modification
to encode your data for storage in SDB. All you need to do is assoc
your client
object into
your desired baseline configuration map:
=> (require '[cemerick.rummage.encoding :as enc]) nil => (def config (assoc enc/keyword-strings :client client)) #'user/config
Note
|
See Data Encoding Schemes for a detailed discussion of what functions must be found in
a configuration map, an enumeration of the canned configurations available in the
|
The keyword-strings
base configuration is used for most of the examples here, which
naively converts all item values to strings when putting, does nothing to those string values
upon retrieval, but does restore keys to keywords.
Note that all of the administrative functions accept either a bare client
or a configuration
map containing a com.amazonaws.services.simpledb.AmazonSimpleDBClient
object mapped to :client
:
=> (create-domain config "demo") nil
Rummage supports SDB’s operations for putting, getting, and deleting items, as well as batch-put and batch-delete operations.
"Item" is a SimpleDB term that refers to a basket of key/value pairs identified by an "item name". Translated to Clojure/Rummage terms, each item is a multimap that contains:
-
the "item name" (interchangeably referred to as the item’s id here) mapped to
:cemerick.rummage/id
-
a slot for each attribute key in the item, the value for which is either a scalar (e.g. string, number, keyword, date, etc) or a set of scalars
Rummage reserves the key ::sdb/id
(which expands to :cemerick.rummage/id
assuming you’ve aliased
the cemerick.rummage
namespace to sdb
) to identify the ID of items (called itemName()
in the SDB documentation).
It will expect every item provided to put-attrs
or batch-put-attrs
to contain an ::sdb/id
slot,
and items loaded via get-attrs
, query
, and query-all
will have their IDs mapped to ::sdb/id
.
Nothing here should be surpising or particularly interesting:
=> (create-domain config "demo") nil => (put-attrs config "demo" {::sdb/id "foo" :name "value" :key #{50 60 65}}) nil => (get-attrs config "demo" "foo") {:key #{"60" "50" "65"}, :name "value", :cemerick.rummage/id "foo"}
You can optionally specify a limited set of keys to delete, or a limited mapping of key/value pairs to delete:
=> (delete-attrs config "demo" "foo" :attrs {:key #{60}}) nil => (get-attrs config "demo" "bar") {:key #{"50" "65"}, :name "value", :cemerick.rummage/id "foo"} => (delete-attrs config "demo" "foo" :attrs #{:key}) nil => (get-attrs config "demo" "foo") {:name "value", :cemerick.rummage/id "foo"} => (delete-attrs config "demo" "foo") nil => (get-attrs config "demo" "foo") nil
You can attach conditions to puts and deletes; see Conditional puts and deletes for details.
batch-put-attrs
and batch-delete-attrs
each accept any number of items or
delete specs, respectively. (SimpleDB supports batch puts and deletes of only 25 items at a time;
Rummage transparently makes as many requests as are necessary to complete each batch put or
batch delete operation.)
=> (batch-put-attrs config "demo" [{::sdb/id "foo" :name "value" :key 50} {::sdb/id "bar" :name "value" :key #{60 65}} {::sdb/id "baz" :name "value" :key 70}]) nil => (get-attrs config "demo" "baz") {:key "70", :name "value", ::sdb/id "baz"}
batch-delete-attrs
accepts a collection of "delete specs": vectors that contain an item
ID as their first element, and an optional set or map as a second element. When
a set is provided, then only attributes with names corresponding to keys in that set are
deleted; when a map is provided, only attributes with names and values corresponding to
pairs in that map are deleted:
=> (batch-delete-attrs config "demo" [["foo" #{:key}] ["bar" {:key 60}] ["baz"]]) nil => (get-attrs config "demo" "foo") {:name "value", :cemerick.rummage/id "foo"} => (get-attrs config "demo" "bar") {:key "65", :name "value", :cemerick.rummage/id "bar"} => (get-attrs config "demo" "baz") nil
All put operations replace existing item values for the same keys by default. If you would
like to add/append values for an existing item key, put-attrs
and batch-put-attrs
optionally
accept an :add-to?
argument: a set of item keys for which item values should be appended,
rather than replaced:
=> (put-attrs config "demo" {::sdb/id "appending" :name 50}) nil => (get-attrs config "demo" "appending") {:name "50", ::sdb/id "appending"} => (put-attrs config "demo" {::sdb/id "appending" :name 60}) nil => (get-attrs config "demo" "appending") {:name "60", ::sdb/id "appending"} => (put-attrs config "demo" {::sdb/id "appending" :name 70} :add-to? #{:name}) nil => (get-attrs config "demo" "appending") {:name #{"70" "60"}, ::sdb/id "appending"}
Both delete-attrs
and put-attrs
can be provided with values defining conditions
under which their corresponding requests should fail:
=> (put-attrs config "demo" {::sdb/id "conditional" :name 70}) nil => (put-attrs config "demo" {::sdb/id "conditional" :name 100} :not-expecting :name) #<CompilerException Status Code: 409, AWS Request ID: a5e71a72-76f2-7d42-e10c-958a773df53b, AWS Error Code: ConditionalCheckFailed, AWS Error Message: Conditional check failed. Attribute (name) value exists (NO_SOURCE_FILE:0)> => (put-attrs config "demo" {::sdb/id "conditional" :name 100} :expecting [:other-name 100]) #<CompilerException Status Code: 404, AWS Request ID: 61939a96-f79f-678e-1e4f-7d29ebbe8e02, AWS Error Code: AttributeDoesNotExist, AWS Error Message: Attribute (other-name) does not exist (NO_SOURCE_FILE:0)> => (put-attrs config "demo" {::sdb/id "conditional" :name 100} :expecting [:name 50]) #<CompilerException Status Code: 409, AWS Request ID: 6bac4305-6877-c1bb-b8b5-c07b08e83d07, AWS Error Code: ConditionalCheckFailed, AWS Error Message: Conditional check failed. Attribute (name) value is (70) but was expected (50) (NO_SOURCE_FILE:0)> => (put-attrs config "demo" {::sdb/id "conditional" :name 100} :expecting [:name 70]) nil => (get-attrs config "demo" "conditional") {:name "100", ::sdb/id "conditional"} => (delete-attrs config "demo" "conditional" :not-expecting :name) #<CompilerException Status Code: 409, AWS Request ID: ca98837b-8be1-9ec9-66fe-5989776fb3bf, AWS Error Code: ConditionalCheckFailed, AWS Error Message: Conditional check failed. Attribute (name) value exists (NO_SOURCE_FILE:0)> => (delete-attrs config "demo" "conditional" :expecting [:name 100]) nil => (get-attrs config "demo" "conditional") nil
You can issue ad-hoc queries over data you’ve stored in SimpleDB. SDB’s canonical representation of these queries is textual, and vaguely resembles SQL:
(batch-put-attrs config "demo" [{::sdb/id "foo" :name "Claremont" :key 50} {::sdb/id "bar" :name "Burlington" :key #{60 65}} {::sdb/id "baz" :name "Keene" :key 70}]) nil => (query config "select key from demo where key is not null") ({:key "50", :cemerick.rummage/id "foo"} {:key #{"60" "65"}, :cemerick.rummage/id "bar"} {:key "70", :cemerick.rummage/id "baz"})
As when retrieving items using get-attrs
, query
uses the encoding functions in the
the configuration map provided as its first argument.
Note
|
You can optionally use SDB’s consistent read semantics when querying. |
Using strings to query SDB works (and may be necessary if you already have canned SDB queries in a "legacy" codebase or are integrating with a system that somehow produces SDB queries dynamically), but doing so leaves you to remember SDB’s quoting rules and replicate the encoding that was used to store attribute names and values. Rummage provides a Clojure map-based DSL for querying SDB:
=> (query config '{select count from demo}) 3 => (query config '{select id from demo}) ("bar" "baz" "foo") => (query config '{select [:key] from demo where (> :key 60)}) ({:key #{"60" "65"}, :cemerick.rummage/id "bar"} {:key "70", :cemerick.rummage/id "baz"}) => (query config '{select [:key] from demo where (like ::sdb/id "ba%")}) ({:key #{"60" "65"}, :cemerick.rummage/id "bar"} {:key "70", :cemerick.rummage/id "baz"}) => (query config '{select [:key] from demo where (and (like ::sdb/id "ba%") (< :key 70))}) ({:key #{"60" "65"}, :cemerick.rummage/id "bar"}) => (query config '{select [:name] from demo where (!= :name "Keene")}) ({:name "Burlington", :cemerick.rummage/id "bar"} {:name "Claremont", :cemerick.rummage/id "foo"}) => (query config '{select * from demo where (not-null :key) order-by [:key]}) ({:key "50", :name "Claremont", :cemerick.rummage/id "foo"} {:key #{"60" "65"}, :name "Burlington", :cemerick.rummage/id "bar"} {:key "70", :name "Keene", :cemerick.rummage/id "baz"}) => (query config '{select * from demo order-by [:key desc] where (not-null :key)}) ({:key "70", :name "Keene", :cemerick.rummage/id "baz"} {:key #{"60" "65"}, :name "Burlington", :cemerick.rummage/id "bar"} {:key "50", :name "Claremont", :cemerick.rummage/id "foo"}) => (query config '{select * from demo order-by [:key desc] where (not-null :key) limit 1}) ({:key "70", :name "Keene", :cemerick.rummage/id "baz"})
Since this query style is map-based, you can generate it dynamically. Additionally, you can easily interpolate values – parameters, keys, domain names, comparison values, etc – into query map literals using syntax-quote:
(let [domain-name "demo" key-values [50 70]] (query config `{select [:name] from ~domain-name where (in :key ~key-values)})) ({:name "Claremont", :cemerick.rummage/id "foo"} {:name "Keene", :cemerick.rummage/id "baz"})
Note
|
Rummage’s => (select-string config '{select [:key] from demo where (and (like ::sdb/id "ba%") (< :key 70))}) "select `key` from `demo` where (itemName() like 'ba%') and (`key` < '70')" |
The examples above demonstrate the query map DSL reasonably well, but one should refer to the
docstring for the query
function for an authoritative list of comparisons,
information about return values, etc., and to the
SDB query documentation
for details on semantics.
query
performs a single request to SDB, which can potentially return only a portion of a
query’s results. If you want to obtain all of the results matching a query, use
the query-all
function, which will lazily page through results of a query for you as you consume
them, using the :next-token
metadata provided by the seqs returned by query
:
=> (batch-put-attrs config "demo" (for [x (range 5000)] {::sdb/id x :key x})) nil => (count (query-all config `{select id from demo})) 5000
Note that query-all
will automatically bump the :limit
of a query up to the SDB maximum of 2500
(the default is 100) to minimize the number of network requests to obtain the full resultset.
The configuration map you provide as the first argument to most of Rummage’s functions defines
how data is encoded to strings for storage in SDB and how strings retrieved from SDB are
decoded to non-string item keys and values. The encoding portion of the configuration map is
also used to encode keys and values to strings when constructing string queries from the query
maps accepted by query
.
Configuration maps should contain the following encoding-related functions:
:encode-id
-
encodes item IDs (values in item maps mapped to
::sdb/id
) to strings :decode-id
-
the dual of the
:encode-id
function; decodes string item IDs retrieved from SDB, potentially to some other type of value :encode
-
encodes item keys and values to attribute name and value strings. Must provide a 1-arg arity that will receive item keys and return a corresponding encoded string, and a 2-arg arity that will receive an item key and value, and return a vector containing the corresponding encoded strings
:decode
-
the dual of the
:encode
function; decodes string item names and values retreived from SDB
If your needs warrant it, you can write your own encoding and decoding functions and use them
in configuration maps with Rummage. However, the cemerick.rummage.encoding
namespace provides
a number of configuration maps (and functions that return configuration maps) implementing
encoding strategies suitable for (hopefully) most datasets that you can use without modification
to encode your data for storage in SDB. These are described in detail here:
The simpliest possible encoding scheme, all-strings
converts all outgoing data to strings
using str
, and passes through all retrieved item strings unchanged.
=> (def config (assoc enc/all-strings :client client)) #'user/config => (put-attrs config "demo" {::sdb/id "all-strings" :keyword 42 "name" :value :date (java.util.Date.)}) nil => (get-attrs config "demo" "all-strings") {":keyword" "42", ":date" "Wed Mar 09 12:39:57 EST 2011", "name" ":value", ::sdb/id "all-strings"}
This can be very useful when working with SDB data that has been stored / needs to be accessible from other SDB clients.
This encoding scheme stores all attribute values as strings just like all-strings
, but
provides for round-tripping of keywords as attribute names.
=> (def config (assoc enc/keyword-strings :client client)) #'user/config => (put-attrs config "demo" {::sdb/id "keyword-strings" :keyword 42 :name :value :date (java.util.Date.)}) nil => (get-attrs config "demo" "keyword-strings") {:date "Wed Mar 09 12:41:31 EST 2011", :keyword "42", :name ":value", ::sdb/id "keyword-strings"}
Note that an error will occur if you attempt to store items that have non-keyword keys using the this configuration.
all-prefixed-config
is a function that returns a configuration map that use a defined
set of prefixed type tags along with roundtrippable string encodings to store item data
of many different types, and restore that item data with its original types upon retrieval.
=> (def config (assoc (enc/all-prefixed-config) :client config)) nil => (put-attrs config "demo" {::sdb/id "prefixed-data" :keyword 42 :ns/name :value :date (java.util.Date.) true false "float value" 108.6}) nil (get-attrs config "demo" "prefixed-data") {:keyword 42, true false, "float value" 108.6, :date #<Date Thu Mar 10 10:01:04 EST 2011>, :ns/name :value, :cemerick.rummage/id "prefixed-data"}
Notice that the retrieved data has the same types as the stored data; in fact, the
map returned here by get-attrs
is equal to the map stored in SDB by put-attrs
(as in, (= stored-map retreived-map)
). So you can see what the attributes look like
in their encoded form, let’s take a look at that item using the all-strings
config
(which, remember, does no decoding of string data retrieved from SDB):
(get-attrs (assoc enc/all-strings :client client) "demo" "s:prefixed-data") {"k:keyword" "i:4611686018427387945", "z:true" "z:false", "s:float value" "f:5 002 1.0860000000000000", "k:date" "D:2011-03-10T15:04:05.419+0000", "k:ns/name" "k:value", :cemerick.rummage/id "s:prefixed-data"}
The supported types and prefixes used by configurations produced by all-prefixed-config
are controlled by the formatting map provided as an argument to
all-prefixed-config
; a default formatting map is used if none
is explicitly provided.
Warning
|
Since all stored values have a type prefix, and |
name-typed-values-config
is a function that returns a configuration map.
This scheme is very similar to
all-prefixed-config
, but restricts the use of type prefixes to attribute
names (specifically, to namespaces of keywords used as attribute keys) and requires that all
values for each attribute are of the same type, indicated by the attribute name’s prefix.
Within that structure, all values are stored using roundtrippable string encodings
and are restored to their original types upon retrieval.
=> (def config (assoc (enc/name-typed-values-config) :client client)) #'user/config => (put-attrs config "demo" {::sdb/id "name-typed-values" :i/keyword 42 :k/name :value :D/date (java.util.Date.) :z/true false :f/float-value 108.6}) nil => (get-attrs config "demo" "name-typed-values") {:i/keyword 42, :D/date #<Date Thu Mar 10 10:20:15 EST 2011>, :k/name :value, :z/true false, :f/float-value 108.6, :cemerick.rummage/id "name-typed-values"}
Notice that, in contrast to all-prefixed-config
, values have no prefixes.
The type metadata for each attribute is stored instead in the namespaces of the keywords used
as attribute keys. (An exception will occur if you attempt to store an attribute that does not use
a namespaced keyword for a key.) Let’s take a look at how this translates into the strings
stored in SDB:
(get-attrs (assoc enc/all-strings :client client) "demo" "s:name-typed-values") {"k:i/keyword" "4611686018427387945", "k:D/date" "2011-03-10T15:20:15.806+0000", "k:k/name" "value", "k:z/true" "false", "k:f/float-value" "5 002 1.0860000000000000", :cemerick.rummage/id "s:name-typed-values"}
Because no prefixes are used in the encoded values, prefixed like
queries will work, in
contrast to all-prefixed-config
. The tradeoff is that
item keys must all be namespaced keywords, with namespaces corresponding to the prefixes specified
in the formatting map.
The supported types and prefixes used by configurations produced by name-typed-values-config
are controlled by the formatting map provided as an argument to
name-typed-values-config
; a default formatting map is used if none
is explicitly provided.
If:
-
your dataset has attributes whose values are of constant types, and
-
you can specify those attribute names and their corresponding types ahead of time,
then you can use configuration maps produced by the fixed-domain-schema
function, which
avoids all of the shortcomings of all-prefixed-config
-derived configurations (e.g.
breaks prefixed like
queries) and name-typed-values-config
-derived configurations (e.g.
requiring attribute keys to be namespaced keywords).
=> (def config (assoc (enc/fixed-domain-schema {:name String :birthday java.util.Date "age" Integer true Boolean}) :client client)) #'user/config => (put-attrs config "demo" {::sdb/id "fixed-domain-schema" :birthday (java.util.Date.) "age" 26 true false}) => (get-attrs config "demo" "fixed-domain-schema") {"age" 26, true false, :birthday #<Date Thu Mar 10 10:45:25 EST 2011>, :cemerick.rummage/id "fixed-domain-schema"}
Again, the items are retrieved and decoded such that their types are the same as those in
the map provided to put-attrs
. Let’s take a look at how those attributes are encoded
for storage:
=> (get-attrs (assoc enc/all-strings :client client) "demo" "s:fixed-domain-schema") {"s:age" "4611686018427387929", "z:true" "false", "k:birthday" "2011-03-10T15:45:25.925+0000", :cemerick.rummage/id "s:fixed-domain-schema"}
As you can see, item keys and the item ID itself are encoded with prefixes as in all-prefixed-config
and
name-typed-values-config
, but the encoded attribute values have no prefixes.
Of course, we’re using SDB, you can add new "columns" to the data that can be encoded by
a fixed-domain-schema
configuration by obtaining a new configuration map using a different
"schema" map provided to the fixed-domain-schema
function:
=> (put-attrs config "demo" {::sdb/id "fixed-domain-schema" :unknown-key 42}) #<java.lang.IllegalArgumentException: No formatter available for prefix :unknown-key> => (def config (assoc (enc/fixed-domain-schema {:name String :birthday java.util.Date "age" Integer true Boolean ;; adding new slot ("column"?) to schema :unknown-key Integer}) :client client)) #'user/config => (put-attrs config "demo" {::sdb/id "fixed-domain-schema" :unknown-key 42}) nil => (get-attrs config "demo" "fixed-domain-schema") {"age" 26, :unknown-key 42, true false, :birthday #<Date Thu Mar 10 10:45:25 EST 2011>, :cemerick.rummage/id "fixed-domain-schema"}
Depending on your dataset, application requirements, and personal preferences, this process of mapping
attribute keys (:name
, "age"
, etc) to value types – similar to how tables are defined in relational
databases – is likely either unreasonably restrictive or comfortingly familiar.
;; TODO
cemerick.rummage.encoding/prefix-formatting
is the default formatting map used by
the configuration-map-producing functions in that namespace (
name-typed-values-config
,
all-prefixed-config
,
and fixed-domain-schema
). It contains mappings for the
following types (with prefixes for those encoding schemes that use them):
Type | Prefix | Example value | Example encoded value (does not include prefix) | Notes |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Maximum absolute value: |
|
|
|
|
Always decodes to a concrete long. |
|
|
|
"5 002 1.0860000000000000" |
|
|
|
|
"5 002 1.0860000000000000" |
Always decodes to a concrete double. |
|
|
|
|
|
|
|
|
|
|
|
|
(Date.) |
|
Dates are always normalized to UTC (which ensures that their encoded forms sort properly when compared lexicographically) and encoded using ISO 8601 |
Have maven. From the command line:
$ mvn clean install
The tests are all live, so:
-
They create and delete domains (though with unique names).
-
They aren’t written to be particularly efficient w.r.t. SDB usage. If you do decide to run the tests, the associated fees should be trivial (or nonexistent if your account is under the SDB free usage cap).
In any case, you are so warned. Make a new AWS account dedicated to testing if you’re concerned on either count.
Since the tests are live, you either need to add your AWS credentials to your
~/.m2/settings.xml
file as properties, or specify them on the command line
using -D
switches:
$ mvn -Daws.id=XXXXXXX -Daws.secret-key=YYYYYYY clean install
Or, you can skip the tests entirely:
$ mvn -Dmaven.test.skip=true clean install
In any case, you’ll find a built .jar
file in the target
directory, and in
its designated spot in ~/.m2/repository
(assuming you ran install
rather than
e.g. package
).
Ping cemerick
on freenode irc or twitter if you have questions
or would like to contribute patches.