Giter Site home page Giter Site logo

jsonance's Introduction

jsonance - WIP / library for analyzing JSON for metadata

jsonance, rhymes with "resonance", is a ... [todo]

j := new(options)
j := open(previousAnalysis, options)

r := j.reader()

r2 := r.prevReader() // Navigate to past

b := j.createBranch(branchName, options)

b2 := b.fork(options)

b.close()

b.summary()

batch.setOpaque(opaqueKey, opaqueVal)

a := j.analyze(doc)

batch.add(vbucketId, seq, key, a)
batch.delete(vbucketId, seq, key)
batch.commit()
batch.close()

j.analysis()

from cbdatasource (or any data source)...

get-opaque
  j.reader.getOpaque()

rollbackEx(vbucketID uint16, vbucketUUID uint64, rollbackSeq uint64) error

onSnapshotStart(vbucketID uint16, snapStartSeq, snapEndSeq uint64, snapType uint32)

set-opaque(vbucketID uint64, []byte)
  b.setOpaque

get-opaque(vbucketID uint64) ([]byte, lastSeq, err)
  b.getOpaque

DataUpdate(vbucketID uint16, key []byte, seq uint64, r *gomemcached.MCRequest)
  b.onMutation

DataDelete(vbucketID uint16, key []byte, seq uint64, r *gomemcached.MCRequest)

analysis thoughts

want the first time a field shows up
want the first time a "fingerprint" (multi-field schema) shows up
want the last time a field shows up
want the last time a fingerprint shows up

what does "first time" / "last time" mean?

  idea: treat the the vbId->seqNum pairs as a vector clock

do missing fields mean it's a different fingerprint?

are brand new additional field(s) associated like inheritance relationship?

  ABCD "contains-a" / has-a ABC?

  ABC --> ABCD  ---+--> ABCDE
      --> ABCE  --/

  assumption / heuristic: most fields are additive

  when ABC shows up...

    A ==> t1
    B ==> t1
    C ==> t1

    t1: [A,B,C], parents: nil

  when ABCD shows up...

    t2: [A,B,C,D], parents: t1

    A ==> t2, t1
    B ==> t2, t1
    C ==> t2, t1
    D ==> t2

  when ABCE shows up...

    t3: [A,B,C,E], parents: t1

    A ==> t3, t2, t1
    B ==> t3, t2, t1
    C ==> t3, t2, t1
    D ==>     t2
    E ==> t3

  when ABCDE shows up

    t4: [A,B,C,D,E], parents: t2, t3

    A ==> t4, t3, t2, t1
    B ==> t4, t3, t2, t1
    C ==> t4, t3, t2, t1
    D ==> t4,     t2
    E ==> t4, t3

  when ABX shows up

    t5: [A,B,X], parents: nil

    A ==> t5, t4, t3, t2, t1
    B ==> t5, t4, t3, t2, t1
    C ==>     t4, t3, t2, t1
    D ==>     t4,     t2
    E ==>     t4, t3
    X ==> t5

generate short fieldId's?

what about UUID's degenerate case of a nested map?
or data-time fields degenerate case?

histograms for array lengths?

what about type fields (type: beer, type: brewery)?

pseudocode ideas

inputs: data map[string]interface{} rev rev

kvs := processData(data, rev)

sigs := constructSigs(kvs, rev) // Short for signatures.

mergeSigs(sigsState, sigs) // Track aggregates and superset-of matches of sigs.

example: processData({ "title": "star wars", "genre": "sci-fi" }, "rev-123") => [ { "name": "title", "path": "", "type": "string", // "string", "number", "object", "array", "null", "boolean" "typeEx": null, // "datetime" (rfcXxxx?), "int", "float" "val": "star wars", ==> track aggregates of min, max, count, lenMin, lenMax, lenTot "rev": "rev-123", ==> latch on existence, first write wins, like a min }, { "name": "genre", "path": "", "type": "string", "typeEx": null, "val": "sci-fi", "rev": "rev-123", } ]

sigs is roughly... several kinds of sigs, each with a... unique hash after... group by path+name group by path+name+type group by path+name+type+typeEx

 what about null's?

example analysis

source: {
  sourceName: "..."
},
branches: {
  "": {
  },
  "20180829-234123": {
    parent: ""
    opaque: {
    }
  }
}

example PINDEX_META...

{
  "name": "bs0_5ea163404f446bb6_13aa53f3",
  "uuid": "ad2b4749569cafe4",
  "indexType": "fulltext-index",
  "indexName": "bs0",
  "indexUUID": "5ea163404f446bb6",
  "indexParams": "{\"doc_config\":{\"mode\":\"type_field\",\"type_field\":\"type\"},\"mapping\":{\"default_analyzer\":\"standard\",\"default_datetime_parser\":\"dateTimeOptional\",\"default_field\":\"_all\",\"default_mapping\":{\"dynamic\":false,\"enabled\":true,\"properties\":{\"description\":{\"dynamic\":false,\"enabled\":true,\"fields\":[{\"analyzer\":\"\",\"include_in_all\":false,\"include_term_vectors\":false,\"index\":true,\"name\":\"description\",\"store\":false,\"type\":\"text\"}]}}},\"default_type\":\"_default\",\"index_dynamic\":false,\"store_dynamic\":false},\"store\":{\"kvStoreName\":\"mossStore\"}}",
  "sourceType": "couchbase",
  "sourceName": "beer-sample",
  "sourceUUID": "8f6e4f2e74d953213609fdd59396f6a9",
  "sourceParams": "{}",
  "sourcePartitions": "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170"
}

sourcePoint: "2934" sourcePoints: { "2934": { parent: "2933" } }

verbiage / trying to name the historic data points... srcRev snapshots points versions sha (a'la git sha) token savepoints rollback ref id tag generation / genTag ancestry / ancestor tag birth record fingerprint lineage point pedigree descent source context

population populace colony settlers

// ParseFailOverLog parses a byte array to an array of [vbucketUUID, // seqNum] pairs. func ParseFailOverLog(body []byte) ([][]uint64, error) { flog := make([][]uint64, len(body)/16) for i, j := 0, 0; i < len(body); i += 16 { uuid := binary.BigEndian.Uint64(body[i : i+8]) seqn := binary.BigEndian.Uint64(body[i+8 : i+16]) flog[j] = []uint64{uuid, seqn} j++ } return flog, nil }

failOverLog... vbID => vbUUID => seqNum

MISON parser http://www.vldb.org/pvldb/vol10/p1118-li.pdf

  • fast json parser
  • speculative locations of fields, both logical vs physical locations
  • SIMD popcnt
  • projections pushed down to json parser

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.