Giter Site home page Giter Site logo

chrisjsewell / jsonextended Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 2.0 1.02 MB

Extending the python json package functionality

Home Page: https://jsonextended.readthedocs.io

License: MIT License

Python 94.86% JavaScript 5.09% Shell 0.05%
python-json json physical-quantities

jsonextended's Introduction

JSON Extended

Build Status Coverage Status Documentation Status PyPI Anaconda-Server Badge

A module to extend the python json package functionality:

  • Treat a directory structure like a nested dictionary:
    • lightweight plugin system: define bespoke classes for parsing different file extensions (in-the-box: .json, .csv, .hdf5) and encoding/decoding objects
    • lazy loading: read files only when they are indexed into
    • tab completion: index as tabs for quick exploration of data
  • Manipulation of nested dictionaries:
    • enhanced pretty printer
    • Javascript rendered, expandable tree in the Jupyter Notebook
    • functions including; filter, merge, flatten, unflatten, diff
    • output to directory structure (of n folder levels)
  • On-disk indexing option for large json files (using the ijson package)
  • Units schema concept to apply and convert physical units (using the pint package)

Documentation: https://jsonextended.readthedocs.io

Contents

Installation

From Conda (recommended):

conda install -c conda-forge jsonextended

From PyPi:

pip install jsonextended

jsonextended has no import dependancies, on Python 3.x and only pathlib2 on 2.7 but, for full functionallity, it is advised to install the following packages:

conda install -c conda-forge ijson numpy pint h5py pandas

Basic Example

from jsonextended import edict, plugins, example_mockpaths

Take a directory structure, potentially containing multiple file types:

datadir = example_mockpaths.directory1
print(datadir.to_string(indentlvl=3,file_content=True))
Folder("dir1")
   File("file1.json") Contents:
    {"key2": {"key3": 4, "key4": 5}, "key1": [1, 2, 3]}
   Folder("subdir1")
     File("file1.csv") Contents:
       # a csv file
      header1,header2,header3
      val1,val2,val3
      val4,val5,val6
      val7,val8,val9
     File("file1.literal.csv") Contents:
       # a csv file with numbers
      header1,header2,header3
      1,1.1,string1
      2,2.2,string2
      3,3.3,string3
   Folder("subdir2")
     Folder("subsubdir21")
       File("file1.keypair") Contents:
         # a key-pair file
        key1 val1
        key2 val2
        key3 val3
        key4 val4

Plugins can be defined for parsing each file type (see Creating Plugins section):

plugins.load_builtin_plugins('parsers')
plugins.view_plugins('parsers')
{'csv.basic': 'read *.csv delimited file with headers to {header:[column_values]}',
 'csv.literal': 'read *.literal.csv delimited files with headers to {header:column_values}, with number strings converted to int/float',
 'hdf5.read': 'read *.hdf5 (in read mode) files using h5py',
 'json.basic': 'read *.json files using json.load',
 'keypair': "read *.keypair, where each line should be; '<key> <pair>'"}

LazyLoad then takes a path name, path-like object or dict-like object, which will lazily load each file with a compatible plugin.

lazy = edict.LazyLoad(datadir)
lazy
{file1.json:..,subdir1:..,subdir2:..}

Lazyload can then be treated like a dictionary, or indexed by tab completion:

list(lazy.keys())
['subdir1', 'subdir2', 'file1.json']
lazy[['file1.json','key1']]
[1, 2, 3]
lazy.subdir1.file1_literal_csv.header2
[1.1, 2.2, 3.3]

For pretty printing of the dictionary:

edict.pprint(lazy,depth=2)
file1.json:
  key1: [1, 2, 3]
  key2: {...}
subdir1:
  file1.csv: {...}
  file1.literal.csv: {...}
subdir2:
  subsubdir21: {...}

Numerous functions exist to manipulate the nested dictionary:

edict.flatten(lazy.subdir1)
{('file1.csv', 'header1'): ['val1', 'val4', 'val7'],
 ('file1.csv', 'header2'): ['val2', 'val5', 'val8'],
 ('file1.csv', 'header3'): ['val3', 'val6', 'val9'],
 ('file1.literal.csv', 'header1'): [1, 2, 3],
 ('file1.literal.csv', 'header2'): [1.1, 2.2, 3.3],
 ('file1.literal.csv', 'header3'): ['string1', 'string2', 'string3']}

LazyLoad parses the plugins.decode function to parser plugin's read_file method (keyword 'object_hook'). Therefore, bespoke decoder plugins can be set up for specific dictionary key signatures:

print(example_mockpaths.jsonfile2.to_string())
File("file2.json") Contents:
{"key1":{"_python_set_": [1, 2, 3]},"key2":{"_numpy_ndarray_": {"dtype": "int64", "value": [1, 2, 3]}}}
edict.LazyLoad(example_mockpaths.jsonfile2).to_dict()
{u'key1': {u'_python_set_': [1, 2, 3]},
 u'key2': {u'_numpy_ndarray_': {u'dtype': u'int64', u'value': [1, 2, 3]}}}
plugins.load_builtin_plugins('decoders')
plugins.view_plugins('decoders')
{'decimal.Decimal': 'encode/decode Decimal type',
 'numpy.ndarray': 'encode/decode numpy.ndarray',
 'pint.Quantity': 'encode/decode pint.Quantity object',
 'python.set': 'decode/encode python set'}
dct = edict.LazyLoad(example_mockpaths.jsonfile2).to_dict()
dct
{u'key1': {1, 2, 3}, u'key2': array([1, 2, 3])}

This process can be reversed, using encoder plugins:

plugins.load_builtin_plugins('encoders')
plugins.view_plugins('encoders')
{'decimal.Decimal': 'encode/decode Decimal type',
 'numpy.ndarray': 'encode/decode numpy.ndarray',
 'pint.Quantity': 'encode/decode pint.Quantity object',
 'python.set': 'decode/encode python set'}
import json
json.dumps(dct,default=plugins.encode)
'{"key2": {"_numpy_ndarray_": {"dtype": "int64", "value": [1, 2, 3]}}, "key1": {"_python_set_": [1, 2, 3]}}'

Creating and Loading Plugins

from jsonextended import plugins, utils

Plugins are recognised as classes with a minimal set of attributes matching the plugin category interface:

plugins.view_interfaces()
{'decoders': ['plugin_name', 'plugin_descript', 'dict_signature'],
 'encoders': ['plugin_name', 'plugin_descript', 'objclass'],
 'parsers': ['plugin_name', 'plugin_descript', 'file_regex', 'read_file']}
plugins.unload_all_plugins()
plugins.view_plugins()
{'decoders': {}, 'encoders': {}, 'parsers': {}}

For example, a simple parser plugin would be:

class ParserPlugin(object):
    plugin_name = 'example'
    plugin_descript = 'a parser for *.example files, that outputs (line_number:line)'
    file_regex = '*.example'
    def read_file(self, file_obj, **kwargs):
        out_dict = {}
        for i, line in enumerate(file_obj):
            out_dict[i] = line.strip()
        return out_dict

Plugins can be loaded as a class:

plugins.load_plugin_classes([ParserPlugin],'parsers')
plugins.view_plugins()
{'decoders': {},
 'encoders': {},
 'parsers': {'example': 'a parser for *.example files, that outputs (line_number:line)'}}

Or by directory (loading all .py files):

fobj = utils.MockPath('example.py',is_file=True,content="""
class ParserPlugin(object):
    plugin_name = 'example.other'
    plugin_descript = 'a parser for *.example.other files, that outputs (line_number:line)'
    file_regex = '*.example.other'
    def read_file(self, file_obj, **kwargs):
        out_dict = {}
        for i, line in enumerate(file_obj):
            out_dict[i] = line.strip()
        return out_dict
""")
dobj = utils.MockPath(structure=[fobj])
plugins.load_plugins_dir(dobj,'parsers')
plugins.view_plugins()
{'decoders': {},
 'encoders': {},
 'parsers': {'example': 'a parser for *.example files, that outputs (line_number:line)',
  'example.other': 'a parser for *.example.other files, that outputs (line_number:line)'}}

For a more complex example of a parser, see jsonextended.complex_parsers

Interface specifications

  • Parsers:
    • file_regex attribute, a str denoting what files to apply it to. A file will be parsed by the longest regex it matches.
    • read_file method, which takes an (open) file object and kwargs as parameters
  • Decoders:
    • dict_signature attribute, a tuple denoting the keys which the dictionary must have, e.g. dict_signature=('a','b') decodes {'a':1,'b':2}
    • from_... method(s), which takes a dict object as parameter. The plugins.decode function will use the method denoted by the intype parameter, e.g. if intype='json', then from_json will be called.
  • Encoders:
    • objclass attribute, the object class to apply the encoding to, e.g. objclass=decimal.Decimal encodes objects of that type
    • to_... method(s), which takes a dict object as parameter. The plugins.encode function will use the method denoted by the outtype parameter, e.g. if outtype='json', then to_json will be called.

Extended Examples

For more information, all functions contain doc-strings with tested examples.

Data Folders JSONisation

from jsonextended import ejson, edict, utils
path = utils.get_test_path()
ejson.jkeys(path)
['dir1', 'dir2', 'dir3']
jdict1 = ejson.to_dict(path)
edict.pprint(jdict1,depth=2)
dir1:
  dir1_1: {...}
  file1: {...}
  file2: {...}
dir2:
  file1: {...}
dir3:
edict.to_html(jdict1,depth=2)

To try the rendered JSON tree, output in the Jupyter Notebook, go to : https://chrisjsewell.github.io/

Nested Dictionary Manipulation

jdict2 = ejson.to_dict(path,['dir1','file1'])
edict.pprint(jdict2,depth=1)
initial: {...}
meta: {...}
optimised: {...}
units: {...}
filtered = edict.filter_keys(jdict2,['vol*'],use_wildcards=True)
edict.pprint(filtered)
initial:
  crystallographic:
    volume: 924.62752781
  primitive:
    volume: 462.313764
optimised:
  crystallographic:
    volume: 1063.98960509
  primitive:
    volume: 531.994803
edict.pprint(edict.flatten(filtered))
(initial, crystallographic, volume):   924.62752781
(initial, primitive, volume):          462.313764
(optimised, crystallographic, volume): 1063.98960509
(optimised, primitive, volume):        531.994803

Units Schema

from jsonextended.units import apply_unitschema, split_quantities
withunits = apply_unitschema(filtered,{'volume':'angstrom^3'})
edict.pprint(withunits)
initial:
  crystallographic:
    volume: 924.62752781 angstrom ** 3
  primitive:
    volume: 462.313764 angstrom ** 3
optimised:
  crystallographic:
    volume: 1063.98960509 angstrom ** 3
  primitive:
    volume: 531.994803 angstrom ** 3
newunits = apply_unitschema(withunits,{'volume':'nm^3'})
edict.pprint(newunits)
initial:
  crystallographic:
    volume: 0.92462752781 nanometer ** 3
  primitive:
    volume: 0.462313764 nanometer ** 3
optimised:
  crystallographic:
    volume: 1.06398960509 nanometer ** 3
  primitive:
    volume: 0.531994803 nanometer ** 3
edict.pprint(split_quantities(newunits),depth=4)
initial:
  crystallographic:
    volume:
      magnitude: 0.92462752781
      units:     nanometer ** 3
  primitive:
    volume:
      magnitude: 0.462313764
      units:     nanometer ** 3
optimised:
  crystallographic:
    volume:
      magnitude: 1.06398960509
      units:     nanometer ** 3
  primitive:
    volume:
      magnitude: 0.531994803
      units:     nanometer ** 3

jsonextended's People

Contributors

chrisjsewell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

jsonextended's Issues

consistent dict type check

standadardize test for dict across functions
isinstance and has_attr => is_dict function, probably using has_attr('items') to make more flexible

remove dict_ prefix

have all functions be able to accept dicts and jsons?

Would mean that would have to add type check to all functions

LazyLoad on-file multi-indexing

if multiple items are parsed to LazyLoad.__getitem__, and indexing into a file, have a way of calling plugins.parse with a keys keyword to allow for on-file indexing before returning new object

Mainly for using ijson. would have two plugins for .json and just load whichever one to use or not use ijson.

Combine/Split Objects

like jsonextended.units combine/split but for arbitrary object class
dict_combine(class, arg_names=['name'], kwarg_names=['other name']) for class.__init__(*args, **kwargs)
split on class attributes

BasicParser.read_file eror handling

have read_file produce dictionary, then only merge at end of read, when no errors have been thrown
then wouldn't have to use self.__init_file_keys

dict_to_html defer to online version

Not finding using local version and not rendering with url on Python 2.7,

Check jsonrenderer.js has been found and rendered, else defer to online version

add documentation section about load speed

Parsing 26.5 MB JSON:

index 1 of 1000

get dictionary

on disk:

1 loop, best of 3: 299 ms per loop
maximum of 3: 103.960938 MB per loop

on disk (parsing number as Decimal):

1 loop, best of 3: 2.71 s per loop
maximum of 3: 179.109375 MB per loop

in memory:

10 loops, best of 3: 24.6 ms per loop
maximum of 3: 5.007812 MB per loop

in memory (parsing number as Decimal):

1 loop, best of 3: 18.1 s per loop
maximum of 3: 3.644531 MB per loop

get keys

on disk:

1 loop, best of 3: 294 ms per loop
maximum of 3: 104.574219 MB per loop

in memory:

1 loop, best of 3: 18.1 s per loop
maximum of 3: 3.644531 MB per loop

index 999 of 1000

get dictionary

on disk:

1 loop, best of 3: 303 ms per loop
maximum of 3: 101.046875 MB per loop

in memory:

1 loop, best of 3: 18.1 s per loop
maximum of 3: 3.777344 MB per loop

get keys

on disk:

1 loop, best of 3: 284 ms per loop
maximum of 3: 101.785156 MB per loop

in memory:

1 loop, best of 3: 18 s per loop
maximum of 3: 3.476562 MB per loop

Improve speed of on-disk json key access

json_keys: ijson.parse is really slow, maybe do partial in_memory, using ijson.items
or, if implment json lazy loader ( #4 ), use that

     %timeit json_to_dict('test.json',['initial','crystallographic'], in_memory=True)
     100 loops, best of 3: 5.1 ms per loop
     %timeit json_to_dict('test.json',['initial','crystallographic'], in_memory=False)
     100 loops, best of 3: 4.78 ms per loop
     %timeit json_keys('test.json',['initial','crystallographic'], in_memory=True)
     100 loops, best of 3: 10.7 ms per loop
     %timeit json_keys('test.json',['initial','crystallographic'], in_memory=False)
     1 loop, best of 3: 697 ms per loop

BaseParser abstract class

change BasicParser to TextParser,
and have it as subclass of BaseParser, an abstract class that just contains methods:

  • read_file
  • get_dict

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.