chamkank / hone Goto Github PK

View Code? Open in Web Editor NEW

133.0 133.0 32.0 59 KB

Convert CSV to automatically nested JSON

License: MIT License

Python 100.00%

hone's People

Contributors

Stargazers

Watchers

hone's Issues

Support working with stdin and stdout

When using the CLI sometimes it's useful to not use files, but stdin/stdout instead. If allows further processing with e.g the jq tool.

It would be nice to be able to do this:

$ echo 'name,age (years),weight (kg),birth day,birth month,birth year,adopted,adopted_since
Tommy,5,3.6,11,April,2011,TRUE,2012' | hone > output.json

$ hone input.csv
[
  {
    "adopted": "TRUE",
    "adopted_since": "2012",
    "age (years)": "5",
    "birth": {
      "day": "11",
      "month": "April",
      "year": "2011"
    },
    "name": "Tommy",
    "weight (kg)": "3.6"
  }
]

$ hone input.csv | jq '.[0]["name"]'
"Tommy"

Differentiate between private helpers & public methods

objects with single key are not nested

If I use a delimiter | for example and I have a column name like lvl1|lvl2, the resulting JSON structure looks like "lvl1|lvl2" : <value>. I believe it would make more sense if the result would be "lvl1":{"lvl2":<value>}. What is the reasoning behind this? As I needed it, I've prepared the code adjustments for this and I can create a pull request for it. Thanks!

When we pass schema as a json, getting the error

I am trying to use the this library for converting csv to nested json. I am getting the below error
"line 39, in populate_structure_with_data
key_path = mapping[column_name]
KeyError: 'studentid'"
when i try the script with example_a.csv its working good. i am not sure about whats going on

below is my code

import hone

optional_arguments = {
  "delimiters": [","]
}
Hone = hone.Hone(**optional_arguments)
schema = Hone.get_schema('csvschema.json')  # nested JSON schema for input.csv
result = Hone.convert('data.csv', schema=schema)  # final structure, nested according to schema

Example A not reproducible

Hi!

I'm trying to replicate the examples that are illustrated in the readme but however I'm not able to replicate them.

I tried to convert the example_a.csv with the following code snippet:

import hone

optional_arguments = {
  "delimiters": [" ", ",", ";"]
}
Hone = hone.Hone(**optional_arguments)
schema = Hone.get_schema('./example_a.csv')

and the expected output should be:

[
  {
    "adopted": "TRUE",
    "adopted_since": "2012",
    "age (years)": "5",
    "birth": {
      "day": "11",
      "month": "April",
      "year": "2011"
    },
    "name": "Tommy",
    "weight (kg)": "3.6"
  },
  {
    "adopted": "FALSE",
    "adopted_since": "N/A",
    "age (years)": "2",
    "birth": {
      "day": "6",
      "month": "May",
      "year": "2015"
    },
    "name": "Clara",
    "weight (kg)": "8.2"
  },
  {
    "adopted": "TRUE",
    "adopted_since": "2017",
    "age (years)": "6",
    "birth": {
      "day": "21",
      "month": "August",
      "year": "2011"
    },
    "name": "Catnip",
    "weight (kg)": "3.3"
  },
  {
    "adopted": "TRUE",
    "adopted_since": "2018",
    "age (years)": "3",
    "birth": {
      "day": "18",
      "month": "January",
      "year": "2015"
    },
    "name": "Ciel",
    "weight (kg)": "3.1"
  }
]

but what I get is the following:

{'adopted_since': 'adopted_since',
 'adopted': 'adopted',
 'birth': {'year': 'birth year', 'month': 'birth month', 'day': 'birth day'},
 'weight (kg)': 'weight (kg)',
 'age (years)': 'age (years)',
 'name': 'name'}

so basically the cell values are not inserted in the dictionary.

Anything I did wrong or missed in the code snippet?

Thanks for the help!

Custom schema issues / nested arrays

I'm trying to create a schema that will put a list of items from separate columns in to a nested array:

 "data": {
       "tags": ["cashier", "burger flipper", "fry cook"]
  }

Each item is in a different column in the source csv. I tried each column header as 'data_tags', but that only returns the first column with that header. I tried creating a custom schema, and setting the column headers to data_tags_1, data_tags_2,data_tags_3, but I don't think I understand how to use schema - I always get a KeyError on the first column of the csv.

Custom schema:
{"data": {"tags": ["data_tags_1", "data_tags_2", "data_tags_3"]}}

Any suggestions on how to accomplish a nested array?

Integer and Boolean type keys

Is there a way to specify the type of the key when converting to JSON? I have csv columns which are integers and boolean. When i use this module to convert csv to json, those values are outputted as strings.

{
"number": "240",
"isdown": "FALSE"
}
I am excepting
{
"number": 240,
"isdown": false
}

Conversion tool scope

Hi.
Congrats for your work.
Do you think that it will be possible to use your conversion tool with our attached firebase json file configuration?
Thanks.
Manuel

example.txt

Support for type casting

Support for type casting (design document)

Overview

Introduction
How each type will be handled
- string
- number
- literal names (null, true, and false)
- object
- array
Using the feature
- New command-line options
- New module options

Introduction

Currently (v0.1.4), all CSV cell values are assumed to be strings and are encoded as such in the JSON output. For example, a cell containing 4.0 becomes "4.0".

However according to the JSON standard (STD 90, i.e. RFC 8259), a JSON value can be any one of the following:

string (anything surrounded by quotes)
number ([ minus ] int [ frac ] [ exp ])
literal names
- null
- true
- false
object (begin-object [ member *( value-separator member ) ] end-object)
array (begin-array [ value *( value-separator value ) ] end-array)

A type casting feature was requested in #4, so this document describes how the feature be implemented.

To start, each cell value can be safely evaluated by either using ast.literal_eval (for string, number, null, true, false), or json.JSONDecoder (for object and array). This will give us the Python representation of that value using the most appropriate type available. From there, we have to encode each type into the appropriate JSON data type. The details on how this will be done are described below.

How each type will be handled

string

The handling of this type will not change.

number

JSON supports integers, fractions, and exponents. Converting a number encoded as a string to an integer or float is trivial in Python, but we have to be careful to preserve scientific notation. We want 4E10 to be encoded as 4E10, not 40000000000.0.

Unfortunately, we may lose the original scientific notation when we convert the cell value to a float. And so when the values are being converted to JSON, the default JSONEncoder will not know to keep the number in scientific notation. Here is the solution that I propose:

Cell values that can be evaluated to a float but also contain an exponent (e or E) will be wrapped in decimal.Decimal.
Python's JSONEncoder will be extended to support the encoding of decimal.Decimal so that we can preserve the original scientific notation of a value.

literal names (`null`, `true`, and `false`)

Currently, if a cell in a row of the CSV file is left out, it's corresponding field is not included in the outputted JSON object. This is a sensible default. However, the user should be able to map values to null if there's a semantic reason for doing so. For example, the string "N/A" would be a good candidate for null replacement.

As I mentioned in #4, there is no standard way to represent a boolean value in a CSV file. Due to this, if the user wants to automatically convert cell values to JSON boolean values, they must specify a mapping.

The takeaway from this section is that if they user wants to use null, true, or false in the outputted JSON, they must specify a literal name mapping for those values. For example:

{
  true : [1, "1", "yes", "YES", "y", "Y", "true", "True", "T"],
  false : [0, "0", "no", "NO", "n", "N", "false", "False", "F"],
  null : ["N/A", "None", "none"]
}

The user will be able to express the above using individual flags for true values, false values, and null values (instead of a single map for all literals).

object

If type casting for objects is enabled, the assumption will be that any cell value that starts with an opening curly brace and ends with a closing curly brace will contain a JSON object. JSONDecoder will be used to decode the object.

However, the Python's JSONDecoder won't work out-of-the-box because of our need to preserve scientific notation in floats, and to parse boolean values as mentioned in the previous sections. Thus, we will extend the decoder similarly to how we will extend the encoder (see number section).

array

If type casting for arrays is enabled, the assumption will be that any cell value that starts with an opening square brace and ends with a closing square brace will contain a JSON array. We will handle this in the same way we handle objects - by using a custom JSON decoder that takes into account our special handling of boolean and float values.

Using the feature

New command-line options

`--type-cast`

-t --type-cast ["string", "int", "float", "object", "array", "true", "false", "null"]
  - When possible, convert CSV values to one or more specified data types supported by JSON. 
  - The default is ["string", "int", "float", "object", "array"].
  - If "true" is included, the values to convert to true must be specified using --true-values.
  - If "false" is included, the values to convert to false must be specified using --false-values.
  - If "null" is included, the values to convert to null must be specified using --null-values.

`--true-values`

-t --true-values [value1, value2, ...]
  - Used to specify which values should be converted to true in the JSON output.
  - Required when "true" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

`--false-values`

-t --false-values [value1, value2, ...]
  - Used to specify which values should be converted to false in the JSON output.
  - Required when "false" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

`--null-values`

-n --null-values [value1, value2, ...]
  - Used to specify which values should be converted to null in the JSON output.
  - Required when "null" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

New module options

This may change after #1 is implemented, so this will be finalized later.

Support Appending into Embedded structure

Hi,

Your work is awesome, I was wondering if you can add a support to append inside the embedded structure

For example:
In the below data set if first two rows are same then have only one document and embed them into one.

name,age (years),weight (kg),birth day,birth month,birth year,adopted,adopted_since
Tommy,5,3.6,11,April,2011,TRUE,2012
Tommy,5,3.6,12,May,2012,TRUE,2012
Clara,2,8.2,6,May,2015,FALSE,N/A
Catnip,6,3.3,21,August,2011,TRUE,2017
Ciel,3,3.1,18,January,2015,TRUE,2018

Output:

[
    {
        "adopted_since": "2012",
        "weight (kg)": "3.6",
        "birth": [{
            "year": "2011",
            "day": "11",
            "month": "April"
        },{
            "year": "2012",
            "day": "12",
            "month": "April"
        }],
        "age (years)": "5",
        "name": "Tommy",
        "adopted": "TRUE"
    },
    {
        "adopted_since": "N/A",
        "weight (kg)": "8.2",
        "birth": {
            "year": "2015",
            "day": "6",
            "month": "May"
        },
        "age (years)": "2",
        "name": "Clara",
        "adopted": "FALSE"
    },
    {
        "adopted_since": "2017",
        "weight (kg)": "3.3",
        "birth": {
            "year": "2011",
            "day": "21",
            "month": "August"
        },
        "age (years)": "6",
        "name": "Catnip",
        "adopted": "TRUE"
    },
    {
        "adopted_since": "2018",
        "weight (kg)": "3.1",
        "birth": {
            "year": "2015",
            "day": "18",
            "month": "January"
        },
        "age (years)": "3",
        "name": "Ciel",
        "adopted": "TRUE"
    }
]

Missing schema in example

Hi,
Maybe I missed something but I assume some specific shemas are need for transformation, specialy with nested elements.

Multistring delimiters

This package currently does not allow for using multi-character delimiters (eg. ||). Is there a reason for it? As I needed it, I've prepared a fix for this and I can create a pull request for it.

Anyway to structure json output values based on column order?

Currently it looks like json output is structured based on column name but in alphabetical order, is there a way to keep the keys in order as they appear in the header left to right?

not recognizing the arrays nested in json

Hi, can we able to convert the csv data below with '.' as delimiter that has a json array nested in an array

CSV:

Products.name | Products.specifications.cost | Products.discount | Products.specifications.weight | Products.type
Onions | 50 | 0 | 1 | classic
Onions | 200 | 0 | 2 | classic
Tomato | 25 | 1 | 1 | classic
Broccoli | 50 | 0 | 0.5 | classic

JSON needed as output:
{
"Products": [
{
"name": "Onions",
"specification": [
{
"cost": "50",
"weight": "1"
},
{
"cost": "200",
"weight": "2"
}
],
"discount": "0",
"type": "classic"
},
{
"name": "Tomato",
"specification": [
{
"cost": "25",
"weight": "1"
}
],
"discount": "1",
"type": "classic"
},
{
"name": "Broccoli",
"specification": [
{
"cost": "50",
"weight": "0.5"
}
],
"discount": "0",
"type": "classic"
}
]
}

but I m getting this output from the python :
[
{
"Products": {
"discount": "0",
"name": "Onions",
"specification": {
"cost": "50",
"weight": "1"
},
"type": "classic"
}
},
{
"Products": {
"discount": "0",
"name": "Onions",
"specification": {
"cost": "200",
"weight": "2"
},
"type": "classic"
}
},
{
"Products": {
"discount": "1",
"name": "Tomato",
"specification": {
"cost": "25",
"weight": "1"
},
"type": "classic"
}
},
{
"Products": {
"discount": "0",
"name": "Broccoli",
"specification": {
"cost": "50",
"weight": "0.5"
},
"type": "classic"
}
}
]

Properly escape quotes in cell values

Bug uncovered in #16

Support for custom delimiters

The delimiters that are used to generate the nested structure from column names are commas, underscores, and spaces. A user should be allowed to override the default delimiter(s) by passing in a Python list containing their delimiters of choice.

Proposed interface:

`--delimiters`

-d --delimiters [delimiter1, delimiter2, ...]
  - Override the default delimiters for generating a nested structure from column names.
  - The default value is [",", "_", " "].
  - Input must be a Python-compatible list of strings.

Document all command-line arguments

Right now only the optional arguments are documented. Positional arguments should also be documented.

cannot import name 'Hone' from 'hone'

from hone import Hone

def process_mapper():
h = Hone()
schema = h.get_schema('curlim__table__mapping.csv') # returns nested JSON schema for input.csv
result = h.convert('curlim__table__mapping.csv') # returns converted JSON as Python dictionary

def main():
process_mapper()

if name == 'main':
main()

ERROR -
ImportError: cannot import name 'Hone' from 'hone' (/Users/user1/code-repo/test/datamapper/hone.py)

Require Python 3.0+

Disallow installation of module for Python 2

Handle BOM at start of CSV files

#25 (comment)

utils module missing after install + run in terminal

Hi,

I've just followed the install steps and can't seem to get past some missing stuff.
This is the message I get (on Mac, High Sierra 10.13.3):

File "/usr/local/bin/hone", line 7, in <module> from hone.__main__ import main File "/Library/Python/2.7/site-packages/hone/__init__.py", line 1, in <module> from hone.hone import Hone File "/Library/Python/2.7/site-packages/hone/hone.py", line 1, in <module> from hone.utils import csv_utils ImportError: No module named utils

Any ideas what I should do (differently)?

Improve tests

Decouple JSON and CSV data from tests
Improve test workflow

Allow schema to be overridden

While hone automatically generates the nested schema to use when converting a CSV file to JSON, there are cases where you may want to modify the schema, or provide an entirely different schema for that matter.

This can be an optional parameter for the convert method:

import hone

Hone = hone.Hone()
schema = Hone.get_schema('path/to/input.csv')
result = Hone.convert('path/to/input.csv', schema=schema)

And a flag for the command-line interface;

  -s [SCHEMA], --schema [SCHEMA]
                        Manually specify the schema that defines the structure
                        of the generated JSON, instead of having it
                        automatically generated. [SCHEMA] must be a valid JSON
                        object encoded as a string.

Use built-in csv module

I wrote my own functions for parsing CSV files in /utils/csv_utils.py but it would be better to use the built-in csv module.