chamkank / hone Goto Github PK
View Code? Open in Web Editor NEWConvert CSV to automatically nested JSON
License: MIT License
Convert CSV to automatically nested JSON
License: MIT License
When using the CLI sometimes it's useful to not use files, but stdin/stdout instead. If allows further processing with e.g the jq tool.
It would be nice to be able to do this:
$ echo 'name,age (years),weight (kg),birth day,birth month,birth year,adopted,adopted_since
Tommy,5,3.6,11,April,2011,TRUE,2012' | hone > output.json
$ hone input.csv
[
{
"adopted": "TRUE",
"adopted_since": "2012",
"age (years)": "5",
"birth": {
"day": "11",
"month": "April",
"year": "2011"
},
"name": "Tommy",
"weight (kg)": "3.6"
}
]
$ hone input.csv | jq '.[0]["name"]'
"Tommy"
If I use a delimiter |
for example and I have a column name like lvl1|lvl2
, the resulting JSON structure looks like "lvl1|lvl2" : <value>
. I believe it would make more sense if the result would be "lvl1":{"lvl2":<value>}
. What is the reasoning behind this? As I needed it, I've prepared the code adjustments for this and I can create a pull request for it. Thanks!
I am trying to use the this library for converting csv to nested json. I am getting the below error
"line 39, in populate_structure_with_data
key_path = mapping[column_name]
KeyError: 'studentid'"
when i try the script with example_a.csv its working good. i am not sure about whats going on
below is my code
import hone
optional_arguments = {
"delimiters": [","]
}
Hone = hone.Hone(**optional_arguments)
schema = Hone.get_schema('csvschema.json') # nested JSON schema for input.csv
result = Hone.convert('data.csv', schema=schema) # final structure, nested according to schema
Hi!
I'm trying to replicate the examples that are illustrated in the readme but however I'm not able to replicate them.
I tried to convert the example_a.csv
with the following code snippet:
import hone
optional_arguments = {
"delimiters": [" ", ",", ";"]
}
Hone = hone.Hone(**optional_arguments)
schema = Hone.get_schema('./example_a.csv')
and the expected output should be:
[
{
"adopted": "TRUE",
"adopted_since": "2012",
"age (years)": "5",
"birth": {
"day": "11",
"month": "April",
"year": "2011"
},
"name": "Tommy",
"weight (kg)": "3.6"
},
{
"adopted": "FALSE",
"adopted_since": "N/A",
"age (years)": "2",
"birth": {
"day": "6",
"month": "May",
"year": "2015"
},
"name": "Clara",
"weight (kg)": "8.2"
},
{
"adopted": "TRUE",
"adopted_since": "2017",
"age (years)": "6",
"birth": {
"day": "21",
"month": "August",
"year": "2011"
},
"name": "Catnip",
"weight (kg)": "3.3"
},
{
"adopted": "TRUE",
"adopted_since": "2018",
"age (years)": "3",
"birth": {
"day": "18",
"month": "January",
"year": "2015"
},
"name": "Ciel",
"weight (kg)": "3.1"
}
]
but what I get is the following:
{'adopted_since': 'adopted_since',
'adopted': 'adopted',
'birth': {'year': 'birth year', 'month': 'birth month', 'day': 'birth day'},
'weight (kg)': 'weight (kg)',
'age (years)': 'age (years)',
'name': 'name'}
so basically the cell values are not inserted in the dictionary.
Anything I did wrong or missed in the code snippet?
Thanks for the help!
I'm trying to create a schema that will put a list of items from separate columns in to a nested array:
"data": {
"tags": ["cashier", "burger flipper", "fry cook"]
}
Each item is in a different column in the source csv. I tried each column header as 'data_tags', but that only returns the first column with that header. I tried creating a custom schema, and setting the column headers to data_tags_1, data_tags_2,data_tags_3, but I don't think I understand how to use schema - I always get a KeyError on the first column of the csv.
Custom schema:
{"data": {"tags": ["data_tags_1", "data_tags_2", "data_tags_3"]}}
Any suggestions on how to accomplish a nested array?
Is there a way to specify the type of the key when converting to JSON? I have csv columns which are integers and boolean. When i use this module to convert csv to json, those values are outputted as strings.
{
"number": "240",
"isdown": "FALSE"
}
I am excepting
{
"number": 240,
"isdown": false
}
Hi.
Congrats for your work.
Do you think that it will be possible to use your conversion tool with our attached firebase json file configuration?
Thanks.
Manuel
null
, true
, and false
)Currently (v0.1.4), all CSV cell values are assumed to be strings and are encoded as such in the JSON output. For example, a cell containing 4.0
becomes "4.0"
.
However according to the JSON standard (STD 90, i.e. RFC 8259), a JSON value can be any one of the following:
[ minus ] int [ frac ] [ exp ]
)null
true
false
begin-object [ member *( value-separator member ) ] end-object
)begin-array [ value *( value-separator value ) ] end-array
)A type casting feature was requested in #4, so this document describes how the feature be implemented.
To start, each cell value can be safely evaluated by either using ast.literal_eval
(for string, number, null
, true
, false
), or json.JSONDecoder
(for object and array). This will give us the Python representation of that value using the most appropriate type available. From there, we have to encode each type into the appropriate JSON data type. The details on how this will be done are described below.
The handling of this type will not change.
JSON supports integers, fractions, and exponents. Converting a number encoded as a string to an integer or float is trivial in Python, but we have to be careful to preserve scientific notation. We want 4E10
to be encoded as 4E10
, not 40000000000.0
.
Unfortunately, we may lose the original scientific notation when we convert the cell value to a float. And so when the values are being converted to JSON, the default JSONEncoder
will not know to keep the number in scientific notation. Here is the solution that I propose:
float
but also contain an exponent (e
or E
) will be wrapped in decimal.Decimal
.JSONEncoder
will be extended to support the encoding of decimal.Decimal
so that we can preserve the original scientific notation of a value.null
, true
, and false
)Currently, if a cell in a row of the CSV file is left out, it's corresponding field is not included in the outputted JSON object. This is a sensible default. However, the user should be able to map values to null
if there's a semantic reason for doing so. For example, the string "N/A"
would be a good candidate for null
replacement.
As I mentioned in #4, there is no standard way to represent a boolean value in a CSV file. Due to this, if the user wants to automatically convert cell values to JSON boolean values, they must specify a mapping.
The takeaway from this section is that if they user wants to use null
, true
, or false
in the outputted JSON, they must specify a literal name mapping for those values. For example:
{
true : [1, "1", "yes", "YES", "y", "Y", "true", "True", "T"],
false : [0, "0", "no", "NO", "n", "N", "false", "False", "F"],
null : ["N/A", "None", "none"]
}
The user will be able to express the above using individual flags for true
values, false
values, and null
values (instead of a single map for all literals).
If type casting for objects is enabled, the assumption will be that any cell value that starts with an opening curly brace and ends with a closing curly brace will contain a JSON object. JSONDecoder
will be used to decode the object.
However, the Python's JSONDecoder
won't work out-of-the-box because of our need to preserve scientific notation in floats, and to parse boolean values as mentioned in the previous sections. Thus, we will extend the decoder similarly to how we will extend the encoder (see number section).
If type casting for arrays is enabled, the assumption will be that any cell value that starts with an opening square brace and ends with a closing square brace will contain a JSON array. We will handle this in the same way we handle objects - by using a custom JSON decoder that takes into account our special handling of boolean and float values.
--type-cast
-t --type-cast ["string", "int", "float", "object", "array", "true", "false", "null"]
- When possible, convert CSV values to one or more specified data types supported by JSON.
- The default is ["string", "int", "float", "object", "array"].
- If "true" is included, the values to convert to true must be specified using --true-values.
- If "false" is included, the values to convert to false must be specified using --false-values.
- If "null" is included, the values to convert to null must be specified using --null-values.
--true-values
-t --true-values [value1, value2, ...]
- Used to specify which values should be converted to true in the JSON output.
- Required when "true" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
--false-values
-t --false-values [value1, value2, ...]
- Used to specify which values should be converted to false in the JSON output.
- Required when "false" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
--null-values
-n --null-values [value1, value2, ...]
- Used to specify which values should be converted to null in the JSON output.
- Required when "null" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
This may change after #1 is implemented, so this will be finalized later.
Hi,
Your work is awesome, I was wondering if you can add a support to append inside the embedded structure
For example:
In the below data set if first two rows are same then have only one document and embed them into one.
name,age (years),weight (kg),birth day,birth month,birth year,adopted,adopted_since
Tommy,5,3.6,11,April,2011,TRUE,2012
Tommy,5,3.6,12,May,2012,TRUE,2012
Clara,2,8.2,6,May,2015,FALSE,N/A
Catnip,6,3.3,21,August,2011,TRUE,2017
Ciel,3,3.1,18,January,2015,TRUE,2018
Output:
[
{
"adopted_since": "2012",
"weight (kg)": "3.6",
"birth": [{
"year": "2011",
"day": "11",
"month": "April"
},{
"year": "2012",
"day": "12",
"month": "April"
}],
"age (years)": "5",
"name": "Tommy",
"adopted": "TRUE"
},
{
"adopted_since": "N/A",
"weight (kg)": "8.2",
"birth": {
"year": "2015",
"day": "6",
"month": "May"
},
"age (years)": "2",
"name": "Clara",
"adopted": "FALSE"
},
{
"adopted_since": "2017",
"weight (kg)": "3.3",
"birth": {
"year": "2011",
"day": "21",
"month": "August"
},
"age (years)": "6",
"name": "Catnip",
"adopted": "TRUE"
},
{
"adopted_since": "2018",
"weight (kg)": "3.1",
"birth": {
"year": "2015",
"day": "18",
"month": "January"
},
"age (years)": "3",
"name": "Ciel",
"adopted": "TRUE"
}
]
Hi,
Maybe I missed something but I assume some specific shemas are need for transformation, specialy with nested elements.
This package currently does not allow for using multi-character delimiters (eg. ||
). Is there a reason for it? As I needed it, I've prepared a fix for this and I can create a pull request for it.
Currently it looks like json output is structured based on column name but in alphabetical order, is there a way to keep the keys in order as they appear in the header left to right?
Hi, can we able to convert the csv data below with '.' as delimiter that has a json array nested in an array
CSV:
Products.name | Products.specifications.cost | Products.discount | Products.specifications.weight | Products.type
Onions | 50 | 0 | 1 | classic
Onions | 200 | 0 | 2 | classic
Tomato | 25 | 1 | 1 | classic
Broccoli | 50 | 0 | 0.5 | classic
JSON needed as output:
{
"Products": [
{
"name": "Onions",
"specification": [
{
"cost": "50",
"weight": "1"
},
{
"cost": "200",
"weight": "2"
}
],
"discount": "0",
"type": "classic"
},
{
"name": "Tomato",
"specification": [
{
"cost": "25",
"weight": "1"
}
],
"discount": "1",
"type": "classic"
},
{
"name": "Broccoli",
"specification": [
{
"cost": "50",
"weight": "0.5"
}
],
"discount": "0",
"type": "classic"
}
]
}
but I m getting this output from the python :
[
{
"Products": {
"discount": "0",
"name": "Onions",
"specification": {
"cost": "50",
"weight": "1"
},
"type": "classic"
}
},
{
"Products": {
"discount": "0",
"name": "Onions",
"specification": {
"cost": "200",
"weight": "2"
},
"type": "classic"
}
},
{
"Products": {
"discount": "1",
"name": "Tomato",
"specification": {
"cost": "25",
"weight": "1"
},
"type": "classic"
}
},
{
"Products": {
"discount": "0",
"name": "Broccoli",
"specification": {
"cost": "50",
"weight": "0.5"
},
"type": "classic"
}
}
]
Bug uncovered in #16
The delimiters that are used to generate the nested structure from column names are commas, underscores, and spaces. A user should be allowed to override the default delimiter(s) by passing in a Python list containing their delimiters of choice.
--delimiters
-d --delimiters [delimiter1, delimiter2, ...]
- Override the default delimiters for generating a nested structure from column names.
- The default value is [",", "_", " "].
- Input must be a Python-compatible list of strings.
Right now only the optional arguments are documented. Positional arguments should also be documented.
from hone import Hone
def process_mapper():
h = Hone()
schema = h.get_schema('curlim__table__mapping.csv') # returns nested JSON schema for input.csv
result = h.convert('curlim__table__mapping.csv') # returns converted JSON as Python dictionary
def main():
process_mapper()
if name == 'main':
main()
ERROR -
ImportError: cannot import name 'Hone' from 'hone' (/Users/user1/code-repo/test/datamapper/hone.py)
Disallow installation of module for Python 2
Hi,
I've just followed the install steps and can't seem to get past some missing stuff.
This is the message I get (on Mac, High Sierra 10.13.3):
File "/usr/local/bin/hone", line 7, in <module> from hone.__main__ import main File "/Library/Python/2.7/site-packages/hone/__init__.py", line 1, in <module> from hone.hone import Hone File "/Library/Python/2.7/site-packages/hone/hone.py", line 1, in <module> from hone.utils import csv_utils ImportError: No module named utils
Any ideas what I should do (differently)?
While hone automatically generates the nested schema to use when converting a CSV file to JSON, there are cases where you may want to modify the schema, or provide an entirely different schema for that matter.
This can be an optional parameter for the convert
method:
import hone
Hone = hone.Hone()
schema = Hone.get_schema('path/to/input.csv')
result = Hone.convert('path/to/input.csv', schema=schema)
And a flag for the command-line interface;
-s [SCHEMA], --schema [SCHEMA]
Manually specify the schema that defines the structure
of the generated JSON, instead of having it
automatically generated. [SCHEMA] must be a valid JSON
object encoded as a string.
I wrote my own functions for parsing CSV files in /utils/csv_utils.py but it would be better to use the built-in csv
module.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.