Giter Site home page Giter Site logo

pandavro's Introduction

pandavro

Python package

The interface between Apache Avro and pandas DataFrame.

Installation

pandavro is available to install from PyPI.

$ pip install pandavro

Description

It prepares like pandas APIs:

  • read_avro
    • Read the records from Avro file and fit them into pandas DataFrame using fastavro.
  • to_avro
    • Write the rows of pandas DataFrame to Avro file with the original schema infer.

What can and can't pandavro do?

Avro can represent the following kinds of types:

  • Primitive types (null, bool, int etc.)
  • Complex types (records, arrays, maps etc.)
  • Logical types (annotated primitive/complex type to represent e.g. datetime)

When converting to Avro, pandavro will try to infer the schema. It will output a non-nested schema without any indexes set on the dataframe and it will also not try to infer if any column can be nullable so all columns are set as nullable, i.e. a boolean will be encoded in Avro schema as ['null', 'bool'].

Pandavro can handle these primitive types:

Numpy/pandas type Avro primitive type
np.bool_ boolean
np.float32 float
np.float64 double
np.unicode_ string
np.object_ string
np.int8, np.int16, np.int32 int
np.uint8, np.uint16, np.uint32 "unsigned" int*
np.uint64 "unsigned" long*
np.int64, pd.Int64Dtype long
pd.Int8Dtype, pd.Int16Dtype, pd.Int32Dtype int
pd.UInt8Dtype, pd.UInt16Dtype, pd.UInt32Dtype "unsigned" int*
pd.StringDtype** string
pd.BooleanDtype** boolean

* We represent the unsigned versions of these integers by adding the non-standard "unsigned" flag as such: {'type': 'int', 'unsigned': True}. Pandas 0.24 added support for nullable integers. Writing pd.UInt64Dtype is not supported by fastavro.

** Pandas 1.0.0 added support for nullable string and boolean datatypes.

Pandavro also supports these logical types:

Numpy/pandas type Avro logical type
np.datetime64, pd.DatetimeTZDtype, pd.Timestamp timestamp-micros*
If a boolean column includes empty values, pandas classifies the column as having a dtype of object - this is accounted for in complex column handling.

And these complex types - all complex types other than 'fixed' will be classified by pandas as having a dtype of object, so their underlying python types are used to determine the Avro type:

Numpy/Python type Avro complex type
dict, collections.OrderedDict record
list array
np.void fixed

Record and array types can be arbitrarily nested within each other.

The schema definition of a record requires a unique name for the record separate from the column itself. This does not map to any concept in pandas, so for this we just append '_record' to the original column name and a number to ensure that there are zero duplicate 'name' values.

The remaining Avro complex types are not currently supported for the following reasons:

  1. Enum: The closest pandas type to Avro's enum type is pd.Categorical, but it still is not a complete match. Possible values of the enum type can only be alphanumeric strings, whereas pd.Categorical values have no such limitation.
  2. Map: No strictly matching concept in Python/pandas - Python dictionaries can have arbitrarily typed keys. Functionality can be essentially be achieved with the record type.
  3. Union: Any column with mixed types (other than empty values/NoneType) are treated by pandas as having a dtype of object, and will be written as strings. It would be difficult to deterministically infer multiple allowed data types based solely on a column's contents.

And these logical types:

Numpy/pandas type Avro logical type
np.datetime64, pd.DatetimeTZDtype, pd.Timestamp timestamp-micros/timezone-millis

Note that the timestamp must not contain any timezone (it must be naive) because Avro does not support timezones. Timestamps are encoded as microseconds by default, but can be encoded in milliseconds by using times_as_micros=False

* If passed to_avro(..., times_as_micros=False), this has a millisecond resolution.

Due to an inherent design choice in fastavro, it interprets a naive datetime in the system's timezone before serializing it. This has the consequence that your naive datetime will not correctly roundtrip to and from an Avro file. Always indicate a timezone to avoid the system timezone introducing problems.

If you don't want pandavro to infer the schema but instead define it yourself, pass it using the schema kwarg to to_avro.

Loading Pandas nullable datatypes

The nullable datatypes indicated in the table above are easily written to Avro, but loading them introduces ambiguity as we can use either the old, default or these new datatypes. We solve this by using a special keyword when loading to force conversion to these new NA-supporting datatypes:

import pandavro as pdx

# Load datatypes as NA-compatible datatypes where possible
pdx.read_avro(path, na_dtypes=True)

This is different from convert_dtypes as it does not infer the datatype based on the actual values, but it looks at the Avro schema so is deterministic and not dependent on the actual values.

Also note that, in "normal" mode, numpy int/uint dtypes are all read back as np.int64 due to how fastavro reads them. (This could be worked around by converting type after loading, PRs welcome.) In na_dtypes=True mode they are loaded correctly as Pandas NA-dtypes, but with no less than 32 bits of resolution (less is not supported by Avro so we can not infer it from the schema).

Example

See tests/pandavro_test.py for more examples.

import os
import numpy as np
import pandas as pd
import pandavro as pdx

OUTPUT_PATH='{}/example.avro'.format(os.path.dirname(__file__))


def main():
    df = pd.DataFrame({
        "Boolean": [True, False, True, False],
        "pdBoolean": pd.Series([True, None, True, False], dtype=pd.BooleanDtype()),
        "Float64": np.random.randn(4),
        "Int64": np.random.randint(0, 10, 4),
        "pdInt64":  pd.Series(list(np.random.randint(0, 10, 3)) + [None], dtype=pd.Int64Dtype()),
        "String": ['foo', 'bar', 'foo', 'bar'],
        "pdString": pd.Series(['foo', 'bar', 'foo', None], dtype=pd.StringDtype()),
        "DateTime64": [pd.Timestamp('20190101'), pd.Timestamp('20190102'),
                       pd.Timestamp('20190103'), pd.Timestamp('20190104')]
    })

    pdx.to_avro(OUTPUT_PATH, df)
    saved = pdx.read_avro(OUTPUT_PATH)
    print(saved)


if __name__ == '__main__':
    main()

pandavro's People

Contributors

alantaranti avatar dargueta avatar deleted avatar floscha avatar kristofarkas avatar lordgrenville avatar marctorsoc avatar ruben-trdj avatar slemouzy avatar streitl avatar synapticarbors avatar the-fonz avatar ynqa avatar yudetamago avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pandavro's Issues

from_records() got an unexpected keyword argument 'na_dtypes'

Hi,

First of all: many thanks for pandavro! It's incredibly useful in day-to-day data operations.

When using read_avro() with na_dtypes=True, I get the following TypeError, using Pandas 1.3.5:

from_records() got an unexpected keyword argument 'na_dtypes'

Will post full trace below.

I'd like to humbly request if this is a know issue and if there is a workaround. If it is a new issue, I'm willing to help fix it. Any pointers to get started are deeply appreciated.

Full command:

df = pdx.read_avro('./test.avro', na_dtypes=True)

Full trace:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_4020/1232288353.py in <module>
----> 1 df = pdx.read_avro('./test.avro', na_dtypes=True)

/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in read_avro(file_path_or_buffer, schema, **kwargs)
    194     if isinstance(file_path_or_buffer, six.string_types):
    195         with open(file_path_or_buffer, 'rb') as f:
--> 196             return __file_to_dataframe(f, schema, **kwargs)
    197     else:
    198         return __file_to_dataframe(file_path_or_buffer, schema, **kwargs)

/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py in __file_to_dataframe(f, schema, **kwargs)
    177 def __file_to_dataframe(f, schema, **kwargs):
    178     reader = fastavro.reader(f, reader_schema=schema)
--> 179     return pd.DataFrame.from_records(list(reader), **kwargs)
    180 
    181 

TypeError: from_records() got an unexpected keyword argument 'na_dtypes'

Auto-deploy on merge to master?

I see that the config is

deploy:
  provider: pypi
  user: pyncha
  password:
    secure: ****
  on:
    tags: true
    python: 3.9

so it doesn't auto-release https://pypi.org/project/pandavro/#history. Not being an expert in travis, it looks to me like it will only release a new version once you manually add a tag to a branch, as explained in https://docs.travis-ci.com/user/deployment/pypi/#deploying-tags. Wouldn't be easier just to release every time you merge on master? i.e:

  on:
    branch: master

as explained here: https://docs.travis-ci.com/user/deployment/pypi/#deploying-specific-branches

WDYT?

allow for process_record() while reading in avro

Feature request:
Could you allow for process_record function while reading in avro? Here is a suggestion.

def __file_to_dataframe(f, schema, process_record=None, **kwargs):

    reader = fastavro.reader(f, reader_schema=schema)
    records = list()
   if preprocess_record:
            records = [process_record(r) for r in avro_reader]
   else:
            records = list(avro_reader)

    return pd.DataFrame.from_records(records, **kwargs)

Does the fastavro dependency version need to be pinned?

setup.py has:

    install_requires=[
        # fixed versions.
        'fastavro==1.5.1',
        'pandas>=1.1',
        # https://pandas.pydata.org/pandas-docs/version/1.1/getting_started/install.html#dependencies
        'numpy>=1.15.4',
    ],

This causes a dependency resolution failure for me because I'm using another package that requires fastavro>=1.5.4.

Would it be possible to relax that requirement to be 'fastavro>=1.5.1

Add support for numpy 2.0

Numpy 2.0 just dropped and came with changes that broke pandavro. When importing pandavro I now get the following error:

AttributeError: `np.unicode_` was removed in the NumPy 2.0 release. Use `np.str_` instead.

There might be other breaking changes affecting this package, I did not investigate it. The full numpy 2.0 release notes can be found here.

Problem with datatype datetime64[ns]

I'm having an issue trying to convert a Apache Parquet file into an Apache Avro file.

This is the code:

`import pyarrow.parquet as pq
import pandavro as pdx

table = pq.read_table('/media/sf_AWS/kafka/acciones_postcorte.parq')
pdx.to_avro('opers.avro', table.to_pandas())`

This is the schema of the file:

divpol: string
division: string
poliza: string
asignacion: string
num_asignacion: string
f_asignacion: timestamp[ms]
campana: string
campanacontable: string
despacho_empresa: string
municipio: string
deudagestionar: double
deudavencida: double
d_oap: double
fresultado: timestamp[ms]
resultado: string
gestor: string
captura: timestamp[ms]
gpo1: string
toap: string
visitado: int64
visitado_aus: int64
ps_ano: timestamp[ms]
dg_filepath: string
dg_date: timestamp[ms]
dg_schema_version: int64
index_level_0: int64
metadata


{b'pandas': b'{"index_columns": ["index_level_0"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "divpol", "field_name": "divpol", "pandas_type": "uni'
b'code", "numpy_type": "object", "metadata": null}, {"name": "divi'
b'sion", "field_name": "division", "pandas_type": "unicode", "nump'
b'y_type": "object", "metadata": null}, {"name": "poliza", "field_'
b'name": "poliza", "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": null}, {"name": "asignacion", "field_name": "asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "num_asignacion", "field_name": "num_asig'
b'nacion", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "f_asignacion", "field_name": "f_asignaci'
b'on", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", '
b'"metadata": null}, {"name": "campana", "field_name": "campana", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "campanacontable", "field_name": "campanacontable"'
b', "pandas_type": "unicode", "numpy_type": "object", "metadata": '
b'null}, {"name": "despacho_empresa", "field_name": "despacho_empr'
b'esa", "pandas_type": "unicode", "numpy_type": "object", "metadat'
b'a": null}, {"name": "municipio", "field_name": "municipio", "pan'
b'das_type": "unicode", "numpy_type": "object", "metadata": null},'
b' {"name": "deudagestionar", "field_name": "deudagestionar", "pan'
b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b', {"name": "deudavencida", "field_name": "deudavencida", "pandas'
b'_type": "float64", "numpy_type": "float64", "metadata": null}, {'
b'"name": "d_oap", "field_name": "d_oap", "pandas_type": "float64"'
b', "numpy_type": "float64", "metadata": null}, {"name": "fresulta'
b'do", "field_name": "fresultado", "pandas_type": "datetime", "num'
b'py_type": "datetime64[ns]", "metadata": null}, {"name": "resulta'
b'do", "field_name": "resultado", "pandas_type": "unicode", "numpy'
b'type": "object", "metadata": null}, {"name": "gestor", "field_n'
b'ame": "gestor", "pandas_type": "unicode", "numpy_type": "object"'
b', "metadata": null}, {"name": "captura", "field_name": "captura"'
b', "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "me'
b'tadata": null}, {"name": "gpo1", "field_name": "gpo1", "pandas_t'
b'ype": "unicode", "numpy_type": "object", "metadata": null}, {"na'
b'me": "toap", "field_name": "toap", "pandas_type": "unicode", "nu'
b'mpy_type": "object", "metadata": null}, {"name": "visitado", "fi'
b'eld_name": "visitado", "pandas_type": "int64", "numpy_type": "in'
b't64", "metadata": null}, {"name": "visitado_aus", "field_name": '
b'"visitado_aus", "pandas_type": "int64", "numpy_type": "int64", "'
b'metadata": null}, {"name": "ps_ano", "field_name": "ps_ano", "pa'
b'ndas_type": "datetime", "numpy_type": "datetime64[ns]", "metadat'
b'a": null}, {"name": "dg_filepath", "field_name": "dg_filepath", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "dg_date", "field_name": "dg_date", "pandas_type":'
b' "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, '
b'{"name": "dg_schema_version", "field_name": "dg_schema_version",'
b' "pandas_type": "int64", "numpy_type": "int64", "metadata": null'
b'}, {"name": null, "field_name": "index_level_0", "pandas_typ'
b'e": "int64", "numpy_type": "int64", "metadata": null}], "pandas
'
b'version": "0.22.0"}'}

This is the error:

Traceback (most recent call last): File "p2a.py", line 10, in <module> pdx.to_avro('opers.avro', table.to_pandas()) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 77, in to_avro schema = __schema_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 33, in __schema_infer fields = __fields_infer(df) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 27, in __fields_infer type_avro = __type_infer(type_np) File "/home/jpardobl/python_envs/venv_kafka/lib/python3.6/site-packages/pandavro/__init__.py", line 21, in __type_infer raise TypeError('Invalid type: {}'.format(t)) TypeError: Invalid type: datetime64[ns]

Release 1.5.x

Overview

Use git tag to upload for PyPI #16.

Release 1.5.x:

  • Fix incorrect type inference, Python compatibility issues, support compression #12 @dargueta
  • Feature/add support append file #13 @AlanTaranti
  • Add section to README that explains how pandavro infers schema #15 @The-Fonz
  • Use tags for deploy #16 @ynqa

Datetime-like values errors

Got some problems with datetime-like values.
Tried with pandas 1.0.3 and 0.25.3, both don't working.
fastavro 0.23.4.

date

Traceback (most recent call last):

  File "<ipython-input-180-724a28b4d15a>", line 1, in <module>
    pdx.to_avro('test.avro', df.drop(columns=['event_timestamp']))

  File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
    records=df.to_dict('records'), codec=codec)

  File "fastavro/_write.pyx", line 628, in fastavro._write.writer

  File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write

  File "fastavro/_write.pyx", line 335, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 285, in fastavro._write.write_record

  File "fastavro/_write.pyx", line 333, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 249, in fastavro._write.write_union

ValueError: datetime.date(2020, 6, 10) (type <class 'datetime.date'>) do not match ['null', 'string']

NaTType

Traceback (most recent call last):

  File "<ipython-input-182-991911d54074>", line 1, in <module>
    pdx.to_avro('test.avro', df.drop(columns=['event_date','items']))

  File "/opt/anaconda3/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
    records=df.to_dict('records'), codec=codec)

  File "fastavro/_write.pyx", line 628, in fastavro._write.writer

  File "fastavro/_write.pyx", line 581, in fastavro._write.Writer.write

  File "fastavro/_write.pyx", line 335, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 285, in fastavro._write.write_record

  File "fastavro/_write.pyx", line 333, in fastavro._write.write_data

  File "fastavro/_write.pyx", line 234, in fastavro._write.write_union

  File "fastavro/_validation.pyx", line 169, in fastavro._validation._validate

  File "fastavro/_validation.pyx", line 178, in fastavro._validation._validate

  File "fastavro/_logical_writers.pyx", line 72, in fastavro._logical_writers.prepare_timestamp_micros

  File "fastavro/_logical_writers.pyx", line 105, in fastavro._logical_writers.prepare_timestamp_micros

  File "pandas/_libs/tslibs/nattype.pyx", line 58, in pandas._libs.tslibs.nattype._make_error_func.f

ValueError: NaTType does not support timestamp

Support for bytes

Hi! first of all, thanks for this very useful package. We use it in our ETL and it's really convenient.

I wonder whether there is support for bytes and I missed it. When I add a column to the dataframe being bytes, I get the error

TypeError: argument of type 'NoneType' is not iterable

which I'm getting with other complex types. This doesn't seem a very complex type, so I wonder if it'd be very difficult to add.

At the moment what I'm doing is this:

schema = pdx.schema_infer(df)
bytes_field_idx = next(idx for idx, field in enumerate(schema["fields"]) if field["name"] == "bytes_field")
schema["fields"][bytes_field_idx]["type"] = ["null", "bytes"]

pdx.to_avro(
        str(path),
        df,
        schema=schema,
)

but ofc would be great if I could delegate everything to schema_infer. Am I missing something? It'd be great to support Pathlib.Path as well, but that's not such a big deal :)

Add Python 3.12 support

Hi!
I use pandavro in a couple projects and would love to see Python 3.12 support soon!

Currently, when trying to install Pandavro 1.7.2 under Python 3.12, I get a fail when building fastavro 1.5.4:

Compiler crash traceback from this point on:
  File "/tmp/tmprnf1_8vp/.venv/lib64/python3.12/site-packages/Cython/Compiler/Nodes.py", line 2786, in call_self_node
    type_entry = self.type.args[0].type.entry
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PyObjectType' object has no attribute 'entry'

I know that fastavro 1.9.0 has Python 3.12 support so this should be feasible.
Thank you in advance and keep up the good workl!

Latest pandas does not have DatetimeTZDtypeType

There was a recent change that added a check for type pd.core.dtypes.dtypes.DatetimeTZDtypeType. This does not exist any more in the latest version of Pandas, unfortunately, throwing an error.

AttributeError: module 'pandas.core.dtypes.dtypes' has no attribute 'DatetimeTZDtypeType'

Add compatibility with Python 3.11

The current version pandavro==1.7.1 is not compatible with Python 3.11 because of the pinned dependency fastavro==1.5.1. I would like to request to fix this, so any app using pandavro can be upgraded to the latest version of Python.

I understand that this issue is closely related to #39, but I still decided to open it for raising awareness.

Tests failing in master

Hi, I was thinking about contributing to #27, but just ran tests in master and they fail for me

FAILED tests/pandavro_test.py::test_buffer_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_file_path_e2e - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_delegation - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_append - AssertionError: numpy array are different
FAILED tests/pandavro_test.py::test_dataframe_kwargs - AssertionError: numpy array are different
========================= 5 failed, 4 passed, 5 warnings in 0.60s ===========================

I can see

(Pdb) expect
   Boolean          DateTime64   Float64  Int64 String
0     True 2018-12-31 23:00:00 -0.579613      8    foo
1    False 2019-01-01 23:00:00 -0.922827      3    bar
2     True 2019-01-02 23:00:00 -1.070658      8    foo
3    False 2019-01-03 23:00:00 -0.072218      2    bar
4     True 2019-01-04 23:00:00 -1.604049      3    foo
5    False 2019-01-05 23:00:00 -0.822774      0    bar
6     True 2019-01-06 23:00:00 -0.504930      4    foo
7    False 2019-01-07 23:00:00  1.357435      0    bar
(Pdb) dataframe
   Boolean DateTime64   Float64  Int64 String
0     True 2019-01-01 -0.579613      8    foo
1    False 2019-01-02 -0.922827      3    bar
2     True 2019-01-03 -1.070658      8    foo
3    False 2019-01-04 -0.072218      2    bar
4     True 2019-01-05 -1.604049      3    foo
5    False 2019-01-06 -0.822774      0    bar
6     True 2019-01-07 -0.504930      4    foo
7    False 2019-01-08  1.357435      0    bar

in test_append. Any ideas? There's like a mismatch of 1h, maybe some rounding issue? @ynqa

Unable to infer timestamp type

I have my Avro Schema as below:

schema = { "namespace": "example.avro", "type": "record", "name": "IoTData", "fields": [ {"name": "nodeId", "type": ["null", "string"], "default": None}, {"name": "displayName", "type": ["null", "string"], "default": None}, {"name": "dataType", "type": ["null", "string"], "default": None}, {"name": "statusCode", "type": ["null", "string"], "default": None}, {"name": "timestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "sourceTimestamp", "type": ["null", {"type": "string", "logicalType": "timestamp-micros"}], "default": None}, {"name": "value", "type": ["null", "double"], "default": None} ] }

I had to put default values as None as sometimes these values maybe blank.

File content is as below:

[{'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 61.83}, {'nodeId': 'ns=2;s=SCD30_TEMPERATURE', 'displayName': 'SCD30_TEMPERATURE', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 27.35}, {'nodeId': 'ns=2;s=SCD30_HUMIDITY', 'displayName': 'SCD30_HUMIDITY', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:21.622480', 'sourceTimestamp': '2024-02-26T03:56:20.224859', 'value': 41.49}, {'nodeId': 'ns=2;s=SCD30_CO2', 'displayName': 'SCD30_CO2', 'dataType': 'Double', 'statusCode': 'Good', 'timestamp': '2024-02-25T22:56:22.704777', 'sourceTimestamp': '2024-02-26T03:56:22.250094', 'value': 63.27}]

When I convert this to Pandas data types are as below:


0 nodeId 60 non-null string
1 displayName 60 non-null string
2 dataType 60 non-null string
3 statusCode 60 non-null string
4 timestamp 60 non-null object
5 sourceTimestamp 60 non-null object
6 value 60 non-null float64

Its unable to infer timestamp type.

decimal logical type : TypeError: can only concatenate str (not "int") to str

am trying to convert an existing csv to avro using pandavro.

am not able to resolve the below error:
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str

i did check my csv, avsc and pandavro lines of code multiple times.. am not able to find what is the problem. am not savvy enough to call it a bug.
can anyone provide me with some pointers.

data in the csv : 999.879
the column p_cost in avsc:
{ "name": "p_cost", "type": {"name": "decimalEntry", "type": "bytes", "logicalType": "decimal", "precision": 15, "scale": 3} },
the lines of code. :

def convert_to_decimal(val):
"""
Convert the string number value to a Decimal
- Must set precision and scale beforehand
"""
return Decimal(val)

   schema_promotion = load_schema("promotion.avsc")
   df_promotion = pd.read_csv( '/scratch/tpcds_1/promotion/promotion.dat' , delimiter='|',header=None,usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12

,13,14,15,16,17,18],names=['p_promo_sk',....,'p_cost',...,'p_discount_active']
,dtype={'p_cost': 'str'})

getcontext().prec = 15 # set precision of all future decimals
type(df_promotion['p_cost'])

df_promotion['p_cost'] = df_promotion['p_cost'].apply(convert_to_decimal)
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )

throws below error:

Traceback (most recent call last):
File "perfectlyrandom.py", line 313, in
promotion()
File "perfectlyrandom.py", line 262, in promotion
pdx.to_avro('test_promotion.avro', df_promotion, schema=schema_promotion )
File "/home/opc/.local/lib/python3.8/site-packages/pandavro/init.py", line 322, in to_avro
fastavro.writer(f, schema=schema,
File "fastavro/_write.pyx", line 727, in fastavro._write.writer
File "fastavro/_write.pyx", line 680, in fastavro._write.Writer.write
File "fastavro/_write.pyx", line 432, in fastavro._write.write_data
File "fastavro/_write.pyx", line 422, in fastavro._write.write_data
File "fastavro/_write.pyx", line 366, in fastavro._write.write_record
File "fastavro/_write.pyx", line 387, in fastavro._write.write_data
File "fastavro/_logical_writers.pyx", line 130, in fastavro._logical_writers.prepare_bytes_decimal
File "fastavro/_logical_writers.pyx", line 143, in fastavro._logical_writers.prepare_bytes_decimal
TypeError: can only concatenate str (not "int") to str

if full schema definition and pandas df definition is needed, i shall provide the same.
pip list:
avro-python3 1.10.2
fastavro 1.5.1
numpy 1.23.3
pandas 1.5.0
pandavro 1.7.1

Make a new release to allow versions of fastavro>1.6.0

The latest release (1.7.2) is restrcting fastavro~=1.5.1, which translates to >=1.5.1 , <1.6.0

Currently in trunk this dependency is loosened to 'fastavro>=1.5.1,<2.0.0'

Could we please have a new release with this change? Any version of fastavro<1.8.2 cannot be used on M-series macs and this is causing some paint for local development as I need to constantly work around it.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.