Cython provides lightweight conversions between Foo *

What's the issue with evaluating npy_format_deor<Fo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to deal with structured/record arrays?,about pybind/pybind11

Comments (36)

wjakob commented on August 19, 2024

Something like this does not currently exist. I suppose it should be possible to create like that using information exposed via the def_property() calls, but it's not a priority for me. Contributions are welcomed though.

from pybind11.

aldanor commented on August 19, 2024

How would def_property help with that though? (given there's no struct field reflection in C++)

from pybind11.

wjakob commented on August 19, 2024

Assuming that the binding code calls def_property for all members, you'll know the types, sizes, names, and the offset within the data structure (which can be used to order them). After that it's just another call to the corresponding numpy function (see numpy.h for a way to do it without requiring inclusion of numpy headers) to set up the custom NumPy type.

from pybind11.

aldanor commented on August 19, 2024

I see what you mean. But it wouldn't make it possible to properly evaluate npy_format_descriptor<Foo> though? You're talking about exposing structs, but how would you take them as arguments? (e.g. passing record arrays from Python into C++?)

from pybind11.

wjakob commented on August 19, 2024

What's the issue with evaluating npy_format_descriptor<Foo> for a trivially copyable member of type Foo during its def_property call? Dealing with tightly packed arrays would require some more precautions, e.g. a special converter in cast.h. It's definitely not impossible but might be a lot of work.

from pybind11.

aldanor commented on August 19, 2024

I guess I'd have to read through entire library source first to see how this could be done, but I'm quite willing to do so :) Would you be so kind as to post a few hypothetical code examples of how this could look in user code? (both binding and passing structured arrays as arguments)

from pybind11.

wjakob commented on August 19, 2024

Sorry, my time for pybind11 support is pretty limited -- since what you are requesting is not an official feature, I fear that you are on your own.

from pybind11.

wjakob commented on August 19, 2024

Closing this ticket for now -- I'm generally interested if you develop something of this sort, so feel free to reopen.

from pybind11.

aldanor commented on August 19, 2024

@wjakob

Ok, I've managed to implement a prototype that mostly works (which took a while...), kind of like so:

struct Foo {
    int x;
    double y;
};

struct Bar {
    Foo a;
    bool b;
} __attribute__((packed));

PYBIND11_DTYPE(Foo, x, y);
PYBIND11_DTYPE(Bar, a, b);

after which npy_format_descriptor and format_descriptor will return valid descriptor and format string which will let you use these types in place of scalars. On the numpy side, you will get a structured arrays (can probably adjust it to return recarrays always), which e.g. can be passed directly to pandas dataframe.

A few points to note:

Using typenums (in npy_format_descriptor) is not sufficient. These are only valid for builtin types, anything more and you need a full format descriptor; this also implies stripping constexpr and making it all runtime. Generally, constexpr for all of the numpy stuff doesn't really matter anyway much because there's always going to be the cost of calling numpy itself which will dominate
For the same reason, format_descriptor::value had to be converted to a method since it now has a runtime component.
The whole existing buffer->array logic was very limited and wouldn't work at all for more complex types, so I had to rewrite all that as well
Currently, format_descriptor works through npy_format_descriptor for structured types (so that we use numpy's logic to generate format string and not do it ourselves), which would imply that in order to use a structured Python buffer, you would still need numpy. I don't see this as a problem though

I've started writing tests and at some point encountered a few segfaults, maybe I'm not releasing some Python objects correctly -- I was wondering if you could help take a look if I open (an unfinished) PR?

from pybind11.

aldanor commented on August 19, 2024

@wjakob I've posted my branch as of its current state here to give an idea -- aldanor#1 -- maybe you could take a look.

Cheers,
-I

from pybind11.

wjakob commented on August 19, 2024

ok, I am reopening this -- I am swamped right now, so it might take me a few days to look at your code in detail.

from pybind11.

aldanor commented on August 19, 2024

No worries, thanks! There's still quite a few things to ponder on and implement in this branch anyway.

from pybind11.

wjakob commented on August 19, 2024

Hello Ivan,

nice! That looks like an great set of changes -- here are a few comments:

The switch from format_descriptor::value to format_descriptor::format() is non-ideal. Such API incompatibilities cause breakage for many projects and also require this patch to wait until a major release due to semantic versioning rules. Since it's all constexpr, can't this be stored in a static variable as before?

I like your PYBIND11_DTYPE macro but wonder if this is something that Visual Studio 2015 can handle? There are slight differences in recursive macro evaluations that require extra care, and the PB11_IMPL_* macros in particular look like something that might cause trouble. Note that it should be prefixed with PYBIND11_ like all other global macros.

You changed the architecture here (https://github.com/aldanor/pybind11/pull/1/files#diff-7174f0e1ad0a58b4622057815d7f8199L102) quite a bit -- can you explain why this was needed? Also, I wonder if the std::calloc call can be avoided? When array(buffer_info &) was called with a nullptr-valued buffer, the indended behavior was to allocate uninitialized memory that will be initialized subsequently.

Minor stuff: wrong brace indentation here: https://github.com/aldanor/pybind11/pull/1/files#diff-7174f0e1ad0a58b4622057815d7f8199R228

Extra braces would be helpful here: https://github.com/aldanor/pybind11/pull/1/files#diff-7174f0e1ad0a58b4622057815d7f8199R228

Thanks,
Wenzel

from pybind11.

aldanor commented on August 19, 2024

Hi Wenzel, thanks for taking time to review this, I'll reply to the points raised in order:

I think that the switch from format_descriptor::value to format_descriptor::format() is quite inevitable unless you want to start copying/pasting larger and larger parts of NumPy's C code into pybind11 (like _buffer_format_string for instance). Since we get the format string for non-basic dtypes at runtime, it's no longer a constexpr and has to become a function I guess.
Addressing your concern re: breakage, I've specifically left it backwards compatible for all types for which it was already implemented, so for ints, floats and bools you can still access the format string as a constexpr format_descriptor::value. I don't think any downstream code can break as a result of these changes.
Re: the macros, I don't have MSVS 2015 since I'm on a Mac but I can try installing one in a virtual machine. It would be a lot easier though if someone with a Windows box tried it out...
I will change the macro prefixes to PYBIND11_ if you want (the idea in naming was that they are internal, the only one that the users should access directly is PYBIND11_DTYPE).
Re: changes in architecture:
- Previously, array(const buffer_info&) was constructing descriptor from a typenum (PyArray_DescrFromType) which it inferred from the first letter of the format string. This wouldn't work at all for non-basic types, so I changed it to PyArray_GetArrayParamsFromObject. Basically, we want some way to convert a generic buffer format string to a numpy array descriptor -- I don't know an easier way that doesn't involve copying large parts of numpy codebase into pybind11. This function wants an array/buffer-like object and constructs an array from it -- in our case we can use a memoryview.
- You can't use a memoryview with a null ptr for its buffer, so we have to allocate it manually. If you look at what PyArray_NewFromDescr does when you give it a nullptr for data, it essentially just calls malloc or calloc, depending on whether it needs to be zeroed; we could do the same thing instead of always caling calloc (check for NPY_NEEDS_INIT). Problem is, we don't have the descriptor yet, only buffer info, so the safe thing is to always zero it out I guess..
- Actually... I've just stumbled upon a PyArray_RegisterDataType API that could in theory let us work with custom data types exclusively through typenums, more like it was originally, I could give it a try to see if it works. Note that this still doesn't solve the problem of going from a generic buffer info to a numpy descriptor (for that we may have to keep using PyArray_GetArrayParamsFromObject).
Re: braces/indentation, will fix. I think your last url is wrong though (it's the same as the second last).

In general, I think, there should be easier ways for users to create numpy arrays without having to create buffer info objects at all (just provide the shape and element type, the rest could be filled out pretty much automatically). Also a few additional constructors could help like constructing from a vector and the like. Same way, there should be an easier way to get the pointer to the data without having to go request() -> .ptr -> cast void* to T*, this could be a quality of life improvement for working with numpy arrays. I have a few ideas but it calls for a different separate PR.

What I haven't done/verified yet is the whole casting thing. Currently you can pass any array to a function taking py::array_t<T> as an argument and the dtypes are not really checked. This has nothing to do with this branch, as in the latest pybind11 you can e.g. pass integer arrays to functions requiring floats. I think we want some sort of typechecking there.

Update: is that because of forcecast being set by default? (which seems like a very strange default to have; I'd be up for having int ExtraFlags = 0, or int ExtraFlags = array::c_style but not force cast; this makes even more sense once you have structured arrays).

from pybind11.

aldanor commented on August 19, 2024

Re: macros, indeed, you were right, it seems to only work properly with gcc (and also with clang on OS X), not with vc++ or the older clang:

That's too bad, any ideas on what exactly is wrong with vc++ macro expansion and how to fix it? Hmm...

from pybind11.

wjakob commented on August 19, 2024

Visual Studio doesn't evaluate preprocessor expressions as often as GCC/Clang, sometimes you need to map it through a dummy define like so:

#define EVAL(e) e
#define COMPLICATED_MACRO_(e1, e2, e3) EVAL(..something involving e1,e2,e3...)

However, I would first ensure that you're actually testing against MSVC2015 (which is hugely improved wrt. older releases). It was not clear to me that this rextester.com website uses the latest version.
It might be easiest to install Windows on VirtualBox or Parallels to debug this.

from pybind11.

aldanor commented on August 19, 2024

Thanks. The #define EVAL(e) e approach doesn't seem to work with gcc though, it looks like you have to do #define EVAL(...) __VA_ARGS__ (which is already defined as PB11_IMPL_EVAL0).

Seems to work on vc++ / gcc with this hack, just one line needed to be fixed (namely, PB11_IMPL_MAP_LIST_NEXT1 definition):

vc++: http://rextester.com/DMVKT17578
gcc: http://rextester.com/QPP93519

It apparently fails on clang 3.7 on rextester, but works fine with the default OS X clang on my box so I think it's probably fine.

Upd: correction, it doesn't always work, on clang for 3 arguments I now get

struct Foo {
    int a;
    float b;
    bool c;
};

int main() {
    ::pybind11::detail::npy_format_descriptor<Foo>::register_dtype 
({::pybind11::detail::field_descriptor { "a", __builtin_offsetof(Foo, a), ::pybind11::detail::npy_format_descriptor<decltype(static_cast<Foo*>(0)->a)>::dtype() } ,
 ::pybind11::detail::field_descriptor { "b", __builtin_offsetof(Foo, b),
 ::pybind11::detail::npy_format_descriptor<decltype(static_cast<Foo*>(0)->b)>::dtype() } ,
 PB11_IMPL_MAP_LIST1 (PB11_IMPL_FIELD_DESCRIPTOR, Foo, c, (), 0)});
}

so it fails to expand it. Ok, this will take a bit longer, will poke around :)

Upd: looks like it has to be done with an #ifdef _MSC_VER since the way expansions work is too different. Now it seems like it works with all 3 compilers.

vc++: http://rextester.com/SFG67680
gcc: http://rextester.com/BXKSQL19297

from pybind11.

wjakob commented on August 19, 2024

Responses to various points:

+1 for format() then. If it's going to be an incompatible change, let's remove value completely and let format() be constexpr whenever possible.
The rationale for PYBIND11_* was to isolate macro pollution to one prefix. It's fine to #define other things as well, but those should be #undef -ed in the same header file.
The PyArray_RegisterDataType approach sounds interesting and potentially simpler. I don't have enough context to assess this though and leave it up to you.
Re: Extra braces, I meant this location: https://github.com/aldanor/pybind11/pull/1/files#diff-7174f0e1ad0a58b4622057815d7f8199R260
Great to hear that you got it to work on all compilers. It would be awesome if you could add a few comments to the macros -- that kind of code is incredibly tedious to read and understand, and right now it's still fresh in your head. You could also link to the github page you mentioned above and then say how it's applied in your specific case.
While making all of these changes, it will be important to keep the basic buffer protocol code working to support basic stuff like passing around floating point matrices.

I hope I didn't miss anything -- please let me know if I did.

Thanks,
Wenzel

from pybind11.

aldanor commented on August 19, 2024

@wjakob I've updated a few things (comments, formatting, added a doc section), I think it should be ok to merge -- could you review please, it's the last 8 commits or so? Re: PyArray_RegisterDataType, it's doable but I don't think it would provide any direct benefit other the fact that we will now store a typenum and not a pyobject pointer for structured arrays; if a use case comes up, it's always possible to switch to that. I didn't remove ::value for format descriptors yet, but we can always remove them later.

Re: the tests, I didn't include any casting tests as of now due to the expected upcoming changes to the whole casting shebang.

I'll also squash the commits when opening a PR to this repo.

from pybind11.

aldanor commented on August 19, 2024

@wjakob Looks like I found a bug in NumPy, not sure how to proceed for now 😞
numpy/numpy#7797

We can require that there is no trailing padding for now, I think, until it's fixed in numpy

from pybind11.

wjakob commented on August 19, 2024

I think this assumption is quite reasonable and not a blocker.

from pybind11.

wjakob commented on August 19, 2024

Hello Ivan,

what's the status of this? Is it ready to be merged (ignoring the issue with NumPy and trailing unspecified space for now).

Best,
Wenzel

from pybind11.

wjakob commented on August 19, 2024

Ah, and I forgot to ask: Your patch adds examples on how to expose POD-style static data structures which are known at compile time.

It would be fantastic to also include an example which shows how to return a format type which is constructed at runtime (using the same kind information -- name of the fields, types, and offsets).

Thanks,
Wenzel

from pybind11.

aldanor commented on August 19, 2024

Hi Wenzel, sorry for the delay, so the trailing byte problem is fairly serious and would appear more often than you'd think -- for example when you have something like struct A { short x; } it may look pretty harmless but it will most likely have trailing padding bytes. I sure wouldn't want to merge this thing in half-baked, so I did a bit of digging.

In a nutshell, we'll need to generate buffer format strings on our own (which is fairly trivial; already done), and also fix dtypes generated from buffer strings by numpy (less trivial but not too hard) -- both of these are related to two issues in numpy which may get fixed in future releases. The good news is that we can now revert array(buffer_info) constructor almost to its previous state which means no more manual callocs/mallocs etc.

So, most of stuff works, however I have a problem with a certain function throwing segfaults (which I'm pretty sure related to refcounts being wrong somewhere), but I can't find what exactly wrong.

Maybe you could take a brief look and help, it's a fairly isolated piece of code? (it could be some really stupid mistake on my side...)

I've pushed the current work-in-progress to a separate branch, the offending function is strip_padding_fields and it's quite small: https://github.com/aldanor/pybind11/blob/recarray-wip/include/pybind11/numpy.h#L164 -- to reproduce the segfault, build and run example18.py

Re: your other suggestions and minor edits, most of them are already implemented, the rest I'll fix after the whole dtype / buffer format conversion is done -- then it's good to merge.

from pybind11.

aldanor commented on August 19, 2024

@wjakob Re: constructing types at runtime (technically, POD dtypes are always constructed at runtime even though the declaration is a rigid compile-time one), I would probably suggest doing this on a separate PR, as this would probably involve:

creating a proper pybind11::dtype wrapper akin to other Python objects
moving all dtype-specific logic (like converting to/from buffer format strings etc) into pybind11:dtype
allow extracting dtype from py::array
allow constructing array from dtype and data pointer (and/or shape/strides)

This is fairly straightforward but still quite a bit of work

from pybind11.

wjakob commented on August 19, 2024

Hi Ivan,

I cloned your repository bug didn't get it to segfault. When running on my machine (OSX, Python 3.5, 64 bit), I get

!!! {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[8,0,4], 'itemsize':12}
!!! {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[5,0,1], 'itemsize':9}
!!! [('a', {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[8,0,4], 'itemsize':12}), ('b', {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[5,0,1], 'itemsize':9})]
!!! {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[8,0,4], 'itemsize':24}
!!! {'names':['a'], 'formats':[{'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[8,0,4], 'itemsize':24}], 'offsets':[8], 'itemsize':40}
T{=?:x:3x=I:y:=f:z:}
T{=?:x:=I:y:=f:z:}
T{=T{=?:x:3x=I:y:=f:z:}:a:=T{=?:x:=I:y:=f:z:}:b:}
T{=?:x:3x=I:y:=f:z:12x}
T{8x=T{=?:x:3x=I:y:=f:z:12x}:a:8x}
{'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':12}
[('x', '?'), ('y', '<u4'), ('z', '<f4')]
[('a', {'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':12}), ('b', [('x', '?'), ('y', '<u4'), ('z', '<f4')])]
{'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':24}
{'names':['a'], 'formats':[{'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':24}], 'offsets':[8], 'itemsize':40}
!!! {'names':['z','x','y'], 'formats':['<f4','?','<u4'], 'offsets':[8,0,4], 'itemsize':12}
Traceback (most recent call last):
  File "example/example18.py", line 30, in <module>
    arr = func(0)
RuntimeError: This should never happen: Invalid offset in buffer format string generation. Please report a bug to the Numpy developers.

I didn't quite understand strip_padding_fields. If the problem is that NumPy ignores padding fields, why does removing them from format descriptors help?

from pybind11.

wjakob commented on August 19, 2024

The code in strip_padding_fields seems reasonable by the way -- I didn't see any smoking guns.

from pybind11.

aldanor commented on August 19, 2024

@wjakob Thanks for taking a look :)

Well actually, I forgot, you need PY_DEBUG to be enabled for that (by the way, I highly recommend using debug build of Python for developing things like pybind11 -- lets you catch bugs which would slip otherwise and lldb/gdb right into Python source).

It's an assertion that fails in object.c, only if strip_padding_fields is called (an exception that was missed and wasn't handled):

!!! {'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':24}
!!! {'names':['b','a'], 'formats':[[('x', '?'), ('y', '<u4'), ('z', '<f4')],{'names':['x','y','z'], 'formats':['?','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':12}], 'offsets':[12,0], 'itemsize':21}
Assertion failed: (!PyErr_Occurred()), function PyObject_Str, file Objects/object.c, line 512.
'python ../example/example18.py' terminated by signal SIGABRT (Abort)

So yea, it's the same exception that you see, I think it happens because numpy expects us to sort the entries by offsets (.fields.items() is unordered -- yet another thing that could be improved upstream in numpy)... will look into it and hopefully fix it all today.

Re: why we need to strip padding fields. Basically, numpy will (erroneously) generate dtypes with extra fields of type NPY_VOID and an empty description whenever it sees a buffer string with padding bytes. This is bad for many reasons (again, see numpy/numpy#7797 for a few examples), especially if we consider to start using stricter casting rules -- a dtype with an added padding field is not considered strictly type-equivalent.

from pybind11.

wjakob commented on August 19, 2024

Sounds great! I really look forward to have this feature in pybind11. :).

Regarding PY_DEBUG -- yes, I do that all the time. I find it essential for systematically debugging refcount issues. For better results, I do it on multiple different platforms in VMs (Windows 64 bit seems a particularly good platform for throwing assertion errors when things go wrong)

from pybind11.

wjakob commented on August 19, 2024

Hi Ivan,

I just wanted to check in: do you need anything from my end? (this wasn't 100% clear to me after your last post)

Thanks,
Wenzel

from pybind11.

aldanor commented on August 19, 2024

Hi Wenzel, thanks, I think I got it mostly figured out, just need a bit of time to clean it all up and then rebase onto the current master if everything looks fine, etc. I'm not likely to get a lot of time till the coming weekend though...

// basically, this kind of thing seems to work in Python, now I'll just need to translate it to pybind11:

names, formats, offsets = zip(
    *sorted(
        ((name, dtype, offset) 
         for name, (dtype, offset) in descr.fields.items()
         if not (dtype.kind == 'V' and name == '')), 
    key=lambda v: v[2]))

from pybind11.

wjakob commented on August 19, 2024

That sounds great, thank you. No worries regarding timing -- since it's a API/ABI-breaking change, it's worth taking the time to get everything straightened out.

If you'll be changing the interfaces, then I would (again :)) suggest to also have a simple dynamic wrapper with a API such as:

py::dtype dt;
dt.append("x", format_descriptor<float>(), 0 /* offset */);
...

(which does the necessary sorting and calls to NumPy to turn this into a format descriptor).
The static macro-based approach could then be realized with the help of that.

from pybind11.

wjakob commented on August 19, 2024

A quick request: can you create a PR in this project? I only see one in your personal copy.

Thanks,
Wenzel

from pybind11.

aldanor commented on August 19, 2024

Yep, forgot about it, will do now. Still needs a minor review re: the last changes

Cheers,
-I

from pybind11.

aldanor commented on August 19, 2024

@wjakob Done #308

from pybind11.

aldanor commented on August 19, 2024

Closed in favor of #308

from pybind11.

How to deal with structured/record arrays? about pybind11 HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent