selik / xport Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 23.0 496 KB

Python reader and writer for SAS XPORT data transport files.

License: MIT License

Python 96.22% Shell 2.66% Makefile 1.12%

xport's Introduction

Hi there 👋

xport's People

Contributors

Stargazers

Watchers

xport's Issues

extract metadata

Hi,

If i open xpt file in notepad, i can see column name there. Could you please guide me how can i extract column name form xpt file using xport?

Specific columns names and column labels.

Currently, when writing to the xpt from_columns, the module creates column names from the labels. It would be helpful for our submissions to the FDA to be able to specify both the column names and column labels.

For example, the first column is the study identifier. The required column name is "STUDYID" and the column label should be "Study Identifier". When I pass "Study Identifier" as the column label using from_column, the column name "Study_Id" is created and is not acceptable to the FDA.

When writing from_rows, the column name remains what I had populated in the dataframe but there is apparently not a way to then add column labels.

Michael, Thank you for your excellent work on this project!

Reverse conversion (CSV to XPT)

We have command line: python -m xport dm_1.xpt > ex.csv (input = xpt, output = csv)

Is there a reverse function we can use (input = csv, output = xpt). Please advise on solution.

Thanks,
Sagar

Treatment of `None` in character columns

I'm trying to figure out if None in character columns should be written as empty string or some kind of sentinel value, like an ASCII NUL character.

how can we add format and label while writing xpt file

I want to add label and format in columns . is there any function for this?

3.2.2 version is missing

hey.. pls check... can't upgrade

and i tried uninstallation and installed back again... your code didn't seem to show up (even though i checked our code from github and it's there)

allow output file to be specified as command line argument

Since Windows users may not have command-line pipes, it'd be nice if CLI tool would allow specifying the output file as a command line argument. This would be similar behavior to many CLI tools.

Host the docs website

It should be easy enough to publish the docs website either on GitHub, ReadTheDocs, or somewhere similar.

Cannot specify name of dataset

There is no way to specify the name of the dataset generated. Looking in the code, I see it always defaults to 'dataset'.

Maybe the from_rows and from_columns functions could take an optional parameter to specify the dataset name.

Add support for CPORT (compressed XPORT format)

It seems some archaic FDA submission rules require(d) SAS XPT or CPT-format files. The Aggregate Analysis of ClinicalTrials.gov Database hosts the same data in Oracle "dmp", pipe-delimited text, and SAS CPORT formats. Perhaps we can use these files as a sort of Rosetta stone to infer the specification of the SAS CPT/CPORT format.

Edit value of 1 column

Hi, in order to edit the value of 1 column in a file - do we need to configure pandas python package?

What is example code to access and change the value of 1 column? Please advise since the tutorial says not the same as CSV and only shows mapping example.

Thanks,
Sagar

Cythonize for speed

Some XPORT files can be quite large, so it'd be nice to make string decoding faster. I suspect Cython could give us a boost.

External call? XPORT V5?

Hello,

I have two questions:

How can I manage to call the script, to create a .xpt file with my data, from command line with all required data, since I work in a C# environment. I think I can call the python executable/script with properly formatted arguments, but what about arguments length? I also need to look up how to retrieve the program output (using Windows). Maybe can the documentation be improved for Python-newbies like me?
Is the produced .xpt file XPORT version 5 compliant? I'm required to produce such files, not version 6.

Thank you a lot.

Error reading file- xpt (_init, and _read_header)

Code used below- simple open and read all rows from xpt file

import xport
#this portion is for opening and closing xpt files
with open('bg.xpt', 'rb') as f:
for row in xport.Reader(f):
print row

Error:
Traceback (most recent call last):
File "test1.py", line 18, in
for row in xport.Reader(f):
File "/usr/lib/python2.7/site-packages/xport.py", line 160, in init
version, os, created, modified = self._read_header()
File "/usr/lib/python2.7/site-packages/xport.py", line 197, in _read_header
tokens = tuple(t.rstrip() for t in struct.unpack(fmt, raw))
struct.error: unpack requires a string argument of length 80

Convert csv to xpt format

Hi is there a function to convert csv to xpt? (Reverse of listed xpt to csv file)

Thanks,
Sagar

having issues while setting metadata

Code extract
ds = xport.Dataset(table_df, name='Data', label='Test data')
for column, variable in ds.items():
if (a_condition):
v.format="10.2"

with open('my_file.xpt', 'wb') as f:
xport.v56.dump(ds,f)

That would give me an error when opening my_file.xpt. And a_condition is true when column is numeric.

Please assist and can you please have more examples on the export.

All-blank last row is indistinguishable from XPORT file padding

The XPORT format specifies that the file is padded with b' ' to ensure the total file length (in bytes) is a multiple of 80. If there are no numeric columns and the last row consists of only empty strings or strings with only spaces, these are indistinguishable from the XPORT file padding.

This appears to be a defect in the XPORT format specification. Wontfix?

from_columns error

Hi, when using from_columns function from xport, I get following error for a couple files. The other 15 work fine. Please advise why this error occurs?

"xport.py", line 666, in from_columns
column[ i ] = value.encode('ISO-8859-1')
UnicodeDecodeError: 'ascii' codec cant decode byte 0xa0 in position 11: ordinal not in range(128).

Sas Dataset name

currently while writing a xpt file from dataframe, the from_columns function is called and here we write the member header records and default the name of the date set to b'dataset, see below code for reference
# Member header data
fp.write(b'SAS'
b'dataset' # dataset name -customize this field
b'SASDATA'
+ sas_version
+ os_version
+ 24 * b' '
+ created)
Can you help me to make the dataset name customizable or is there already a function for doing this? As we don't want the all datasets to have the same same as dataset

from_dataframe malfunctioning

From email: "from_dataframe function bugs out as currently the list is referencing df not the passed in dataframe object"

Read CP-1252 character encoding

Is there any way to handle different character encoding sets? Most of the SAS files that I have to read are encoded in CP-1252 (gross, I know) and it looks there isn't a good way to handle that here

Field names cannot start with an underscore: '_STATE'

I'm attempting to read the Behavioral Risk Factor Surveillance System (BRFSS) report from the CDC. When I attempt to read it in I get ValueError: Field names cannot start with an underscore: '_STATE'. The obvious solution to me is to remove the underscore but then I get either ValueError: 256 is not a valid VariableType or 0 is not a valid VariableType

cannot install using setup.py

Clone repo
python setup.py install (under ven, python 3.7)
Result
Traceback (most recent call last):
File "setup.py", line 11, in
import xport
File "/Users/gorelov/PycharmProjects/xport/xport/init.py", line 11, in
from . import reading
ImportError: cannot import name 'reading' from 'xport'

Writes corrupted XPT file if column has mixed numeric and string types

For example, if the first row has a float but the next row has a 10-character str, the XPT namestr will say it's a numeric column of length 10. That will cause a NotImplementedError when reading the XPT file.

Read label of variables

Hi,

It tried without any success to read metadata as label and format of the xpt variables .
Such an option seems to not be available.

Many thanks !

ParseError

is the parse error you were referring to in your description "ParseError: header -- expected b'HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!', got b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00'"

Potentially with custom bytes on the end there? If so, this library could use updating.

from_columns errors when a column is blank

From email: "the from_columns function throws an error when any column is blank... Need to catch this and pass in some representation of Null."

value error while loading for more than 9999 values in dict

Traceback (most recent call last):
  File "main.py", line 124, in <module>
    rows =  xport.load(f)
  File "build/bdist.linux-x86_64/egg/xport.py", line 380, in load
  File "build/bdist.linux-x86_64/egg/xport.py", line 164, in __iter__
  File "build/bdist.linux-x86_64/egg/xport.py", line 339, in _read_observations
ValueError: Incomplete record, ''

whenever i try to load a xpt file with more than 9999 values it gives me incomplete record error.

I am using below method for generating xpt file, but I have 15 columns or so

mapping = {'numbers': [1, 3.14, 42],
           'text': ['life', 'universe', 'everything']}

# as a mapping of labels to columns
with open('answers.xpt', 'wb') as f:
    dump(f, mapping, mode='columns')

but while loading i am using below method

rows = xport.load(f)

everything works fine if I loop till 9999 for inserting data,
but for 10000 records it gives me above error

Also, I am working on debugging the code and identify the problem. Let me know if you can find something.

xport.to_numpy() API naming suggestion

Very convenient library for quick binary XPT files serialization with xport.to_* functions. However it would be naturally named after returned object instead of NumPy library name:

xport.to_dict()
xport.to_ndarray()
xport.to_dataframe()

module 'xport' has no attribute 'XportReader'

I tried to convert NHANES data in xpt format into csv format in Jupyter notebook, and have installed xport with the following code:
`import sys
!{sys.executable} -m pip install xport

import xport, csv
with xport.XportReader('MCQ_J.xpt') as reader:
with open('MCQ_J.csv', 'rb') as out:
writer = csv.DictWriter(out, [f['name'] for f in reader.fields])
for row in reader:
writer.writerow(row)`

but I have the error that "module 'xport' has no attribute 'XportReader'", was my download package wrong or do you have advice on how to solve this?

ImportError: cannot import name reading

get this error when adding in the code as below:

with open('nsIQTScriptablePlugin.xpt', 'r') as f:
for row in xport.Reader(f):
print row

Guidance on how to edit 1 column in xpt (since xpt not same as csv)

Hi All,

Since xpt files not as directly translated as csv editing, can you please guide on how to write to a column to change its value in an xpt file. I only see example of mapping. I do not want to map, here is what I need to do:

Locate Column 1 --> change value '83302' to '2018_001'
Save the new file
Open next file --> locate column 1 --> change value to '2018_002'
..and so on.

Please help how to access 1 column, change value, and save file.
Also 2. Can we convert to CSV and then back to XPT file? Or not needed?

Thanks,
Saga

can't set variable.label as mentioned in example v.label = k.title(). It was working previously

Can we use Python to improve the Makefile?

These sound like good ideas.
http://agdr.org/2020/05/14/Polyglot-Makefiles.html
https://tech.davis-hansson.com/p/make/

Incorrect AssertionError when file terminates with no padding

Thanks @MarkPinches for pointing this out. When the number of bytes read is a multiple of 80, the XPT file may end without padding.

No module named 'xport.v56'; 'xport' is not a package

I get this error when I attempt to do import xport.v56
I am using the Python standard library virtual environment created via python -m venv .venv, but I note that there may be a conda configuration requirement. Could this be what is causing my code to fail?

ParseError in _read_observations

I'm getting the following error parsing an xpt file -

xport.py, line 188, in __iter__
    for obs in self._read_observations(self._variables):
xport.py, line 367, in _read_observations
    raise ParseError('incomplete record', sentinel, block)
data_integration.utilities.xport.ParseError: incomplete record -- expected b'  ', got b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

Note there are many spaces between the quotes in "expected b' ' " that I omitted.

It looks like spaces are expected, but it's getting null characters instead. This is a SaS version 9.4 xpt file, but I've parsed many other version 9.4 files using this library without an issue. Also interestingly I can read the header/schema of this file using this library fine, I'm just having issues parsing the rows.

I don't think the file itself is corrupt, because it can be parsed correctly in R using haven::read_xpt. I'm using the most reason version of xport published to PyPI.

Saving new df as xpt file

Hi, can you explain how we can save the dataframe (df) to new xpt file?

Thanks,
Sagar

AttributeError: module 'pandas' has no attribute 'NA'

Hello Mr. Selik,

I am trying to read a .XPT file into Python for a class assignment and I came across your library. Using the sample code provided at https://pypi.org/project/xport/ , I am receiving the error:

AttributeError: module 'pandas' has no attribute 'NA'

Here is the code I used, modified from the website:

import xport
import xport.v56

with open('data/DXX_J.XPT','rb') as f:
    library = xport.v56.load(f)

It is past my skill level to look into the source code and try to fix the error myself.

xport_error.zip

Can't copy SAS variable metadata to dataframe

I'm trying to convert an XPT file to CSV, and am getting the error below. I installed xport from pip.

The file ( MGX_H.XPT ) is from this cdc.gov page. A direct link to the file is here.

I'm a bit of a newbie with SAS and XPT files, so I'm sorry if I'm missing anything obvious!

Error

$ xport MGX_H.XPT > mgx_h.csv
Traceback (most recent call last):
  File "/home/user/.local/bin/xport", line 8, in <module>
    sys.exit(cli())
  File "/home/user/.local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/xport/cli.py", line 72, in cli
    library = xport.v56.load(input)
  File "/home/user/.local/lib/python3.9/site-packages/xport/v56.py", line 900, in load
    return loads(bytestring)
  File "/home/user/.local/lib/python3.9/site-packages/xport/v56.py", line 911, in loads
    return Library.from_bytes(bytestring)
  File "/home/user/.local/lib/python3.9/site-packages/xport/v56.py", line 700, in from_bytes
    self = Library(
  File "/home/user/.local/lib/python3.9/site-packages/xport/__init__.py", line 589, in __init__
    for dataset in members:
  File "/home/user/.local/lib/python3.9/site-packages/xport/v56.py", line 607, in from_bytes
    data.copy_metadata(head)
  File "/home/user/.local/lib/python3.9/site-packages/xport/__init__.py", line 412, in copy_metadata
    for k, v in self.items():
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 957, in items
    yield k, self._get_item_cache(k)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/generic.py", line 3542, in _get_item_cache
    res = self._box_col_values(values, loc)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3192, in _box_col_values
    return klass(values, index=self.index, name=name, fastpath=True)
  File "/home/user/.local/lib/python3.9/site-packages/xport/__init__.py", line 310, in __init__
    LOG.debug(f'Initialized {self}')
  File "/home/user/.local/lib/python3.9/site-packages/xport/__init__.py", line 276, in __repr__
    return f'{type(self).__name__}\n{super().__repr__()}\n{", ".join(metadata)}'
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 1327, in __repr__
    self.to_string(
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 1386, in to_string
    formatter = fmt.SeriesFormatter(
  File "/home/user/.local/lib/python3.9/site-packages/pandas/io/formats/format.py", line 261, in __init__
    self._chk_truncate()
  File "/home/user/.local/lib/python3.9/site-packages/pandas/io/formats/format.py", line 285, in _chk_truncate
    series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 274, in concat
    op = _Concatenator(
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 395, in __init__
    axis = sample._constructor_expanddim._get_axis_number(axis)
  File "/home/user/.local/lib/python3.9/site-packages/xport/__init__.py", line 340, in _constructor_expanddim
    raise NotImplementedError("Can't copy SAS variable metadata to dataframe")
NotImplementedError: Can't copy SAS variable metadata to dataframe

Environment

$ python --version
Python 3.9.0
$ pip show pandas
Name: pandas
Version: 1.1.4
...
$ pip show xport
Name: xport
Version: 3.2.1
...

RecursionError while reading an NHANES file

Installed in a fresh conda environment:

# Name                    Version                   Build  Channel
ca-certificates           2020.1.1                      0
certifi                   2020.4.5.1               py38_0
click                     7.1.1                    pypi_0    pypi
libcxx                    4.0.1                hcfea43d_1
libcxxabi                 4.0.1                hcfea43d_1
libedit                   3.1.20181209         hb402a30_0
libffi                    3.2.1                h0a44026_6
ncurses                   6.2                  h0a44026_0
numpy                     1.18.3                   pypi_0    pypi
openssl                   1.1.1g               h1de35cc_0
pandas                    1.0.3                    pypi_0    pypi
pip                       20.0.2                   py38_1
python                    3.8.2                hc70fcce_0
python-dateutil           2.8.1                    pypi_0    pypi
pytz                      2019.3                   pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.0                  h1de35cc_0
setuptools                46.1.3                   py38_0
six                       1.14.0                   pypi_0    pypi
sqlite                    3.31.1               h5c1f38d_1
tk                        8.6.8                ha441bb4_0
wheel                     0.34.2                   py38_0
xport                     3.1.2                    pypi_0    pypi
xz                        5.2.5                h1de35cc_0
zlib                      1.2.11               h1de35cc_3

Tried to convert one file to another via xport file1.xpt > file2.csv.

Got an enormous error traceback, ending with:

  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 141, in __init__
    self._consolidate_check()
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 656, in _consolidate_check
    ftypes = [blk.ftype for blk in self.blocks]
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 656, in <listcomp>
    ftypes = [blk.ftype for blk in self.blocks]
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 349, in ftype
    return f"{dtype}:{self._ftype}"
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 54, in __str__
    return dtype.name
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 347, in _name_get
    if _name_includes_bit_suffix(dtype):
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 326, in _name_includes_bit_suffix
    elif np.issubdtype(dtype, np.flexible) and _isunsized(dtype):
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/numerictypes.py", line 392, in issubdtype
    if not issubclass_(arg1, generic):
RecursionError: maximum recursion depth exceeded

Write performance is much slower and a much larger result file is produced compared to pandas.to_csv()

Nice library and would like to use it, not sure how much work xport is doing, but finding it much slower than pandas .to_csv().
We're trying to process data outside of SAS, then move it back into SAS 9.4.6.

I have a Pandas DataFrame (10000000, 10) [All string objects]
pdf.to_csv() takes 30.46 seconds [670MB produced]
xport.v56.dump(ds, f) takes 6.12 minutes [3.5GB produced]

We're using pandas ==1.0.5 and underlying Pandas dataframe came from an Arrow type data structure.

I'm noticing that most of the time comes after the last Converting column 'column10' from object to string

I don't know what the results for a corresponding .sas7bdat file would be, but sas7bdat would be the real end goal.
Thanks!

Need to work with SAS 9

Hello,

I was trying to change the name of the SAS Dataset to more than 8 characters and figured out that it is not supported by this module and it supports only upto V6, You have asked me to submit an issue if I want to work with SAS V9, Itried to change ther version number from 6.06 to 9.4 but it did'nt work.

Can you please help me out with just changing the name of the Dataset as it can contain more than 8 characters.

Thanks
Vignesh

Add support for SAS7BDAT files

Why not? Let's make this an all-around SAS file reader/writer.

Library doesn't support compressed XPORT files - Warning message

The library doesn't support compressed XPORT files.
I.e. files starting with

   **COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED**".

Example files can be fetched from https://www.ctti-clinicaltrials.org/aact-database.

Having support for these files would be great, but might be complicated to do, as the format specification is not available.

Can you add a error message, that the XPORT file is a compressed XPORT (CPORT) file, which is not supported by the library?

Thanks.

Enable SAS Transport Format for SAS v8 and 9

The file format is slightly different.
https://support.sas.com/techsup/technote/ts140_2.pdf

The first header record consists of the following character string, in ASCII:

    HEADER RECORD*******LIBV8 HEADER RECORD!!!!!!!000000000000000000000000000000

The first real header record uses the following layout:

    aaaaaaaabbbbbbbbccccccccddddddddeeeeeeee ffffffffffffffff

where aaaaaaaa and bbbbbbbb are each 'SAS ' and cccccccc is 'SASLIB ', dddddddd is
the version of the SAS system that created the file, and eeeeeeee is the operating system
creating it. ffffffffffffffff is the datetime created, formatted as ddMMMyy:hh:mm:ss.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

Another way to consider this record is as a C structure:

    struct REAL_HEADER {
        char sas_symbol[2][8];
        char saslib[8];
        char sasver[8];
        char sas_os[8];
        char blanks[24];
        char sas_create[16];
    };

Second real header record:

    ddMMMyy:hh:mm:ss

where the string is the datetime modified. Most often, the datetime created and datetime
modified will always be the same. Pad with ASCII blanks to 80 bytes.
Note that only a 2-digit year appears. If any program needs to read in this 2-digit year, be
prepared to deal with dates in the 1900s or the 2000s.

Member header records:
Both of these occur for every member in the transport file.

    HEADER RECORD*******MEMBV8 HEADER RECORD!!!!!!!000000000000000001600000000140
    HEADER RECORD*******DSCPTV8 HEADER RECORD!!!!!!!000000000000000000000000000000

Note the 0140 that appears in the member header record above. That value is the size of the variable descriptor (NAMESTR) record that is described later in this document.

Member header data:

    aaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccddddddddeeeeeeeeffffffffffffffff

where aaaaaaaa is 'SAS ', bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb is the data set name,
cccccccc is SASDATA (if a SAS data set is being created), dddddddd is the version of
the SAS System under which the file was created, and eeeeeeee is the operating system
name. ffffffffffffffff is the datetime created, formatted as in previous headers. Consider
this C structure:

    struct REAL_HEADER {
        char sas_symbol[8];
        char sas_dsname[32];
        char sasdata[8];
        char sasver[8];
        char sas_osname[8];
        char sas_create[16];
    };

The second header record is

    ddMMMyy:hh:mm:ss aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbb

where the datetime modified appears using DATETIME16. format, followed by blanks
up to column 33, where the a's above correspond to a blank-padded data set label, and
bbbbbbbb is the blank-padded data set type. Note that data set labels can be up to 256
characters as of Version 8 of the SAS System, but only up to the first 40 characters are
stored in the second header record. Note also that only a 2-digit year appears in the
datetime modified value. If any program needs to read in this 2-digit year, be prepared to
deal with dates in the 1900s or the 2000s.

Consider the following C structure:

    struct SECOND_HEADER {
        char dtmod_day[2];
        char dtmod_month[3];
        char dtmod_year[2];
        char dtmod_colon1[1];
        char dtmod_hour[2];
        char dtmod_colon2[1];
        char dtmod_minute[2];
        char dtmod_colon2[1];
        char dtmod_second[2];
        char padding[16];
        char dslabel[40];
        char dstype[8];
    };

Namestr header record:
One for each member.

    HEADER RECORD*******NAMSTV8 HEADER RECORD!!!!!!!000000xxxxxx000000000000000000

Namestr records:
Each namestr field is 140 bytes long, but the fields are streamed together and broken in
80-byte pieces. If the last byte of the last namestr field does not fall in the last byte of the
80-byte record, the record is padded with ASCII blanks ('20'x) to 80 bytes.

Here is the C structure definition for the namestr record:

    struct NAMESTR {
        short ntype; /* VARIABLE TYPE: 1=NUMERIC, 2=CHAR */
        short nhfun; /* HASH OF NNAME (always 0) */
        short nlng; /* LENGTH OF VARIABLE IN OBSERVATION */
        short nvar0; /* VARNUM */
        char8 nname; /* NAME OF VARIABLE */
        char40 nlabel; /* LABEL OF VARIABLE */
        char8 nform; /* NAME OF FORMAT */
        short nfl; /* FORMAT FIELD LENGTH OR 0 */
        short nfd; /* FORMAT NUMBER OF DECIMALS */
        short nfj; /* 0=LEFT JUSTIFICATION, 1=RIGHT JUST */
        char nfill[2]; /* (UNUSED, FOR ALIGNMENT AND FUTURE) */
        char8 niform; /* NAME OF INPUT FORMAT */
        short nifl; /* INFORMAT LENGTH ATTRIBUTE */
        short nifd; /* INFORMAT NUMBER OF DECIMALS */
        long npos; /* POSITION OF VALUE IN OBSERVATION */
        char longname[32]; /* long name for Version 8-style */
        short lablen; /* length of label */
        char rest[18]; /* remaining fields are irrelevant */
    };

The variable name truncated to 8 characters goes into nname, and the complete name
goes into longname. Use blank padding in either case if necessary. The variable label
truncated to 40 characters goes into nlabel, and the total length of the label goes into
lablen. If your label exceeds 40 characters, you will have the opportunity to write the
complete label in the label section described below.

Note that the length given in the last 4 bytes of the member header record indicates the
actual number of bytes for the NAMESTR structure. The size of the structure listed
above is 140 bytes.

If you have any labels that exceed 40 characters, they can be placed in this section. The
label records section starts with this header:

    HEADER RECORD*******LABELV8 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long labels will be defined.

Each label is defined using the following:

    aabbccd.....e.....

where

    aa = variable number
    bb = length of name
    cc = length of label
    d.... = name in bb bytes
    e.... = label in cc bytes

For example, variable number 1 named x with the 43-byte label 'a very long label for x is
given right here' would be provided as a stream of 6 bytes in hex '00010001002B'X
followed by the ASCII characters.

    xa very long label for x is given right here

These are streamed together. The last label descriptor is followed by ASCII blanks
('20'X) to an 80-byte boundary.

If you have any format or informat names that exceed 8 characters, regardless of the
label length, a different form of label record header is used:

    HEADER RECORD*******LABELV9 HEADER RECORD!!!!!!!nnnnn

where nnnnn is the number of variables for which long format names and any labels will
be defined.

Each label is defined using the following:

aabbccddeef.....g.....h.....i.....

where

    aa=variable number
    bb=length of name in bytes
    cc=length of label in bytes
    dd=length of format description in bytes
    ee=length of informat description in bytes
    f.....=text for variable name
    g.....=text for variable label
    h.....=text for format description
    i.....=text of informat description

Note: The FORMAT and INFORMAT descriptions are in the form used in a FORMAT
or INFORMAT statement. For example, my_long_fmt., my_long_fmt8.,
my_long_fmt8.2. The text values are streamed together and no characters appear for
attributes with a length of 0 bytes.

For example, variable number 1 is named X and has a label of 'ABC,' no attached
format, and an 11-character informat named my_long_fmt with informat length=8 and
informat decimal=0. The data would be

    (hex)      (characters)
    010103000d XABCmy_long_fmt

The last label descriptor is followed by ASCII blanks ('20'X) to an 80-byte boundary.

Observation header:

   HEADER RECORD*******OBSV8 HEADER RECORD!!!!!!!000000000000000000000000000000

Data records:

Data records are streamed in the same way that namestrs are. There is ASCII blank
padding at the end of the last record if necessary. There is no special trailing record.

xport.ParseError : header

Hi,

I am able to read some xpt files correctly, but for some files with the same code I am getting the following error:

xport.ParseError: header -- expected b'HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!' got b'STUDYID,DOMAIN,USUBJID,SUBJID,RFSTDTC,RFENDTC,RF'

The latter part of the error after "got" are the column names in my .xpt file.

I am wondering if this has to do with the xpt file being generated in a newer version of SAS? If so, please advise how best to go around this issue.

Thanks!

Sas Transort V8 character limit compatibility

I followed this wonderful link from the docs that knew just what I wanted

If you want the relative comfort of SAS Transport v8’s limit of 246 characters, please make an enhancement request.

Is this upgrade feasible?

TypeError: data type "string" not understood

Hello,

Thanks for maintaining this package, it's quite helpful.

I'm trying to run it and, while typing exactly what's in the help section, I'm getting a strange error message. I'm pretty sure it used to work. I'm using version 3.1.3 (from Anaconda).


import pandas
import xport
import xport.v56

df = pandas.DataFrame({
    'alpha': [10, 20, 30],
    'beta': ['x', 'y', 'z'],
})

...  # Analysis work ...

ds = xport.Dataset(df, name='DATA', label='Wonderful data')
for k, v in ds.items():
    v.label = k               # Use the column name as SAS label
    v.name = k.upper()[:8]    # SAS names are limited to 8 chars
    if v.dtype == 'object':
        v.format = '$CHAR20.' # Variables will parse SAS formats
    else:
        v.format = '10.2'

library = xport.Library({'DATA': ds})
# Libraries can have multiple datasets.

with open('example.xpt', 'wb') as f:
    xport.v56.dump(library, f)

Getting this log in Jupyter:


Converting column 'alpha' from int64 to float
Converting column 'beta' from object to string
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)

TypeError: data type "string" not understood

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    613                 try:
--> 614                     self[column] = self[column].astype(dtype)
    615                 except Exception:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5690             # GH 24704: use iloc to handle duplicate column names
-> 5691             results = [
   5692                 self.iloc[:, i].astype(dtype, copy=copy)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    530                     for b in blocks
--> 531                 ]
    532 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    394 
--> 395         self._consolidate_inplace()
    396 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    533             return self.make_block(nv)
--> 534 
    535         # ndim > 1

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    594             return self.make_block(Categorical(self.values, dtype=dtype))
--> 595 
    596         dtype = pandas_dtype(dtype)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)

TypeError: data type 'string' not understood

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-43-8d39bacd8d51> in <module>
     23 
     24 with open('example.xpt', 'wb') as f:
---> 25     xport.v56.dump(library, f)

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in dump(library, fp)
    905 
    906     """
--> 907     fp.write(dumps(library))
    908 
    909 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in dumps(library)
    924 
    925     """
--> 926     return bytes(Library(library))

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    704             b'created': strftime(self.created if self.created else datetime.now()),
    705             b'modified': strftime(self.modified if self.modified else datetime.now()),
--> 706             b'members': b''.join(bytes(Member(member)) for member in self.values()),
    707         }
    708 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in <genexpr>(.0)
    704             b'created': strftime(self.created if self.created else datetime.now()),
    705             b'modified': strftime(self.modified if self.modified else datetime.now()),
--> 706             b'members': b''.join(bytes(Member(member)) for member in self.values()),
    707         }
    708 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    614                     self[column] = self[column].astype(dtype)
    615                 except Exception:
--> 616                     raise TypeError(f'Could not coerce column {column!r} to {dtype}')
    617         header = bytes(MemberHeader.from_dataset(self))
    618         observations = bytes(Observations.from_dataset(self))

TypeError: Could not coerce column 'beta' to string

Any idea what's causing this?

thanks a lot,

Kind regards,
Nicolas

How can I convert my data(represents multiple sas data set) into a single xpt?

Currently, I can convert my JSON data in a xpt format and when I convert it to SAS data set, it contains only single SAS data set(.sas7bdat format).

Is there any way I can convert a JSON data into xpt format with multiple SAS datasets(.sas7bdat format)?

Thanks in advance

selik / xport Goto Github PK

xport's Introduction

Hi there 👋

xport's People

Contributors

Stargazers

Watchers

Forkers

xport's Issues

Error

Environment

Recommend Projects

Recommend Topics

Recommend Org