edsu / pymarc Goto Github PK

View Code? Open in Web Editor NEW

251.0 40.0 100.0 1.26 MB

process MARC records from Python

Home Page: http://python.org/pypi/pymarc

License: Other

Python 99.81% mIRC Script 0.19%

pymarc's Introduction

The pymarc repository has moved to GitLab.

pymarc's People

Contributors

Stargazers

Watchers

Forkers

anarchivist gsf mjgiarlo mattgrayson termim jermnelson mbklein dbs captsolo yuri ixc tingletech yanekk edwardbetts miku godmar tfmorris wooble bibliotechy nephila rlmv piskvorky adimasci lbjay godickii wrecksdart pombredanne davidchouinard imclab acdha kayiwa zekelhealthcare tuliea tessafallon cordmata wrcisco gluejar peterthomasross davidmcclure avishekp4 psvann booknetsirk joshbone avorio danmichaelo stancikcom gryftir eipeele johnnyfalvo davanstrien cshintov cyperus-papyrus ploq mattrweaver gugek bmschmidt ssieb adityahemanth sethwoodworth jmtaysom eugeneai jayinai solonifer emulatingkat mrmiguez maharshmellow rocia shasha79 dag-hammarskjold-library joshuago78 commissarster gandaro nemobis adlpr jaimiemurdock jlebonzec afcarl kbehlers tuulap cclauss kvslee bibliodrone azurah souzaluuk louxfaure vaneseltine donghuawei anybox ajiang1967 labelexe dankeich pihentagyu hypatiana cbpark26 elijahahianyo pytony21 iq-scm thedegeneratedev5150

pymarc's Issues

"editing" MARC; writing it; then re-reading it gives "IndexError: string index out of range"

Here is a failing test and a sample MARC record

tingletech@ee7f404

'module' object has no attribute 'python_2_unicode_compatible'

an email from [email protected]

I've just started to work with python/pymarc on a new computer and used pip to install pymarc (using Python 2.7.10), but when I've just tried to run a python script, I get the following error for the "import pymarc" line:

172-16-245-249:bin hf36$ python aco-1-xml2mrc-oclc-nums.py

Traceback (most recent call last):

  File "aco-1-xml2mrc-oclc-nums.py", line 9, in <module>

    import pymarc

  File "/Library/Python/2.7/site-packages/pymarc/__init__.py", line 62, in <module>

    from .record import *

  File "/Library/Python/2.7/site-packages/pymarc/record.py", line 31, in <module>

    @six.python_2_unicode_compatible

AttributeError: 'module' object has no attribute 'python_2_unicode_compatible'

I assume I've something out of date or such, but have no idea how to troubleshoot this error... (I'm not that versed in installation issues)

New release to PyPI to make Python 3 support official?

https://github.com/edsu/pymarc/blob/master/setup.py#L24 should cause PyPI to register that both Python 2 and Python 3 are supported.

handling MARC21/ANSEL encoded records with wrong leader[9]

If I have a MARC record where leader[9] == 'a' but that contains ANSEL-encoded data, which is read with a MarcReader where to_unicode=False, then pymarc fails to write the record because the presence of leader[9] forces UTF8 encoding, which blows up for ANSEL strings. See mbklein@ff31286

(Of course, if the record is read with 'to_unicode=True', it'll blow up during reader because the record can't be utf8-decoded.)

It seems to me a better strategy might be to remember in the record whether it should be utf8-encoded on write, rather than relying on the (possibly erronenous) leader[9] field. In short, code like this shouldn't fail even with defective/misencoded records:

for record in marcreader:
if (somecondition):
marcwriter.write(record)

ps:

On a related, but separate note, it seems to me that pymarc would double-encode a record that has leader[9] == 'a' on output unless the record was read with to_unicode=True.

API for working with XML data isn't very intuitive

Having some xml data,

data = open('test.xml', 'rb')

, I expected from the README example to be able to do something like

from pymarc import MARCReader
for record in MARCReader(data):
    ...

but instead I had to do

import pymarc
for record in pymarc.parse_xml_to_array(data):
    ...

Determining the file type should be quite easy from reading the first characters of the file stream: xml if "<?xml", json if "{", plain marc otherwise.

Next I wanted to try to serialize a record to XML. The Record object has methods like as_marc(), as_marc21() and hm, even as_json(), but no as_xml()! Instead:

pymarc.record_to_xml(record)

Version mismatch.

With version 2.8.2, the setup.py indicates the version number correctly (2.8.2) but the pymarc/init.py file specifies it as 2.8.1 which is wrong. You've mentioned this in your comment in the setup.py file as well # remember to update pymarc/__init__.py on release!

Backwards compatibility change between 2.8.9 and 2.9.0

I noticed some tests failed with a very simple test case between PyMARC 2.8.9 and 2.9.0:

with open(sys.argv[1], 'rb') as marc_file:
    reader = MARCReader(marc_file)
    for record in reader:
        print record['752']['a'], record['752']['b'], record['752']['d']

Using this data file: https://github.com/loc-rdc/wdl/blob/master/importers/tests/data/LOT10340.mrc the output looks like this:

2.9+:

None None None
None None None
None None None
None None None
None None None
None None None
None None None
None None None
None None None
None None None
None None None
Russian Federation None None

2.8.9:

Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation None None
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation Kostroma Oblast Kostroma
Russian Federation None None

six import fails

for whatever reason, the following statement fails inside a django 1.4.5 project (python 2.7.5)

from six.moves import zip_longest as izip_longest

(in record.py)

test_url is failing

test_url in test/reader.py is failing because the resource http://inkdroid.org/data/marc.dat doesn't exist anymore.

Add encoding option to marcxml.record_to_xml

Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.

In my local fork, I have made the following change:

def record_to_xml(record, quiet=False, namespace=False, encoding='us-ascii'):
  node = record_to_xml_node(record, quiet, namespace)
  return ET.tostring(node, encoding=encoding)

Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:

<record>
    <leader>          22        4500</leader>
    <datafield ind1=" " ind2=" " tag="246">
      <subfield code="a">Nouvelles-H&#233;brides, communiqu&#233;s par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
      <subfield code="b">Lois et r&#232;glements promulgu&#233;s pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et r&#233;glementer la distribution des stup&#233;fiants, amend&#233;e par le Protocole du 11 d&#233;cembre 1946</subfield>
    </datafield>
</record>

And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.

With the change, I get output that looks like this when I pass the optional encoding:

<record>
	<leader>          22        4500</leader>
	<datafield ind1=" " ind2=" " tag="246">
		<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
		<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
	</datafield>
</record>

I invoke as follows:

out_file.write(marcxml.record_to_xml(record,encoding='utf-8'))

And the resulting file ends up with a utf-8 encoding.

Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.

The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.

A permissive MARCReader

Over in #89 there has been a long discussion about working with invalid MARC data in the wild. I must admit I don't work with MARC much these days, so I had no idea people were running into so many problems processing large batches of MARC records.

Currently when pymarc runs into a structural problem in a MARC record it will throw an exception (RecordLeaderInvalid, BaseAddressNotFound, BaseAddressInvalid, RecordDirectoryInvalid, NoFieldsFound) which will also cause record iteration to stop.

@anarchivist offered up a PermissiveMARCReader which he has used to process large amounts of MARC data. PermissiveMARCReader catches all the exceptions thrown by structural problems with the MARC record and moves on to the next record.

Rather than introducing a new class I suggest that a new parameter named strict be added to the MARC.MARCReader constructor. When set to True it will continue to throw these exceptions. When set to False it will catch the exceptions, log them, and move on to the next record. It may be that some of these exceptions need to be relaxed, and the invalid data interpreted in some way. But let's open new issue tickets for those situations as they come up.

Based on the conversation we've been seeing lately I think the default for strict should be set to False. The MARCReader API will be backwards compatible (code that uses pymarc won't need to change). However this will be a significant change in behavior so I think a new minor version release will be needed, v3.2.

What do folks think?

Trouble reading Harvard Open Metadata MARC files (UTF-8 related?)

I am trying to use pymarc to read the Harvard Open Metadata MARC files.

Most of the files process ok but some (for example ab.bib.14.20160401.full.mrc) produce errors when processing. The error I am getting is:

Traceback (most recent call last):
  File "domark.py", line 21, in <module>
    for record in reader:
  File "/Library/Python/2.7/site-packages/six.py", line 535, in next
    return type(self).__next__(self)
  File "/Users/markwatkins/Sites/pharvard/pymarc/reader.py", line 97, in __next__
    utf8_handling=self.utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 74, in __init__
    utf8_handling=utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 307, in decode_marc
    code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

The driver code I am using is:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

if len(sys.argv) >= 2:
    files = [sys.argv[1]]

for file in files:
    with open(file, 'rb') as fh:
        reader = MARCReader(fh, utf8_handling='ignore')
        for record in reader:
#            print "%s by %s" % (record.title(), record.author())
            print(record.as_json())

Other MARC processing tools (e.g. MarcEdit seem to process the file with no issues so I think the file is legitimate).

Am I doing something wrong? Is there an issue with pymarc, possibly UTF-8 processing related?

field .indicators, indicator1, indicator2 confusion

a pymarc.field.Field object has both a .indicators attribute (which is a list of 2 indicators), and .indicator1 and .indicator2 attributes, which are strings that are set initially to the same values as in indicators.

Field.str uses the indicators list to determine what indicators to display in the textual format, while as_marc uses the individual attributes.

The upshot of this is that you if you want to edit indicators in the MARC you'll output, you need to change the individual attributes, but printing the record as text will make it appear you didn't change anything, while changing .indicators[1], for example, will appear to edit the indicator when printing for debugging, but when the MARC is written out it will still have the old value.

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69

Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?

>>> record.as_marc()
'01182nam0 22003253i 450 001001100000005001700011010001800028100004100046101000800087102000700095181002000102182001100122200008100133205001700214210003400231215001800265225001000283300003100293300004800324410003200372500004800404676004100452700003700493702004000530790004800570801002800618850001900646950017800665977001300843\x1eMIL0864540\x1e20180302002150.0\x1e  \x1fa9788804642091\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faRaccolto di sangue\x1fe[thriller]\x1ffSharon Bolton\x1fgtraduzione di Manuela Faimali\x1e  \x1faEd. speciale\x1e  \x1faMilano\x1fcOscar Mondadori\x1fd2014\x1e  \x1fa453 p.\x1fd20 cm\x1e| \x1faOscar\x1e  \x1faIn copertina: Oscar estate\x1e  \x1faA pagina IV di copertina: ebook disponibile\x1e 0\x1f1001CFI0000102\x1f12001 \x1faOscar\x1e10\x1faHThe Iblood harvest\x1f3UBO3836087\x1f9RAVV580629\x1e  \x1fa823.92\x1f9Narrativa inglese. 2000-\x1fv22\x1e 1\x1faBolton\x1fb, S. J.\x1f3RAVV580629\x1f4070\x1e 1\x1faFaimali\x1fb, Manuela\x1f3LO1V356745\x1f4070\x1e 1\x1faBolton\x1fb, Sharon\x1f3CFIV315469\x1fzBolton, S. J.\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4698\x1fe ELAPE0001648725  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
couldn't find 0xa0 in g0=66 g1=69
couldn't find 0xa0 in g0=66 g1=69
>>> record.as_marc()
'01030nam0 22003013i 450 001001100000005001700011010001800028010001800046100004100064101001300105102000700118181002000125182001100145200004200156210002500198215001800223225001600241300003300257410003800290500004800328517003100376676004200407700004100449801002800490850001900518950017800537977001300715\x1eMIL0864555\x1e20180302002150.0\x1e  \x1fa9788856639339\x1e  \x1fa9788856646948\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1fcita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faTutta mia la citt\xa9 \x1ffCarlotta Pistone\x1e  \x1faMilano\x1fcPiemme\x1fd2014\x1e  \x1fa306 p.\x1fd22 cm\x1e| \x1faPiemme voci\x1e  \x1faIn copertina: Milano in love\x1e 0\x1f1001CAG1804037\x1f12001 \x1faPiemme voci\x1e10\x1faTutta mia la citt\xa9 \x1f3LO11530364\x1f9RMLV077939\x1e1 \x1faMilano in love\x1f9BVE0684571\x1e  \x1fa853.92\x1f9Narrativa italiana. 2000-\x1fv22\x1e 1\x1faPistone\x1fb, Carlotta\x1f3RMLV077939\x1f4070\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4725\x1fe ELAPE0001649025  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record.as_dict()
{'fields': [{'001': u'MIL0864555'}, {'005': u'20180302002150.0'}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856639339'}], 'ind2': u' '}}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856646948'}], 'ind2': u' '}}, {'100': {'ind1': u' ', 'subfields': [{u'a': u'20140730d2014    ||||0itac50      ba'}], 'ind2': u' '}}, {'101': {'ind1': u'|', 'subfields': [{u'a': u'ita'}, {u'c': u'ita'}], 'ind2': u' '}}, {'102': {'ind1': u' ', 'subfields': [{u'a': u'it'}], 'ind2': u' '}}, {'181': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'i '}, {u'b': u'xxxe  '}], 'ind2': u'1'}}, {'182': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'n'}], 'ind2': u'1'}}, {'200': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'f': u'Carlotta Pistone'}], 'ind2': u' '}}, {'210': {'ind1': u' ', 'subfields': [{u'a': u'Milano'}, {u'c': u'Piemme'}, {u'd': u'2014'}], 'ind2': u' '}}, {'215': {'ind1': u' ', 'subfields': [{u'a': u'306 p.'}, {u'd': u'22 cm'}], 'ind2': u' '}}, {'225': {'ind1': u'|', 'subfields': [{u'a': u'Piemme voci'}], 'ind2': u' '}}, {'300': {'ind1': u' ', 'subfields': [{u'a': u'In copertina: Milano in love'}], 'ind2': u' '}}, {'410': {'ind1': u' ', 'subfields': [{u'1': u'001CAG1804037'}, {u'1': u'2001 '}, {u'a': u'Piemme voci'}], 'ind2': u'0'}}, {'500': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'3': u'LO11530364'}, {u'9': u'RMLV077939'}], 'ind2': u'0'}}, {'517': {'ind1': u'1', 'subfields': [{u'a': u'Milano in love'}, {u'9': u'BVE0684571'}], 'ind2': u' '}}, {'676': {'ind1': u' ', 'subfields': [{u'a': u'853.92'}, {u'9': u'Narrativa italiana. 2000-'}, {u'v': u'22'}], 'ind2': u' '}}, {'700': {'ind1': u' ', 'subfields': [{u'a': u'Pistone'}, {u'b': u', Carlotta'}, {u'3': u'RMLV077939'}, {u'4': u'070'}], 'ind2': u'1'}}, {'801': {'ind1': u' ', 'subfields': [{u'a': u'IT'}, {u'b': u'IT-000000'}, {u'c': u'20140730'}], 'ind2': u'3'}}, {'850': {'ind1': u' ', 'subfields': [{u'a': u'IT-'}, {u'a': u'IT-MI0185'}], 'ind2': u' '}}, {'950': {'ind1': u' ', 'subfields': [{u'a': u'Arch. della  Produzione Editoriale della Lombardia'}, {u'c': u'1 v.'}, {u'd': u' ELAPE-M     F18                     4725'}, {u'e': u' ELAPE0001649025  VMN                       1 v.'}, {u'f': u'B '}, {u'h': u'20141126'}, {u'i': u'20141126'}], 'ind2': u'0'}}, {'977': {'ind1': u' ', 'subfields': [{u'a': u' EL'}, {u'a': u' NB'}], 'ind2': u' '}}], 'leader': u'01030nam0 22003013i 450 '}

IE001_MIL_EL_00017104.zip

documentation bug field.del_subfield vs field.delete_subfield

Documentation doesn't match code, creating confusion for programmers:

def delete_subfield(self, code):
    """
    Deletes the first subfield with the specified 'code' and returns 
    its value:

        field.del_subfield('a')

    If no subfield is found with the specified code None is returned.
    """

Unable to read marc21 file exported from NewGenLib

I take a file exported from NextGenLib (http://verussolutions.biz/web/content/download) software which contains around 1500 records in MARC21 format. When i try to open it using pymarc 3.0.3 following error gets thrown:


Traceback (most recent call last):
  File "convert_marc21_records.py", line 18, in 
    for record in reader:
  File "/Users/alexcorbi/Developing/ODC/venv/lib/python2.7/site-packages/six.py", line 530, in next
    return type(self).**next**(self)
  File "/Users/alexcorbi/Developing/ODC/venv/lib/python2.7/site-packages/pymarc/reader.py", line 97, in **next**
    utf8_handling=self.utf8_handling)
  File "/Users/alexcorbi/Developing/ODC/venv/lib/python2.7/site-packages/pymarc/record.py", line 72, in **init**
    utf8_handling=utf8_handling)
  File "/Users/alexcorbi/Developing/ODC/venv/lib/python2.7/site-packages/pymarc/record.py", line 321, in decode_marc
    raise NoFieldsFound
pymarc.exceptions.NoFieldsFound: Unable to locate fields in record data

However if i try the sample files provided with the library, they get read properly. Is there any difference between MARC21 files?

Backslash is printed in MARC file if the indicator is a space

I have discovered that if the field object detect the indicator is a space, it will be converted as '\\'

        for indicator in self.indicators:
            if indicator in (' ','\\'):
                text += '\\'
            else:
                text += '%s' % indicator

The "\" will be printed in the MARC file, another library such as Perl's MARC::Record will trigger it as an invalid indicator.

Embedded fields of UNIMARC

Thank you for pymarc!
But pymarc can't handle embedded fields in UNIMARC scheme. Eg:

EX 1 Embedded fields technique
200 1#$aKesteven chronicle ...
205 ##$aFosse Way ed.
430 #1$12001#$aLincolnshire chronicle$1205##$aNorth Kesteven ed.

More info: http://archive.ifla.org/VI/3/p1996-1/uni4.htm#410

Unable to work with real UTF-8 MARC packets

Here is the error I got:

[2010-10-25 14:17:46,099][test_db5] ERROR:web-services:[24]: 'raw_data': record.as_marc21(),
[2010-10-25 14:17:46,100][test_db5] ERROR:web-services:[25]: File "build/bdist.linux-i686/egg/pymarc/record.py", line 221, in as_marc
[2010-10-25 14:17:46,100][test_db5] ERROR:web-services:[26]: field_data = field_data.encode('utf-8')

[2010-10-25 14:17:46,100][test_db5] ERROR:web-services:[27]: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 13: ordinal not in range(128)

I would attach a file with the real MARC packet that triggers the error, but I do not know how to do that.

In basic the problem is that, the subfields are in utf-8 already, still the method "as_marc" from class named "Field" returns strings with utf-8 inside them.

record.title() conflates 245a and 245b

The title() method, if called on an item where the main title is in 245a and the subtitle in 245b, jams title and subtitle together. They need a colon and space separator.

MARCMaker exception

If you try to load a MARC record in MARCMaker format you get an obscure exception:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    for record in reader:
  File "/usr/local/lib/python2.7/site-packages/six.py", line 558, in next
    return type(self).__next__(self)
  File "/Users/ed/Projects/pymarc/pymarc/reader.py", line 90, in __next__
    length = int(first5)
ValueError: invalid literal for int() with base 10: '=LDR '

Something a bit more descriptive would be useful! pymarc does not support reading records in this format.

Add "code4lib" topic?

In the interest of collecting relevant projects to code4lib, could that be added to the repository as a topic? It'll help people discover software during their GitHub searches.

Exception at the end of iterating through a marc file.

When I am Iterating through a marc file containing multiple MARC records, After the last record the following exception is raised.
Traceback (most recent call last):
File "compare_marcs.py", line 320, in
pprint(compare_marc_files(exp, obs))
File "compare_marcs.py", line 299, in compare_marc_files
for exp_marc, obs_marc in izip(expected, observed):
File "/home/shinto/shinto/virtualenvs/marc/local/lib/python2.7/site-packages/six.py", line 558, in next
return type(self).next(self)
File "/home/shinto/shinto/virtualenvs/marc/local/lib/python2.7/site-packages/pymarc/reader.py", line 97, in next
utf8_handling=self.utf8_handling)
File "/home/shinto/shinto/virtualenvs/marc/local/lib/python2.7/site-packages/pymarc/record.py", line 74, in init
utf8_handling=utf8_handling)
File "/home/shinto/shinto/virtualenvs/marc/local/lib/python2.7/site-packages/pymarc/record.py", line 245, in decode_marc
raise BaseAddressInvalid
pymarc.exceptions.BaseAddressInvalid: Base address exceeds size of record

Py3: can't print record with raw fields

pymarc.field.Field.__str__() uses native str as the arguments to self.data.replace, but in a RawField, data is a bytes object, so we get TypeError: 'str' does not support the buffer interface

RawField should probably have its own implementation of __str__, although it's unclear to me how we should represent a field with an unknown encoding as text...

Make leader mutable?

It would be really convenient to be able to mutate the leader somehow.

I find myself writing record.leader = record.leader[0:9] + 'a' + record.leader[10:] when converting MARC-8 to UTF-8 to work around the leader being an immutable string, which is pretty ugly.

(On the other hand, I'm not all that sure this could be done in a way that's backwards-compatible with things that expect record.leader to be a normal string. record.leader[9] = 'a' might be unrealistic to support, but perhaps we could add a method that flips a specific part of the leader?)

UnicodeEncodeError occuring in 3.0.4 but not 2.9.2?

I’ve run into a problem that appears to be related to upgrading pymarc from version 2.9.2 to version 3.0.4. I have Python version 2.7.5 and version 2.0.4 of PyZ3950 installed on CentOS Linux 7. The error message is:

Traceback (most recent call last):
File "./recordtest.py", line 15, in
print "marc_record:[%s]" % marc_record
File “/path_to/lib/python/site-packages/pymarc/record.py", line 84, in str
text_list.extend([str(field) for field in self.fields])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1024: ordinal not in range(128)

Below is a short test script I used to replicate the problem (works with pymarc 2.9.2 but not 3.0.4):

-BEGIN- test script

!/_path_to_python/python

from PyZ3950 import zoom
from pymarc import Record

connection = zoom.Connection ('z3950.loc.gov', 7090)
connection.databaseName = 'VOYAGER'
connection.preferredRecordSyntax = 'USMARC'
query = zoom.Query ('CCL', 'isbn=9780415782654')
results = connection.search (query)
for result in results:
print "result:[%s]" % result
print
marc_record = Record(data=result.data)
print "marc_record:[%s]" % marc_record

connection.close ()

-END- test script

The record I am using to test can be found here:
http://lccn.loc.gov/2011052495

It could very well be that I missed flag or parameter that is required in pymarc 3.#, but I did not see anything in the documentation.

Thoughts or suggestions?

readthedocs

It would be nice to get pymarc up on readthedocs.

Corrupt / Incompatible data: couldn't find 0x20 in g0=83 g1=69

I was getting the error: couldn't find 0x20 in g0=83 g1=69

That error eventually, led me down the right path, to the fact that the data was corrupt / was in the wrong encoding type. It should have shown Pi Squared, but instead it showed some weird characters.

An error that says something about unable to parse characters would have been helpful. I get that may be what it's saying, just in a more esoteric form, but it would be nice if it said something more like:
Unable to parse character 0x20 in g0=83 g1=69.

Side effect bug with Record.as_marc() method

I've been dealing with a problem related to a string vs. byte side-effect in a MARC record leader when using the as_marc() method when processing both new and existing MARC records in Python 3. The bug can be replicated as follows:

>> new_rec = pymarc.Record()
>> type(new_rec.leader)
<class 'str'>
>> new_rec.as_marc()
b'00026     2200025   4500\x1e\x1d'
>> type(new_rec.leader)

Any subsequent processing on the record will cause errors, most significantly when trying to convert the record to MARC XML:

>> pymarc.record_to_xml(new_rec) # Causes a TypeError exception trying to serialize a byte string

The problem occurs in line 367 in record.py when the strleader is being encoding to bytes and assigned to the Record instance's leader. Before suggesting a fix (I can think of a couple and I can submit a pull request) are there any reasons why this behavior is preferred in the as_marc() method? (Personal bias would be that a serialization method like as_marc() not alter the underlying MARC data structure)

Thanks,
Jeremy

unicode leader fields

I ran into an issue with serializing from marc_xml into marc21 transmission. I get a UnicodeDecode error if any of my data is beyond the default encoding (ASCII) because there is a mix of types. This could also be an issue if someone modifies the leader through record.leader and somehow puts in a unicode object instead of an ordinary string object. It doesn't show up in the other as_marc tests because there isn't any writing out of both unicode data and leaders.

I've got a test at: https://gist.github.com/1322248

def test_writing_unicode_leader(self):
    """Test serializing a record with a unicode leader"""
    record = Record()
    record.add_field(Field(245, ['1', '0'], ['a', unichr(0x1234)]))
    record.leader = u'         a              ' # unicode leader here.
    writer = MARCWriter(open('test/foo', 'w'))
    writer.write(record)
    writer.close()

    reader = MARCReader(open('test/foo'), to_unicode=True)
    record = reader.next()
    self.assertEqual(record['245']['a'], unichr(0x1234))

    os.remove('test/foo')

Here is my traceback:

Traceback (most recent call last):
  File "/.../marc8.py", line 69, in test_writing_unicode_leader
    writer.write(record)
  File "/...", line 40, in write
    self.file_handle.write(record.as_marc())
  File "/...", line 249, in as_marc
    return self.leader + directory + fields
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 4: ordinal not in range(128)

I can submit a patch, but my problem is I am kind of mystified by where I should be dealing with this:

should it be in the ingest in marc_xml.py, I could sort of backpedal the leader into string.
should it be in the as_marc() serialization, where we can decode the leader if it is unicode.
or should it be pulled into the larger unicode issue that some pymarc conversation has talked about; adding unicode to the field and record objects.

Creating a Record with a particular leader

I'm exporting records from our library system and I need to replicate the leader field where it makes sense to do so. Record's API does not allow me to do that.

Can I propose the addition of a leader keyword argument to init of Record? This argument would default to ' ' * LEADER_LEN.

Leader positions 5 through 8 and 17 through 19 would be taken from the leader argument. The other leader positions would be set or calculated, as is currently done.

Happy to submit a pull request if this proposal is acceptable.

Extend Writer with JSONWriter, XMLWriter and TextWriter

The only subclass of Writer we have is MARCWriter.

I would like to create a few more subclasses of Writer.

JSONWriter that outputs an array of MARC-in-JSON objects.
XMLWriter that outputs a MARCXML collection.
TextWriter that outputs str(record), with a blank link between records.

My motivation here is so I can code to the generic Writer interface and plug in different writers as need be. Particularly I want to plug in a TextWriter during development, but a MARCWriter in production.

MARC21 line-format

pymarc can handles the standard MARC21 in ISO2709(raw MARC).

However I got a strange format of MARC, which is called line-format by the utility yaz-marcdump.

How could I parse this marc with pymarc??

Following is the marc(with some char in Chinese), or on https://gist.github.com/mail6543210/6095688
File can be seen as UTF-8.

000    cam
010 0  $a978-986-157-529-2$b平裝$dNT$380
091    $acw$bCIP97006132
100    $a20080401d2008    k  y0chiy50      e
101 0  $achi
102    $acw
200 1  $a危機OFF$e企業危機管理指南$f羅倫斯.巴頓(Laurence Barton)原著$g許瀞予譯
205    $a初版
210    $a臺北市$c麥格羅希爾$d2008.05
211    $a0805
215 0  $a  面$d  公分
225 1  $a經營管理叢書$vBM173
330    $a本書提及企業如何因應威脅、災難、人為破壞以及醜聞,如何控管各種風險。
454  1 $12001 $aCrisis leadership now$ea real-world guide to preparing for threats, disaster, sabotage, and scandal
606    $2cst$a危機管理
606    $2cst$a企業領導
606    $2lc$aLeadership.
606    $2lc$aCrisis management.
676    $a658.4/092
680    $aHD57.7
681    $a494

Record.as_marc() forces output encoding to UTF-8

Looks like this was introduced in 022172c and e1b1f9a by me. :(

Really, Record.as_marc() should do some checking to see if leader/09 is set to "a" or perhaps have as_marc() passed a boolean argument. Currently, the Record constructor has an optional a "force_utf8" boolean argument which is thus far only used by decode_marc(). We could use it to set an attribute for the object.

Either way, currently, pymarc fails when trying to write MARC-8 encoded records. Presumably there are going to be some cases where people want to script a minor change and still output the record in MARC-8.

Move `test` into `pymarc`

Then, there will not be a test folder in the site-packages, it will appears in site-packages/pymarc/test

What you need to do are:
git mv test pymarc/
sed -i "s/'test\//'pymarc\/test\//g" pymarc/test/*.py
sed -i "s/'test'/'pymarc\.test'/" setup.py

Problem reading alternative graphic character sets

Below is a failing unit test. As I understand it subscript 2 can be represented in MARC-8 like this: ESC b 2 ESC s.
pymarc doesn't handle this correctly. Is it a bug, or am I doing something wrong?

from pymarc import marc8_to_unicode

assert marc8_to_unicode('CO\x1bb2\x1bs is a gas') == u'CO\u2082 is a gas'

How to analyze the bad.dat

I have a format that needs to be parsed, like the structure test/bad.bat, but I didn't find the demo in your code.

Unable to fetch setuptools file - link broken

I ran setup.py install but it doesn't seem to be able to fetch the required file - setuptools version 0.6c5 . The designated link gives me 404: http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c5-py2.7.egg

Help? Thx!

helper methods for new 3XX RDA fields?

RDA has brought with it three new fields which are mandatory according to the standard. They are

336 - Content Type
337 - Media Type
338 - Carrier Type

As they will (or should) appear in every RDA record, would helper methods for record.py for these be of interest to pymarc? Something along the lines of content_type(self), etc.

Let me know and I'd be happy to submit a pull.

Convert README to PyPI description?

https://pypi.org/project/pymarc/#description says: The author of this package has not provided a project description. Can this be prevented by displaying README.md's content there? Or some of it? Maybe even in a PyPI-standard way?

Problem reading alternative graphic character at end of field

I've hit a problem that PyMARC doesn't handle the alternative graphic character set properly if they it appears at the end of a field. The example code below gives "IndexError: string index out of range"

from pymarc import marc8_to_unicode

assert marc8_to_unicode('CO\x1bb2\x1bs') == u'CO\u2082'

as_marc() should define default encoding in field

For backwards compatibility it would be better to give a default encoding for as_marc in field.py like:

def as_marc(self, encoding='utf8'):

MARC-8 output

I need to output records encoded in MARC-8. I am willing to code this, but I would like to know how you would advice I should do it ?

YAZ - collecting data and printing them with PYMARC

Hello. I have simple data collected from YAZ commands.

yaz-client -m catalogue.dat

I am connecting to library which has MARC21 format and UTF-8 encoding.
I am saving records to catalogue.dat file. It's CZECH library so titles are with special characters for example Ř or Ě etc. when i will run this code:

def get_books(request):
    with open('catalogue.dat', 'rb') as fh:
        reader = MARCReader(fh)
        for record in reader:
            print(str(record.title()))
    return HttpResponseRedirect('/')

Console will print this:

couldn't find 0xbe in g0=66 g1=69
Zelen©Ł kniha /
couldn't find 0xbe in g0=66 g1=69
Kniha p¿©Łtel /
Kniha ¿©Ưkadel /
Kniha poezie /
Kniha dn©Ư /
Kniha ¿©Ưkadel /
Kniha definic /
Kniha cest /
Kniha Frenesis /
Smoln©Ł kniha /
couldn't find 0xbe in g0=66 g1=69
couldn't find 0xbe in g0=66 g1=69
couldn't find 0xbe in g0=66 g1=69
couldn't find 0xaf in g0=66 g1=69

So basicly there are two issues. First why it prints couldn't find errors and why it prints data without that special characters? Thank you so much.

Document use of MARCXML with an example

We are still trying to find out how to parse and serialize MARCXML. I then stumbled upon #73 (make it simpler) but one or two examples in the documentation may be enough to start with:

how to read a full MARCXML file
how to read a stream of records from a MARCXML file (maybe this helps)
how to serialize MARC records to XML

P.S: See also this help request.

JSON serializer multiple subfields bug

The JSON serializer [written by me :|] contains a bug, that will throw away multiple subfields and only keep one.

I'll shortly submit a pull request that fixes this issue.

Add subfields at an arbitrary position

from the mailing list:

But maybe it would be useful if the Field.add_subfield method took an optional 0 based position that defaults to the end of the field to preserve existing behavior? For example:
field = record.title()
field.add_subfield('a', 'Middlemarch :', pos=0)

This shouldn't be difficult to implement, and seems a lot nicer than manipulating the internal heterogeneous list of code, data, code, data manually.

MARC-8 mapping (Eszett, Euro Sign, and some revisions)

I came across some records in the wild which had the eszett in them and noted that the existing marc8_mapping.py doesn't have a mapping for that character (UTF-8: U+00DF).

It looks like the LC Code Tables for MARC-8 mappings were updated in 2004: see https://memory.loc.gov/diglib/codetables/45.html which might explain how the character (and the Euro symbol) are overlooked.

I can provide an updated file in a pull request.

But there are a a couple of other changes listed that aren't reflected in the mapping:

See:

Revised June 2004 to add the Eszett (M+C7) and the Euro Sign (M+C8) to the
MARC-8 set.

Revised September 2004 to change the mapping from MARC-8 to Unicode for
the Ligature (M+EB and M+EC) from U+FE20 and U+FE21 to U+0361.

Revised September 2004 to change the mapping from MARC-8 to Unicode for
the Double Tilde (M+FA and M+FB) from U+FE22 and U+FE23 to U+0360.

Revised March 2005 to change the mapping from MARC-8 to Unicode for the
Alif (M+2E) from U+02BE to U+02BC.

So the question is how to handle the revised mappings? Just do the right thing right now? Keep doing the old behavior? Its easy enough with the new characters but the changes might be problematic for some?

Field object has no attribute subfields

For a 001 field, which doesn't have any subfields, trying to fetch them generates an exception rather than returning empty list.

f.get_subfields()
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/site-packages/pymarc-3.0.1-py2.6.egg/pymarc/field.py", line 155, in get_subfields
for subfield in self:
File "/usr/lib/python2.6/site-packages/six-1.7.3-py2.6.egg/six.py", line 518, in next
return type(self).next(self)
File "/usr/lib/python2.6/site-packages/pymarc-3.0.1-py2.6.egg/pymarc/field.py", line 126, in next
while self.__pos < len(self.subfields):
AttributeError: 'Field' object has no attribute 'subfields'

as_marc() should throw exception on fields that are too big

I use pymarc to generate iso2709 of records from internals datas.
When I read my records generated by pymarc (even with pymarc!) I have directory offset problem.
With yaz-marcdump the error is:
(Directory offset 204: Bad value for data length and/or length starting (394\x1E##\x1Fa9782))
(Base address not at end of directory, base 194, end 205)
(Directory offset 132: Data out of bounds 51665 >= 15063)

What's wrong ?