wheelodex / headerparser Goto Github PK

7.0 3.0 1.0 306 KB

argparse for mail-style headers

License: MIT License

Python 100.00%

e-mail email mail rfc822 headers rfc2822 rfc5322 parser python available-on-pypi

headerparser's Introduction

Packaged projects for the Python programming language are distributed in two main formats: sdists (archives of code and other files that require processing before they can be installed) and wheels (zipfiles of code ready for immediate installation). A project's wheel contains the complete information about what modules, files, & commands the project installs, along with information about what other projects the project depends on, but the Python Package Index (PyPI) (where wheels are distributed) doesn't expose any of this information! This is the problem that Wheelodex is here to solve.

Wheelodex scans PyPI for wheel files, analyzes them, and stores & displays the results. The site allows users to view the complete metadata inside wheels, search for wheels containing a given Python module or file, browse or search for wheels that define a given command or other entry point, and even find out projects' reverse dependencies.

Note that, in order to save disk space, Wheelodex only records data on wheels from the latest version of each PyPI project; wheels from older versions are periodically purged from the database. Projects' long descriptions aren't even recorded at all.

Suggestions and pull requests are welcome.

headerparser's People

Contributors

Stargazers

Watchers

Forkers

pombredanne

headerparser's Issues

Make `HeaderParser` take scanner options as a `scanner_opts: dict` argument instead of as `**kwargs`

Add a `ScannerConfig` class for holding scanner options instead of using keyword arguments

Give `add_field()` an `i18n: bool` argument

To decode internationalized headers before passing to type.

Also add this argument to add_additional().

Add support for Lists

I'm currently using this lib to parse Metadata of debian akcgaes and repo which uses RFC 822. There are some comma separated lists e.g. Tag: devel::rcs, implemented-in::c, interface::commandline, role::program . It is esaiy to convert this into a list, but I think it would be nice if headerparser is able to directly parse a list.

Add a new parser API built on `attrs` for defining classes instantiated from scanned stanzas

A parser will be defined via a class decorated with @parsable. Header fields will be mapped to attributes of the class, with non-trivial mappings defined via field declarations of the form fieldname: Annotation = Field(...).

Alternative idea: Replace Field with typing.Annotated à la Pydantic 2.0.
Field constructs an attr.Attribute with headerparser-specific parameters stored in the attribute metadata under a "headerparser" key
@parsable compiles the class's parsing metadata into a ParserSpec instance that is then saved as a class variable, which is then used by the actual parse*() functions.
@parsable can be passed the following arguments:
- name_decoder — what the v1 parser calls the "normalizer"; defaults to lambda s: re.sub(r'[^\w_]', "_", s.lower())
- scanner_options: dict[str, Any]
- **kwargs — passed to attr.define
Field — For defining nontrivial multiple=False fields
- Takes the following arguments:
  - alias
  - decoder — A callable that takes a header name (str) and a value
    - For fields with aliases, this is passed the actual field name, not the alias, as that's what pydantic does with validators.
  - **kwargs — passed to attr.field
MultiField: For defining multiple=True fields
- Takes the same arguments as Field, except that decoder is passed a header name and a list of values
ExtraFields: For defining an attribute to store additional fields with multiple=False on
- Takes the following arguments:
  - decoder — a callable that is passed a list of (name, value) pairs with unique names
  - **kwargs — passed to attr.field
- Extra fields are allowed in the parsed input iff this or MultiExtraFields is present
- A class cannot have more than one ExtraFields or MultiExtraFields
MultiExtraFields: For defining an attribute to store additional fields with multiple=True on
- Takes the following arguments:
  - decoder — a callable that is passed a list of (name, value) pairs in which the names need not be unique
  - **kwargs — passed to attr.field
BodyField: For defining the attribute on which the body will be stored
- Takes the following arguments:
  - decoder — a callable that takes just a value
  - **kwargs — passed to attr.field
- A body is allowed iff such a BodyField is present in the class
- A class cannot have more than one BodyField
Functions:
- parse(klass: Type[L], data: Union[Iterable[str], str, Scanner]) -> L
- parse_stanzas(klass: Type[L], data: Union[Iterable[str], str, Scanner]) -> Iterator[L]
- parse_stream(klass: Type[L], fields: Iterable[Tuple[Optional[str], str]]) -> L
  - There's no point in trying to merge this and parse_stanzas_stream() into the non-stream versions, as either way this function or an equivalent will be needed for the others to call
- parse_stanzas_stream(klass: Type[L], fields: Iterable[Iterable[Tuple[str, str]]]) -> Iterator[L]
- There is no parse_next_stanza(); to get this effect, the user should scan the stanza themselves using Scanner and pass the results to parse_stream()
  - Or should parse_next_stanza() exist but only take a Scanner?
- make_parsable(…) — wraps attr.make_class()
- is_parsable(Any) -> bool
- Something (get_scanner()?) for taking a parsable and returning a Scanner initialized with its scanner options?
  - The function would also need to take the data to initialize the Scanner with — unless I give Scanner a feed() method
There is a ParserMixin(?) mixin class that implements equivalents of all of the parse*() functions as classmethods that get the klass from cls
Supply a premade set of decoders for parsing bools, timestamps, etc.?
Supply higher-order functions for converting single-argument functions to (name, value) decoders, converting (name, value) decoders to (name, [value]) decoders, and converting single-argument functions to (name, [value]) decoders
Supply one or more equivalents of attrs' pipe() et alii?
Add an option for just discarding all extra/unknown fields?

Add a function or type for parsing RFC 822 dates

Give `HeaderParser` an option for storing the body in a given `dict` key

Possible option name: body_key

Add a scanner option for ignoring all blank lines

Cf #41.

Give `add_additional()` an option for whether to normalize additional header names before adding them to the dict

Add a scanner option for decoding internationalized header names

See https://tools.ietf.org/html/rfc2047

Add a scanner option for setting what counts as a line terminator

Default setting: CR, LF, and CRLF

Support setting this option to Unicode newlines

Give the scanner methods for reading a body composed of a certain number of lines/bytes or until a regex matches

This will be useful for mailbox parsing.

Improve documentation & examples

Contrast handling of multi-occurrence fields with that of the standard library
Draw attention to the case-insensitivity of field names when parsing and when retrieving from the dict
Give examples of custom normalization (or at least explain what it is and why it's worth having)
Add action examples
Add example recipes to the documentation of HeaderParser for common mail-like formats
Write more user-friendly documentation that goes through HeaderParser feature by feature like attrs' documentation

Give `add_field()` `multiple_type` and `multiple_action` arguments

Like type and action, but called on a list of all values encountered for a multiple field.

Also add these arguments to add_additional().

Add a scanner option for controlling handling of all-whitespace lines

Add a utility function for removing RFC 822 comments from header values

Support converting parsed classes back to mail-like strings

Post #52:

Give Field et alii optional encoder parameters for specifying how to stringify attribute values when dumping with dump(parsable, fp) etc. functions.

Should this functionality be called "dumping" or "encoding" or something else?
- Arguably, the opposite of scanning is printing, but defining a function named "print()" isn't such a good idea.
encoders are callables with the following signatures:
- For Field and MultiField: (name: str, value: Any) -> Any
- For ExtraFields and MultiExtraFields: (value: Any) -> Sequence[tuple[str, Any]] | Mapping[str, Sequence[Any] | Any]
- For BodyField: (value: Any) -> Any
Encoders must return one of the following:
- For any field:
  - None — no value will be written
- For Field and MultiField:
  - Sequence[Any] — will be used as multiple field values
  - Any — will be stringified to be used as the field value
- For body fields:
  - Any — will be stringified
- For extra fields:
  - Sequence[tuple[str, Any]]
  - Mapping[str, Sequence[Any] | Any]
This will require also adding a name_encoder parameter to @parsable
- Named fields will also need some argument for specifying the spelling of their encoded name.
Functions for "dumping":
- dump(parseable, fp) -> None
- dump_stream(fields: Iterable[Tuple[Optional[str], str]], fp: TextIO) -> None
- dump_stanzas_stream(fields: Iterable[Iterable[Tuple[str, str]]], fp: TextIO) -> None
- dumps*() functions that return strings
Give the "dumping" functions keyword options for the following:
- separator
- folding indentation (indent)
- auto_indent: bool = False (Rethink name) — when True, field values in which all lines after the first are already indented (i.e., folded) are not indented again
The string-returning dump functions should be the "core" ones that the others are implemented in terms of, as we don't want to write anything to a file until we're sure that all the return values of the decoders are valid.
Line wrapping fields is the caller's job (but maybe add a helper function for that?).
None (after serializing/encoding) field values are always skipped when dumping; if the user doesn't want that, they need to set a dumper that serializes Nones to something else.
Fields with aliases are dumped using the decoded aliases.

Add a utility function for splitting apart comma-and-space-separated lists

Cf. #2?

Support parsing fields that can either be in the header or be the body

E.g., the "Description" field in Python packaging metadata

Give `HeaderParser` a `dict` factory option

Give parsers a way to store parsed fields in a presupplied arbitrary mapping object (or one created from a dict_factory/dict_cls callable?) instead of creating a new NormalizedDict

Add an entry point for converting RFC822-style files/headers to JSON

name: mail2json? headers2json?
Include options for:
- parsing multiple stanzas into an array of JSON objects
- setting the key name for the "message body"
- handling of multiple occurrences of the same header in a single stanza; choices:
  - raise an error
  - combine multi-occurrence headers into an array of values
  - use an array of values for all headers regardless of multiplicity (default?)
  - output an array of {"header": ..., "value": ...} objects
- handling of non-ASCII characters and the various ways in which they can be escaped
- handling of "From " lines (and/or other non-header headers like the first line of an HTTP request or response?)
- handling of header lettercases?

Support putting additional fields in a sub-`dict` of `NormalizedDict`

Give add_additional() an option for putting all additional fields in a given sub-dict (or a presupplied arbitrary mapping object?) so that named fields can still use custom dests

Create a `BODY` token to use as a `dict` key for storing bodies instead of storing them as an attribute

Add a field type for storing parsing defects

Post #52:

Add a DefectsField field type for collecting errors raised during parsing and decoding

By default, errors are stored as a dict that maps header field names to lists of exceptions
- What should happen for scanner and body errors?
  - Idea: Don't catch scanner errors … for now
  - Idea: Store them in the dict with the key set to a SCANNING or BODY enum or token
  - Idea: Wrap all decoder errors in custom DecoderError instances
    - Subclasses:
      - FieldDecoderError(post-alias-name, value, error)
      - ExtraFieldsDecoderError(value, error)
      - BodyDecoderError(value, error)
Non-extra fields can now take a required: bool parameter so that lack of a required field can be caught & registered as a defect
Errors are stored after calling .with_traceback(None) on them and their chain of causes (__cause__) & contexts (__context__) in order to reduce memory use
Should defects mode be toggleable by an option when parse*() is called?

Add a function or type for parsing Content-Type-style parameterized headers

References:

https://tools.ietf.org/html/rfc2045, §5.1 — Syntax of the Content-Type Header Field
https://tools.ietf.org/html/rfc2183 — The Content-Disposition Header Field
https://tools.ietf.org/html/rfc2231 — MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations
https://tools.ietf.org/html/rfc5988, §5 — The Link Header Field
https://tools.ietf.org/html/rfc7239, §4 — Forwarded HTTP Header Field

This would probably require two separate classes, one for a scanning a single stanza (with optional body) and one for scanning multiple stanzas.
This may be better as additional classes instead of replacing Scanner.

Use `attrs` to define exception classes

Or just use dataclasses?

Add a scanner option for controlling handling of header lines with empty header names

Options:

Raise an error (current/default)
Yield a header with an empty name
Look for the next colon in the line
Treat the line as the start of the body

Handle "From " lines

Give NormalizedDict a from_line attribute
Give the scanner a from_line_regex parameter; if the first line of a stanza matches the regex, it is assumed to be a "From" line
Create a "SpecialHeader" enum with FromLine and Body values for use as the first element of (header, value) pairs yielded by the scanner representing "From " lines and bodies
- Use the enum values as keys in NormalizedDicts instead of having dedicated from_line and body attributes?
Give the parser an option for requiring a "From " line
Export premade regexes for matching Unix mail "From " lines, HTTP request lines, and HTTP response status lines

Add a utility function for decoding internationalized header strings