Giter Site home page Giter Site logo

headerparser's Introduction

Project Status: Active — The project has reached a stable, usable state and is being actively developed. CI Status coverage MIT License

Site | GitHub | Issues | Changelog

Packaged projects for the Python programming language are distributed in two main formats: sdists (archives of code and other files that require processing before they can be installed) and wheels (zipfiles of code ready for immediate installation). A project's wheel contains the complete information about what modules, files, & commands the project installs, along with information about what other projects the project depends on, but the Python Package Index (PyPI) (where wheels are distributed) doesn't expose any of this information! This is the problem that Wheelodex is here to solve.

Wheelodex scans PyPI for wheel files, analyzes them, and stores & displays the results. The site allows users to view the complete metadata inside wheels, search for wheels containing a given Python module or file, browse or search for wheels that define a given command or other entry point, and even find out projects' reverse dependencies.

Note that, in order to save disk space, Wheelodex only records data on wheels from the latest version of each PyPI project; wheels from older versions are periodically purged from the database. Projects' long descriptions aren't even recorded at all.

Suggestions and pull requests are welcome.

headerparser's People

Contributors

dependabot[bot] avatar jwodder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

pombredanne

headerparser's Issues

Add support for Lists

I'm currently using this lib to parse Metadata of debian akcgaes and repo which uses RFC 822. There are some comma separated lists e.g. Tag: devel::rcs, implemented-in::c, interface::commandline, role::program . It is esaiy to convert this into a list, but I think it would be nice if headerparser is able to directly parse a list.

Add a new parser API built on `attrs` for defining classes instantiated from scanned stanzas

A parser will be defined via a class decorated with @parsable. Header fields will be mapped to attributes of the class, with non-trivial mappings defined via field declarations of the form fieldname: Annotation = Field(...).

  • Alternative idea: Replace Field with typing.Annotated à la Pydantic 2.0.

  • Field constructs an attr.Attribute with headerparser-specific parameters stored in the attribute metadata under a "headerparser" key

  • @parsable compiles the class's parsing metadata into a ParserSpec instance that is then saved as a class variable, which is then used by the actual parse*() functions.

  • @parsable can be passed the following arguments:

    • name_decoder — what the v1 parser calls the "normalizer"; defaults to lambda s: re.sub(r'[^\w_]', "_", s.lower())
    • scanner_options: dict[str, Any]
    • **kwargs — passed to attr.define
  • Field — For defining nontrivial multiple=False fields

    • Takes the following arguments:
      • alias
      • decoder — A callable that takes a header name (str) and a value
        • For fields with aliases, this is passed the actual field name, not the alias, as that's what pydantic does with validators.
      • **kwargs — passed to attr.field
  • MultiField: For defining multiple=True fields

    • Takes the same arguments as Field, except that decoder is passed a header name and a list of values
  • ExtraFields: For defining an attribute to store additional fields with multiple=False on

    • Takes the following arguments:
      • decoder — a callable that is passed a list of (name, value) pairs with unique names
      • **kwargs — passed to attr.field
    • Extra fields are allowed in the parsed input iff this or MultiExtraFields is present
    • A class cannot have more than one ExtraFields or MultiExtraFields
  • MultiExtraFields: For defining an attribute to store additional fields with multiple=True on

    • Takes the following arguments:
      • decoder — a callable that is passed a list of (name, value) pairs in which the names need not be unique
      • **kwargs — passed to attr.field
  • BodyField: For defining the attribute on which the body will be stored

    • Takes the following arguments:
      • decoder — a callable that takes just a value
      • **kwargs — passed to attr.field
    • A body is allowed iff such a BodyField is present in the class
    • A class cannot have more than one BodyField
  • Functions:

    • parse(klass: Type[L], data: Union[Iterable[str], str, Scanner]) -> L
    • parse_stanzas(klass: Type[L], data: Union[Iterable[str], str, Scanner]) -> Iterator[L]
    • parse_stream(klass: Type[L], fields: Iterable[Tuple[Optional[str], str]]) -> L
      • There's no point in trying to merge this and parse_stanzas_stream() into the non-stream versions, as either way this function or an equivalent will be needed for the others to call
    • parse_stanzas_stream(klass: Type[L], fields: Iterable[Iterable[Tuple[str, str]]]) -> Iterator[L]
    • There is no parse_next_stanza(); to get this effect, the user should scan the stanza themselves using Scanner and pass the results to parse_stream()
      • Or should parse_next_stanza() exist but only take a Scanner?
    • make_parsable(…) — wraps attr.make_class()
    • is_parsable(Any) -> bool
    • Something (get_scanner()?) for taking a parsable and returning a Scanner initialized with its scanner options?
      • The function would also need to take the data to initialize the Scanner with — unless I give Scanner a feed() method
  • There is a ParserMixin(?) mixin class that implements equivalents of all of the parse*() functions as classmethods that get the klass from cls

  • Supply a premade set of decoders for parsing bools, timestamps, etc.?

  • Supply higher-order functions for converting single-argument functions to (name, value) decoders, converting (name, value) decoders to (name, [value]) decoders, and converting single-argument functions to (name, [value]) decoders

  • Supply one or more equivalents of attrs' pipe() et alii?

  • Add an option for just discarding all extra/unknown fields?

Improve documentation & examples

  • Contrast handling of multi-occurrence fields with that of the standard library
  • Draw attention to the case-insensitivity of field names when parsing and when retrieving from the dict
  • Give examples of custom normalization (or at least explain what it is and why it's worth having)
  • Add action examples
  • Add example recipes to the documentation of HeaderParser for common mail-like formats
  • Write more user-friendly documentation that goes through HeaderParser feature by feature like attrs' documentation

Support converting parsed classes back to mail-like strings

Post #52:

Give Field et alii optional encoder parameters for specifying how to stringify attribute values when dumping with dump(parsable, fp) etc. functions.

  • Should this functionality be called "dumping" or "encoding" or something else?

    • Arguably, the opposite of scanning is printing, but defining a function named "print()" isn't such a good idea.
  • encoders are callables with the following signatures:

    • For Field and MultiField: (name: str, value: Any) -> Any
    • For ExtraFields and MultiExtraFields: (value: Any) -> Sequence[tuple[str, Any]] | Mapping[str, Sequence[Any] | Any]
    • For BodyField: (value: Any) -> Any
  • Encoders must return one of the following:

    • For any field:
      • None — no value will be written
    • For Field and MultiField:
      • Sequence[Any] — will be used as multiple field values
      • Any — will be stringified to be used as the field value
    • For body fields:
      • Any — will be stringified
    • For extra fields:
      • Sequence[tuple[str, Any]]
      • Mapping[str, Sequence[Any] | Any]
  • This will require also adding a name_encoder parameter to @parsable

    • Named fields will also need some argument for specifying the spelling of their encoded name.
  • Functions for "dumping":

    • dump(parseable, fp) -> None
    • dump_stream(fields: Iterable[Tuple[Optional[str], str]], fp: TextIO) -> None
    • dump_stanzas_stream(fields: Iterable[Iterable[Tuple[str, str]]], fp: TextIO) -> None
    • dumps*() functions that return strings
  • Give the "dumping" functions keyword options for the following:

    • separator
    • folding indentation (indent)
    • auto_indent: bool = False (Rethink name) — when True, field values in which all lines after the first are already indented (i.e., folded) are not indented again
  • The string-returning dump functions should be the "core" ones that the others are implemented in terms of, as we don't want to write anything to a file until we're sure that all the return values of the decoders are valid.

  • Line wrapping fields is the caller's job (but maybe add a helper function for that?).

  • None (after serializing/encoding) field values are always skipped when dumping; if the user doesn't want that, they need to set a dumper that serializes Nones to something else.

  • Fields with aliases are dumped using the decoded aliases.

Give `HeaderParser` a `dict` factory option

Give parsers a way to store parsed fields in a presupplied arbitrary mapping object (or one created from a dict_factory/dict_cls callable?) instead of creating a new NormalizedDict

Add an entry point for converting RFC822-style files/headers to JSON

  • name: mail2json? headers2json?

  • Include options for:

    • parsing multiple stanzas into an array of JSON objects
    • setting the key name for the "message body"
    • handling of multiple occurrences of the same header in a single stanza; choices:
      • raise an error
      • combine multi-occurrence headers into an array of values
      • use an array of values for all headers regardless of multiplicity (default?)
      • output an array of {"header": ..., "value": ...} objects
    • handling of non-ASCII characters and the various ways in which they can be escaped
    • handling of "From " lines (and/or other non-header headers like the first line of an HTTP request or response?)
    • handling of header lettercases?

Add a field type for storing parsing defects

Post #52:

Add a DefectsField field type for collecting errors raised during parsing and decoding

  • By default, errors are stored as a dict that maps header field names to lists of exceptions

    • What should happen for scanner and body errors?
      • Idea: Don't catch scanner errors … for now
      • Idea: Store them in the dict with the key set to a SCANNING or BODY enum or token
      • Idea: Wrap all decoder errors in custom DecoderError instances
        • Subclasses:
          • FieldDecoderError(post-alias-name, value, error)
          • ExtraFieldsDecoderError(value, error)
          • BodyDecoderError(value, error)
  • Non-extra fields can now take a required: bool parameter so that lack of a required field can be caught & registered as a defect

  • Errors are stored after calling .with_traceback(None) on them and their chain of causes (__cause__) & contexts (__context__) in order to reduce memory use

  • Should defects mode be toggleable by an option when parse*() is called?

Add a function or type for parsing Content-Type-style parameterized headers

References:

Handle "From " lines

  • Give NormalizedDict a from_line attribute

  • Give the scanner a from_line_regex parameter; if the first line of a stanza matches the regex, it is assumed to be a "From" line

  • Create a "SpecialHeader" enum with FromLine and Body values for use as the first element of (header, value) pairs yielded by the scanner representing "From " lines and bodies

    • Use the enum values as keys in NormalizedDicts instead of having dedicated from_line and body attributes?
  • Give the parser an option for requiring a "From " line

  • Export premade regexes for matching Unix mail "From " lines, HTTP request lines, and HTTP response status lines

Add more tests

  • Different header name normalizers (identity, hyphens=underscores, titlecase?, etc.)
  • add_additional():
    • Calling add_additional() multiple times (some times with allow=False)
    • add_additional(False, extra arguments ...)
    • add_additional when a header has a dest that's just a normalized form of one of its names
  • Calling add_field()/add_additional() on a HeaderParser after a previous call raised an error
  • Scanning & parsing Unicode
  • Normalizer that returns a non-string
  • Non-string keys in NormalizedDict with the default normalizer
  • Equality of HeaderParser objects
  • Passing scanner options to HeaderParser
  • Scanning files not opened in universal newlines mode

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.