Skip to content

Commit 0b6f6c8

Browse files
committed
#12586: add provisional email policy with new header parsing and folding.
When the new policies are used (and only when the new policies are explicitly used) headers turn into objects that have attributes based on their parsed values, and can be set using objects that encapsulate the values, as well as set directly from unicode strings. The folding algorithm then takes care of encoding unicode where needed, and folding according to the highest level syntactic objects. With this patch only date and time headers are parsed as anything other than unstructured, but that is all the helper methods in the existing API handle. I do plan to add more parsers, and complete the set specified in the RFC before the package becomes stable.
1 parent 0fa2edd commit 0b6f6c8

16 files changed

+6994
-116
lines changed

Doc/library/email.policy.rst

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,3 +306,326 @@ added matters. To illustrate::
306306
``7bit``, non-ascii binary data is CTE encoded using the ``unknown-8bit``
307307
charset. Otherwise the original source header is used, with its existing
308308
line breaks and and any (RFC invalid) binary data it may contain.
309+
310+
311+
.. note::
312+
313+
The remainder of the classes documented below are included in the standard
314+
library on a :term:`provisional basis <provisional package>`. Backwards
315+
incompatible changes (up to and including removal of the feature) may occur
316+
if deemed necessary by the core developers.
317+
318+
319+
.. class:: EmailPolicy(**kw)
320+
321+
This concrete :class:`Policy` provides behavior that is intended to be fully
322+
compliant with the current email RFCs. These include (but are not limited
323+
to) :rfc:`5322`, :rfc:`2047`, and the current MIME RFCs.
324+
325+
This policy adds new header parsing and folding algorithms. Instead of
326+
simple strings, headers are custom objects with custom attributes depending
327+
on the type of the field. The parsing and folding algorithm fully implement
328+
:rfc:`2047` and :rfc:`5322`.
329+
330+
In addition to the settable attributes listed above that apply to all
331+
policies, this policy adds the following additional attributes:
332+
333+
.. attribute:: refold_source
334+
335+
If the value for a header in the ``Message`` object originated from a
336+
:mod:`~email.parser` (as opposed to being set by a program), this
337+
attribute indicates whether or not a generator should refold that value
338+
when transforming the message back into stream form. The possible values
339+
are:
340+
341+
======== ===============================================================
342+
``none`` all source values use original folding
343+
344+
``long`` source values that have any line that is longer than
345+
``max_line_length`` will be refolded
346+
347+
``all`` all values are refolded.
348+
======== ===============================================================
349+
350+
The default is ``long``.
351+
352+
.. attribute:: header_factory
353+
354+
A callable that takes two arguments, ``name`` and ``value``, where
355+
``name`` is a header field name and ``value`` is an unfolded header field
356+
value, and returns a string-like object that represents that header. A
357+
default ``header_factory`` is provided that understands some of the
358+
:RFC:`5322` header field types. (Currently address fields and date
359+
fields have special treatment, while all other fields are treated as
360+
unstructured. This list will be completed before the extension is marked
361+
stable.)
362+
363+
The class provides the following concrete implementations of the abstract
364+
methods of :class:`Policy`:
365+
366+
.. method:: header_source_parse(sourcelines)
367+
368+
The implementation of this method is the same as that for the
369+
:class:`Compat32` policy.
370+
371+
.. method:: header_store_parse(name, value)
372+
373+
The name is returned unchanged. If the input value has a ``name``
374+
attribute and it matches *name* ignoring case, the value is returned
375+
unchanged. Otherwise the *name* and *value* are passed to
376+
``header_factory``, and the resulting custom header object is returned as
377+
the value. In this case a ``ValueError`` is raised if the input value
378+
contains CR or LF characters.
379+
380+
.. method:: header_fetch_parse(name, value)
381+
382+
If the value has a ``name`` attribute, it is returned to unmodified.
383+
Otherwise the *name*, and the *value* with any CR or LF characters
384+
removed, are passed to the ``header_factory``, and the resulting custom
385+
header object is returned. Any surrogateescaped bytes get turned into
386+
the unicode unknown-character glyph.
387+
388+
.. method:: fold(name, value)
389+
390+
Header folding is controlled by the :attr:`refold_source` policy setting.
391+
A value is considered to be a 'source value' if and only if it does not
392+
have a ``name`` attribute (having a ``name`` attribute means it is a
393+
header object of some sort). If a source value needs to be refolded
394+
according to the policy, it is converted into a custom header object by
395+
passing the *name* and the *value* with any CR and LF characters removed
396+
to the ``header_factory``. Folding of a custom header object is done by
397+
calling its ``fold`` method with the current policy.
398+
399+
Source values are split into lines using :meth:`~str.splitlines`. If
400+
the value is not to be refolded, the lines are rejoined using the
401+
``linesep`` from the policy and returned. The exception is lines
402+
containing non-ascii binary data. In that case the value is refolded
403+
regardless of the ``refold_source`` setting, which causes the binary data
404+
to be CTE encoded using the ``unknown-8bit`` charset.
405+
406+
.. method:: fold_binary(name, value)
407+
408+
The same as :meth:`fold` if :attr:`cte_type` is ``7bit``, except that
409+
the returned value is bytes.
410+
411+
If :attr:`cte_type` is ``8bit``, non-ASCII binary data is converted back
412+
into bytes. Headers with binary data are not refolded, regardless of the
413+
``refold_header`` setting, since there is no way to know whether the
414+
binary data consists of single byte characters or multibyte characters.
415+
416+
The following instances of :class:`EmailPolicy` provide defaults suitable for
417+
specific application domains. Note that in the future the behavior of these
418+
instances (in particular the ``HTTP` instance) may be adjusted to conform even
419+
more closely to the RFCs relevant to their domains.
420+
421+
.. data:: default
422+
423+
An instance of ``EmailPolicy`` with all defaults unchanged. This policy
424+
uses the standard Python ``\n`` line endings rather than the RFC-correct
425+
``\r\n``.
426+
427+
.. data:: SMTP
428+
429+
Suitable for serializing messages in conformance with the email RFCs.
430+
Like ``default``, but with ``linesep`` set to ``\r\n``, which is RFC
431+
compliant.
432+
433+
.. data:: HTTP
434+
435+
Suitable for serializing headers with for use in HTTP traffic. Like
436+
``SMTP`` except that ``max_line_length`` is set to ``None`` (unlimited).
437+
438+
.. data:: strict
439+
440+
Convenience instance. The same as ``default`` except that
441+
``raise_on_defect`` is set to ``True``. This allows any policy to be made
442+
strict by writing::
443+
444+
somepolicy + policy.strict
445+
446+
With all of these :class:`EmailPolicies <.EmailPolicy>`, the effective API of
447+
the email package is changed from the Python 3.2 API in the following ways:
448+
449+
* Setting a header on a :class:`~email.message.Message` results in that
450+
header being parsed and a custom header object created.
451+
452+
* Fetching a header value from a :class:`~email.message.Message` results
453+
in that header being parsed and a custom header object created and
454+
returned.
455+
456+
* Any custom header object, or any header that is refolded due to the
457+
policy settings, is folded using an algorithm that fully implements the
458+
RFC folding algorithms, including knowing where encoded words are required
459+
and allowed.
460+
461+
From the application view, this means that any header obtained through the
462+
:class:`~email.message.Message` is a custom header object with custom
463+
attributes, whose string value is the fully decoded unicode value of the
464+
header. Likewise, a header may be assigned a new value, or a new header
465+
created, using a unicode string, and the policy will take care of converting
466+
the unicode string into the correct RFC encoded form.
467+
468+
The custom header objects and their attributes are described below. All custom
469+
header objects are string subclasses, and their string value is the fully
470+
decoded value of the header field (the part of the field after the ``:``)
471+
472+
473+
.. class:: BaseHeader
474+
475+
This is the base class for all custom header objects. It provides the
476+
following attributes:
477+
478+
.. attribute:: name
479+
480+
The header field name (the portion of the field before the ':').
481+
482+
.. attribute:: defects
483+
484+
A possibly empty list of :class:`~email.errors.MessageDefect` objects
485+
that record any RFC violations found while parsing the header field.
486+
487+
.. method:: fold(*, policy)
488+
489+
Return a string containing :attr:`~email.policy.Policy.linesep`
490+
characters as required to correctly fold the header according
491+
to *policy*. A :attr:`~email.policy.Policy.cte_type` of
492+
``8bit`` will be treated as if it were ``7bit``, since strings
493+
may not contain binary data.
494+
495+
496+
.. class:: UnstructuredHeader
497+
498+
The class used for any header that does not have a more specific
499+
type. (The :mailheader:`Subject` header is an example of an
500+
unstructured header.) It does not have any additional attributes.
501+
502+
503+
.. class:: DateHeader
504+
505+
The value of this type of header is a single date and time value. The
506+
primary example of this type of header is the :mailheader:`Date` header.
507+
508+
.. attribute:: datetime
509+
510+
A :class:`~datetime.datetime` encoding the date and time from the
511+
header value.
512+
513+
The ``datetime`` will be a naive ``datetime`` if the value either does
514+
not have a specified timezone (which would be a violation of the RFC) or
515+
if the timezone is specified as ``-0000``. This timezone value indicates
516+
that the date and time is to be considered to be in UTC, but with no
517+
indication of the local timezone in which it was generated. (This
518+
contrasts to ``+0000``, which indicates a date and time that really is in
519+
the UTC ``0000`` timezone.)
520+
521+
If the header value contains a valid timezone that is not ``-0000``, the
522+
``datetime`` will be an aware ``datetime`` having a
523+
:class:`~datetime.tzinfo` set to the :class:`~datetime.timezone`
524+
indicated by the header value.
525+
526+
A ``datetime`` may also be assigned to a :mailheader:`Date` type header.
527+
The resulting string value will use a timezone of ``-0000`` if the
528+
``datetime`` is naive, and the appropriate UTC offset if the ``datetime`` is
529+
aware.
530+
531+
532+
.. class:: AddressHeader
533+
534+
This class is used for all headers that can contain addresses, whether they
535+
are supposed to be singleton addresses or a list.
536+
537+
.. attribute:: addresses
538+
539+
A list of :class:`.Address` objects listing all of the addresses that
540+
could be parsed out of the field value.
541+
542+
.. attribute:: groups
543+
544+
A list of :class:`.Group` objects. Every address in :attr:`.addresses`
545+
appears in one of the group objects in the tuple. Addresses that are not
546+
syntactically part of a group are represented by ``Group`` objects whose
547+
``name`` is ``None``.
548+
549+
In addition to addresses in string form, any combination of
550+
:class:`.Address` and :class:`.Group` objects, singly or in a list, may be
551+
assigned to an address header.
552+
553+
554+
.. class:: Address(display_name='', username='', domain='', addr_spec=None):
555+
556+
The class used to represent an email address. The general form of an
557+
address is::
558+
559+
[display_name] <username@domain>
560+
561+
or::
562+
563+
username@domain
564+
565+
where each part must conform to specific syntax rules spelled out in
566+
:rfc:`5322`.
567+
568+
As a convenience *addr_spec* can be specified instead of *username* and
569+
*domain*, in which case *username* and *domain* will be parsed from the
570+
*addr_spec*. An *addr_spec* must be a properly RFC quoted string; if it is
571+
not ``Address`` will raise an error. Unicode characters are allowed and
572+
will be property encoded when serialized. However, per the RFCs, unicode is
573+
*not* allowed in the username portion of the address.
574+
575+
.. attribute:: display_name
576+
577+
The display name portion of the address, if any, with all quoting
578+
removed. If the address does not have a display name, this attribute
579+
will be an empty string.
580+
581+
.. attribute:: username
582+
583+
The ``username`` portion of the address, with all quoting removed.
584+
585+
.. attribute:: domain
586+
587+
The ``domain`` portion of the address.
588+
589+
.. attribute:: addr_spec
590+
591+
The ``username@domain`` portion of the address, correctly quoted
592+
for use as a bare address (the second form shown above). This
593+
attribute is not mutable.
594+
595+
.. method:: __str__()
596+
597+
The ``str`` value of the object is the address quoted according to
598+
:rfc:`5322` rules, but with no Content Transfer Encoding of any non-ASCII
599+
characters.
600+
601+
602+
.. class:: Group(display_name=None, addresses=None)
603+
604+
The class used to represent an address group. The general form of an
605+
address group is::
606+
607+
display_name: [address-list];
608+
609+
As a convenience for processing lists of addresses that consist of a mixture
610+
of groups and single addresses, a ``Group`` may also be used to represent
611+
single addresses that are not part of a group by setting *display_name* to
612+
``None`` and providing a list of the single address as *addresses*.
613+
614+
.. attribute:: display_name
615+
616+
The ``display_name`` of the group. If it is ``None`` and there is
617+
exactly one ``Address`` in ``addresses``, then the ``Group`` represents a
618+
single address that is not in a group.
619+
620+
.. attribute:: addresses
621+
622+
A possibly empty tuple of :class:`.Address` objects representing the
623+
addresses in the group.
624+
625+
.. method:: __str__()
626+
627+
The ``str`` value of a ``Group`` is formatted according to :rfc:`5322`,
628+
but with no Content Transfer Encoding of any non-ASCII characters. If
629+
``display_name`` is none and there is a single ``Address`` in the
630+
``addresses` list, the ``str`` value will be the same as the ``str`` of
631+
that single ``Address``.

0 commit comments

Comments
 (0)