Skip to content

Commit 5bbec39

Browse files
committed
gh-67022: Document bytes/str inconsistency in email.header.decode_header()
This function's possible return types have been surprising and error-prone for the entirety of its Python 3.x history. It can return either: 1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` of length >1 2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1 This means that any user of this function must be prepared to accept either `bytes` or `str` for the first member of the 2-tuples it returns, which is a very surprising behavior in Python 3.x, particularly given that the second member of the tuple is supposed to represent the charset/encoding of the first member. This patch documents the behavior of this function, and adds test cases to demonstrate it. As discussed in bpo-22833, this cannot be changed in a backwards-compatible way, and some users of this function depend precisely on the existing behavior.
1 parent 742d461 commit 5bbec39

File tree

4 files changed

+38
-10
lines changed

4 files changed

+38
-10
lines changed

Doc/library/email.header.rst

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -173,21 +173,34 @@ Here is the :class:`Header` class description:
173173
The :mod:`email.header` module also provides the following convenient functions.
174174

175175

176+
176177
.. function:: decode_header(header)
177178

178179
Decode a message header value without converting the character set. The header
179180
value is in *header*.
180181

181-
This function returns a list of ``(decoded_string, charset)`` pairs containing
182-
each of the decoded parts of the header. *charset* is ``None`` for non-encoded
183-
parts of the header, otherwise a lower case string containing the name of the
184-
character set specified in the encoded string.
182+
For historical reasons, this function may return either:
183+
184+
1. A list of pairs containing each of the decoded parts of the header,
185+
``(decoded_bytes, charset)``, where *decoded_bytes* is always an instance of
186+
:class:`bytes`, and *charset* is either:
187+
- A lower case string containing the name of the character set specified.
188+
- ``None`` for non-encoded parts of the header.
189+
2. A list of length 1 containing a pair ``(string, None)``, where
190+
*string* is always an instance of :class:`str`.
185191

186-
Here's an example::
192+
An :exc:`classemail.errors.HeaderParseError` may be raised when
193+
certain decoding errors occur (e.g. a base64 decoding exception).
194+
195+
Here are examples:
187196

188197
>>> from email.header import decode_header
189198
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
190199
[(b'p\xf6stal', 'iso-8859-1')]
200+
>>> decode_header('unencoded_string')
201+
[('unencoded_string', None)]
202+
>>> decode_header('bar =?utf-8?B?ZsOzbw==?=')
203+
[(b'bar ', None), (b'f\xc3\xb3o', 'utf-8')]
191204

192205

193206
.. function:: make_header(decoded_seq, maxlinelen=None, header_name=None, continuation_ws=' ')
@@ -202,4 +215,3 @@ The :mod:`email.header` module also provides the following convenient functions.
202215
This function takes one of those sequence of pairs and returns a
203216
:class:`Header` instance. Optional *maxlinelen*, *header_name*, and
204217
*continuation_ws* are as in the :class:`Header` constructor.
205-

Lib/email/header.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,13 @@
6161
def decode_header(header):
6262
"""Decode a message header value without converting charset.
6363
64-
Returns a list of (string, charset) pairs containing each of the decoded
65-
parts of the header. Charset is None for non-encoded parts of the header,
66-
otherwise a lower-case string containing the name of the character set
67-
specified in the encoded string.
64+
For historical reasons, this function may return either:
65+
66+
1. A list of length 1 containing a pair (str, None).
67+
2. A list of (bytes, charset) pairs containing each of the decoded
68+
parts of the header. Charset is None for non-encoded parts of the header,
69+
otherwise a lower-case string containing the name of the character set
70+
specified in the encoded string.
6871
6972
header may be a string that may or may not contain RFC2047 encoded words,
7073
or it may be a Header object.

Lib/test/test_email/test_email.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2464,6 +2464,18 @@ def test_multiline_header(self):
24642464
self.assertEqual(str(make_header(decode_header(s))),
24652465
'"Müller T" <T.Mueller@xxx.com>')
24662466

2467+
def test_unencoded_ascii(self):
2468+
# bpo-22833/gh-67022: returns [(str, None)] rather than [(bytes, None)]
2469+
s = 'header without encoded words'
2470+
self.assertEqual(decode_header(s),
2471+
[('header without encoded words', None)])
2472+
2473+
def test_unencoded_utf8(self):
2474+
# bpo-22833/gh-67022: returns [(str, None)] rather than [(bytes, None)]
2475+
s = 'header with unexpected non ASCII caract\xe8res'
2476+
self.assertEqual(decode_header(s),
2477+
[('header with unexpected non ASCII caract\xe8res', None)])
2478+
24672479

24682480
# Test the MIMEMessage class
24692481
class TestMIMEMessage(TestEmailBase):
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
The inconsistent return types of :func:`email.header.decode_header` are now documented.

0 commit comments

Comments
 (0)