Bug report
Bug description:
ElementTree.parse() fails with "not well-formed (invalid token)" for wrong declared encoding utf8 iif the XML contains non-ASCII characters, instead of "unknown encoding"; more details after repro:
Repro:
import io, xml.etree.ElementTree as ET
s = """\
<?xml version='1.0' encoding='utf8'?>
<outline text="Comentário" />
"""
def parse(s):
return ET.parse(io.BytesIO(s.encode()))
parse(s)
Output:
>>> parse(s)
Traceback (most recent call last):
File "<python-input-3>", line 1, in <module>
parse(s)
File "<python-input-2>", line 2, in parse
return ET.parse(io.BytesIO(s.encode()))
File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/xml/etree/ElementTree.py", line 1214, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/xml/etree/ElementTree.py", line 577, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 21
>>>
>>> # utf-8 works fine
>>> parse(s.replace('utf8', 'utf-8')).getroot().get('text')
'Comentário'
>>>
>>> # ascii-only characters work fine, despite the wrong utf8 declared encoding
>>> parse(s.replace('á', 'a')).getroot().get('text')
'Comentario'
>>>
>>> # a truly unknown encoding fails with the correct message
>>> parse(s.replace('utf8', 'xyz'))
Traceback (most recent call last):
...
LookupError: unknown encoding: xyz
>>>
>>> # ascii encoding fails with the same message as utf8
>>> # (perhaps utf8 silently falls back to ascii?)
>>> parse(s.replace('utf8', 'ascii'))
Traceback (most recent call last):
...
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 21
Per the XML spec and IANA character sets list, the correct (and only) encoding name is utf-8 (works fine with etree).
Whether to accept utf8 was discussed previously in #46531, which was closed as won't fix (but in that issue, the error message was "unknown encoding", so the current message is a regression); FWIW, LXML does accept utf8 as a valid encoding.
Expected behavior:
utf8 encoding fails with "unknown encoding", regardless of whether the input contains non-ASCII characters or not ("in the face of ambiguity, refuse the temptation to guess"), or
- treat
utf8 as utf-8, even if it's not actually correct (str.encode() and LXML supporting it seems to indicate it is a common (mis)spelling)
LXML behavior, for reference:
>>> import lxml.etree as ET
>>>
>>> # lxml accepts wrong encoding utf8
>>> parse(s).getroot().get('text')
'Comentário'
>>>
>>> # unknown encoding fails as expected
>>> parse(s.replace('utf8', 'xyz'))
Traceback (most recent call last):
...
lxml.etree.XMLSyntaxError: Unsupported encoding: xyz, line 1, column 35
CPython versions tested on:
3.14, 3.13, 3.12
Operating systems tested on:
macOS, Linux
Bug report
Bug description:
ElementTree.parse() fails with "not well-formed (invalid token)" for wrong declared encoding
utf8iif the XML contains non-ASCII characters, instead of "unknown encoding"; more details after repro:Repro:
Output:
Per the XML spec and IANA character sets list, the correct (and only) encoding name is
utf-8(works fine with etree).Whether to accept
utf8was discussed previously in #46531, which was closed as won't fix (but in that issue, the error message was "unknown encoding", so the current message is a regression); FWIW, LXML does acceptutf8as a valid encoding.Expected behavior:
utf8encoding fails with "unknown encoding", regardless of whether the input contains non-ASCII characters or not ("in the face of ambiguity, refuse the temptation to guess"), orutf8asutf-8, even if it's not actually correct (str.encode() and LXML supporting it seems to indicate it is a common (mis)spelling)LXML behavior, for reference:
CPython versions tested on:
3.14, 3.13, 3.12
Operating systems tested on:
macOS, Linux