Skip to content

ElementTree.parse() fails with "not well-formed" for declared encoding "utf8" #148821

@lemon24

Description

@lemon24

Bug report

Bug description:

ElementTree.parse() fails with "not well-formed (invalid token)" for wrong declared encoding utf8 iif the XML contains non-ASCII characters, instead of "unknown encoding"; more details after repro:

Repro:

import io, xml.etree.ElementTree as ET

s = """\
<?xml version='1.0' encoding='utf8'?>
<outline text="Comentário" />
"""

def parse(s):
    return ET.parse(io.BytesIO(s.encode()))
    
parse(s)

Output:

>>> parse(s)
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    parse(s)
  File "<python-input-2>", line 2, in parse
    return ET.parse(io.BytesIO(s.encode()))
  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/xml/etree/ElementTree.py", line 1214, in parse
    tree.parse(source, parser)
  File "/Library/Frameworks/Python.framework/Versions/3.14/lib/python3.14/xml/etree/ElementTree.py", line 577, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 21
>>>
>>> # utf-8 works fine
>>> parse(s.replace('utf8', 'utf-8')).getroot().get('text')
'Comentário'
>>>
>>> # ascii-only characters work fine, despite the wrong utf8 declared encoding
>>> parse(s.replace('á', 'a')).getroot().get('text')
'Comentario'
>>>
>>> # a truly unknown encoding fails with the correct message
>>> parse(s.replace('utf8', 'xyz'))
Traceback (most recent call last):
  ...
LookupError: unknown encoding: xyz
>>>
>>> # ascii encoding fails with the same message as utf8
>>> # (perhaps utf8 silently falls back to ascii?)
>>> parse(s.replace('utf8', 'ascii'))
Traceback (most recent call last):
  ...
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 21

Per the XML spec and IANA character sets list, the correct (and only) encoding name is utf-8 (works fine with etree).

Whether to accept utf8 was discussed previously in #46531, which was closed as won't fix (but in that issue, the error message was "unknown encoding", so the current message is a regression); FWIW, LXML does accept utf8 as a valid encoding.

Expected behavior:

  • utf8 encoding fails with "unknown encoding", regardless of whether the input contains non-ASCII characters or not ("in the face of ambiguity, refuse the temptation to guess"), or
  • treat utf8 as utf-8, even if it's not actually correct (str.encode() and LXML supporting it seems to indicate it is a common (mis)spelling)

LXML behavior, for reference:

>>> import lxml.etree as ET
>>> 
>>> # lxml accepts wrong encoding utf8
>>> parse(s).getroot().get('text')
'Comentário'
>>>
>>> # unknown encoding fails as expected
>>> parse(s.replace('utf8', 'xyz'))
Traceback (most recent call last):
  ...
lxml.etree.XMLSyntaxError: Unsupported encoding: xyz, line 1, column 35

CPython versions tested on:

3.14, 3.13, 3.12

Operating systems tested on:

macOS, Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesextension-modulesC modules in the Modules dirtopic-XMLtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions