Skip to content
Open
28 changes: 23 additions & 5 deletions Doc/library/xml.etree.elementtree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -711,14 +711,14 @@ Functions

.. function:: tostring(element, encoding="us-ascii", method="xml", *, \
xml_declaration=None, default_namespace=None, \
short_empty_elements=True)
validate=False, short_empty_elements=True)

Generates a string representation of an XML element, including all
subelements. *element* is an :class:`Element` instance. *encoding* [1]_ is
the output encoding (default is US-ASCII). Use ``encoding="unicode"`` to
generate a Unicode string (otherwise, a bytestring is generated). *method*
is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
*xml_declaration*, *default_namespace* and *short_empty_elements* has the same
*xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* have the same
meaning as in :meth:`ElementTree.write`. Returns an (optionally) encoded string
containing the XML data.

Expand All @@ -732,17 +732,20 @@ Functions
The :func:`tostring` function now preserves the attribute order
specified by the user.

.. versionchanged:: next
Added the *validate* parameter.


.. function:: tostringlist(element, encoding="us-ascii", method="xml", *, \
xml_declaration=None, default_namespace=None, \
short_empty_elements=True)
validate=False, short_empty_elements=True)

Generates a string representation of an XML element, including all
subelements. *element* is an :class:`Element` instance. *encoding* [1]_ is
the output encoding (default is US-ASCII). Use ``encoding="unicode"`` to
generate a Unicode string (otherwise, a bytestring is generated). *method*
is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``).
*xml_declaration*, *default_namespace* and *short_empty_elements* has the same
*xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* have the same
meaning as in :meth:`ElementTree.write`. Returns a list of (optionally) encoded
strings containing the XML data. It does not guarantee any specific sequence,
except that ``b"".join(tostringlist(element)) == tostring(element)``.
Expand All @@ -759,6 +762,9 @@ Functions
The :func:`tostringlist` function now preserves the attribute order
specified by the user.

.. versionchanged:: next
Added the *validate* parameter.


.. function:: XML(text, parser=None)

Expand Down Expand Up @@ -1186,7 +1192,7 @@ ElementTree Objects

.. method:: write(file, encoding="us-ascii", xml_declaration=None, \
default_namespace=None, method="xml", *, \
short_empty_elements=True)
validate=False, short_empty_elements=True)

Writes the element tree to a file, as XML. *file* is a file name, or a
:term:`file object` opened for writing. *encoding* [1]_ is the output
Expand All @@ -1197,6 +1203,15 @@ ElementTree Objects
*default_namespace* sets the default XML namespace (for "xmlns").
*method* is either ``"xml"``, ``"html"`` or ``"text"`` (default is
``"xml"``).

If *validate* is true, check that all characters are legal,
that element and attribute names are valid, and that the content
of comments, processing instructions and HTML elements
like ``<script>`` do not contain illegal sequences according
to the selected *method* (``"xml"`` or ``"html"``).
Raise :exc:`ValueError` if any check fails.
By default, or if *method* is ``"text"``, no validation is performed.

The keyword-only *short_empty_elements* parameter controls the formatting
of elements that contain no content. If ``True`` (the default), they are
emitted as a single self-closed tag, otherwise they are emitted as a pair
Expand All @@ -1216,6 +1231,9 @@ ElementTree Objects
The :meth:`write` method now preserves the attribute order specified
by the user.

.. versionchanged:: next
Added the *validate* parameter.


This is the XML file that is going to be manipulated::

Expand Down
10 changes: 10 additions & 0 deletions Doc/whatsnew/3.16.rst
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,16 @@ tarfile
* The undocumented and unused :attr:`!tarfile.TarFile.tarfile` attribute
has been deprecated since Python 3.13.

xml.etree.ElementTree
---------------------

* Add the *validate* option to functions
:func:`~xml.etree.ElementTree.tostring`,
:func:`~xml.etree.ElementTree.tostringlist`, and the
:meth:`Element.write <xml.etree.ElementTree.ElementTree.write>` method,
which allows to validate the element or element tree before serialization.
(Contributed by Serhiy Storchaka in :gh:`149468`.)

.. Add removals above alphabetically, not here at the end.


Expand Down
223 changes: 223 additions & 0 deletions Lib/test/test_xml_etree.py
Original file line number Diff line number Diff line change
Expand Up @@ -1387,6 +1387,229 @@ def test_attlist_default(self):
{'{http://www.w3.org/XML/1998/namespace}lang': 'eng'})


class XMLValidationTest(unittest.TestCase):

def check(self, elem):
self.assertRaises(ValueError,
ET.tostring, elem, validate=True)
ET.tostring(elem) # no exception

def check_valid(self, elem, expected):
self.assertEqual(ET.tostring(elem, validate=True), expected)

def test_invalid_comment(self):
self.check(ET.Comment('a--b'))
self.check(ET.Comment(' B+, B, or B-'))
self.check(ET.Comment('\x00'))
self.check(ET.Comment('\x01'))
self.check(ET.Comment('\ud8ff'))
self.check(ET.Comment('\ufffe'))

def test_invalid_processing_instruction(self):
self.check(ET.PI(''))
self.check(ET.PI('0'))
self.check(ET.PI('a/b'))
self.check(ET.PI('foo\xa0bar'))
self.check(ET.PI('foo\fbar'))
self.check(ET.PI('xml'))
self.check(ET.PI('XML'))
self.check(ET.PI('xml', 'encoding="UTF-8"'))
self.check(ET.PI('foo', 'a?>b'))
self.check(ET.PI('foo', '\x00'))
self.check(ET.PI('foo', '\x01'))
self.check(ET.PI('foo', '\ud8ff'))
self.check(ET.PI('foo', '\ufffe'))

self.check_valid(ET.PI('foo\tbar'), b'<?foo\tbar?>')
self.check_valid(ET.PI('foo\nbar'), b'<?foo\nbar?>')
self.check_valid(ET.PI('foo\rbar'), b'<?foo\rbar?>')

def test_invalid_tag(self):
self.check(ET.Element(''))
self.check(ET.Element('0'))
self.check(ET.Element('a/b'))
self.check(ET.Element(ET.QName('')))
self.check(ET.Element(ET.QName('0')))
self.check(ET.Element(ET.QName('a/b')))

def test_invalid_attr_name(self):
self.check(ET.Element('tag', attrib={'': 'value'}))
self.check(ET.Element('tag', attrib={'0': 'value'}))
self.check(ET.Element('tag', attrib={'a/b': 'value'}))
self.check(ET.Element('tag', attrib={ET.QName(''): 'value'}))
self.check(ET.Element('tag', attrib={ET.QName('0'): 'value'}))
self.check(ET.Element('tag', attrib={ET.QName('a/b'): 'value'}))

def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))
Comment on lines +1443 to +1449
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and several other methods could use subTests if you think it's an improvement, e.g.:

Suggested change
def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))
@support.subTests('value', ('\x00', '\ud8ff', '\ufffe'))
def test_invalid_attr_value(self, value):
self.check(ET.Element('tag', attrib={'key': value}))
self.check(ET.Element('tag', attrib={'key': ET.QName(value)}))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it.

The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.


def test_invalid_text(self):
elem = ET.Element('tag')
elem.text = '\x00'
self.check(elem)
elem.text = '\ud8ff'
self.check(elem)
elem.text = '\ufffe'
self.check(elem)

def test_invalid_tail(self):
elem = ET.Element('tag')
elem.tail = '\x00'
self.check(elem)
elem.tail = '\ud8ff'
self.check(elem)
elem.tail = '\ufffe'
self.check(elem)

def test_invalid_text_without_tag(self):
elem = ET.Element(None)
elem.text = '\x00'
self.check(elem)
elem.text = '\ud8ff'
self.check(elem)
elem.text = '\ufffe'
self.check(elem)

def test_invalid_subelements(self):
elem = ET.Element('tag')
subelem = ET.SubElement(elem, 'subtag')
ET.SubElement(subelem, '\x00')
self.check(elem)
elem.tag = None
self.check(elem)

def test_invalid_namespace_uri(self):
self.check(ET.Element('{\x00}tag'))
self.check(ET.Element('{\ud8ff}tag'))
self.check(ET.Element('{\ufffe}tag'))
self.check(ET.Element(ET.QName('\x00', 'tag')))
self.check(ET.Element(ET.QName('\ud8ff', 'tag')))
self.check(ET.Element(ET.QName('\ufffe', 'tag')))


class HTMLValidationTest(unittest.TestCase):
Comment thread
serhiy-storchaka marked this conversation as resolved.

def check(self, elem):
self.assertRaises(ValueError,
ET.tostring, elem, method='html', validate=True)
ET.tostring(elem, method='html') # no exception

def test_invalid_comment(self):
self.check(ET.Comment('>'))
self.check(ET.Comment('->'))
self.check(ET.Comment('a-->b'))
self.check(ET.Comment('a--!>b'))
self.check(ET.Comment('a\x00b'))
self.check(ET.Comment('a\ud8ffb'))

def test_invalid_processing_instruction(self):
self.check(ET.PI('a>b'))
self.check(ET.PI('a\x00b'))
self.check(ET.PI('a\ud8ffb'))

def test_invalid_tag(self):
self.check(ET.Element(''))
self.check(ET.Element('?'))
self.check(ET.Element('!'))
self.check(ET.Element('0'))
self.check(ET.Element(' a'))
self.check(ET.Element('a b'))
self.check(ET.Element('a\nb'))
self.check(ET.Element('a/b'))
self.check(ET.Element('a>b'))
self.check(ET.Element('a\x00b'))
self.check(ET.Element('a\ud8ffb'))
self.check(ET.Element(ET.QName('')))
self.check(ET.Element(ET.QName('0')))
self.check(ET.Element(ET.QName('a/b')))

def test_invalid_attr_name(self):
self.check(ET.Element('tag', attrib={'': 'value'}))
self.check(ET.Element('tag', attrib={'\x00': 'value'}))
self.check(ET.Element('tag', attrib={'\ud8ff': 'value'}))
self.check(ET.Element('tag', attrib={'a/b': 'value'}))
self.check(ET.Element('tag', attrib={'a=b': 'value'}))
self.check(ET.Element('tag', attrib={'a\x00b': 'value'}))
self.check(ET.Element('tag', attrib={'a\ud8ffb': 'value'}))
self.check(ET.Element('tag', attrib={ET.QName(''): 'value'}))
self.check(ET.Element('tag', attrib={ET.QName('a/b'): 'value'}))

def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('a"b')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('a&b')}))

def test_invalid_text(self):
elem = ET.Element('tag')
elem.text = '\x00'
self.check(elem)
elem.text = '\ud8ff'
self.check(elem)

def test_invalid_tail(self):
elem = ET.Element('tag')
elem.tail = '\x00'
self.check(elem)
elem.tail = '\ud8ff'
self.check(elem)

def test_invalid_text_without_tag(self):
elem = ET.Element(None)
elem.text = '\x00'
self.check(elem)
elem.text = '\ud8ff'
self.check(elem)

def test_invalid_subelements(self):
elem = ET.Element('tag')
subelem = ET.SubElement(elem, 'subtag')
ET.SubElement(subelem, '\x00')
self.check(elem)
elem.tag = None
self.check(elem)

def test_invalid_namespace_uri(self):
self.check(ET.Element('{\x00}tag'))
self.check(ET.Element('{\ud8ff}tag'))
self.check(ET.Element(ET.QName('\x00', 'tag')))
self.check(ET.Element(ET.QName('\ud8ff', 'tag')))

@support.subTests('tag', ("script", "style", "xmp", "iframe", "noembed", "noframes"))
def test_invalid_cdata_content(self, tag):
elem = ET.Element(tag.upper())
elem.text = 'a</%s>b' % tag.title()
self.check(elem)
elem.text = 'a</%s b' % tag.title()
self.check(elem)
elem.text = 'a</%s/b' % tag.title()
self.check(elem)
elem.text = 'a\x00b'
self.check(elem)
elem.text = 'a\ud8ffb'
self.check(elem)

@support.subTests('tag', ("script", "style", "xmp", "iframe", "noembed", "noframes"))
def test_cdata_subelements(self, tag):
elem = ET.Element(tag)
ET.SubElement(elem, 'subtag')
self.check(elem)

def test_invalid_plaintext_content(self):
elem = ET.Element('plaintext')
elem.text = 'a\x00b'
self.check(elem)
elem.text = 'a\ud8ffb'
self.check(elem)


class IterparseTest(unittest.TestCase):
Comment thread
serhiy-storchaka marked this conversation as resolved.
# Test iterparse interface.

Expand Down
Loading
Loading