-
-
Notifications
You must be signed in to change notification settings - Fork 34.5k
gh-135661: Fix CDATA section parsing in HTMLParser #135665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
f7f9f56
816f34e
cf918e3
d346c10
9e1ae33
524cac5
2a1bb46
8cdbc95
e4f13a8
50fd4b3
165fd1e
a5f45b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -121,6 +121,17 @@ The output will then be: | |||||||||
| attributes can be preserved, etc.). | ||||||||||
|
|
||||||||||
|
|
||||||||||
| .. method:: HTMLParser.support_cdata(flag) | ||||||||||
|
|
||||||||||
| Sets how the parser will parse CDATA declarations. | ||||||||||
| If *flag* is true, then the :meth:`unknown_decl` method will be called | ||||||||||
| for the CDATA section ``<![CDATA[...]]>``. | ||||||||||
| If *flag* is false, then the :meth:`handle_comment` method will be called | ||||||||||
| for ``<![CDATA[...>``. | ||||||||||
|
|
||||||||||
| .. versionadded:: 3.13.6 | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should mention the previous behaviour.
Suggested change
|
||||||||||
|
|
||||||||||
|
|
||||||||||
| The following methods are called when data or markup elements are encountered | ||||||||||
| and they are meant to be overridden in a subclass. The base class | ||||||||||
| implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`): | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -144,6 +144,7 @@ def reset(self): | |
| self.lasttag = '???' | ||
| self.interesting = interesting_normal | ||
| self.cdata_elem = None | ||
| self._support_cdata = False | ||
| super().reset() | ||
|
|
||
| def feed(self, data): | ||
|
|
@@ -174,6 +175,9 @@ def clear_cdata_mode(self): | |
| self.interesting = interesting_normal | ||
| self.cdata_elem = None | ||
|
|
||
| def support_cdata(self, flag=True): | ||
| self._support_cdata = flag | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not convinced this is the best way to handle the issue.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Private means that users shouldn't call it. Given that nothing in It looks like the issue can't be solved in CPython alone -- users need to adapt their code?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless. I will try other approach, but can't guarantee anything.
Comment on lines
+187
to
+198
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any reason to have a setter method, rather than changing the value of
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can change it in future if we need more complex logic. Of course, in Python we can just add a property. And yes, docstring. |
||
|
|
||
| # Internal -- handle data as far as reasonable. May leave state | ||
| # and data to be processed by a subsequent call. If 'end' is | ||
| # true, force handling all data as if followed by EOF marker. | ||
|
|
@@ -249,7 +253,10 @@ def goahead(self, end): | |
| break | ||
| self.handle_comment(rawdata[i+4:j]) | ||
| elif startswith("<![CDATA[", i): | ||
| self.unknown_decl(rawdata[i+3:]) | ||
| if self._support_cdata: | ||
| self.unknown_decl(rawdata[i+3:]) | ||
| else: | ||
| self.handle_comment(rawdata[i+1:]) | ||
| elif rawdata[i:i+9].lower() == '<!doctype': | ||
| self.handle_decl(rawdata[i+2:]) | ||
| elif startswith("<!", i): | ||
|
|
@@ -325,7 +332,14 @@ def parse_html_declaration(self, i): | |
| # this case is actually already handled in goahead() | ||
| return self.parse_comment(i) | ||
| elif rawdata[i:i+9] == '<![CDATA[': | ||
| return self.parse_marked_section(i) | ||
| if self._support_cdata: | ||
| j = rawdata.find(']]>', i+9) | ||
| if j < 0: | ||
| return -1 | ||
| self.unknown_decl(rawdata[i+3: j]) | ||
| return j + 3 | ||
| else: | ||
| return self.parse_bogus_comment(i) | ||
| elif rawdata[i:i+9].lower() == '<!doctype': | ||
| # find the closing > | ||
| gtpos = rawdata.find('>', i+9) | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,2 @@ | ||||||||||||||||||
| Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to | ||||||||||||||||||
| the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section. | ||||||||||||||||||
|
Comment on lines
+1
to
+2
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention the default behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when
<![CDATA[...]]>is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.So we should set it to true by default if we keep this approach.