Escape ]]> in CDATA sections
A bug in 2.1.2, submitted by nickdunn on 24 November 2010
Announcement
Symphony's issue tracker has been moved to Github.
Issues are displayed here for reference only and cannot be created or edited.
Browse
Closed#429: Escape ]]> in CDATA sections
Maybe I’m gonna say something really stupid (if that is the case, please insult me!). If I understood correctly, the problem is that there can’t be a sequence of the following characters “]]>” because it would be interpreted as a CDATA closing directive. What does happen if every character in that sequence is replaced with an XML entity? Would that be enough?
You may be right. My understanding of the “unformatted” output however was that it would deliver the raw text of the field, rather than the text-formatted value.
One idea is to htmlspecialchars()
the result which would do the trick. However this is formatting the content when the developer has explicitly asked for the unformatted version. But perhaps this isn’t such a big problem, since they’re getting an XML-safe version, rather than the full text-formatter version.
I’ll give HTML encoding and/or XML entity encoding a try.
I’ve posted a pullrequest.
Not sure I agree with this implementation. What if I have simply quoted ]]>
in my post. I wonder if this might cause other problems. My thinking is that the CDATA section contents should actually be HTML encoded, in fact this would negate the need for CDATA at all.
For example my post might be (Markdown):
When using the formatted output this would be Markdown-ified. However when unformatted your patch would result in:
Which I think would be incorrect.
It would make more sense to end up with:
Don’t you think? We are assuming that the content is always going to be delivered as XML or HTML.
What if I have simply quoted
]]>
in my post.
This is what this issue was about, wasn’t it? It does fix that.
My thinking is that the CDATA section contents should actually be HTML encoded.
No, you are not allowed to do that.
I see this as correct behaviour.
This is incorrect behaviour. You’re not allowed to escape contents of a CDATA section, the >
will be taken literally (or even encoded to >
).
My point about HTML escaping the contents was that we wouldn’t need to worry about CDATA at all (hence I had removed it from the pastie).
I haven’t tested it, but if the behaviour you’ve achieved is correct then that’s great. Maybe it just looks a bit weird. Does this correctly escape all of these?
]]>
and
<![CDATA[
and
<![CDATA[]]>
I’ll try and find time to set up a test with as many possibilities as I can think of to make sure this works as expected.
I can’t help but think that HTML encoding/escaping would make things easier though, running the contents through General::sanitize(...)
.
Does this correctly escape all of these?
It handles all of them correctly, yes.
https://gist.github.com/756141 runs through perfectly in formatted
and unformatted
mode.
The pullrequest has been pulled into integration.
In case there are no more concerns this issue can be marked as “committed”. Otherwise we have to think about reverting the commit.
Thanks Nils, great job :-)
This issue is closed.
As discovered in #96, if you include the characters
]]X>
(without the X) in a Textarea and then choose the unformatted option in a data source, you get an XML/XSLT error:This is because the
unformatted
option of a Textarea wraps the raw contents with a CDATA section. And the only thing that’s parsed from a CDATA section is the closing directive. So if you include two closing directives, you have problems.Now this won’t be a problem for most users, since they almost all will be using the formatted option. But on this site the unformatted option is used when editing things (comments, issues, extension descriptions) as you want raw text rather than HTML.
So what is the solution? One idea is to simply remove of obfuscate any closing CDATA directives in the raw text output:
(note in all instances I have removed one square bracket so I don’t receive XML errors when editing this post!)
What if the user genuinely needs CDATA within the raw text? We can’t remove the outer CDATA since the raw text may contain
>
and<
, which is why the CDATA was added in the first place.My head hurts.