“Escape ]]> in CDATA sections” – Issues – Discuss

Escape ]]> in CDATA sections

A bug in 2.1.2, submitted by nickdunn on 24 November 2010

Announcement

Symphony's issue tracker has been moved to Github.

Issues are displayed here for reference only and cannot be created or edited.

Browse

Closed#429: Escape ]]> in CDATA sections

- nickdunn
- Comment #1
- 24 Nov 10, 11:29 pm

As discovered in #96, if you include the characters ]]X> (without the X) in a Textarea and then choose the unformatted option in a data source, you get an XML/XSLT error:

loadXML(): Sequence ‘]>’ not allowed in content in Entity, line: 41

loadXML(): internal errorExtra content at the end of the document in Entity, line: 41

This is because the unformatted option of a Textarea wraps the raw contents with a CDATA section. And the only thing that’s parsed from a CDATA section is the closing directive. So if you include two closing directives, you have problems.

Now this won’t be a problem for most users, since they almost all will be using the formatted option. But on this site the unformatted option is used when editing things (comments, issues, extension descriptions) as you want raw text rather than HTML.

So what is the solution? One idea is to simply remove of obfuscate any closing CDATA directives in the raw text output:

sprintf('<![CDATA[%s]>', preg_replace('/]>/', ']>', $data['value']))

(note in all instances I have removed one square bracket so I don’t receive XML errors when editing this post!)

What if the user genuinely needs CDATA within the raw text? We can’t remove the outer CDATA since the raw text may contain > and <, which is why the CDATA was added in the first place.

My head hurts.

- eKoeS
- Comment #2
- 02 Dec 10, 9:09 am

Maybe I’m gonna say something really stupid (if that is the case, please insult me!). If I understood correctly, the problem is that there can’t be a sequence of the following characters “]]>” because it would be interpreted as a CDATA closing directive. What does happen if every character in that sequence is replaced with an XML entity? Would that be enough?

- nickdunn
- Comment #3
- 22 Dec 10, 6:04 am

You may be right. My understanding of the “unformatted” output however was that it would deliver the raw text of the field, rather than the text-formatted value.

One idea is to htmlspecialchars() the result which would do the trick. However this is formatting the content when the developer has explicitly asked for the unformatted version. But perhaps this isn’t such a big problem, since they’re getting an XML-safe version, rather than the full text-formatter version.

I’ll give HTML encoding and/or XML entity encoding a try.

- phoque
- Comment #4
- 24 Dec 10, 10:55 am

I’ve posted a pullrequest.

- nickdunn
- Comment #5
- 27 Dec 10, 9:33 am

Not sure I agree with this implementation. What if I have simply quoted ]]> in my post. I wonder if this might cause other problems. My thinking is that the CDATA section contents should actually be HTML encoded, in fact this would negate the need for CDATA at all.

- nickdunn
- Comment #6
- 27 Dec 10, 9:37 am

For example my post might be (Markdown):

http://pastie.org/1407456

When using the formatted output this would be Markdown-ified. However when unformatted your patch would result in:

http://pastie.org/1407459

Which I think would be incorrect.

It would make more sense to end up with:

http://pastie.org/1407462

Don’t you think? We are assuming that the content is always going to be delivered as XML or HTML.

- phoque
- Comment #7
- 27 Dec 10, 9:44 am

What if I have simply quoted ]]> in my post.

This is what this issue was about, wasn’t it? It does fix that.

My thinking is that the CDATA section contents should actually be HTML encoded.

No, you are not allowed to do that.

- phoque
- Comment #8
- 27 Dec 10, 9:47 am

http://pastie.org/1407459

I see this as correct behaviour.

http://pastie.org/1407462

This is incorrect behaviour. You’re not allowed to escape contents of a CDATA section, the > will be taken literally (or even encoded to &gt;).

- nickdunn
- Comment #9
- 27 Dec 10, 11:14 pm

My point about HTML escaping the contents was that we wouldn’t need to worry about CDATA at all (hence I had removed it from the pastie).

I haven’t tested it, but if the behaviour you’ve achieved is correct then that’s great. Maybe it just looks a bit weird. Does this correctly escape all of these?

]]>

and

<![CDATA[

and

<![CDATA[]]>

I’ll try and find time to set up a test with as many possibilities as I can think of to make sure this works as expected.

I can’t help but think that HTML encoding/escaping would make things easier though, running the contents through General::sanitize(...).

- phoque
- Comment #10
- 27 Dec 10, 11:44 pm

Does this correctly escape all of these?

It handles all of them correctly, yes.

https://gist.github.com/756141 runs through perfectly in formatted and unformatted mode.

- phoque
- Comment #11
- 30 Dec 10, 4:27 am

The pullrequest has been pulled into integration.

In case there are no more concerns this issue can be marked as “committed”. Otherwise we have to think about reverting the commit.

- nickdunn
- Comment #12
- 30 Dec 10, 5:50 am

Thanks Nils, great job :-)

This issue is closed.

Symphony.