htmlentities() or htmlspecialchars(), that's the question...

htmlentities() or htmlspecialchars(), that's the question...

by Eloy Lafuente (stronk7) -
Number of replies: 7
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Hi,

some hours ago I was going to begin the change of calls from:

htmlentities($text) to

htmlentities($text, ENT_COMPAT, get_string('thischarset));

to allow now ISO-8859-1 langs to run a bit better (based in this discussion and as part of the migration to UTF-8.

And then I started "googling" a bit about htmlentities, charsets and so on.

First of all I found that the 'UTF-8' charset is only supported from PHP 4.3.0 inside htmlentities() (oh, oh, not applicable under Moodle 1.5.x! sad). Then I found that a lot of people recommends to use htmlspecialchars() because it doesn't break non ISO-8859-1 strings and it's lighter than htmlentities().

Doing a quick search against Moodle code I've seen that there are (aprox.) 100 calls to htmlentities() and 100 calls to htmlspecialchars().

I'm sure that, sometimes the htmlentities() function is needed but I'm not able to find why (or at least one logical pattern no decide when it's required).

Could anybody explain why (and when) we need to convert some characters (apart from the well-known, & < > and ", that are handled by htmlspecialchars() ) to their hml-entities equivalent?

Ciao smile
Average of ratings: -
In reply to Eloy Lafuente (stronk7)

Re: htmlentities() or htmlspecialchars(), that's the question...

by Petr Skoda -
Picture of Core developers Picture of Documentation writers Picture of Peer reviewers Picture of Plugin developers
The way I understand it:
  • htmlentities() is used to convert given text into ASCII encoding;
  • htmlspecialchars() and our improved modification p() and s() are used in forms and displaying of nonhtml text.

IMHO we do not need htmlentites() because we do not know the encoding of the course content (source encoding), we only know the lang pack encoding (target encoding). AFAIK we do not use ASCII anywhere in Moodle. In fact if you type characters that are outside the lang pack charset (such as if you type Russion here здравствуйте), it is converted into entities by the browser itself.

I hope my view is right if not please, please correct me wink

skodak
In reply to Eloy Lafuente (stronk7)

Re: htmlentities() or htmlspecialchars(), that's the question...

by Martín Langhoff -
htmlspecialchars() is identical to htmlentities() if you only have plain ASCII alphanum. In strings with high-bit characters, htmlspecialchars() will leave them alone... if PHP charset and the HTTP charset are in agreement, the user should see the content correctly.

With htmlentities(), all the high-bit chars get converted to html entities. This is sometimes saner, because it can survive the HTTP charset being wrong. Does that make sense?

htmlspecialchars() leaves ñ alone, and you need the browser to know the right encoding to display it correctly, whereas htmlentities turns it into &ntilde; which has a better chance to get printed correctly if we fsck up the charsets.
In reply to Martín Langhoff

Re: htmlentities() or htmlspecialchars(), that's the question...

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Well,

it seems that the only reason to use htmlentities() in a lot of places is to be able to show characters OUT of the charset specified in the http charset, isn't?

For example, to write some Cyrillic (WINDOWS-1251) chars inside this ISO-8859-1 page, they have to be written as html entities. So, in this case, htmlentities() leaves things unmodified, because the text has been sent, stored and displayed using html entities, isn't it? If it leaves things unmodified, it isn't necessary in this case, is it?

The problem arrives when one Russian moodler (whose charset is WINDOWS-1251) is writing the text in this page. Then, the page is sent and stored to DB in that charset and, before being showed, it's filtered by htmlentities() that, simply, breaks it, both for other moodlers using ISO-8859-1 (because the encoding is incompatible) and for the same moodler using WINDOWS-1251) because htmlentities(), as is used currently, breaks non-iso encodings.

This can be solved by one of this:

- Use htmlentities($text, ENT_COMPAT, 'windows-1251') or
- Use htmlspecialchars($text)

The first will allow any user (using any charset) to see the text properly (because the text is converted to html charset-independant entities).

The second will allow WINDOWS-1251 users (only!) to see the text properly (without any conversion) simply because both text and page encoding match.

As Skodak said two posts above, we have to use htmlentities() because we don't know how the text (encoding) has been entered originally so, it's impossible to know the correct charset to specify in the htmlentities() call. sad

Anyway, there are at least two situations when we are really sure of the encoding used to write text:

- When $CFG->unicode is enabled. Every page is forced to UF-8.
- When $CFG->courselang is set. Every page inside a course is forced to the encoding used by the lang variable.

Thinking in the future (that wonderful UFT-8 world), I was thinking about change all those htmlentities() calls to some sort of $textlib->htmlentities() function (I have added the initial version of textlib some hours ago to CVS).

This function will detect if $CFG->unicode is enabled or if $CFG->courselang is defined. With any of them present, it will apply the htmlspecialchars() function (because the charset is forced, information will be displayed properly always). For the rest, the standard htmlentities() will be used, trying to convert as many chars as possible but knowing that it won't be perfect always. By doing this, we'll allow sites running under$ CFG->unicode and course with the lang (and charset) forced to run properly, leaving things unmodified for the rest.

In some months, everybody will be running UTF-8 (forced) so, the previous approach will continue working smoothly (without the need of convert to html-entities anymore).

Coming back to my original question, I wanted to know if there was any reason preventing me to use hmlspecialchars() as described above everywhere (some place where the conversion to entities is mandatory, security, url encoding...).

This is the full story! wink

Ciao smile

PD: Also, finally, to make things a bit more complex, I only have tested htmlspecialchars() under iso-88591, windows-1251 and utf-8 (latin characters only), so it would be great to get some feedback about how such function is working under REAL non-iso-8859-1 chars (I've done some basic tests under Vietnamese and Chinese and, visually, htmlspecialchars() seems to work fine always). Any native confirmation will be welcome. Sure!
In reply to Eloy Lafuente (stronk7)

Re: htmlentities() or htmlspecialchars(), that's the question...

by Petr Skoda -
Picture of Core developers Picture of Documentation writers Picture of Peer reviewers Picture of Plugin developers
>it seems that the only reason to use htmlentities() in a lot of places is to be able to show characters OUT of the charset specified in the http charset, isn't?

IMO htmlentites() should not be used at all, because there are many problems:
  1. we do not know the encoding of the database content - it is an important Moodle design limitation
  2. htmlentities does not support all ISO and Windows encodings - this alone is a showstopper

We also have to be careful not to encode text twice, because the '&' in entities would be replaced by '&amp;'. As I said before, browser does some conversion to htmlentities - it means we should not use either htmlentities() or htmlspecialchars(), but instead p() and s().

I propose:
  1. we do not use eny html*() during input
  2. we use only p() and s() for output

I have searched Moodle CVS for htmlentities() too:
  • diagnostics of backup/restore - OK, no need to change it
  • tex - FIX?, s() should work fine here
  • mediaplugin - OK, no need to change it
  • install.php - OK?, s() should work here too, no need to change it
  • adodb and cas - OK, 3rd party library
  • filelib - FIX?, htmlspecialchars seems better here (we do not know encoding, there are no html entities)
  • weblib - FIX?, s() should work fine for FORMAT_PLAIN; the rest is OK
  • rsslib - not sure
  • quiz - not sure
  • wiki - needs a lot of fixing

I have also reviewed the use of htmlspecialchars(), seems most of them could be replaced by p() and s().

skodak
Average of ratings: Useful (1)
In reply to Petr Skoda

Re: htmlentities() or htmlspecialchars(), that's the question...

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Great!

That seems to be nice solution. As our s() and p() functions are using htmlspecialchars() all those output, currently handled both by htmlentities() and htmlspecialchars(), should work fine under s().

Today I've been looking for current uses of htmlentities() across Moodle code too and the "worst" parts to modify seems to be wiki. Both rsslib and quiz code use "unhtmlentities" to convert back from XML files (although it will be better if processed by html_entity_decode() always instead).

I think that it should be really interesting to apply this simple change to

weblib.php:
1099c1099
< $text = htmlentities($text);
---
> $text = s($text);


and get some feedback from people using other charsets different from ISO-8859-1 (viewing, for example, how plain text forum posts, assignment descriptions... work with the change applied).

If it works fine, I propose to change it both for 1.5 and HEAD. It will make things easier for the UTF-8 transition and non ISO-8859-1 users will get instant benefits.

Ciao smile
Average of ratings: Useful (1)
In reply to Eloy Lafuente (stronk7)

Re: htmlentities() or htmlspecialchars(), that's the question...

by Haruhiko Okumura -
Sorry, I haven't followed all the discussions, but please note that Prof. Kita and I suggested use of htmlentities(x, , get_string('thischarset')) for maximal compatibility. In fact I rewrote most of the occurences of htmlentities to htmlspecialchars when I began to use Moodle a year ago.
In reply to Haruhiko Okumura

Re: htmlentities() or htmlspecialchars(), that's the question...

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Hi Haruhiko,

I've followed your discussion and, after one discussion with Martin and Mitsuhiro some days ago we planned to apply the change suggested by your solution because it seemed to be nice.

The path to follow was to create a new htmlentities() inside the recently added textlib library and change all those old calls to use the new function.

Anyway, I started reading more and more about the use of htmlentities() over Internet and I found are a lot of pages suggesting to avoid it completely and use htmlspecialchars() instead.

But we have currently one function that does the htmlspecialchars() job currently inside Moodle. It's the s() function and it should be enough to solve the problems produced by htmlentities().

The initial question of this discussion was to determinate if we NEED the htmlentities() function for some hidden feature ot we can change it safely to the better htmlspecialchars() and to check if the s() (that performs internally a htmlspecialchars() call) approach will work for everybody.

If there are not problems, the change to s() will start, making Moodle a bit more non-iso charset compliant.

Ciao smile