Forums / Developer / Can shorten be made to shorten in unicode units rather than bytes?

Can shorten be made to shorten in unicode units rather than bytes?

Author Message

Sean Carney

Friday 06 February 2004 2:30:16 pm

We are really happy with the Shorten function and do not mind that it cuts off words. But, we do have a problem where it cuts off unicode caracters in the middle and creates a garbage character. You can see an example at our page http://nsd.hopetalk.org

We need to find a way to have shorten cut off based on unicode units.

Marco Zinn

Saturday 07 February 2004 11:46:09 am

I'm not into unicode, but splitting the unicode character should not happen.
I suggest, that you file a bug report.

Marco
http://www.hyperroad-design.com

Sean Carney

Sunday 22 February 2004 9:37:28 pm

Thank you Marco. I filed a bug report. It also seems strange that shorten is cutting off some bytes even if the characters displayed are less then then characters that have been specified.

Jan Borsodi

Wednesday 03 March 2004 7:31:59 am

PHP itself does not support Unicode internally. You can get some support with the mbstring extension and overriding internal text functions but not all of PHP will support it.

We also use the mbstring extension (if available) to perform conversion when it's needed (instead of all the time). However our i18n system does not support text operation such as extraction a portion of it yet. This means that all template operators that modify text will not work on Unicode characters.

The reason for the cutoff is the UTF8 encoding (which encodes Unicode characters), each Unicode character will be represented in an UTF8 encoding which can vary from 1 byte to 6 bytes. (1-3 is the most common).
This means that a string that has three characters can actually be 4 or more bytes, and since PHP only sees each byte as a character it will cut off at the wrong place.

The only way to get support for this is create all the various text operations that are being used in the operators and place them in the i18n library. Then change the operators to use that functionality.
However this is not a small task, especially considering problems such as case mapping (lowercase, uppercase etc.).

--
Amos

Documentation: http://ez.no/ez_publish/documentation
FAQ: http://ez.no/ez_publish/documentation/faq