HOWTO Convert database from charset latin-1 to utf-8

Author Message

Björn [email protected]

Wednesday 21 December 2005 12:21:23 pm

What is the savest way to convert a hole database into another charset?

I did it with just dumping the data.

Sometimes it seems it does not work right.

I had to run this query since there stupid chars injected.

SELECT UNHEX ( 'C2A0' );

UPDATE ezcontentobject_attribute SET data_text = REPLACE ( data_text, UNHEX ( 'C2A0' ), ' ' );

I have no clue where those chars came from. Does anybody has an idea?

Looking for a new job? http://www.xrow.com/xrow-GmbH/Jobs
Looking for hosting? http://hostingezpublish.com
-----------------------------------------------------------------------------
GMT +01:00 Hannover, Germany
Web: http://www.xrow.com/

Georg Franz

Wednesday 21 December 2005 2:44:03 pm

Hi Björn,

which version of mysql?

The problem is: Content is saved in utf-8 in some table rows if you are using in latin-1. So if you convert the whole dump from latin-1 to utf-8 the content is messed up.

So I wrote some scripts to do the job "manually" ... i search the links after the post.

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Georg Franz

Wednesday 21 December 2005 2:51:21 pm

Hi,

found the link:
http://ez.no/products/ez_publish_cms/documentation/configuration/configuration/language_and_charset/unicode_with_ez_publish

As I remember you need at least mysql 4.1.11+, because earlier versions have an utf-8 bug.

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Björn [email protected]

Wednesday 21 December 2005 2:59:21 pm

My versions are of course up to date... I moved the content from a mysql 3.x to a mysql 5 and back to a 4.1

Too said that your script is lost in the old documentation...

I wonder if the new documentation can properly answer this question ;-)

Looking for a new job? http://www.xrow.com/xrow-GmbH/Jobs
Looking for hosting? http://hostingezpublish.com
-----------------------------------------------------------------------------
GMT +01:00 Hannover, Germany
Web: http://www.xrow.com/

Georg Franz

Thursday 22 December 2005 4:59:53 am

Hi Björn,

http://ez.no/community/forum/setup_design/iso_8859_1_to_utf_8_conversion/re_iso_8859_1_to_utf_8_conversion__1

There is a script to change the charset + collation of the whole db at once. But be careful, it only works, if the content of all tables is in the same charset.

Best wishes,
Georg.

--
http://www.schicksal.com Horoskop website which uses eZ Publish since 2004

Xavier Dutoit

Thursday 22 December 2005 7:23:47 am

Let me share with you the pain I had with the conversion.

Dear microsoft has a tendency to put its non standard characters into the latin1 (the usual special quotes that polute the word documents). Are they your stupid caracters bt any chance ?

If you put them into latin1 and retrieve them, it works fine, and for the rest of us not having windows, well who cares about minorities ?

However, when I converted them to utf8, they ended up as plain garbage, and to make funny things even worse, the update to a new mysql version made things even messier (the default encoding was latin1, now the default is utf8).

I solved the problem with iconv, and sed to convert all the mischanged caracters into the dump (and an awful lot of time).

In short, start with utf8 from the start if you can. If you can't be really sure you don't have do to that in a hurry, otherwise you might be in a big problem.

Talking about the encoding: remember that you can't have one siteaccess in latin1 and the other in utf8. It has to be installwide.

Good luck.

http://www.sydesy.com

Yngve Bergheim

Tuesday 03 January 2006 3:58:14 am

As Georg Franz said, you also have to search and replace <?xml version="1.0" encoding="UTF-8"?> (or your current encoding) with <?xml version="1.0" encoding="utf-8"?> in an eZ sql dump file and then reimport the dump.

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.