What we gonna talk about here is character encoding and how to convert from ISO-8859-1 to UTF-8. If this doesn’t seem familiar to you you might want to start learning a bit about it. Cause hell is a scary place and you don’t want to end up there. This is how to avoid hell and for those of you who are struggling in the darkness; this is how to get your ass out of there.

Why would I need to convert?

  • You are tired of using entities for foreign characters. Hate using things like this ŵ (← very common character… uhm… )
  • You are going to merge your web site with another website. (Which has a different encoding.)
  • You want to write in more languages. (I’m talking about languages as in the languages people speak NOT programming languages.)
  • When generating emails and sending data you have problems with weird signs.
  • You are mad because ISO-8859-1 doesn’t contain an euro sign (€). (Isn’t that quite funny? I mean it’s encoding especially suited for the west of europe. Yea, yea € I know. But still.)

There are more reasons. Feel free to add those in the comments.

The biggest reasons “back in the days” to use ISO-8859-1 was that it had slightly better support in older user agents. Well those old agents are pretty much gone by now. Who still uses Netscape 4? Hands up! Anyone? Well, that’s what I thought.
All in all UTF-8 is, in the long run, more fit for the future.

Don’t go to hell

So what do I mean by “Encoding Hell”. Well those who has been there know what I mean. It’s when you just switch encoding without properly converting your data and you end up with letters looking sÃ¥ här (that’s the word “så här” meaning “like this” in swedish.) I got this nasty letters by opening a file with UTF-8 encoding in my text editor using ISO-8859-1. And guess what? Opening with the right encoding doesn’t switch the letters back. They are stuck looking nasty. I’m in hell!

Letters in Encoding Hell

Well this was only one file and sure you can change the letters manually or by using some fancy search/replace-command. But think about if you made this mistake with hundreds or thousands of files? Not nice. Big hell!

Do it right from the beginning

The best thing is to avoid Encoding Hell all together. Choose the right encoding from start and you are safe.
But if you must change. First thing. Be careful! If you are careful you will save time. Make up a converting-encoding-plan first so you know exactly what you are doing. Just don’t start testing. My advice is to always start with a backup. Backup everything you want to convert. If you have a database connected to your webpages don’t forget to back up that one as well.

Converting your files

Don’t even think of converting your files by using your text editor. It usually turns out nasty. There are tools to help us out with this. You can use the terminal/the command prompt to convert things. Me, myself, I have tried out “Encoding Master” which I think is simply brilliant. Easy to understand and it converted everything very nicely. It’s also supported by both PC and Mac.
Don’t forget to add/change the encoding in the headers of the webpages too.
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">

Converting your database

Be careful here. Don’t start adding UTF-8 encoded data into your old database. Make sure it’s converted and pretty first. Otherwise you might end up with a messy mix and bang! You are back in hell!
To change the encoding of your database you must change the encoding of everything. The database itself. Every table in the database and every column that includes some sort of text-value. I’m talking about data of type: char, varchar, tinytext, text, mediumtext and longtext.
Since I lately been working with PHP and MySQL the following will apply to them. If you are working in other languages and other databases you probably need to do some further research to see what is the right methods to use.

To change the databse itself use the following query:
ALTER DATABASE my_database charset=utf8;
To change a table use :
ALTER TABLE my_table charset=utf8;
To change a column in a table use:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;

Make sure everything in there uses the character set utf8 and collation utf8_unicode_ci. Ok?

Encoding in PHP

If you are coding in PHP you might have problems with the MySQL and PHP talking UTF-8 language. Solution to this which worked for me was to send the following query:
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'; This must be done right after connecting to the database.

From Simon comes this handy tip to those of you who are using MySQLI together with UTF-8:
(With error-handling included as a bonus…)

$mysqli = new mysqli(DB_SERVER, DB_USERNAME, DB_PASSWORD,
DB_DATABASE);
if(mysqli_connect_error()){
throw new Exception('Connection failed
due to the following error: '.$error);
}
if (!$mysqli->set_charset("utf8")){
printf("Error loading character set utf8: %s\n",
$mysqli->error);
}

It could also be a good idea to add the code below to the beginning of the PHP-files.
mb_language('uni');
mb_internal_encoding('UTF-8');

Last but not Least…

Always research carefully before changing your encoding. Especially if you are a converting-virgin and are changing the encoding for the first time. Never rely solely on one tutorial. Not even this one which, I need to admit, is not covering everything. This is pretty general info. Hey, you can write a book about encoding and the struggles with converting. So remember to look around. Ask people. Start with a backup. Always. And, again, be careful!

Happy Encoding!

/Ida