I’ve recently had to move a massive dataset that includes UTF-8 strings which contains extended set code points (i.e. planes other than the Basic Multilingual Plane, including Emoji) into a MySQL database hosted using Amazon’s RDS.
Even though all of the databases and tables were configured to use purely utf8 character set and the unicode_ci collation, and though SQLAlchemy was also configured to use UTF8, I quickly ran into issues:
Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1
The solution was:
- Read Mathias Bynens’ awesome tutorial: The things we do to store U+01F4A9 PILE OF POO () correctly
- After you created a backup of your current database, change the MySQL server settings via the Parameter Groups section of the RDS console:
- Click “Create DB Parameter Group”
- Choose the correct Group Family (probably mysql5.6)
- Input a group name and description (probably “mysql5.6-utf8mb4” and “MySQL 5.6 using UTF8MB4 charset by default”
- Select this new Parameter Group in the console and click “Edit Parameters”. Set the following parameter values:
character_set_client: utf8mb4 character_set_database: utf8mb4 character_set_results: utf8mb4 character_set_connection: utf8mb4 collation_connection: utf8mb4_unicode_ci collation_server: utf8mb4_unicode_ci character_set_server: utf8mb4
and click “Save Changes”.
- Go to the “Instances” dashboard, right click your RDS instance and “Modify” it, change the “Parameter Group” option to your newly created Parameter Group and click “Continue”, “Modify DB Instance”.
- You can “Reboot” the instance if you want to be extra sure the new configuration was loaded.
- (Optional) Change the MySQL client settings. For the CLI mysql client, edit /etc/mysql/my.cnf and under [client] add:
[client] default-character-set = utf8mb4
This is to allow proper viewing of data using the mysql tool.
- Modify your existing databases, tables and columns to use UTF8MB4, as explained in the tutorial in part 1.
- Modify your SQLAlchemy connection string from:
(whether or not to add use_unicode=0 is left for the programmer’s discretion.)