Alfresco: Migration from MySQL to PostgreSQL

Alfresco is a full featured opensource ECM. It uses opensource software extensively , as a webapp server (tomcat by default), an office suite (openoffice or libreoffice), image manipulation tools (imagemagick), and of course an SGBD! In this article, the former component is the one we will focus on. After having shipped mysql by default, the Alfresco installer now comes woth postgreSQL as a default database server. We could argue about pros and cons of each other, and that would probably not lead to a mutual agreement…
Nevertheless, I had cases where DBAs get rid of their MySQL DB and switch to Postgres.
This article explains how to do this migration for an Alfresco DB. The following has been tested with Alfresco 3.4.x et 4.x. It is also known not to work as is, with version older than 3.4!

The idea here is fairly simple:
The database schema being relatively complex, and mainly managed by hibernate/ibatis… it’s easy to get lost in those muddy waters! So in order to minimimse tthe sources of errors, we’ll let Alfresco deal with the creation of a new schema, and then insert our (almost) raw data…

Prerequisites:
- We will assume destination and source instances of Alfresco are the same version. iF further upgrade is planned t can be done once database migration has been completed successfully.
- We will assume that collations and charsetsare similar. By default Postgres and MySQL should use respectively UTF-8 and utf8_bin.
- downtime has to be scheduled during this migration in order to ensure data consistency across the instances.
A brand new DB:

To start with, we’ll deploy a fresh Alfresco instance, so called out of the box instance, using PostgreSQL as an SGDB. I won’t dive into much details here as I assume you’re already used to lfresco installation if yo plan to migrate from one SL server to another! To quickly summarise, install tomcat, create a postgres database and its associated user, put the alfresco wars in the webapps folder, and configure your alfresco-global.properties fileto point to the new empty DB.
When tomcat deploys Alfresco, and Alfresco finds an empty database, he will create a schema for it, just has onewould expect from a clean install (and also populate it with some data).
Once bootstrap completed, Alfresco has done its part, we can turn it of and start our job! We will use this database as a template to ensure the structure is as Alfresco expect it to be. Let’s dump this schema (and forget about those ootb data we don’t really care about).

~$ pg_dump -h gimly -U alfresco -W -s alfresco > /tmp/pg_schema.sql

In order to avoid problems at restore time, we are going to split this file in two. One file will contain table definitions (/tmp/pg_schema_table.sql), and the second one will contains constraints and indexes (/tmp/pg_schema_fk-idx.sql). Now let’s store those files in a safe place for later use. The rest of the new ootb instance can be deleted.
Production data backup:

We now have to turn off Alfresco (the one running MySQL), in order to do the database backup. mysqldump is a perfect tools for our use case.

~$ mysqldump -u root -p --hex-blob -c --default-character-set=utf8 --compact -n --compatible=postgresql -t alfresco | \ sed -r -e 's/0x00([),])/false\1/g' \ -e 's/0x01([),])/true\1/g' \ -e "s/([(,])0x([0-9a-fA-F]*)([),])/\1decode('\2','hex')\3/g" \ -e 's/INSERT INTO "([a-zA-Z_-]+)" $(.*)$ VALUES \(/INSERT INTO "\1" (\L\2) VALUES (/g' \ > /tmp/data.sql

Here is an explanation of the different options we use:

-n : Tell mysqldump to exclude table creation statements.

-c : Tell mysqldump to incluse table names in the INSERT statements. This is to avoid issues related to column ordering within a table.

–compatible=postgresql : Tell mysqldump produce a PostgreSQL friendly output (This is unfortunately not enough for Postgres to read this file).

–hex-blob : Tell mysqldump to treat binary fields (blolb or boolean) as hexadecimal fields (offer an easier parser for tools like sed.

–default-character-set=utf8 : Set the default charset to UTF-8.

–compact: Tell mysqldump ignore some useless statements.

s/0x00([),])/false\1/g : Map hexadecimal value 0x00, which is not know to Postgres, to false (typical boolean field).

s/0x01([),])/true\1/g : Map hexadecimal value 0x01, which is not know to Postgres, to true (typical boolean field).

s/([(,])0x([0-9a-fA-F]*)([),])/\1decode(‘\2′,’hex’)\3/g : Insert the decode() function for Postgres to flawlessly reècreate binary fields.

s/INSERT INTO “([a-zA-Z_-]+)” $(.*)$ VALUES \( : lower table and columns name case.

The resulting file is almost ready, we will just add the line bellow to the top of the file:

SET statement_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = off; SET check_function_bodies = false; SET client_min_messages = warning; SET escape_string_warning = off;
Restore data to PostgreSQL:

In order to restore we will first create a new database (ensure to delete the one create on postgres at first step as its useless and could be missleading), here is how to delete the previous db from postgres and re-create it empty:

~$ dropdb alfresco ~$ createdb -Oalfresco alfresco

Now let’s restore the bare structure of the db (no index, no constraints):

~$ psql -h gimly -U alfresco -W alfresco < /tmp/pg_schema_table.sql

and then insert data:

~$ psql -h gimly -U alfresco -W alfresco < /tmp/data.sql

Note that at this step you may find some data that doesn’t fit in the new db, this is usually because of the length of thefields and is rather easy to fix by updating the table definition in the file /tmp/pg_schema_table.sql Once data have been inserted, just recreate indexes and foreign contraints:

~$ psql -h gimly -U alfresco -W alfresco < /tmp/pg_schema3411_fkIdx.sql

At this step the database is almost ready to startup Alfresco. The last issue we have to deal with are sequences. Postgres, as most RDBMS do, uses séquences to automatically increment integer values. MySQL, doesn’t provide sequences , and instead uses an integer field option: AUTO_INCREMENT. Once again we could discuss the best approach or the reason of this choice.
But the important bit here is postgreSQL sequences have to be updated in order to have Alfresco working properly.
So we will extract from MySQL the values of the involved fields and put them in a file:

mysql> use information_schema mysql> SELECT TABLE_NAME,AUTO_INCREMENT   FROM TABLES   WHERE TABLE_SCHEMA='alfresco'     AND TABLE_NAME NOT LIKE 'jbpm_%'   INTO OUTFILE '/tmp/AI.sql';

And then update the sequences accordingly in PostgreSQL:

grep -v '\N$' /tmp/AI.sql | \   awk '{print "ALTER SEQUENCE " $1 "_seq RESTART WITH " $2 ";"}' | \   psql -h gimly -U alfresco -W alfresco

One last sequence have to be updated. This one is slightly different from the others as it is shared by several tables (see Alfresco JIRA MNT-7445). To find out the value the sequence has to be updated to, we use the SQL statement bellow:

mysql> SELECT max(AUTO_INCREMENT)   FROM TABLES   WHERE TABLE_SCHEMA='alfresco'     AND TABLE_NAME LIKE 'JBPM%';

We pick the higher value and use it inthe statement bellow:

ALTER SEQUENCE hibernate_sequence RESTART WITH <VALUE> ;
Alfresco Configuration:

Last step before starting Alfresco, update Alfresco configuration to use Postgres instead of MySQL. Ths is done by editing the alfresco-global.properties file (please read official documentation):

db.driver=org.postgresql.Driver db.url=jdbc:postgresql://localhost:5432/${db.name} db.name=alfresco db.username=alfresco db.password=donttellanybody

Make sure your tomcat installation offers the necessary jar files to allow Alfresco to use the postgres jdbc driver, the restart Alfresco.

Please note that this procedure (tested against Alfresco entreprise 3.4.11 followed by an upgrade to 4.2) is not officially supported. So if you subscribed to Alfresco support, contacting them prior to try anything is always a good idea. If you want to use this on a production system, make sure you have all the necessary consistent backup, and test evert aspect of Alfresco thoroughly to make sure everything works as expected.

Posted

19 August 2013

Enterprise Content Management, SGBD, sysadmin

Alexandre Chapellon

Tags:

Comments

14 responses to “Alfresco: Migration from MySQL to PostgreSQL”

Johne774

19 July 2014

I appreciate you sharing this article.Thanks Again. Really Cool.

Reply
alex

18 September 2014

Bonjour,

Comment séparer ce fichier pg_schema.sql en 2 fichiers (pg_schema_table.sql & pg_schema_fk-idx.sql)?

Merci d’avance,

Reply
1. Alexandre Chapellon
  
  4 October 2014
  
  si je me souviens bien, le fichier de dump créé contient les indexes à la fin. Donc normalement, il chercher dans le fichier la première occurence de ‘CREATE INDEX’ ou de ‘ALTER TABLE blablabla ADD CONSTRAINT….’). Le debut du fichier devrait etre pg_schema_table.sql et la fin pg_schema_fk-idx.sql
  
  Reply
guillaume vola

20 October 2014

Bonjour,

Merci pour la procédure qui est claire et qui fonctionne à 99%.
Lors de l’import des données j’ai des erreurs dans certaines tables ACT. (workflow)
ERREUR: la colonne Â« has_start_form_key_ Â» est de type boolean mais l’expression est de type integer
LIGNE 1 : …497-4b49-466c-8774-d3f1e3537f01activitiAdhoc.png’,1,1,’1′,NU…
Je dois donc reprendre mon fichier pour rajouter des ‘1’ au lieu de 1 à certaines valeurs.
Avec l’ordre des colonnes qui change c’est assez fastidieux.
Avez vous rencontrer ce type de problème ?
Je précise que je migre de mysql à postgres sur une version 4.1.8 d’ Alfresco.
Cordialement

Reply
1. Alexandre Chapellon
  
  20 October 2014
  
  Non je n’ai pas eu ce problème sur la 3.4 que j’ai testé. Cependant, si le champ était un booléen en mysql, le cas devrait être géré par l’option ” –hex-blob” et les sed suivants. Avez vous bien utilisé cette option pour l’export? Le champs source en mysql est il bien un booléen?
  
  Reply
guillaume vola

4 November 2014

Bonjour

Voici le script que j’utilise pour transformer le dump mysql.

#!/bin/sh

INPUT_FILE=$1

sed “/./!d” $INPUT_FILE | sed “/^USE*/d” > dumppostgresql1.sql
sed “s/0x00/false/g” dumppostgresql1.sql | sed “s/0x01/true/g” | sed “s/,0x$[0-9A-F]*$/,decode(‘\1′,’hex’)/g” > dumppostgresql2.sql
sed “s/\\\’/\’\’/g” dumppostgresql2.sql > dumppostgresql_tmp.sql
sed “s/,$[0-1]$,/,’\1′,/g” dumppostgresql_tmp.sql > dumppostgresql_tmp2.sql
sed “s/,$[0-1]$,/,’\1′,/g” dumppostgresql_tmp2.sql > dumppostgresql_tmp.sql
sed “s/,$[0-1]$,/,’\1′,/g” dumppostgresql_tmp.sql > dumppostgresql_tmp2.sql
sed “s/,$[0-1]$,/,’\1′,/g” dumppostgresql_tmp2.sql > dumppostgresql_final.sql
rm -f dumppostgresql1.sql dumppostgresql2.sql dumppostgresql_tmp.sql dumppostgresql_tmp2.sql

Reply
1. Alexandre Chapellon
  
  6 November 2014
  
  Ce script ne correspond pas à ce que j’ai publié il me semble… les sed semblent un peu approximatifs et n’utilisent pas les regex étendues. Quel est le problème avec la procédure fournie sur le site? Il est important que le dump SL sur lequel vous travaillez soit fait avec les options présentes dans l’article.
  
  Reply
guillaume vola

4 November 2014

Il me reste des erreurs à cause des noms de colonnes en majuscule.

psql:dumppostgresql_final.sql:1: ERREUR: la colonne « ID_ » de la relation « act_ge_bytearray » n’existe pas
LIGNE 1 : INSERT INTO “act_ge_bytearray” (“ID_”, “REV_”, “NAME_”, “DEP…
^
psql:dumppostgresql_final.sql:2: ERREUR: la colonne « NAME_ » de la relation « act_ge_property » n’existe pas
LIGNE 1 : INSERT INTO “act_ge_property” (“NAME_”, “VALUE_”, “REV_”) VA…
^
psql:dumppostgresql_final.sql:3: ERREUR: la colonne « ID_ » de la relation « act_hi_actinst » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_actinst” (“ID_”, “PROC_DEF_ID_”, “PROC_I…
^
psql:dumppostgresql_final.sql:4: ERREUR: la colonne « ID_ » de la relation « act_hi_comment » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_comment” (“ID_”, “TYPE_”, “TIME_”, “USER…
^
psql:dumppostgresql_final.sql:5: ERREUR: la colonne « ID_ » de la relation « act_hi_detail » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_detail” (“ID_”, “TYPE_”, “PROC_INST_ID_”…
^
psql:dumppostgresql_final.sql:6: ERREUR: la colonne « ID_ » de la relation « act_hi_identitylink » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_identitylink” (“ID_”, “GROUP_ID_”, “TYPE…
^
psql:dumppostgresql_final.sql:7: ERREUR: la colonne « ID_ » de la relation « act_hi_procinst » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_procinst” (“ID_”, “PROC_INST_ID_”, “BUSI…
^
psql:dumppostgresql_final.sql:8: ERREUR: la colonne « ID_ » de la relation « act_hi_taskinst » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_taskinst” (“ID_”, “PROC_DEF_ID_”, “TASK_…
^
psql:dumppostgresql_final.sql:9: ERREUR: la colonne « ID_ » de la relation « act_hi_varinst » n’existe pas
LIGNE 1 : INSERT INTO “act_hi_varinst” (“ID_”, “PROC_INST_ID_”, “EXECU…
^
psql:dumppostgresql_final.sql:10: ERREUR: la colonne « ID_ » de la relation « act_re_deployment » n’existe pas
LIGNE 1 : INSERT INTO “act_re_deployment” (“ID_”, “NAME_”, “DEPLOY_TIM…
^
psql:dumppostgresql_final.sql:11: ERREUR: la colonne « ID_ » de la relation « act_re_procdef » n’existe pas
LIGNE 1 : INSERT INTO “act_re_procdef” (“ID_”, “CATEGORY_”, “NAME_”, “…
^
psql:dumppostgresql_final.sql:12: ERREUR: la colonne « ID_ » de la relation « act_ru_execution » n’existe pas
LIGNE 1 : INSERT INTO “act_ru_execution” (“ID_”, “REV_”, “PROC_INST_ID…
^
psql:dumppostgresql_final.sql:13: ERREUR: la colonne « ID_ » de la relation « act_ru_identitylink » n’existe pas
LIGNE 1 : INSERT INTO “act_ru_identitylink” (“ID_”, “REV_”, “GROUP_ID_…
^
psql:dumppostgresql_final.sql:14: ERREUR: la colonne « ID_ » de la relation « act_ru_task » n’existe pas
LIGNE 1 : INSERT INTO “act_ru_task” (“ID_”, “REV_”, “EXECUTION_ID_”, “…
^
psql:dumppostgresql_final.sql:15: ERREUR: la colonne « ID_ » de la relation « act_ru_variable » n’existe pas
LIGNE 1 : INSERT INTO “act_ru_variable” (“ID_”, “REV_”, “TYPE_”, “NAME…

Reply
1. Alexandre Chapellon
  
  6 November 2014
  
  Ceci devrait être géré par la commande sed -e 's/INSERT INTO "([a-zA-Z_-]+)" $(.*)$ VALUES \(/INSERT INTO "\1" (\L\2) VALUES (/g' \ présente dans l’article. Encore une fois, merci de suivre la procédure.
  
  Reply
Ray

25 August 2015

I am trying to achieve a migration from mysql to postgresql on Alfresco Community 4.0.d prior to proceding with an upgrade to Alfresco One 5.x. I hop you can help me to resolve this difficulty. Ubuntu Server 14.04 LTS in a VirtualBox so we can’t do any damage to the production server!

I’ve installed the fresh postgresql enabled copy, started-up and shut down alfresco server. I have postgresql running per your instructions. Here’s the output from my attempt get the schema out following your instruction above:

pg_dump -h 127.0.0.1 -p 5432 -U alfresco -W -s alfresco > schemadump.sql
Password:
pg_dump: column number -1 is out of range 0..21
pg_dump: column number -1 is out of range 0..21
pg_dump: cannot duplicate null pointer

All help would be greatly, greatly appreciated.

Reply
1. Alexandre Chapellon
  
  29 August 2015
  
  Hi Ray,
  
  From what I can see this is a postgresql bug. Which version of potgresql are you using?
  This bug seems to be related with your issue:
  http://www.postgresql.org/message-id/20140305120520.1918.871@wrigleys.postgresql.org
  Please also note I never tested this procdure with alfresco 5. It may work ootb or need some tweaks.
  
  Reply
  1. Ray
    
    18 September 2015
    
    Hi,
    
    Many thanks for your response. I’ve been on holiday – hence the delay…
    
    I’m only going to attempt the upgrade to Alfresco 5 after I’ve successfully migrated from mysql to postgres within the existing Alfresco 4.0.d version. So my version of postgres is whichever comes rolled in with Alfresco 4.0.d default wizard installation: version 9.0.2 if they matched their pre-requisite notes here https://wiki.alfresco.com/wiki/Alfresco_Community_4.0.d_Release_Notes
    
    I was sticking to your instructions as you seem to have tested with 3.x and 4.x versions of Alfresco, so I was really surpised to hit this problem straight away.
    
    Thansk for further advice orsuggestions.
  2. Alexandre Chapellon
    
    24 September 2015
    
    Actually, iirc, this procedure has only on 3.4. Having said that, it should work on 4. However I would highly recommend deploying yourself tomcat, postgres and other needed component from your package manager and then use Alfresco war (I never tested the procedure with the installer). This is a lot easier to maintain and is definitely the way to go for a production system.
    In your case the postgres shipped with this crappy installer is so old it is certainly impacted by the bug in mentioned earlier.
    To proceed quickly, you can try to install a fresh and decent version of postgres, beside the old one (run it on a different port) and change your alfreaco-global.properties file to point to the new pgsql. Then try to go ahead with the procedure.
Paul M

1 December 2015

THANK YOU! This saved my ass on migrating from Alfresco 4.2 Community MySQL (which I was using from an OOTB bitnami installation) -> Alfresco 5.0d Community Postgres!

It feels good to be on Postgres now, but what a pain to get there!

Reply

Alfresco: Migration from MySQL to PostgreSQL

Comments

14 responses to “Alfresco: Migration from MySQL to PostgreSQL”

Leave a Reply to guillaume vola Cancel reply