Monday, August 15, 2011

Slony failure test cases / Stopping the postgresql service kindly/ non-empty tables

Slony node failure test cases

I'll be testing failure scenarios within the slony cluster operation.

Environment:


1 master node, and 2 slave nodes subscribed to the master node.

+Case 1. Master node goes off.

  • Let's stop the postgresql service running in  the master node.


Stopping the postgresql service kindly

Beware, as there are active connections being attended by the server, it won't stop itself issuing a simple pg_ctl stop. You are advice to use some particular built in shutdown mode instead of simply killing the postgres process as this could result in unexpected loss of data or some sort of corruption of postgres.

http://www.question-defense.com/2008/10/17/pg_ctl-server-does-not-shut-down-force-postgres-to-shutdown

Result:

  • The slon daemon enters sleep mode whilst  waiting for the master server to be available again.




Bringing the master node to life again, restores the cluster's normal operation.



+Case 2. One slave node goes off.

The possible question here is whether everything else is still working.

Issuing a simple update in the master node:

update pgbench_history set tid = 15 where id > 1400000;

#there where 100000 rows affected by this update

and now querying "select count(*) from pgbench_history where tid = 15;" gives us the expected result.

+Case 2a. Bringing the broken slave node back to life.

If I start the postgres service and the slon daemon again in the node that was purposely "shutdown" it starts to catch up 'till is  properly synchronized.



The issue here is: what's the maximum lapse within of which you are able to re-attach a "broken" node so It can catch up with the missing sync packages. I mean, as far as I understand, the synchronization occurs thanks to SYNC packages which are stored...(where the hell are they stored?). Well I need to figure that out in order to know  if the new attached node can update it's state no matter how long ago was left behind, or if it's more convenient to recreate the  tables involved in the replication set from the ground up or truncate them.

Gotcha: non-empty tables when subscribing nodes

  • When subscribing to a replication set you better make sure that the tables involved in the set are empty, otherwise the initial replication can consume more time than it should, specially if they're big.

No comments:

Post a Comment