When the Replication stops working: analysis and resync of MySQL Replication

Quando la Replication smette di funzionare: analisi, ripristino e resync della MySQL Replication

We already had the opportunity to talk about Replication between MySQL Database in a previous article, where we described it as a great way to increase the security and reliability of data storage without spending a fortune. In this article we will see what to do when the Replication stops working: how to notice, what can I do to restore it and, most importantly, how to ensure that the data is re-synchronized.

Identify causes

The first thing to do is make sure that the Replication is actually broken. Although in most cases we can clearly see it by just looking at the replicated data, we need to check it in an objective way. In order to do this go to the Slave, issue the command – or if you prefer to read the results in human-readable format – and go read the column contents: and : if there’s at least a NO, that means that the Replication is actually broken, otherwise the problem is attributable to other causes.

Once you have confirmed that Replication is broken, second thing to do is try to figure out what caused the fact. Regardless of our configuration these are, in order of probability, the root causes that result in aborting a Replication:

  • Presence of one (or more) query that ran fine on the Master – thus being copied into the binary log – but failed on the Slave for some reason.
  • Sudden shutdown of the Master and/or Slave host, resulting in an inability to connect that eventually led to a Connect Retry Timeout.
  • Stability problems of the Slave or the server machine: corrupted tables, broken indexes, hard-drive full or faulty, other hardware problems & such.

Let’s look at these scenarios and try to identify appropriate strategies to prevent or fix the issue for each one of them.

Presence of query with errors

Do this is very simple: just connect to the Slave, type the command and read the contents of the column. If this column is blank your problem lies elsewhere, otherwise keep reading.

The presence of a query containing errors is without a doubt the most common cause of the Replication failure: we can say that two out of three people among all those who read this article will realize that they have this problem.

Let’s try to understand what happened and why. How did a query cause errors on the Slave whilst being properly processed by the Master, to the point it ended up in the binary log? The possibilities are many, although not always obvious when looking at the query in question. Let’s try to look at the main causes and find some possible workarounds that could be applied to our system before actually restoring the Replication.

Causes

  • If the query is an , it is likely that the error is due to the fact that the specified already exists in the table of the Slave. In many cases this is due to the fact that both the Master and the Slave are hosting a website, application or service that provides for the automatic insertion of data within the DB based on a time schedule such as every night, every week, etc. , or as a result of operations performed by the user during login, logout, when you open a particular page/screen, etc.: these automatic tasks take place both on the Master and the Slave since both environments have one instance of this application which is configured to perform the same operations. To better understand this, let’s take an example: imagine a Web Application that every night does the cron job to insert a row into the table. The table requires a primary key based on the date in question, which means that each record will be created with a unique ID. Such ID will be identical to that generated by any other instance of Web Application: therefore, assuming two or more linked instances connected to different DB in Replication between them, these automatic queries will conflict with those in the Master-generated binary log, potentially knocking our Replication out of work.
  • If the query is an , it is likely that the error is due to the fact that the Slave doesn’t (yet or anymore) contain a record that was actually changed on the Master.  Once again, the reason is almost always due to the presence of some automatic DB task such as a maintenance script, auto-delete cron job etc. which could trigger here and not there. A common example of that could be a maintenance job periodically clearing the DB records of some uploaded media contents if they do not match a file physically present on the HDD. As soon as the job runs on the Slave, which is not in production and thus doesn’t receive the actual files, the “orphan” records will be deleted: any subsequent  in these rows will knock the replication out.

Workaround

Given that each specific scenario has their “best” solutions, we should at least be aware of the following general things:

Insert a switch

As a general rule of thumb, your app should always give you the chance to optionally turn off any cron job that might cause a replication break: a simple switch in the   /   will be enough in most cases. That way you can define which instance will take care of these executions and which not. if you’re building a no-single-point-of-failure environment you can achieve that via a semaphore-based mechanism to ensure that only the first available/running node will perform the cron job, preventing the other ones to do it.

Use the IGNORE directive

If the offending queries do not contain any instance-specific information that is running and therefore can be “skipped” without creating problems, you might as well use the   directive, so any duplicate  or invalid   will be discarded without breaking the replication. You need to be very careful when you do that – expecially if you’re dealing with offending   queries – since such method could easily create a mismatch between Master and Slave.

Exclude queries from the Binary Log

If the offending queries need to be executed locally on each instance because they contain specific information about variables or potentially dependent on individual circumstances, the best solution might be to exclude these queries from the binary log. You can do this using the   MySQL session variable, which can be set to zero by executing    just before one (or more) query. If your application features session recycling or re-using, it is highly recommended that you restore the original value after each set of relevant queries by inserting the   directive at the end of the MySQL statement, like in the following example:

Restoring the Replication

Once we determined the causes of the stop – and applied the appropriate workaround strategies – let’s see how we can restore the Replication by making it working again. If the problem is due to the presence of a limited number of queries belonging to the same type you can try to make the Slave “skip” these blocking queries from the binary-log; if you’re dealing with a large number of queries that belong to different types, it is highly advisable to carry out a full realignment of the Slave: whilst it is a long and potentially tiring process, although almost entirely automatic, it’s often the only way you have to properly resync your data between the Master and the Slave(s).

Skip a single query

This statement can be used on any Slave to skip the first topmost query (the one that’s giving errors) contained inside the binary log:

This statement basically makes the Slave jump  the offending query and resume the execution of the binary log starting with the following one. Immediately after executing this you should rescan the status of Replication with  to see if the issue is resolved or not. You need to look at the following three values:

  •   and   , which should be both set to Yes.
  • , indicating the number of seconds between the most recent query ran to the Master (and thus present into the binary log) and the query processed by the Slave: in other words, the number of seconds separating the Slave from the Master.

If the issue is resolved the third value will decrease at high speed, while the other two values will keep their value of Yes for the whole time. Otherwise you’ll reach a new offending query, meaning you need to repeat these steps. You can see the new problematic query by looking again at the   column from the   command results.

Important: Using the   statement could be troublesome if you’re actively using transactions. If you do, we highly recommend to read the excellent article on Percona boys regarding that topic. If you have never heard of Percona and you’re working a lot with MySQL you should really spend some time to fill this huge gap, starting with their Wikipedia page and continuing with the Percona Toolkit Overview from their official website – which will probably make you regret that you’re not using MySQL on Linux.

Full realignment of the Slave

If the skip of individual queries couldn’t solve your problem, you will need to go for a full realignment between the Master and your Slave(s): the procedure is not complicated but will block your Master DB for a few seconds, so you’ll need to choose the appropriate time to do that. When the time comes, connect to your Master using the MySQL command line tool or with a management GUI software such as SQLyog and run the following commands:

Right after that, while keeping this connection open, issue a full database dump, taking care to specify the parameters that will be used to immediately restore the replication once finished:

Immediately after issuing the  command you can release the LOCK to the tables, so that your DB can be accessed again. You can do that by going back to the first opened connection and typing the following command:

As soon as the dump process is completed, copy the resuling dump file on the PC hosting the Slave DB and import it with the following command:

Then restart the Slave:

The replication will start automatically, also syncing the changes made to the Master DB during the import process.

 

Sudden shutdown of the Master and/or Slave

If the replication breakage is due to a prolonged disconnect one of the two machines, the first thing to do is to try to restart it manually with a simple:

As soon as you do that you can check the Replication status with the usual  and looking at the   and   columns. If they are set to Yes, you only need to check them periodically until the   column value will reach zero, otherwise we will have to proceed with a full Slave Realignment as described above.

Stability problems of the Slave or the host machine

If you are dealing with corrupted tables, indexes, or worse still malfunctioning problems related to hardware failure, you’re Slave data integrity is probably gone: even if you manage to restart the replication by skipping queries or by restarting the Slave, you could end up with some serious data mismatch due to the cause that you won’t be able to really know if the binary log has been properly processed or not. That’s why we strongly advice to proceed with a full Slave Realignment as described above, unless you’re able to determine precisely if you’re data integrity has been preserved or not.

That’s all for now: happy coding!

 

RELATED POSTS

About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies.

View all posts by Ryan