最近跟同事测试了下MHA 作为mysql 高可用方案的一种 MHA具有很多优点 DeNA目前采用这种架构 支撑2亿+PV的访问

MHA
一.MHA介绍

MHA自动化主服务器故障转移,快速将从服务器晋级为主服务器(通常在10-30s),而不影响复制的一致性,不需要花钱买更多的新服务器,不会有性能损耗,容易安装,不必更改现有的部署环境,适用于任何存储引擎。

MHA提供在线主服务器切换,改变先正运行的主服务器到另外一台上,这个过程只需0.5-2s的时间,这个时间内数据无法写入。

MHA Manager通过ssh连接mysql slave服务器。

虽然MHA试图从挡掉的主服务器上保存二进制日志,并不是总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失最新数据。

使用半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此他们彼此保持一致性。

MHA有如下特性:

1.主服务器的自动监控和故障转移

MHA监控复制架构的主服务器,一旦检测到主服务器故障,就会自动进行故障转移。即使有些从服务器没有收到最新的relay log,MHA自动从最新的从服务器上识别差异的relay log并把这些日志应用到其他从服务器上,因此所有的从服务器保持一致性了。MHA通常在几秒内完成故障转移,9-12秒可以检测出主服务器故障,7-10秒内关闭故障的主服务器以避免脑裂,几秒中内应用差异的relay log到新的主服务器上,整个过程可以在10-30s内完成。还可以设置优先级指定其中的一台slave作为master的候选人。由于MHA在slaves之间修复一致性,因此可以将任何slave变成新的master,而不会发生一致性的问题,从而导致复制失败。

2.交互式主服务器故障转移

可以只使用MHA的故障转移,而不用于监控主服务器,当主服务器故障时,人工调用MHA来进行故障故障。

3.非交互式的主故障转移

不监控主服务器,但自动实现故障转移。这种特征适用于已经使用其他软件来监控主服务器状态,比如heartbeat来检测主服务器故障和虚拟IP地址接管,可以使用MHA来实现故障转移和slave服务器晋级为master服务器。

4.在线切换主服务器

在许多情况下,需要将现有的主服务器迁移到另外一台服务器上。比如主服务器硬件故障,RAID控制卡需要重建,将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降,导致停机时间至少无法写入数据。另外,阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。MHA提供快速切换和优雅的阻塞写入,这个切换过程只需要0.5-2s的时间,这段时间内数据是无法写入的。在很多情况下,0.5-2s的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口(呵呵,不需要你在夜黑风高时通宵达旦完成切换主服务器的任务)。

注:MHA可以应用于任何复制结构

二. MHA所需条件

1.SSH公钥验证

MHA管理节点通过ssh连接mysql服务器,MHA节点通过scp发送最新的relay log到其他slaves服务器上。为了使这些过程自动化,使用SSH公钥验证密码。

2.操作系统

MHA目前只支持Linux系统

3.单台可写主服务器和多台从服务器或只读主服务器

当主服务器当掉时,MHA修复从服务器之间数据一致性。MHA试图从当掉的主服务器上保存尚未发送的二进制日志文件并应用于所有从服务器。如果只有一个从服务器,就不需在意从服务器之间一致性问题。即使只有一个从服务器,MHA也会从当掉的主服务器上保存尚未发送的二进制日志事件只要能通过ssh访问到主服务器。使用半同步复制可以解决当主服务器当掉后,无法ssh到主服务器上保存尚未发送的二进制日志事件。

从MHA Manager0.52版本开始,支持多主复制结构。只允许其中一台主服务器可写,其他主服务器必须设置read-only=1。默认情况下,被管理服务器应该是两层复制结构。

4.在三层或三层以上复制情况下

默认情况下,MHA不支持三层或三层以上的复制结构。如master1—master2—-slave3。MHA故障转移和恢复是直接从从服务器中选择一台作为当前的主主服务器。MHA可以管理master1和master2,当master1当掉后,将master2作为主,MHA不会监控和恢复slave3因为slave3是从不同的主服务器上(master2)复制的。为了使MHA工作在这种架构下,需要做如下设置:

只在MHA配置文件中配置master1和master2

在MHA配置文件中所有主机上设置multi_tier_slave=1

在这种情况下,MHA只管理主主服务器和二层的从服务器,在故障转移过程中,三层从服务器仍然可以正常工作的。

5.mysql版本5.0或更高

MHA支持mysql5.0或以上版本。因为从mysql5.0版本起二进制日志格式(binlog v4格式)改变了。当MHA解析二进制日志来确定目标中继日志位置,是使用v4格式的。MySQL版本不得低于5.0.60。

6.mysqlbinlog版本3.3或更高

MHA在目标从服务器上应用二进制事件使用mysqlbinlog。如果主服务器使用基于行格式复制,基于行格式的事件写入到二进制文件中,这种二进制日志格式的文件只能被MySQL5.1或更高版本的mysqlbinlog解析。MySQL5.0.60以下版本中的mysqlbinlog不支持基于行格式的。

7.候选主服务器log-bin必须开启

如果当前的从服务器没有开启log-bin,那么将不可能成为主服务器。MHA管理节点会检测是否有配置log-bin。如果当前所有从服务器都没有设置log-bin,那么MHA不进行故障转移。

8.所有服务器上的二进制日志和中继日志过滤规则必须相同

binlog-do-db和replicate-ignore-db设置必须相同。MHA在启动时候会检测过滤规则,如果过滤规则不同,MHA不启动监控和故障转移。

9.候选主服务器上的复制用户必须存在

当故障转移后,所有从服务器上将执行change master to命令。

10.保留中继日志和定期清理

默认情况下,从服务器上的中继日志在SQL线程执行完后会被自动删除的。但是这些中继日志在恢复其他从服务器时候可能会被用到,因此需要禁用中继日志的自动清除和定期清除旧的中继日志。定期清除中继日志需要考虑到复制延时的问题。在ext3文件系统下,删除大的文件需要一定的时间,会导致严重的复制延时。为了避免复制延时,暂时为中继日志创建硬链接。

MHA节点包含pure_relay_logs命令工具,它可以为中继日志创建硬链接,执行SET GLOBAL relay_log_purge=1,等待几秒中以便SQL线程切换到新的中继日志,再执行SET GLOBAL relay_log_purge=0。

pure_relay_logs参数如下所示:
–user mysql用户名
–password mysql密码
–host mysql服务器地址
–port 端口号
–workdir 创建和删除中继日志硬链接目录。成功执行脚本后,硬链接的中继日志文件将被删除。默认目录是/var/tmp。

–disable_relay_log_purge 如果relay_log_purge=1,purge_relay_logs脚本将退出不做任何事情。设置–disable_relay_log_purge参数,purge_relay_logs脚本不会退去,且自动设置relay_log_purge=0。

定期执行purge_relay_logs脚本:

Purge_relay_logs脚本删除中继日志不会阻塞SQL线程。因此在每台从服务器上设置计划任务定期清除中继日志。

00 00 * * * /usr/bin/purge_relay_logs –user=root –password=passwd –disable_relay_log_purge >> /data/masterha/log/purge_relay_logs.log 2>&1

最好在每台从服务器上不同时间点执行计划任务。

11. LOAD DATA INFILE不要使用基于语句型的二进制日志

如果使用非事务性存储引擎,在执行完LOAD DATA INFILE基于语句型二进制日志时,主服务器当掉,MHA可能不会产生差异的中继日志事件。使用LOAD DATA INFILE基于语句型二进制日志有一些已知问题,在mysql5.1版本中不建议使用,同时还会引起严重的复制延时,因此没有理由使用它。

三.MHA过程

1.监控和故障转移过程
检测复制设置和确定当前主服务器
监控主服务器
检测主服务器当掉
再次检测从服务器配置
关闭当掉的主服务器(可选)
恢复一个新的主服务器
激活新的主服务器
恢复其余的从服务器
告警(可选)

2.在线切换过程

检测复制设置和确定当前主服务器
确定新的主服务器
阻塞写入到当前主服务器
等待所有从服务器赶上复制
授予写入到新的主服务器
重新设置从服务器

缺点:脚本需要研究,模拟场景处理。

应用要考虑的问题:
应用架构要考虑两个问题(在一个特定的集群下):
1 自动识别master和slave的问题(master机器可能会切换)
2 负载均衡的问题(我们可以定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,需要考虑这个)

比如说下面的场景:
From:
db-181 (current master)
+–db-183
+–db-184

To:
db-183 (new master)
+–db-184

这个方案需要机器至少4台(1 master+ 3 slaves(其中一个专门作为备份))。

————————————————————–

准备,机器安排,db-182为管理机:
db-181 (current master)
+–db-183
+–db-184

配置ssh:
mkdir ~/.ssh
chmod 700 ~/.ssh
ssh-keygen -t rsa
ssh-keygen -t dsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

ssh db-181 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh db-181 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh db-183 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh db-183 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh db-184 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh db-184 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

scp ~/.ssh/authorized_keys db-181:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys db-183:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys db-184:~/.ssh/authorized_keys

1 所有机器安装DBD-mysql-4.020

mkdir /tmp/mysql-static
cp /usr/lib64/mysql/*.a /tmp/mysql-static
perl Makefile.PL –libs=”-L/tmp/mysql-static -lmysqlclient”
make
make test
make install

2 所有机器Installing MHA Node

You can also install MHA Node from source.
## Install DBD::mysql if not installed

tar -zxf mha4mysql-node-0.53.tar.gz
cd mha4mysql-node-0.53
perl Makefile.PL
make
make install

3 管理机Installing MHA Manager
## Install DBD::mysql if not installed
rpm -ivh perl-Config-Tiny-2.12-1.el5.rf.noarch.rpm
rpm -ivh perl-Params-Validate-0.95-1.el5.rf.x86_64.rpm
rpm -ivh perl-Log-Dispatch-2.26-1.el5.rf.noarch.rpm
rpm -ivh perl-Parallel-ForkManager-0.7.5-2.2.el5.rf.noarch.rpm

tar -zxf mha4mysql-manager-0.53.tar.gz
cd mha4mysql-manager-0.53
perl Makefile.PL
make
make install

4 配置文件
[server default]
manager_workdir=/masterha/app1
manager_log=/masterha/app1/manager.log
user=root
password=oracle
ssh_user=root
repl_user=rep
repl_password=rep
ping_interval=1

[server1]
hostname=db-181
master_binlog_dir=/data/mysql
candidate_master=1
[server2]
hostname=db-183
master_binlog_dir=/data/mysql
candidate_master=1
[server3]
hostname=db-184
master_binlog_dir=/data/mysql

5 root 的权限及MASTER_HOST(vip待试验)

CHANGE MASTER TO
MASTER_HOST=’10.2.0.181′,
MASTER_USER=’rep’,
MASTER_PASSWORD=’rep’,
MASTER_LOG_FILE=’mysql-bin.000004′,
MASTER_LOG_POS=107;

db-181
mysql> grant all privileges on *.* to ‘root’@’db-182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’db-181’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’10.2.0.182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

db-183
mysql> grant all privileges on *.* to ‘root’@’10.2.0.182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’db-182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’db-183’ identified by ‘oracle’;
Query OK, 0 rows affected (0.01 sec)

db-184
mysql> grant all privileges on *.* to ‘root’@’10.2.0.182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’db-182’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to ‘root’@’db-184’ identified by ‘oracle’;
Query OK, 0 rows affected (0.00 sec)

6 测试

db-181 (current master)
+–db-183
+–db-184

a db-183 停止传输日志(模拟日志不一致)
mysql> stop slave io_thread;
Query OK, 0 rows affected (0.00 sec)

b db-184 继续传输日志

c db-181 sysbench(模拟生产)
[root@db-182 bin]# ./sysbench –test=oltp –mysql-table-engine=innodb –oltp-table-size=1000000 –mysql-user=sbtest –mysql-password=123456 –mysql-host=db-181 –mysql-socket=/data/mysql/mysql.sock –num-threads=10 –max-requests=100000 –oltp-test-mode=complex run
sysbench 0.4.12: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 10

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using “BEGIN” for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 100000
Threads started!
Done.

OLTP test statistics:
queries performed:
read: 1400084
write: 500030
other: 200012
total: 2100126
transactions: 100006 (1261.39 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 1900114 (23966.46 per sec.)
other operations: 200012 (2522.78 per sec.)

Test execution summary:
total time: 79.2822s
total number of events: 100006
total time taken by event execution: 792.0176
per-request statistics:
min: 3.07ms
avg: 7.92ms
max: 344.96ms
approx. 95 percentile: 11.93ms

Threads fairness:
events (avg/stddev): 10000.6000/42.99
execution time (avg/stddev): 79.2018/0.00

d db-183开启io_thread; db-181 kill mysql ,mysql_safe(模拟master crash)

e MHA 日志(参考http://code.google.com/p/mysql-master-ha/wiki/Sequences_of_MHA)
Sat Apr 7 17:22:45 2012 – [info] MHA::MasterMonitor version 0.53.
Sat Apr 7 17:22:45 2012 – [info] Dead Servers:
Sat Apr 7 17:22:45 2012 – [info] db-181(10.2.0.181:3306)
Sat Apr 7 17:22:45 2012 – [info] Alive Servers:
Sat Apr 7 17:22:45 2012 – [info] db-183(10.2.0.183:3306)
Sat Apr 7 17:22:45 2012 – [info] db-184(10.2.0.184:3306)
Sat Apr 7 17:22:45 2012 – [info] Alive Slaves:
Sat Apr 7 17:22:45 2012 – [info] db-183(10.2.0.183:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:45 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:45 2012 – [info] Primary candidate for the new Master (candidate_master is set)
Sat Apr 7 17:22:45 2012 – [info] db-184(10.2.0.184:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:45 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:45 2012 – [warning] MySQL master is not currently alive!
Sat Apr 7 17:22:45 2012 – [info] Checking slave configurations..
Sat Apr 7 17:22:45 2012 – [warning] relay_log_purge=0 is not set on slave db-183(10.2.0.183:3306).
Sat Apr 7 17:22:45 2012 – [warning] relay_log_purge=0 is not set on slave db-184(10.2.0.184:3306).
Sat Apr 7 17:22:45 2012 – [info] Checking replication filtering settings..
Sat Apr 7 17:22:45 2012 – [info] Replication filtering check ok.
Sat Apr 7 17:22:45 2012 – [info] Starting SSH connection tests..
Sat Apr 7 17:22:46 2012 – [info] All SSH connection tests passed successfully.
Sat Apr 7 17:22:46 2012 – [info] Checking MHA Node version..
Sat Apr 7 17:22:47 2012 – [info] Version check ok.
Sat Apr 7 17:22:47 2012 – [info] Getting current master (maybe dead) info ..
Sat Apr 7 17:22:47 2012 – [info] Identified master is db-181(10.2.0.181:3306).
Sat Apr 7 17:22:47 2012 – [info] Checking SSH publickey authentication settings on the current master..
Sat Apr 7 17:22:47 2012 – [info] HealthCheck: SSH to db-181 is reachable.
Sat Apr 7 17:22:47 2012 – [info] Master MHA Node version is 0.53.
Sat Apr 7 17:22:47 2012 – [info] Checking recovery script configurations on the current master..
Sat Apr 7 17:22:47 2012 – [info] Executing command: save_binary_logs –command=test –start_pos=4 –binlog_dir=/data/mysql –output_file=/masterha/app1/save_binary_logs_test –manager_version=0.53 –start_file=mysql-bin.000008
Sat Apr 7 17:22:47 2012 – [info] Connecting to root@db-181(db-181)..
Creating /masterha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql, up to mysql-bin.000008
Sat Apr 7 17:22:47 2012 – [info] Master setting check done.
Sat Apr 7 17:22:47 2012 – [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sat Apr 7 17:22:47 2012 – [info] Executing command : apply_diff_relay_logs –command=test –slave_user=root –slave_host=db-183 –slave_ip=10.2.0.183 –slave_port=3306 –workdir=/masterha/app1 –target_version=5.5.16-log –manager_version=0.53 –relay_log_info=/data/mysql/relay-log.info –relay_dir=/data/mysql/ –slave_pass=xxx
Sat Apr 7 17:22:47 2012 – [info] Connecting to root@10.2.0.183(db-183:22)..
Checking slave recovery environment settings..
Opening /data/mysql/relay-log.info … ok.
Relay log found at /data/mysql, up to db-183-relay-bin.000003
Temporary relay log file is /data/mysql/db-183-relay-bin.000003
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Sat Apr 7 17:22:47 2012 – [info] Executing command : apply_diff_relay_logs –command=test –slave_user=root –slave_host=db-184 –slave_ip=10.2.0.184 –slave_port=3306 –workdir=/masterha/app1 –target_version=5.5.16-log –manager_version=0.53 –relay_log_info=/data/mysql/relay-log.info –relay_dir=/data/mysql/ –slave_pass=xxx
Sat Apr 7 17:22:47 2012 – [info] Connecting to root@10.2.0.184(db-184:22)..
Checking slave recovery environment settings..
Opening /data/mysql/relay-log.info … ok.
Relay log found at /data/mysql, up to db-184-relay-bin.000002
Temporary relay log file is /data/mysql/db-184-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Sat Apr 7 17:22:48 2012 – [info] Slaves settings check done.
Sat Apr 7 17:22:48 2012 – [info]
db-181 (current master)
+–db-183
+–db-184

Sat Apr 7 17:22:48 2012 – [warning] master_ip_failover_script is not defined.
Sat Apr 7 17:22:48 2012 – [warning] shutdown_script is not defined.
Sat Apr 7 17:22:48 2012 – [error][/usr/lib/perl5/site_perl/5.8.8/MHA/Server.pm, ln445] Checking slave status failed on db-183(10.2.0.183:3306). err=Got error when executing SHOW SLAVE STATUS. MySQL server has gone away
Sat Apr 7 17:22:48 2012 – [info] Set master ping interval 1 seconds.
Sat Apr 7 17:22:48 2012 – [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Sat Apr 7 17:22:48 2012 – [info] Starting ping health check on db-181(10.2.0.181:3306)..
Sat Apr 7 17:22:48 2012 – [warning] Got error on MySQL connect: 2003 (Can’t connect to MySQL server on ‘10.2.0.181’ (111))
Sat Apr 7 17:22:48 2012 – [warning] Connection failed 1 time(s)..
Sat Apr 7 17:22:48 2012 – [info] Executing SSH check script: save_binary_logs –command=test –start_pos=4 –binlog_dir=/data/mysql –output_file=/masterha/app1/save_binary_logs_test –manager_version=0.53 –binlog_prefix=mysql-bin
Sat Apr 7 17:22:48 2012 – [info] HealthCheck: SSH to db-181 is reachable.
Sat Apr 7 17:22:49 2012 – [warning] Got error on MySQL connect: 2003 (Can’t connect to MySQL server on ‘10.2.0.181’ (111))
Sat Apr 7 17:22:49 2012 – [warning] Connection failed 2 time(s)..
Sat Apr 7 17:22:50 2012 – [warning] Got error on MySQL connect: 2003 (Can’t connect to MySQL server on ‘10.2.0.181’ (111))
Sat Apr 7 17:22:50 2012 – [warning] Connection failed 3 time(s)..
Sat Apr 7 17:22:50 2012 – [warning] Master is not reachable from health checker!
Sat Apr 7 17:22:50 2012 – [warning] Master db-181(10.2.0.181:3306) is not reachable!
Sat Apr 7 17:22:50 2012 – [warning] SSH is reachable.
Sat Apr 7 17:22:50 2012 – [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /masterha/app1/app1.cnf again, and trying to connect to all servers to check server status..
Sat Apr 7 17:22:50 2012 – [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Apr 7 17:22:50 2012 – [info] Reading application default configurations from /masterha/app1/app1.cnf..
Sat Apr 7 17:22:50 2012 – [info] Reading server configurations from /masterha/app1/app1.cnf..
Sat Apr 7 17:22:50 2012 – [info] Dead Servers:
Sat Apr 7 17:22:50 2012 – [info] db-181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Alive Servers:
Sat Apr 7 17:22:50 2012 – [info] db-183(10.2.0.183:3306)
Sat Apr 7 17:22:50 2012 – [info] db-184(10.2.0.184:3306)
Sat Apr 7 17:22:50 2012 – [info] Alive Slaves:
Sat Apr 7 17:22:50 2012 – [info] db-183(10.2.0.183:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Primary candidate for the new Master (candidate_master is set)
Sat Apr 7 17:22:50 2012 – [info] db-184(10.2.0.184:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Checking slave configurations..
Sat Apr 7 17:22:50 2012 – [warning] relay_log_purge=0 is not set on slave db-183(10.2.0.183:3306).
Sat Apr 7 17:22:50 2012 – [warning] relay_log_purge=0 is not set on slave db-184(10.2.0.184:3306).
Sat Apr 7 17:22:50 2012 – [info] Checking replication filtering settings..
Sat Apr 7 17:22:50 2012 – [info] Replication filtering check ok.
Sat Apr 7 17:22:50 2012 – [info] Master is down!
Sat Apr 7 17:22:50 2012 – [info] Terminating monitoring script.
Sat Apr 7 17:22:50 2012 – [info] Got exit code 20 (Master dead).
Sat Apr 7 17:22:50 2012 – [info] MHA::MasterFailover version 0.53.
Sat Apr 7 17:22:50 2012 – [info] Starting master failover.
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] * Phase 1: Configuration Check Phase..
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] Dead Servers:
Sat Apr 7 17:22:50 2012 – [info] db-181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Checking master reachability via mysql(double check)..
Sat Apr 7 17:22:50 2012 – [info] ok.
Sat Apr 7 17:22:50 2012 – [info] Alive Servers:
Sat Apr 7 17:22:50 2012 – [info] db-183(10.2.0.183:3306)
Sat Apr 7 17:22:50 2012 – [info] db-184(10.2.0.184:3306)
Sat Apr 7 17:22:50 2012 – [info] Alive Slaves:
Sat Apr 7 17:22:50 2012 – [info] db-183(10.2.0.183:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Primary candidate for the new Master (candidate_master is set)
Sat Apr 7 17:22:50 2012 – [info] db-184(10.2.0.184:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] ** Phase 1: Configuration Check Phase completed.
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] * Phase 2: Dead Master Shutdown Phase..
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] Forcing shutdown so that applications never connect to the current master..
Sat Apr 7 17:22:50 2012 – [warning] master_ip_failover_script is not set. Skipping invalidating dead master ip address.
Sat Apr 7 17:22:50 2012 – [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Apr 7 17:22:50 2012 – [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] * Phase 3: Master Recovery Phase..
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] The latest binary log file/position on all slaves is mysql-bin.000008:364599308
Sat Apr 7 17:22:50 2012 – [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Apr 7 17:22:50 2012 – [info] db-184(10.2.0.184:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] The oldest binary log file/position on all slaves is mysql-bin.000008:356208745
Sat Apr 7 17:22:50 2012 – [info] Oldest slaves:
Sat Apr 7 17:22:50 2012 – [info] db-183(10.2.0.183:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:50 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:50 2012 – [info] Primary candidate for the new Master (candidate_master is set)
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] * Phase 3.2: Saving Dead Master’s Binlog Phase..
Sat Apr 7 17:22:50 2012 – [info]
Sat Apr 7 17:22:50 2012 – [info] Fetching dead master’s binary logs..
Sat Apr 7 17:22:50 2012 – [info] Executing command on the dead master db-181(10.2.0.181:3306): save_binary_logs –command=save –start_file=mysql-bin.000008 –start_pos=364599308 –binlog_dir=/data/mysql –output_file=/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog –handle_raw_binlog=1 –disable_log_bin=0 –manager_version=0.53
Creating /masterha/app1 if not exists.. ok.
Concat binary/relay logs from mysql-bin.000008 pos 364599308 to mysql-bin.000008 EOF into /masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog ..
Dumping binlog format description event, from position 0 to 107.. ok.
Dumping effective binlog data from /data/mysql/mysql-bin.000008 position 364599308 to tail(364602356).. ok.
Concat succeeded.
Sat Apr 7 17:22:50 2012 – [info] scp from root@10.2.0.181:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog to local:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog succeeded.
Sat Apr 7 17:22:50 2012 – [info] HealthCheck: SSH to db-183 is reachable.
Sat Apr 7 17:22:51 2012 – [info] HealthCheck: SSH to db-184 is reachable.
Sat Apr 7 17:22:51 2012 – [info]
Sat Apr 7 17:22:51 2012 – [info] * Phase 3.3: Determining New Master Phase..
Sat Apr 7 17:22:51 2012 – [info]
Sat Apr 7 17:22:51 2012 – [info] Finding the latest slave that has all relay logs for recovering other slaves..
Sat Apr 7 17:22:51 2012 – [info] Checking whether db-184 has relay logs from the oldest position..
Sat Apr 7 17:22:51 2012 – [info] Executing command: apply_diff_relay_logs –command=find –latest_mlf=mysql-bin.000008 –latest_rmlp=364599308 –target_mlf=mysql-bin.000008 –target_rmlp=356208745 –server_id=3 –workdir=/masterha/app1 –timestamp=20120407172250 –manager_version=0.53 –relay_log_info=/data/mysql/relay-log.info –relay_dir=/data/mysql/ :
Opening /data/mysql/relay-log.info … ok.
Relay log found at /data/mysql, up to db-184-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:db-184-relay-bin.000002, start_pos:356208891.
Target relay log FOUND!
Sat Apr 7 17:22:51 2012 – [info] OK. db-184 has all relay logs.
Sat Apr 7 17:22:51 2012 – [info] Searching new master from slaves..
Sat Apr 7 17:22:51 2012 – [info] Candidate masters from the configuration file:
Sat Apr 7 17:22:51 2012 – [info] db-183(10.2.0.183:3306) Version=5.5.16-log (oldest major version between slaves) log-bin:enabled
Sat Apr 7 17:22:51 2012 – [info] Replicating from 10.2.0.181(10.2.0.181:3306)
Sat Apr 7 17:22:51 2012 – [info] Primary candidate for the new Master (candidate_master is set)
Sat Apr 7 17:22:51 2012 – [info] Non-candidate masters:
Sat Apr 7 17:22:51 2012 – [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Apr 7 17:22:51 2012 – [info] Not found.
Sat Apr 7 17:22:51 2012 – [info] Searching from all candidate_master slaves..
Sat Apr 7 17:22:51 2012 – [info] New master is db-183(10.2.0.183:3306)
Sat Apr 7 17:22:51 2012 – [info] Starting master failover..
Sat Apr 7 17:22:51 2012 – [info]
From:
db-181 (current master)
+–db-183
+–db-184

To:
db-183 (new master)
+–db-184
Sat Apr 7 17:22:51 2012 – [info]
Sat Apr 7 17:22:51 2012 – [info] * Phase 3.3: New Master Diff Log Generation Phase..
Sat Apr 7 17:22:51 2012 – [info]
Sat Apr 7 17:22:51 2012 – [info] Server db-183 received relay logs up to: mysql-bin.000008:356208745
Sat Apr 7 17:22:51 2012 – [info] Need to get diffs from the latest slave(db-184) up to: mysql-bin.000008:364599308 (using the latest slave’s relay logs)
Sat Apr 7 17:22:51 2012 – [info] Connecting to the latest slave host db-184, generating diff relay log files..
Sat Apr 7 17:22:51 2012 – [info] Executing command: apply_diff_relay_logs –command=generate_and_send –scp_user=root –scp_host=10.2.0.183 –latest_mlf=mysql-bin.000008 –latest_rmlp=364599308 –target_mlf=mysql-bin.000008 –target_rmlp=356208745 –server_id=3 –diff_file_readtolatest=/masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog –workdir=/masterha/app1 –timestamp=20120407172250 –handle_raw_binlog=1 –disable_log_bin=0 –manager_version=0.53 –relay_log_info=/data/mysql/relay-log.info –relay_dir=/data/mysql/
Sat Apr 7 17:22:52 2012 – [info]
Opening /data/mysql/relay-log.info … ok.
Relay log found at /data/mysql, up to db-184-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:db-184-relay-bin.000002, start_pos:356208891.
Concat binary/relay logs from db-184-relay-bin.000002 pos 356208891 to db-184-relay-bin.000002 EOF into /masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog ..
Dumping binlog format description event, from position 0 to 253.. ok.
Dumping effective binlog data from /data/mysql/db-184-relay-bin.000002 position 356208891 to tail(364599454).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog .
scp db-184:/masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog to root@10.2.0.183(22) succeeded.
Sat Apr 7 17:22:52 2012 – [info] Generating diff files succeeded.
Sat Apr 7 17:22:52 2012 – [info] Sending binlog..
Sat Apr 7 17:22:52 2012 – [info] scp from local:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog to root@db-183:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog succeeded.
Sat Apr 7 17:22:52 2012 – [info]
Sat Apr 7 17:22:52 2012 – [info] * Phase 3.4: Master Log Apply Phase..
Sat Apr 7 17:22:52 2012 – [info]
Sat Apr 7 17:22:52 2012 – [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Sat Apr 7 17:22:52 2012 – [info] Starting recovery on db-183(10.2.0.183:3306)..
Sat Apr 7 17:22:52 2012 – [info] Generating diffs succeeded.
Sat Apr 7 17:22:52 2012 – [info] Waiting until all relay logs are applied.
Sat Apr 7 17:22:52 2012 – [info] done.
Sat Apr 7 17:22:52 2012 – [info] Getting slave status..
Sat Apr 7 17:22:52 2012 – [info] This slave(db-183)’s Exec_Master_Log_Pos(mysql-bin.000008:356208571) does not equal to Read_Master_Log_Pos(mysql-bin.000008:356208745). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos.
Sat Apr 7 17:22:52 2012 – [info] Saving local relay logs from exec pos to read pos on db-183: from db-183-relay-bin.000003:279938556 to the end of the relay log..
Sat Apr 7 17:22:52 2012 – [info] Executing command : save_binary_logs –command=save –start_file=db-183-relay-bin.000003 –start_pos=279938556 –output_file=/masterha/app1/relay_from_exec_to_read_db-183_3306_20120407172250.binlog –handle_raw_binlog=1 –disable_log_bin=0 –manager_version=0.53 –relay_log_info=/data/mysql/relay-log.info –binlog_dir=/data/mysql/
Sat Apr 7 17:22:53 2012 – [info]
Creating /masterha/app1 if not exists.. ok.
Concat binary/relay logs from db-183-relay-bin.000003 pos 279938556 to db-183-relay-bin.000003 EOF into /masterha/app1/relay_from_exec_to_read_db-183_3306_20120407172250.binlog ..
Dumping binlog format description event, from position 0 to 253.. ok.
Dumping effective binlog data from /data/mysql/db-183-relay-bin.000003 position 279938556 to tail(279938730).. ok.
Concat succeeded.
Sat Apr 7 17:22:53 2012 – [info] Connecting to the target slave host db-183, running recover script..
Sat Apr 7 17:22:53 2012 – [info] Executing command: apply_diff_relay_logs –command=apply –slave_user=root –slave_host=db-183 –slave_ip=10.2.0.183 –slave_port=3306 –apply_files=/masterha/app1/relay_from_exec_to_read_db-183_3306_20120407172250.binlog,/masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog,/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog –workdir=/masterha/app1 –target_version=5.5.16-log –timestamp=20120407172250 –handle_raw_binlog=1 –disable_log_bin=0 –manager_version=0.53 –slave_pass=xxx
Sat Apr 7 17:23:36 2012 – [info]
Concat all apply files to /masterha/app1/total_binlog_for_db-183_3306.20120407172250.binlog ..
Copying the first binlog file /masterha/app1/relay_from_exec_to_read_db-183_3306_20120407172250.binlog to /masterha/app1/total_binlog_for_db-183_3306.20120407172250.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog.. dumped up to pos 253. ok.
/masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog has effective binlog events from pos 253.
Dumping effective binlog data from /masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog position 253 to tail(8390816).. ok.
Dumping binlog head events (rotate events), skipping format description events from /masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog.. dumped up to pos 107. ok.
/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog has effective binlog events from pos 107.
Dumping effective binlog data from /masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog position 107 to tail(3155).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /masterha/app1/total_binlog_for_db-183_3306.20120407172250.binlog .
Applying differential binary/relay log files /masterha/app1/relay_from_exec_to_read_db-183_3306_20120407172250.binlog,/masterha/app1/relay_from_read_to_latest_db-183_3306_20120407172250.binlog,/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog on db-183:3306. This may take long time…
Applying log files succeeded.
Sat Apr 7 17:23:36 2012 – [info] All relay logs were successfully applied.
Sat Apr 7 17:23:36 2012 – [info] Getting new master’s binlog name and position..
Sat Apr 7 17:23:36 2012 – [info] mysql-bin.000001:359464652
Sat Apr 7 17:23:36 2012 – [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=’db-183 or 10.2.0.183′, MASTER_PORT=3306, MASTER_LOG_FILE=’mysql-bin.000001′, MASTER_LOG_POS=359464652, MASTER_USER=’rep’, MASTER_PASSWORD=’xxx’;
Sat Apr 7 17:23:36 2012 – [warning] master_ip_failover_script is not set. Skipping taking over new master ip address.
Sat Apr 7 17:23:36 2012 – [info] Setting read_only=0 on db-183(10.2.0.183:3306)..
Sat Apr 7 17:23:36 2012 – [info] ok.
Sat Apr 7 17:23:36 2012 – [info] ** Finished master recovery successfully.
Sat Apr 7 17:23:36 2012 – [info] * Phase 3: Master Recovery Phase completed.
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] * Phase 4: Slaves Recovery Phase..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] — Slave diff file generation on host db-184(10.2.0.184:3306) started, pid: 20100. Check tmp log /masterha/app1/db-184_3306_20120407172250.log if it takes time..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] Log messages from db-184 …
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] This server has all relay logs. No need to generate diff files from the latest slave.
Sat Apr 7 17:23:36 2012 – [info] End of log messages from db-184.
Sat Apr 7 17:23:36 2012 – [info] — db-184(10.2.0.184:3306) has the latest relay log events.
Sat Apr 7 17:23:36 2012 – [info] Generating relay diff files from the latest slave succeeded.
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] — Slave recovery on host db-184(10.2.0.184:3306) started, pid: 20102. Check tmp log /masterha/app1/db-184_3306_20120407172250.log if it takes time..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] Log messages from db-184 …
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] Sending binlog..
Sat Apr 7 17:23:36 2012 – [info] scp from local:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog to root@db-184:/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog succeeded.
Sat Apr 7 17:23:36 2012 – [info] Starting recovery on db-184(10.2.0.184:3306)..
Sat Apr 7 17:23:36 2012 – [info] Generating diffs succeeded.
Sat Apr 7 17:23:36 2012 – [info] Waiting until all relay logs are applied.
Sat Apr 7 17:23:36 2012 – [info] done.
Sat Apr 7 17:23:36 2012 – [info] Getting slave status..
Sat Apr 7 17:23:36 2012 – [info] This slave(db-184)’s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000008:364599308). No need to recover from Exec_Master_Log_Pos.
Sat Apr 7 17:23:36 2012 – [info] Connecting to the target slave host db-184, running recover script..
Sat Apr 7 17:23:36 2012 – [info] Executing command: apply_diff_relay_logs –command=apply –slave_user=root –slave_host=db-184 –slave_ip=10.2.0.184 –slave_port=3306 –apply_files=/masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog –workdir=/masterha/app1 –target_version=5.5.16-log –timestamp=20120407172250 –handle_raw_binlog=1 –disable_log_bin=0 –manager_version=0.53 –slave_pass=xxx
Sat Apr 7 17:23:36 2012 – [info]
Applying differential binary/relay log files /masterha/app1/saved_master_binlog_from_db-181_3306_20120407172250.binlog on db-184:3306. This may take long time…
Applying log files succeeded.
Sat Apr 7 17:23:36 2012 – [info] All relay logs were successfully applied.
Sat Apr 7 17:23:36 2012 – [info] Resetting slave db-184(10.2.0.184:3306) and starting replication from the new master db-183(10.2.0.183:3306)..
Sat Apr 7 17:23:36 2012 – [info] Executed CHANGE MASTER.
Sat Apr 7 17:23:36 2012 – [info] Slave started.
Sat Apr 7 17:23:36 2012 – [info] End of log messages from db-184.
Sat Apr 7 17:23:36 2012 – [info] — Slave recovery on host db-184(10.2.0.184:3306) succeeded.
Sat Apr 7 17:23:36 2012 – [info] All new slave servers recovered successfully.
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] * Phase 5: New master cleanup phease..
Sat Apr 7 17:23:36 2012 – [info]
Sat Apr 7 17:23:36 2012 – [info] Resetting slave info on the new master..
Sat Apr 7 17:23:36 2012 – [info] db-183: Resetting slave info succeeded.
Sat Apr 7 17:23:36 2012 – [info] Master failover to db-183(10.2.0.183:3306) completed successfully.
Sat Apr 7 17:23:36 2012 – [info]

—– Failover Report —–

app1: MySQL Master failover db-181 to db-183 succeeded

Master db-181 is down!

Check MHA Manager logs at db-182:/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
The latest slave db-184(10.2.0.184:3306) has all relay logs for recovery.
Selected db-183 as a new master.
db-183: OK: Applying all logs succeeded.
db-184: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
db-184: OK: Applying all logs succeeded. Slave started, replicating from db-183.
db-183: Resetting slave info succeeded.
Master failover to db-183(10.2.0.183:3306) completed successfully.

MHA的project地址为:https://code.google.com/p/mysql-master-ha/