Wednesday, January 6, 2021

View Transactions from MySQL Binlogs to Resolve Replication Issues

Summary

Sometimes Replication abends need investigation and the first step is to determine the transaction that is failing.  So we can determine the transaction based on logfile and position from the replication abend is the most common reason I go after transactions from the bin logs so I know what we need to do to fix.

Replication Abend Message

2017-03-20T14:16:10.383779Z 287072 [ERROR] Slave SQL for channel 'group_replication_recovery': Worker 0 failed executing

transaction '57671649-2432-4b1a-af66-74f16f43c510:1588295' at master log mynode-bin.000387, end_log_pos 33127552;

Could not execute Delete_rows event on table myschema.tablename; Can't find record in 'tablename',

Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 33127552, Error_code: 1032

 

In the example message we can determine the log file name and position

mynode-bin.000387

33127552

Mysql Binlog Utility Options

--base64-output=DECODE-ROWS

Required for the ROW based replication format

 

--verbose

Required to verbose the output in a readable text format

 

MySQL Bin log to a text file to be able to search for records

Example:

mysqlbinlog --base64-output=DECODE-ROWS --verbose  mynode-bin.000387 > mynode.binlog.00387.txt

 

 

 

Search the binlog text file for Records

Edit the text file created in the previous and then search for the log positon

 

vi mynode.binlog.00387.txt

/33127552

 

#170318  2:42:06 server id 101  end_log_pos 33127552    Delete_rows: table id 595 flags: STMT_END_F

### DELETE FROM `myschema`.`tablename`

### WHERE

###   @1='05d7490c-0971-408f-82f5-d5024922dcd7'

###   @2='received.x12.005010x222a1'

###   @3=526

###   @4=9223372036854775807

###   @5='claims/claims'

###   @6=1489819270

### DELETE FROM `myschema`.`tablename`

### WHERE

###   @1='05d7490c-0971-408f-82f5-d5024922dcd7'

###   @2='received.x12.005010x223a2'

###   @3=1

###   @4=9223372036854775807

###   @5='claims/claims'

###   @6=1489819270

### DELETE FROM `myschema`.`tablename`

### WHERE

###   @1='087f46ce-c8a2-490b-89c7-a4dd77584070'

###   @2='processed'

###   @3=24

###   @4=9223372036854775807

###   @5='claims/claims'

###   @6=1489819270

 

** note in this case 3 delete statements, if you just skipped the position in replication you would skip all 3 delete statements probably no exactly what you would want to do.

 

Resolve the Replication Issue

Based on key value of table find which of the 3 deletes is an issue.

 

After that we can insert a record into the table that would satisfy the delete so that replication can successfully remove the dummy record and continue.


Tuesday, January 5, 2021

Group Replication will not rejoin group due to connection issue

 

Summary

Group replication left as a member of the group and will not rejoin.

Track down reason why it will not rejoin group, example here is it will not reconnect due to a credential issue.

 

1.       Check what error is connection issues is happening trying to rejoin the group in the local database

 

select * from performance_schema.replication_connection_status ;

 

Example:

| group_replication_recovery |                                      |                                      |      NULL | OFF           |                         0 | 0000-00-00 00:00:00.000000 |                                                                                                                                                                                                                      |             13117 | Fatal error: Invalid (empty) username when attempting to connect to the master server. Connection attempt terminated. | 2021-01-05 08:38:15.014830 |                         | 0000-00-00 00:00:00.000000                        | 0000-00-00 00:00:00.000000                         | 0000-00-00 00:00:00.000000                    | 0000-00-00 00:00:00.000000                  |                      | 0000-00-00 00:00:00.000000                     | 0000-00-00 00:00:00.000000                      | 0000-00-00 00:00:00.000000                 |

** In this case it is stating that empty username is being used attempting to connect in.

 

2.       check the channel definition

 

select channel_name, host, port, user, network_interface, auto_position from performance_schema.replication_connection_configuration ;

 

Example:

+----------------------------+--------------------------+------+----------+-------------------+---------------+

| channel_name               | host                     | port | user     | network_interface | auto_position |

+----------------------------+--------------------------+------+----------+-------------------+---------------+

|                            | aglqapmyp01              | 3306 | rpl_user |                   | 0             |

| group_replication_applier  | <NULL>                   |    0 |          |                   | 1             |

| group_replication_recovery | aglqapmyp05.xxxxxxxx.xxx | 3306 |          |                   | 1             |

+----------------------------+--------------------------+------+----------+-------------------+---------------+

3 rows in set (0.00 sec)

 

** Note in this case the user column is missing

 

 

3.       Since it appears that the credentials are missing for the channel we can reset them and give a username and password for the channel.

 

CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='xxxxxxx' FOR CHANNEL 'group_replication_recovery';

 

 

4.       Recheck that our change for the credential took affect

 

select channel_name, host, port, user, network_interface, auto_position from performance_schema.replication_connection_configuration ;

 

Example:

+----------------------------+--------------------------+------+----------+-------------------+---------------+

| channel_name               | host                     | port | user     | network_interface | auto_position |

+----------------------------+--------------------------+------+----------+-------------------+---------------+

|                            | aglqapmyp01              | 3306 | rpl_user |                   | 0             |

| group_replication_applier  | <NULL>                   |    0 |          |                   | 1             |

| group_replication_recovery | aglqapmyp05.xxxxxxxx.xxx | 3306 | rpl_user |                   | 1             |

+----------------------------+--------------------------+------+----------+-------------------+---------------+

3 rows in set (0.00 sec)

 

 

 

 

5.       Restart database to restart group replication, a good way to do that is restart the database with the mysql group replication start on boot on, this way we are completely clean.

 

Example:

/etc/rc.d/init.d/mysql.server stop

/etc/rc.d/init.d/mysql.server start

 

 

6.       Check group replication status

 

SELECT MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE,MEMBER_VERSION FROM performance_schema.replication_group_members order by MEMBER_HOST;

 

Example:

+--------------------------+--------------+-------------+----------------+

| MEMBER_HOST              | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |

+--------------------------+--------------+-------------+----------------+

| aglqapmyp04.xxxxxxxx.xxx | ONLINE       | SECONDARY   | 8.0.22         |

| aglqapmyp05.xxxxxxxx.xxx | ONLINE       | PRIMARY     | 8.0.22         |

| aglqapmyp06.xxxxxxxx.xxx | ONLINE       | SECONDARY   | 8.0.22         |

+--------------------------+--------------+-------------+----------------+