This blog is the fourth in a series and it follows the blog Analyze Pacemaker events in Cloud Logging, which describes how you can install and configure Google Cloud Ops Agent to stream Pacemaker logs of all your high availability clusters to Cloud Logging. You can analyze Pacemaker events happening to any of your clusters in one central place. But what if you don’t have this agent installed and want to know what happened to your cluster?
Let’s look at this open source python script logparser, which will help you consolidate relevant Pacemaker logs from cluster nodes and filter the log entries for critical events such as fencing or resource failure. It takes below log files as input files and generates an output file of log entries in chronological order for critical events.
System log such as /var/log/messages
Pacemaker logs such as /var/log/pacemaker.log and /var/log/corosync/corosync.log
hb_report in SUSE
sosreport in RedHat
How to use this script?
The script is available to download from this GitHub repository and supports multiple platforms.
The program requires Python 3.6+. It can run on Linux, Windows and MacOS. As the first step, install or update your Python environment. Second, clone the GitHub repository as shown below.
Run the script
See ‘-h’ for help. Specify the input log files, optional time range or output file name. By default, the output file is ‘logparser.out’ in the current directory.
The hb_report is a utility provided by SUSE to capture all relevant Pacemaker logs in one package. If ssh login without password is set up between the cluster nodes, it should gather all information from all nodes. If not, collect the hb_report on each cluster node.
The sosreport is a similar utility provided by RedHat to collect system log files, configuration details and system information. Pacemaker logs are also collected. Collect the sosreport on each cluster node.
You can also parse single system logs or Pacemaker logs.
In Windows, execute the Python file logparser.py instead.
Next, we need to analyze the output information of the log parser results.
Understanding the Output Information
The output log may contain a variety of information, including but not limited to fencing actions, resources actions, failures, or Corosync subsystem events.
Fencing action reason and result
The example below shows a fencing (reboot) action targeting a cluster node because the node left the cluster. The subsequent log entry shows the fencing operation is successful (OK).
Pacemaker actions to manage cluster resources
The example below illustrates multiple actions affecting the cluster resources, such as actions moving resources from one cluster node to another, or an action stopping a resource on a specific cluster node.
Failed resource operations
Pacemaker manages cluster resources by calling resource operations such as monitor, start or stop, which are defined in corresponding resource agents (shell or Python scripts). The log parser filters log entries of failed operations. The example below shows a monitor operation that failed because the virtual IP resource is not running.
Resource agent, fence agent warnings and errors
A resource agent or fence agent writes detailed logs for operations. When you observe resource operation failure, the agent logs can help identify the root cause. The log parser filters the ERROR logs for all agents. Additionally, it filters WARNING logs for the SAPHana agent.
Corosync communication error or failure
Corosync is the messaging layer that the cluster nodes use to communicate with each other. Failure in Corosync communication between nodes may trigger a fencing action.
The example below shows a Corosync message being retransmitted multiple times and eventually reporting an error that the other cluster node left the cluster.
This next example shows that a Corosync TOKEN was not received within the defined time period and eventually Corosync reported an error that the other cluster node left the cluster.
Reach migration threshold and force resource off
When the number of failures of a resource reaches the defined migration threshold (parameter migration-threshold), the resource is forced to migrate to another cluster node.
When a resource fails to start on a cluster node, the number of failures will be updated to INFINITY, which implicitly reaches the migration threshold and forces a resource migration. If there is any location constraint preventing the resource to run on the other cluster nodes or no other cluster nodes are available, the resource is stopped and cannot run anywhere.
Location constraint added due to manual resource movement
All location constraints with prefix ‘cli-prefer’ or ‘cli-ban’ are added implicitly when a user triggers either a cluster resource move or ban command. These constraints should be cleared after the resource movement, as they restrict the resource so it only runs on a certain node. The example below shows a ‘cli-ban’ location constraint was created, and a ‘cli-prefer’ location constraint was deleted.
Cluster/Node/Resource maintenance/standby/manage mode change
The log parser filters log entries when any maintenance commands are issued on the cluster, cluster nodes or resources. The examples below show the cluster maintenance mode was enabled, and a node was set to standby.
This Pacemaker log parser can give you one simplified view of critical events in your High Availability cluster. If further support is needed from the Google Cloud Customer Care Team, follow this guide to collect the diagnostics files and open a support case.
If you are interested in learning more about running SAP on Google Cloud with Pacemaker, read the previous blogs in this series here:
Read More for the details.