RHCS troubleshooting

2013-12-26 12:15

 

1.1. 群集狀態查看(clustat)

The clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services. The clustat command displays cluster status only from the viewpoint of the cluster system on which it is running.

常用參數-i,指定刷新間隔,可動態觀察群集起停狀態轉變。如:clustat -i 2,每隔2秒鐘刷新顯示clustat輸出。

 

1.2. cman管理工具(man_tool)

 cman_tool is a program that manages the cluster management subsystem    CMAN. cman_tool can be used to join the node to a cluster, leave the cluster, kill another cluster node or change the value of expected  votes of a cluster.    Be careful that you understand the consequences of the commands issued via cman_tool as they can affect all nodes in your cluster. Most of the time the cman_tool will only be invoked from your startup and shutdown scripts.

下圖可看到db1上次被fenced的時間,以及使用的fence設備。

 [root@db1 oradata]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M     96   2010-09-02 15:04:11  db1.fjnet114.com

    Last fenced:   2010-09-02 14:04:11  by ilo1
   2   M    100   2010-09-02 15:04:11  db2.fjnet114.com

-------------------------------------------------------------------------------

[root@db1 home]# cman_tool status

Version: 6.1.0

Config Version: 8

Cluster Name: new_cluster

Cluster Id: 23732

Cluster Member: Yes

Cluster Generation: 104

Membership state: Cluster-Member

Nodes: 2

Expected votes: 1

Total votes: 2

Quorum: 1 

Active subsystems: 8

Flags: 2node Dirty

Ports Bound: 0 177 

Node name: db1.fjnet114.com

Node ID: 1

Multicast addresses: 239.192.92.17 //在redhat 4中未發現多播地址;

Node addresses: 192.168.114.102

 

1.3. fence/dlm狀態查看(group_tool)

The  group_tool program displays the status of fence, dlm and gfs    groups. The information is read from the groupd daemon which controls the fenced, dlm_controld and gfs_controld daemons. group_tool will also dump debug logs from various daemons.

此命令在redhat 4版本上沒有。

[root@db1 oradata]# group_tool ls
type             level name       id       state       
fence            0     default    00010001 none        
[1 2]
dlm              1     rgmanager  00020001 none        
[1 2]

 

1.4. rgmanager資源測試(rg_test)

Cman對群集資源監控設置查看rg_test rules, /usr/share/cluster保留有部分應用預設監控腳本;

1、Display there source rules that rg_test understands. rg_test rules Test a configuration (and /usr/share/cluster) for errors or redundant resource agents.

rg_test test /etc/cluster/cluster.conf

2、Display the start and stop ordering of a service.Display start order:

rg_test noop /etc/cluster/cluster.conf start service servicename

這個命令在測試資源的依賴關係時很有用,使用rg_test --help看不到noop參數。在我環境下輸出如下:

[root@db1 oradata]# rg_test noop /etc/cluster/cluster.conf start service wbdb_service
Running in test mode.
Starting wbdb_service...
[start] service:wbdb_service
[start] fs:oradata
[start] fs:orabackup
[start] ip:192.168.114.108
[start] script:oracle
Start of wbdb_service complete

Display stop order:

rg_test noop /etc/cluster/cluster.conf stop service servicename

3、Explicitly start or stop a service.

Important Only do this on one node, and always disable the service in rgmanager

first. Start a service:

rg_test test /etc/cluster/cluster.conf start service servicename

Stop a service:

rg_test test /etc/cluster/cluster.conf stop service servicename

4、Calculate and display the resource tree delta between two cluster.conf files.查看2份cluster設定檔的資原始目錄結構和啟停順序。

rg_test delta cluster.conf file 1 cluster.conf file 2

For example:

rg_test delta /etc/cluster/cluster.conf.bak /etc/cluster/cluster.conf

 

1.5. 動態查看日誌(tail –f)

該命令用以觀察群集日誌時特別有用,可看到群集何時進行磁片mount,IP位址切換,服務啟動等資訊。

常用命令:

tail –f /var/log/message

 

1.6. 測試fence設備配置(fence_node/fence_drac/…)

使用fence_node 命令進行fence配置測試,該命令將讀取cluster.conf中關於fence設備的配置。

常用命令

/sbin/fence_node db1.fjnet114.com

/sbin/fence_node db2.fjnet114.com

針對每個不同的fence設備,redhat提供了相應的工具fence_drac、fence_ilo等,可在命令下直接載入fence設備參數進行測試。參數-o指定執行的動作,可為reboot\off\on\status等,詳見man fence_drac。

如:

[root@db2 ~]# fence_drac -a 192.168.114.106 -l admin -p wlhmbst@2008 -o status

status: on

 

1.7. 手動群集切換clusvcadmin

The clusvcadm command allows you to enable, disable, relocate, and restart high-availability services in a cluster. For more information about this tool, refer to the clusvcadm(8) man page.

做rhcs的切換測試方式有很多,比如拔網線、模擬宕機操作。但是日常維護作業過程中需要做群集的切換,我們希望以對系統破壞最小的操作進行。你們就可以使用clusvcadmin命令。

[root@db2 /]# clusvcadm -r wbdb_service -m db2.fjnet114.com
Trying to relocate service:wbdb_service to db2.fjnet114.com...Success
service:wbdb_service is now running on db2.fjnet114.com

 

2.      IP埠使用情況

Port Number

Protocol

Component

5404, 5405

UDP

cman (Cluster Manager)                 

11111

TCP

ricci (part of Conga remote agent)   

14567

TCP

gnbd (Global Network Block Device)     

16851

TCP

modclusterd (part of Conga remote agen

21064

TCP

dlm (Distributed Lock Manager)       

50006, 50008,50009 

TCP

ccsd (Cluster Configuration System daemon)

50007

UDP

ccsd (Cluster Configuration System daemon)

 

3.      常見故障分析

If a node in your cluster is repeatedly getting fenced, it means that one of the nodes in your cluster is not seeing enough "heartbeat" network messages from the node that is getting fenced. Most of the time, this is a result of flaky or faulty hardware, such as bad cables or bad ports on the network hub or switch. Test your communications paths thoroughly without the cluster software running to make sure your hardware is working correctly.

如果群集中的一個節點被反復執行fenced而重啟,這意味著群集中的另一節點沒有發現被fenced節點足夠多的心跳資訊。大多數情況下,這是硬體故障導致的,如網路交換機中的故障線纜、埠等。在沒有群集軟體運行的情況下,測試通信鏈路以確認你的硬體環境工作正常。

• If a node in your cluster is repeatedly getting fenced right at startup, if may be due to system activities that occur when a node joins a cluster. If your network is busy, your cluster may decide it is not getting enough heartbeat packets. To address this, you may have to increase the post_join_delay setting in your cluster.

如果群集中的一個節點在開機時被反復fenced而重啟,這可能是由這樣一種系統活動導致的,當節點正在加入群集,一旦網路繁忙,群集可能覺得沒有足夠的心跳資訊而被fenced。為解決這個情況,你需要將cluster.conf中的post_join_delay參數調大些,如由3改為60。