admin 管理员组

文章数量: 1184232

###db01 message

Sep 29 15:29:27 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it

Sep 29 15:29:27 db01 kernel: bonding: bond1: making interface eth2 the new active one.

Sep 29 15:29:31 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Sep 29 15:29:31 db01 kernel: bonding: bond1: link status definitely up for interface eth3.

Sep 29 15:31:28 db01 kernel: igb: eth2 NIC Link is Down

Sep 29 15:31:28 db01 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it

Sep 29 15:31:28 db01 kernel: bonding: bond1: making interface eth3 the new active one.

Sep 29 15:31:28 db01 kernel: igb: eth3 NIC Link is Down

Sep 29 15:31:29 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it

Sep 29 15:31:29 db01 kernel: bonding: bond1: now running without any active interface !

Sep 29 15:31:54 db01 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth2.

Sep 29 15:31:54 db01 kernel: bonding: bond1: making interface eth2 the new active one.

Sep 29 15:31:54 db01 kernel: bonding: bond1: first active interface up!

Sep 29 15:31:54 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth3.

Sep 29 15:36:10 db01 shutdown[17047]: shutting down for system reboot

Sep 29 15:36:11 db01 gconfd (root-6536): Received signal 15, shutting down cleanly

Sep 29 15:36:11 db01 gconfd (root-6536): Exiting

###db02 message

Sep 29 15:36:54 db02 kernel: igb: eth2 NIC Link is Down

Sep 29 15:36:54 db02 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it

Sep 29 15:36:54 db02 kernel: bonding: bond1: making interface eth3 the new active one.

Sep 29 15:36:55 db02 kernel: igb: eth3 NIC Link is Down

Sep 29 15:36:55 db02 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it

Sep 29 15:36:55 db02 kernel: bonding: bond1: now running without any active interface !

Sep 29 15:37:10 db02 kernel: igb: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX

Sep 29 15:37:10 db02 kernel: igb: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX

Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth2.

Sep 29 15:37:10 db02 kernel: bonding: bond1: making interface eth2 the new active one.

Sep 29 15:37:10 db02 kernel: bonding: bond1: first active interface up!

Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth3.

问题分析:

​    从如上详细的日志信息我们不难看出有如下动作,在db01上执行关机操作之后ocss和crsd进程都会向远端返送消息告诉对端本机即将执行关闭操作。后再停止各个进程。

在节点二中我们可从message日志中看到之前的私网bond1口状态时Down,在15:37分将第一个节点shutdown -immediate之后私网bond1居然自动执行了up操作。随即我们从ocss和crsd的日志中可以看到集群进程都正在起来。

那么这个时候我们可以分析问题应该是出在私网网络这一部分,可能是网卡绑定的问题。

问题处理过程:

既然从日志中看出是网络问题,那么我们就从网络排除,待节点一重启启动后,首先采用ping私网来确定,节点一启动了,同样集群服务是没有起来的:

ping节点2的私网,不通:

[root@db01 ~]# ping pri02

PING pri02.xmtvdb (10.10.11.2) 56(84) bytes of data.

From pri01.xmtvdb (10.10.11.1) icmp_seq=193 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=194 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=195 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=197 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=198 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=199 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=201 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=202 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=203 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=204 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=205 Destination Host Unreachable

From pri01.xmtvdb (10.10.11.1) icmp_seq=206 Destination Host Unreachable

检查bonding,是好的,没有问题:

###db01

[root@db01 ~]# cat /proc/net/bonding/bond1

Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth2

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: eth2

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 1

Permanent HW addr: 40:f2:e9:db:c9:c4

Slave Interface: eth3

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 1

Permanent HW addr: 40:f2:e9:db:c9:c5

###db02

[root@db02 ~]# cat /proc/net/bonding/bond1

Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth2

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

Slave Interface: eth2

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: 40:f2:e9:db:c9:fc

Slave Interface: eth3

MII Status: up

Speed: 1000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: 40:f2:e9:db:c9:fd

随即尝试down掉节点二的bond1中的eth3网口,发现可以ping通,且集群能够起来。

###db02

[root@db02 ~]# ifdow eth3

Sep 29 15:40:55 db02 kernel: bonding: bond1: Removing slave eth3

###db01

[root@db01 ~]# ping pri02

PING pri02.xmtvdb (10.10.11.2) 56(84) bytes of data.

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=1 ttl=64 time=0.071 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=2 ttl=64 time=0.122 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=3 ttl=64 time=0.134 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=4 ttl=64 time=0.098 ms

同时这个时候集群服务也起来了:

[root@db01 ~]# su - grid -c "crsctl status res -t"

--------------------------------------------------------------------------------

NAME          TARGET  STATE        SERVER                  STATE_DETAILS

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

ora.BAK001.dg

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.DATA001.dg

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.FRA001.dg

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.LISTENER.lsnr

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.OCR_VOTE.dg

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.asm

ONLINE  ONLINE      db01                    Started

ONLINE  ONLINE      db02                    Started

ora.gsd

OFFLINE OFFLINE      db01

OFFLINE OFFLINE      db02

ora1work

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.ons

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

ora.registry.acfs

ONLINE  ONLINE      db01

ONLINE  ONLINE      db02

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE      db01

ora.cvu

1        ONLINE  ONLINE      db01

ora.db01.vip

1        ONLINE  ONLINE      db01

ora.db02.vip

1        ONLINE  ONLINE      db02

ora.oc4j

1        ONLINE  ONLINE      db01

ora.scan1.vip

1        ONLINE  ONLINE      db01

ora.xmman.db

1        ONLINE  ONLINE      db01                    Open

2        ONLINE  ONLINE      db02                    Open

ora.xmman.taf.svc

1        ONLINE  ONLINE      db01

2        ONLINE  ONLINE      db02

再次把eth3 up起来,不受影响

###db02

[root@db02 ~]# ifup eth3

###db01

[root@db01 ~]# ping pri02

PING pri02.xmtvdb (10.10.11.2) 56(84) bytes of data.

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=1 ttl=64 time=0.161 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=2 ttl=64 time=0.022 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=3 ttl=64 time=0.034 ms

64 bytes from pri02.xmtvdb (10.10.11.2): icmp_seq=4 ttl=64 time=0.196 ms

随即根据Oracle最佳实践将直连的两根心跳线连接上交换后,问题没有再现;原因未知,有知道的请告知。

本文标签: 节点 重启 oracle9i Oracle RAC