1、9312-A主板(1/13)忽然出现硬件故障,导致该单板不停复位。
Jan 19 2012 14:29:07 Quidway %%01CSSM/4/STACKBACKUP(l)[33]:This cluster CSS compete result is backup.
Jan 19 2012 14:29:15 Quidway %%01ALML/4/CLOCKFAULT(l)[50]:The \"CLK_33M_CHK\" sensor15 of MPU board[1/13] detect clock signal fault
Jan 19 2012 14:29:15 Quidway %%01ALML/4/CLOCKFAULT(l)[51]:The \"CLK_125M_CHK\" sensor16 of MPU board[1/13] detect clock signal fault
Jan 19 2012 14:29:15 Quidway %%01ALML/4/CLOCKFAULT_RESUME(l)[55]:The \"CLK_125M_CHK\" sensor16 of MPU board[1/13] detect clock signal fault resume
Jan 19 2012 14:29:15 Quidway %%01ALML/4/CLOCKFAULT(l)[56]:The \"CLK_125M_CHK\" sensor16 of MPU board[1/13] detect clock signal fault
Jan 19 2012 14:29:15 Quidway %%01ALML/3/CPU_RESET(l)[57]:The canbus node of MPU board[1/13] detects that CPU was reset.
2、由于该单板的复位导致9312-A备板(1/14)也出现异常复位,应该是由于1/13单板复位
导致,怀疑是1/13板一直复位,自动回退到了老的版本,此时出现主备板版本不一致
引发。
V1R6后续版本已经解决该问题。
Jan 19 2012 14:29:41 Quidway %%01ALML/4/ENTRESET(l):MPU frame[1] board[14] is reset, The reason is: VRP reset selfboard because of find exception.
3、此时1框的两块主控都复位了,导致堆叠分裂。分裂之后,1/14单板启动,启动完成之后又会堆叠合并。
合并的过程会出现2号框的整框复位,这个是堆叠机制要求的。
Jan 19 2012 14:39:05 Quidway %%01ALML/4/ENTRESET(l):LPU frame[2] board[5] is reset, The reason is: Reset for CSS management.
Jan 19 2012 14:39:05 Quidway %%01ALML/4/ENTRESET(l):LPU frame[2] board[8] is reset, The reason is: Reset for CSS management.
Jan 19 2012 14:39:05 Quidway %%01ALML/4/ENTRESET(l):MPU frame[2] board[14] is reset, The reason is: Reset for CSS management.
Jan 19 2012 14:39:06 Quidway %%01ALML/4/ENTRESET(l):MPU frame[2] board[13] is reset, The reason is: Reset for CSS management.
4、1/13故障之后引发了1/14单板的复位,同时1/14的复位引发了2框的复位。
5、升级到V1R6之后,应该可以解决上诉问题。但是日志里分别在01:23:21才使能了两框的堆叠,但是
01:49:17、02:21:42、02:12:29和02:32:05都有电源的告警,怀疑是人为手动整框下电,在02:29:13
的时候去使能了堆叠,之后就一直没有再使能堆叠,一直处于单框工作状态。
详细分析如下:
B单框直到 20 1:23才开始有堆叠
Jan 19 2012 21:13:18 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"startup system-software cfcard:/s9300v100r006c00spc800.cc slave-board\")
Jan 19 2012 21:13:22
Quidway %%01SHELL/6/DISPLAY_CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"display startup\")
Jan 19 2012 21:13:35
Quidway %%01SHELL/6/DISPLAY_CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"display current-configuration\")
Jan 19 2012 21:13:40 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"system-view\")
A框20号 01:08:53才开始使能堆叠
an 19 2012 22:01:01 Quidway BASETRAP/4/CPUUSAGERESUME:OID 1.3.6.1.4.1.2011.5.25.129.2.4.2 CPU utilization resumed from exceeding the pre-alarm threshold.(Index=70516745, BaseUsagePhyIndex=0, UsageType=1, UsageIndex=0, Severity=6, ProbableCause=154, EventType=4, PhysicalName=\"MPU Board 13\UsageUnit=1, UsageThreshold=80)
Jan 19 2012 22:01:07 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"get S9300V100R006C00SPC800.CC\")
Jan 19 2012 22:02:31 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"quit\")
Jan 19 2012 22:02:32 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"dir\")
A日志
Jan 20 2012 01:04:46 Quidway BASETRAP/4/CPUUSAGERESUME:OID 1.3.6.1.4.1.2011.5.25.129.2.4.2 CPU utilization recovered to the normal
range.(Index=68419593, BaseUsagePhyIndex=0, UsageType=1, UsageIndex=0, Severity=6, ProbableCause=154, EventType=4,PhysicalName=LPU Board 5,
RelativeResource=\"\
Jan 20 2012 01:08:19
Quidway %%01SHELL/6/DISPLAY_CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"display device\")
Jan 20 2012 01:08:47 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"system-view\")
Jan 20 2012 01:08:53 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"css enable\")
B日志
Jan 20 2012 01:23:21 SwitchB %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"save\")
Jan 20 2012 01:23:22 SwitchB %%01HWCM/5/TRAPLOG(l):OID
1.3.6.1.4.1.2011.6.10.2.1 configure changed. (EventIndex=9, CommandSource=1, ConfigSource=2, ConfigDestination=4)
Jan 20 2012 01:23:26 SwitchB %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"system-view\")
Jan 20 2012 01:23:28 SwitchB %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"css enable\")
Jan 20 2012 01:23:29 SwitchB %%01VFS/5/DEV_UNREG(l):Device slave#flash: unregistration finished.
Jan 20 2012 01:23:29 SwitchB %%01VFS/5/DEV_UNREG(l):Device slave#cfcard: unregistration finished.
B日志
Jan 20 2012 01:26:08 SwitchA %%01CSSM/4/STACKBACKUP(l)[326]:This cluster CSS compete result is backup. 选为备框
Jan 20 2012 01:56:59 SwitchB %%01CSSM/4/STACKMASTER(l):This cluster CSS compete result is master.
A日志
Jan 20 2012 01:25:15 SwitchA %%01CSSM/4/STACKMASTER(l):This cluster CSS compete result is master.选为主框
Self slot:25, CSS status: master
Matser:[1,25], backup:[2,27]
1:49分 25掉电了。主备切换。B为主框
Jan 20 2012 01:49:17 SwitchA %%01ALML/4/IOFAULT(l):The \"AC MODE
PROTEC\" sensor3 of [FRAME1/PWR1] detects a fault.
Jan 20 2012 01:49:17 SwitchA %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME1/PWR2] detects a fault.
Jan 20 2012 01:49:18 SwitchA %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME1/PWR3] detects a fault.
%2012-Jan-20 01:56:29.790.2 SwitchA
01SOURCE/6/TASKREGSUC(D)[64]:Succeed to create framework task LSPMLsp management.
===== current int switch info (slot: 25) =====
Reset reason is power off(after reset), StartKind is Cold Reset.
%2012-Jan-20 01:56:29.790.3 SwitchA
01SOURCE/6/TASKREGSUC(D)[65]:Succeed to create framework task RSVP task.
Jan 20 2012 01:57:52 SwitchB %%01CSSM/4/STACKBACKUP(l)[333]:This cluster CSS compete result is backup.
电源问题导致重新选为备框
Jan 20 2012 02:13:02 SwitchB %%01ALML/4/ENTRESET(l):MPU frame[1] board[13] is reset. The reason is: Reset for no heart.
Jan 20 2012 02:12:29 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME1/PWR1] detects a fault.
Jan 20 2012 02:12:29 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME1/PWR2] detects a fault.
Jan 20 2012 02:12:30 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME1/PWR3] detects a fault.
Jan 20 2012 02:19:02 SwitchB %%01CSSM/4/STACKBACKUP(l):This cluster CSS compete result is backup.
B:掉电了
Jan 20 2012 02:21:42 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME2/PWR1] detects a fault.
Jan 20 2012 02:21:43 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME2/PWR2] detects a fault.
Jan 20 2012 02:21:44 SwitchB %%01ALML/4/IOFAULT(l):The \"AC MODE PROTEC\" sensor3 of [FRAME2/PWR3] detects a fault.
A:堆叠端口linkdown了
Jan 20 2012 02:21:46 SwitchB CSSM/4/STACKLINKDOWN:OID
1.3.6.1.4.1.2011.5.25.183.3.3.2.1 1/13 CSS port 1 down.
Jan 20 2012 02:21:46 SwitchB CSSM/4/STACKLINKDOWN:OID 1.3.6.1.4.1.2011.5.25.183.3.3.2.1 1/13 CSS port 3 down.
去使能堆叠
B:
Jan 20 2012 02:29:13 SwitchB %%01RSVP/7/SND_HA_BATCHBK_OVER(l):Sent batch backup end event to HA.
Jan 20 2012 02:29:15 SwitchB %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"system-view\")
Jan 20 2012 02:29:18 SwitchB %%01SHELL/5/CMDRECORD(l):Record
command information. (Task=co0 , Ip=**, User=**, Command=\"undo css enable\")
A:
Jan 20 2012 02:23:52 Quidway %%01HWCM/5/TRAPLOG(l):OID
1.3.6.1.4.1.2011.6.10.2.1 configure changed. (EventIndex=1, CommandSource=3, ConfigSource=4, ConfigDestination=2)
Jan 20 2012 02:23:54 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"system-view\")
Jan 20 2012 02:24:03 Quidway %%01SHELL/5/CMDRECORD(l):Record command information. (Task=co0 , Ip=**, User=**, Command=\"undo css enable\")
后面就都变为了单框了
因篇幅问题不能全部显示,请点此查看更多更全内容