新聞中心
近年來,我們專注于提供全系列企業級性能管理方案和相關的IT服務,在幫助用戶提高業務效率和整體生產力的同時,降低運營和運維成本。
返回列表
首頁 / 新聞資訊 / 行業資訊
干貨 | Controlfile Sequence Number耗盡引起的控制文件損壞案例
來源:   日期:2018-06-26

作者:趙世雄 | 東方龍馬 · 廣州分公司

 

我們通常碰到過的ORACLE數據庫控制文件損壞情況有哪些呢?你是否碰到過controlfile sequence number值被耗盡而引起損壞的情況?下面我來跟大家講述一個東方龍馬廣州團隊曾經碰到過的一個案例。



背景



在2017年國慶小長假快結束的時候,東方龍馬廣州團隊接到了一個客戶的緊急報障電話,客戶電話里急促的說道:"我們數據庫告警日志出現控制文件損壞的報錯",我們的工程師立即遠程連到客戶系統,查看告警日志:




Completed checkpoint up to RBA [0x1dd32.2.10], SCN: 10184774533893

Fri Oct 06 12:08:31 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/xxx/trace/xxx_ora_37874.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Fri Oct 06 12:08:31 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_m000_32185.trc:

ORA-00235: control file read without a lock inconsistent due to concurrent update

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_37874.trc  (incident=4273683):

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Incident

details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4273683/XXX_ora_37874_i4273683.trc

Fri Oct 06 12:08:32 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc  (incident=4275459):

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Incident details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4275459/XXX_ora_70265_i4275459.trc

.........................內容較多,省略中間部分內容

.........................內容較多,省略中間部分內容

Fri Oct 06 13:19:00 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'




告警日志里,我們看到ctrl01.ctl控制文件持續出現損壞報錯,本以為用ctrl02.ctl控制文件copy一份就能修復ctrl01.ctl控制文件。心里嘀咕著:這是個小問題的時候,數據庫突然宕掉了。




Fri Oct 06 13:19:00 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

LGWR (ospid: 49423): terminating the instance due to error 227

Fri Oct 06 13:19:15 2017

opiodr aborting process unknown ospid (71057) as a result of ORA-1092

Fri Oct 06 13:19:17 2017

System state dump requested by (instance=1, osid=49423 (LGWR)), summary=[abnormal instance termination].

Fri Oct 06 13:19:25 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:26 2017

opiodr aborting process unknown ospid (71043) as a result of ORA-1092

System State dumped to trace file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_diag_49394.trc

Fri Oct 06 13:19:27 2017

opiodr aborting process unknown ospid (71045) as a result of ORA-1092

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

opiodr aborting process unknown ospid (71047) as a result of ORA-1092

Fri Oct 06 13:22:47 2017

opiodr aborting process unknown ospid (71055) as a result of ORA-1092

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Instance terminated by LGWR, pid = 49423




心里一陣拔涼,問題沒那么簡單,難道ctrl02.ctl控制文件也損壞了?

確實正如我們工程師所料,使用ctrl02.ctl控制文件重啟數據庫的時候,報ctrl02.ctl也損壞。


微信圖片_20180626150309.jpg



解決方案



此系統是客戶的核心系統,先幫客戶恢復系統再分析原因,于是利用備份的控制文件來修復控制文件。




$mv /database/XXX/XXX1/XXX/ctrl01.ctl /database/XXX/XXX1/XXX/ctrl01.ctl_bak

$mv /database/XXX/XXX2/XXX/ctrl02.ctl /database/XXX/XXX2/XXX/ctrl02.ctl_bak

RMAN>restore controlfile from '/backup/ctl.bak';




系統恢復后,開始查找原因。通常碰到控制文件損壞,我們的第一反應就是存儲是不是出問題了。于是查看操作系統日志,沒發現任何異常。在定位原因的過程時,控制文件再次損壞。我們通過把控制文件創建到本地硬盤、切換到備機、禁用HA Cluseter測試、斷開容災鏈路測試等一系列測試,問題依舊。在上面這些測試期間,反復跟原廠溝通確認是否是Bug 20324049引起,原廠回復比較含糊,不能百分百肯定。




Bug 20324049  ORA-227 Controlfile Corruption when reaching Maximum Value for Control Seq kccfhcsq


 This note gives a brief overview of bug 20324049. 

 The content was last updated on: 17-APR-2018

 Click here for details of each of the sections below.



Affects:

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

(Not specified)

Versions confirmed as being affected

12.1.0.2 (Server Patch Set)

11.2.0.4

10.2.0.5

8.1.7.4

Platforms affected

Generic (all / most platforms affected)


Fixed:

The fix for 20324049 is first included in

18.1.0

12.2.0.1.170919 (Sep 2017) Database Release Update (DB RU)

12.2.0.1.DBOCT2017RUR:180417 (Apr 2018) Database Release Update Revision(DB RUR)

12.2.0.1.DBJAN2018RUR:180417(Apr 2018) Database Release Update Revision(DB RUR)

12.1.0.2.180116 (Jan 2018) Database Patch Set Update (DB PSU)

12.1.0.2.180116 (Jan 2018) Database Proactive Bundle Patch

11.2.0.4.180116 (Jan 2018) Database Patch Set Update (DB PSU)

11.2.0.4.180116 (Jan 2018) Exadata Database Bundle Patch

12.2.0.1.171017 (Oct 2017) Bundle Patch for Windows Platforms

12.1.0.2.180116 (Jan 2018) Bundle Patch for Windows Platforms

11.2.0.4.180116 (Jan 2018) Bundle Patch for Windows Platforms



Interim patches may be available for earlier versions - click here to check.

Symptoms:

Corruption (Logical)

Instance May Crash

HCHECK script reports this issue

Error May Occur

ORA-227 / HCKE-50


Related To:

Instance Startup




Description

The controlfile sequence number is not expected to increment so

 rapidly that it would ever reach the architectural limit during

 the database lifetime, but if the controlfile sequence number

 does reach the limit for some reason, then processes will fail

 with ORA-227 errors causing the instance to terminate, then after

 this, the database cannot be mounted again because the FG process

 doing the mount will always fail with the same ORA-00227 error.

 This fix writes regular warning messages to the alert log in the

 case where the controlfile sequence number approaches the

 architectural limit. And this fix also provides an event which

 can be set to modify the behavior of "CREATE CONTROLFILE" with

 "NORESETLOGS" so that it will reset the controlfile sequence

 number to 1 in the datafiles, the online redo logs, and the

 new controlfile.

 

If hcheck in Note:136697.1 is run, it may detect this issue

with error HCKE-0050, see Note:2128446.1.

 

Rediscovery Notes

Look for all the following:

 - various processes start reporting the following error:

     ORA-227: corrupt block detected in control file: (block 1,# blocks 1)

 - a common stack trace for the ORA-227 errors might be this:

     kcvsursuht kcc_begin_txn_internal kccocx kccchb kcccsi ...

 - then a fatal background process gets an ORA-227 error and terminates the instance

 - then the database cannot be mounted again

 - do a hex dump of the controlfile via

    $ xxd -g4 <controlfile>  > cf.hexdump

   and search forward in the hexdump for the first occurrence

   of the database name, eg "R12B" in the example below:

    ...

    0004000: 15c20000 01000000 00000000 00000104  ................

    0004010: 59320000 00000000 0000200b 24e7e472  Y2........ .$..r

    0004020: 52313242 00000000 00000000 00020000  R12B............

    ...        ^^^^^^^^

   the 3rd 32-byte word in the row with the database name is

   the controlfile sequence number, it will be zero (but it

   should never be zero, and it's the reason for the ORA-227's)

 - an "xxd" hex dump of a restored recent backup of the controlfile

   should show the controlfile sequence number is some high value

   eg e1ffffff (which is 0xffffffe1 after endian conversion).

 

Workaround

There is no workaround for the ORA-227 problem after it has

happened.

It is possible to proactively prevent the ORA-227 problem from

happening (before it happens) in the specific case where the

controlfile seq# is incrementing very quickly (eg many times per

second) due to storing of last scn/time of nologging operations

in the controlfile; in this specific case, the workaround is to

set:  db_unrecoverable_scn_tracking=FALSE

 

Solution:

After installing this fix:

1. verify the controlfile sequence# is at or above 0xFF000000:

set numwidth 15

select max(FHCSQ)

from x$kcvfh;

2. Generate Trace file to recreate the controlfile:

    alter database backup controlfile to trace noresetlogs;

3. shutdown

4. startup nomount

5. alter session set events '20324049 trace name context forever, level 1';

6. execute the commands in the tracefile generated by step#2

7. alter session set events '20324049 trace name context off';

8. confirm the controlfile sequence# is now low with the same query in 1.

 

Take new backups after applying this solution, otherwise recovery will not be possible

failing with these errors:

 

 ORA-00283: recovery session canceled due to errors

 ORA-01122: database file 1 failed verification check

 ORA-01110: data file 1: '/oracle/dbs/t_db1.f'

 ORA-01207: file is more recent than control file - old control file



此系統的數據庫是11.2.0.3版本,Bug 20324049描述影響的版本不涉及到11.2.0.3版本,由于主機、存儲上已經做了各種排查也未能發現原因跟其相關。于是我們建議把數據庫升級到11.2.0.4版本,并打上Bug 20324049補丁,并重建控制文件,把controlfile sequence number的通過此補丁壓回值到1。  



CREATE CONTROLFILE is being performed with event 20324049 set, this will now reset the controlfile sequence number to 1 in the datafiles, the online redo logs, and the new controlfile.

datafile 1 header updated: controlfile sequence# reset to 1

datafile 2 header updated: controlfile sequence# reset to 1

datafile 3 header updated: controlfile sequence# reset to 1

datafile 4 header updated: controlfile sequence# reset to 1

datafile 5 header updated: controlfile sequence# reset to 1

datafile 6 header updated: controlfile sequence# reset to 1

datafile 7 header updated: controlfile sequence# reset to 1

datafile 8 header updated: controlfile sequence# reset to 1

datafile 9 header updated: controlfile sequence# reset to 1

datafile 10 header updated: controlfile sequence# reset to 1

datafile 11 header updated: controlfile sequence# reset to 1

datafile 12 header updated: controlfile sequence# reset to 1

..........................................................

..........................................................

datafile 344 header updated: controlfile sequence# reset to 1

datafile 345 header updated: controlfile sequence# reset to 1

datafile 346 header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo01-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo01-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo02-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo02-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo03-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo03-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo04-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo04-2.log header updated: controlfile sequence# reset to 1 



當然,我們做這個升級是先利用一份存儲鏡像出來的數據進行測試,測試一切都進行順利,且經過應用一個周期的測試,控制文件未出現損壞。于是,我們在國慶小長假結束前的凌晨立即對生產庫進行修復。通過東方龍馬廣州團隊的協力合作,終于在國慶小長假結束前幫客戶恢復了系統的正常使用。


微信圖片_20180626150248.jpg


關于controlfile sequence number 


OERR: ORA-00227 corrupt block detected in controlfile: (block %s, # blocks %s) Master Note / Troubleshooting, Diagnostic and Solution (文檔 ID 48808.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 8.0.6.0 to 12.1.0.2 [Release 8.0.6 to 12.1]

Information in this document applies to any platform.

PURPOSE

This article provides information about error ORA-00227 and possible actions.

SCOPE

 This note is intended for general audience as initial starting point for beginning diagnosis of ORA-00227.

DETAILS


Error:   ORA 227

Text:    corrupt block detected in controlfile: (block %s, # blocks %s) 

..........................................................


Cause:  A block header corruption or checksum error was detected on reading the controlfile. 

Action: Use the CREATE CONTROLFILE or RECOVER DATABASE USING BACKUP CONTROLFILE command.

 

Cause

The controlfile is corrupted. This could be caused by a problem external to Oracle like a Hardware/OS problem or an Oracle Defect.


Identify if Bug 20324049 may be causing the error ORA-227

If Bug 20324049 is causing the error ORA-227; then the maximum sequence number in the Database will be close to the maximum value (4294967295):

set numwidth 15

select max(FHCSQ)

case trunc(max(FHCSQ)/4294967295,1) 

when 0.9 then 'WARNING: Reference Bug 20324049' 

else 'NO Warning' end "Bug 20324049"

from x$kcvfh;


控制文件序列值(controlfile sequence number)最大值為:4294967295(即43億左右,2的32次方減1),可以利用上面的語句檢查你的系統是否存在控制文件序列值超過極限值的風險。

 

Take a backup of the control files as they are now

Take a backup of the existent control files with a regular copy (cp) or any other mechanism.  This can be used for future diagnosis and in case they are needed for additional recovery.

 

Identify which control file is failing

ORA-00227 is normally accompanied by ORA-202 which prints the affected control file name.  Review the alert log for more details.

 

Execute DBVERIFY on all Control Files

DBVERIFY is a tool that is intended to identify corruptions in Datafiles but in some cases may help to identify block corruptions in Control Files. 


Identify control file block size by executing dbfsize. Example:

$ dbfsize /oradata/controlfile/control1.ctl


Database file: /oradata/controlfile/control1.ctl

Database file type: file system

Database file size: 614 16384 byte blocks

In this case the control file block size is 16384.

Alternatively identify the control file block size by querying view V$CONTROLFILE.


Execute DBVERIFY on each Control File copy using the block size identified in 1.1.  Example:

dbv file=/oradata/controlfile/control1.ctl blocksize=16384

 

Solution

Solution 1. Use another mirror copy of the control file

If it is determined that the control file is damaged (the corruption is persistent) and if the database is down, then take one of the other control files from the control_files parameter and copy it over the bad control file noted above or startup the database with one control file at a time.  


Try opening the database and if error persists then go to Solution 2:

Solution 2. Recreate the control file

Use the next article to recreate the control file:

Doc ID 735106.1 How to Recreate a Controlfile

Solution 3. Restore a backup of the control file and apply media recovery

Restore a backup of control file and apply media recovery using RECOVER DATABASE USING BACKUP CONTROLFILE

If the database is on IBM AIX, make sure to review the issue described in Note 2237498.1



觸發控制文件序列值增長的情況有


發生檢查點

日志文件的切換

歸檔online redolog

運行崩潰后的恢復

熱備的開始和結束

DML通過nologging等選項執行對象時

對大象進行直接DML操作等

還有其他你能想到的方面嗎?

歡迎補充








微信圖片_20180626150221.jpg


|  北京    |    上海    |   廣州    |   成都    |


4008-906-960



微信圖片_20180626150229.png



4008-906-960

全國免費咨詢電話
  • 官方微博
  • 官方微信
Copyright 1998-2016 版權所有 北京東方龍馬軟件發展有限公司 京ICP備14000200號-1
上海时时彩加盟 股市对冲基金赚钱 千炮捕鱼达人 麻将上分模式 qq群排名什么赚钱 海南琼崖麻将下载 躺着赚钱怎么弄 宁波七百搭麻将下截 2018什么手游能赚钱人民币 新浪爱彩苹果 手机版大话西游赚钱不 电商微商到底怎么赚钱 大学的食堂怎么赚钱吗 地下城勇士怎么赚钱 2017刷材料赚钱游戏 如意彩票游戏 传世世界3d赚钱