作者:趙世雄 | 東方龍馬 · 廣州分公司
我們通常碰到過的ORACLE數據庫控制文件損壞情況有哪些呢?你是否碰到過controlfile sequence number值被耗盡而引起損壞的情況?下面我來跟大家講述一個東方龍馬廣州團隊曾經碰到過的一個案例。
在2017年國慶小長假快結束的時候,東方龍馬廣州團隊接到了一個客戶的緊急報障電話,客戶電話里急促的說道:"我們數據庫告警日志出現控制文件損壞的報錯",我們的工程師立即遠程連到客戶系統,查看告警日志:
Completed checkpoint up to RBA [0x1dd32.2.10], SCN: 10184774533893
Fri Oct 06 12:08:31 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/xxx/trace/xxx_ora_37874.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Fri Oct 06 12:08:31 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_m000_32185.trc:
ORA-00235: control file read without a lock inconsistent due to concurrent update
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_37874.trc (incident=4273683):
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Incident
details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4273683/XXX_ora_37874_i4273683.trc
Fri Oct 06 12:08:32 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc (incident=4275459):
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Incident details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4275459/XXX_ora_70265_i4275459.trc
.........................內容較多,省略中間部分內容
.........................內容較多,省略中間部分內容
Fri Oct 06 13:19:00 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
告警日志里,我們看到ctrl01.ctl控制文件持續出現損壞報錯,本以為用ctrl02.ctl控制文件copy一份就能修復ctrl01.ctl控制文件。心里嘀咕著:這是個小問題的時候,數據庫突然宕掉了。
Fri Oct 06 13:19:00 2017
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:
ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)
ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'
LGWR (ospid: 49423): terminating the instance due to error 227
Fri Oct 06 13:19:15 2017
opiodr aborting process unknown ospid (71057) as a result of ORA-1092
Fri Oct 06 13:19:17 2017
System state dump requested by (instance=1, osid=49423 (LGWR)), summary=[abnormal instance termination].
Fri Oct 06 13:19:25 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:26 2017
opiodr aborting process unknown ospid (71043) as a result of ORA-1092
System State dumped to trace file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_diag_49394.trc
Fri Oct 06 13:19:27 2017
opiodr aborting process unknown ospid (71045) as a result of ORA-1092
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
opiodr aborting process unknown ospid (71047) as a result of ORA-1092
Fri Oct 06 13:22:47 2017
opiodr aborting process unknown ospid (71055) as a result of ORA-1092
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Fri Oct 06 13:19:27 2017
ORA-1092 : opitsk aborting process
Instance terminated by LGWR, pid = 49423
心里一陣拔涼,問題沒那么簡單,難道ctrl02.ctl控制文件也損壞了?
確實正如我們工程師所料,使用ctrl02.ctl控制文件重啟數據庫的時候,報ctrl02.ctl也損壞。
此系統是客戶的核心系統,先幫客戶恢復系統再分析原因,于是利用備份的控制文件來修復控制文件。
$mv /database/XXX/XXX1/XXX/ctrl01.ctl /database/XXX/XXX1/XXX/ctrl01.ctl_bak
$mv /database/XXX/XXX2/XXX/ctrl02.ctl /database/XXX/XXX2/XXX/ctrl02.ctl_bak
RMAN>restore controlfile from '/backup/ctl.bak';
系統恢復后,開始查找原因。通常碰到控制文件損壞,我們的第一反應就是存儲是不是出問題了。于是查看操作系統日志,沒發現任何異常。在定位原因的過程時,控制文件再次損壞。我們通過把控制文件創建到本地硬盤、切換到備機、禁用HA Cluseter測試、斷開容災鏈路測試等一系列測試,問題依舊。在上面這些測試期間,反復跟原廠溝通確認是否是Bug 20324049引起,原廠回復比較含糊,不能百分百肯定。
Bug 20324049 ORA-227 Controlfile Corruption when reaching Maximum Value for Control Seq kccfhcsq
This note gives a brief overview of bug 20324049.
The content was last updated on: 17-APR-2018
Click here for details of each of the sections below.
Affects:
Product (Component) | Oracle Server (Rdbms) |
Range of versions believed to be affected | (Not specified) |
Versions confirmed as being affected | 12.1.0.2 (Server Patch Set) 11.2.0.4 10.2.0.5 8.1.7.4 |
Platforms affected | Generic (all / most platforms affected) |
Fixed:
The fix for 20324049 is first included in | 18.1.0 12.2.0.1.170919 (Sep 2017) Database Release Update (DB RU) 12.2.0.1.DBOCT2017RUR:180417 (Apr 2018) Database Release Update Revision(DB RUR) 12.2.0.1.DBJAN2018RUR:180417(Apr 2018) Database Release Update Revision(DB RUR) 12.1.0.2.180116 (Jan 2018) Database Patch Set Update (DB PSU) 12.1.0.2.180116 (Jan 2018) Database Proactive Bundle Patch 11.2.0.4.180116 (Jan 2018) Database Patch Set Update (DB PSU) 11.2.0.4.180116 (Jan 2018) Exadata Database Bundle Patch 12.2.0.1.171017 (Oct 2017) Bundle Patch for Windows Platforms 12.1.0.2.180116 (Jan 2018) Bundle Patch for Windows Platforms 11.2.0.4.180116 (Jan 2018) Bundle Patch for Windows Platforms |
Interim patches may be available for earlier versions - click here to check.
Symptoms:
Corruption (Logical)
Instance May Crash
HCHECK script reports this issue
Error May Occur
ORA-227 / HCKE-50
Related To:
Instance Startup
Description
The controlfile sequence number is not expected to increment so
rapidly that it would ever reach the architectural limit during
the database lifetime, but if the controlfile sequence number
does reach the limit for some reason, then processes will fail
with ORA-227 errors causing the instance to terminate, then after
this, the database cannot be mounted again because the FG process
doing the mount will always fail with the same ORA-00227 error.
This fix writes regular warning messages to the alert log in the
case where the controlfile sequence number approaches the
architectural limit. And this fix also provides an event which
can be set to modify the behavior of "CREATE CONTROLFILE" with
"NORESETLOGS" so that it will reset the controlfile sequence
number to 1 in the datafiles, the online redo logs, and the
new controlfile.
If hcheck in Note:136697.1 is run, it may detect this issue
with error HCKE-0050, see Note:2128446.1.
Rediscovery Notes
Look for all the following:
- various processes start reporting the following error:
ORA-227: corrupt block detected in control file: (block 1,# blocks 1)
- a common stack trace for the ORA-227 errors might be this:
kcvsursuht kcc_begin_txn_internal kccocx kccchb kcccsi ...
- then a fatal background process gets an ORA-227 error and terminates the instance
- then the database cannot be mounted again
- do a hex dump of the controlfile via
$ xxd -g4 <controlfile> > cf.hexdump
and search forward in the hexdump for the first occurrence
of the database name, eg "R12B" in the example below:
...
0004000: 15c20000 01000000 00000000 00000104 ................
0004010: 59320000 00000000 0000200b 24e7e472 Y2........ .$..r
0004020: 52313242 00000000 00000000 00020000 R12B............
... ^^^^^^^^
the 3rd 32-byte word in the row with the database name is
the controlfile sequence number, it will be zero (but it
should never be zero, and it's the reason for the ORA-227's)
- an "xxd" hex dump of a restored recent backup of the controlfile
should show the controlfile sequence number is some high value
eg e1ffffff (which is 0xffffffe1 after endian conversion).
Workaround
There is no workaround for the ORA-227 problem after it has
happened.
It is possible to proactively prevent the ORA-227 problem from
happening (before it happens) in the specific case where the
controlfile seq# is incrementing very quickly (eg many times per
second) due to storing of last scn/time of nologging operations
in the controlfile; in this specific case, the workaround is to
set: db_unrecoverable_scn_tracking=FALSE
Solution:
After installing this fix:
1. verify the controlfile sequence# is at or above 0xFF000000:
set numwidth 15
select max(FHCSQ)
from x$kcvfh;
2. Generate Trace file to recreate the controlfile:
alter database backup controlfile to trace noresetlogs;
3. shutdown
4. startup nomount
5. alter session set events '20324049 trace name context forever, level 1';
6. execute the commands in the tracefile generated by step#2
7. alter session set events '20324049 trace name context off';
8. confirm the controlfile sequence# is now low with the same query in 1.
Take new backups after applying this solution, otherwise recovery will not be possible
failing with these errors:
ORA-00283: recovery session canceled due to errors
ORA-01122: database file 1 failed verification check
ORA-01110: data file 1: '/oracle/dbs/t_db1.f'
ORA-01207: file is more recent than control file - old control file
此系統的數據庫是11.2.0.3版本,Bug 20324049描述影響的版本不涉及到11.2.0.3版本,由于主機、存儲上已經做了各種排查也未能發現原因跟其相關。于是我們建議把數據庫升級到11.2.0.4版本,并打上Bug 20324049補丁,并重建控制文件,把controlfile sequence number的通過此補丁壓回值到1。
CREATE CONTROLFILE is being performed with event 20324049 set, this will now reset the controlfile sequence number to 1 in the datafiles, the online redo logs, and the new controlfile.
datafile 1 header updated: controlfile sequence# reset to 1
datafile 2 header updated: controlfile sequence# reset to 1
datafile 3 header updated: controlfile sequence# reset to 1
datafile 4 header updated: controlfile sequence# reset to 1
datafile 5 header updated: controlfile sequence# reset to 1
datafile 6 header updated: controlfile sequence# reset to 1
datafile 7 header updated: controlfile sequence# reset to 1
datafile 8 header updated: controlfile sequence# reset to 1
datafile 9 header updated: controlfile sequence# reset to 1
datafile 10 header updated: controlfile sequence# reset to 1
datafile 11 header updated: controlfile sequence# reset to 1
datafile 12 header updated: controlfile sequence# reset to 1
..........................................................
..........................................................
datafile 344 header updated: controlfile sequence# reset to 1
datafile 345 header updated: controlfile sequence# reset to 1
datafile 346 header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo01-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo01-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo02-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo02-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo03-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo03-2.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo04-1.log header updated: controlfile sequence# reset to 1
logfile /database/xxxx/xxxx/xxxx/redo04-2.log header updated: controlfile sequence# reset to 1
當然,我們做這個升級是先利用一份存儲鏡像出來的數據進行測試,測試一切都進行順利,且經過應用一個周期的測試,控制文件未出現損壞。于是,我們在國慶小長假結束前的凌晨立即對生產庫進行修復。通過東方龍馬廣州團隊的協力合作,終于在國慶小長假結束前幫客戶恢復了系統的正常使用。
OERR: ORA-00227 corrupt block detected in controlfile: (block %s, # blocks %s) Master Note / Troubleshooting, Diagnostic and Solution (文檔 ID 48808.1)
APPLIES TO:
Oracle Database - Enterprise Edition - Version 8.0.6.0 to 12.1.0.2 [Release 8.0.6 to 12.1]
Information in this document applies to any platform.
PURPOSE
This article provides information about error ORA-00227 and possible actions.
SCOPE
This note is intended for general audience as initial starting point for beginning diagnosis of ORA-00227.
DETAILS
Error: ORA 227
Text: corrupt block detected in controlfile: (block %s, # blocks %s)
..........................................................
Cause: A block header corruption or checksum error was detected on reading the controlfile.
Action: Use the CREATE CONTROLFILE or RECOVER DATABASE USING BACKUP CONTROLFILE command.
Cause
The controlfile is corrupted. This could be caused by a problem external to Oracle like a Hardware/OS problem or an Oracle Defect.
Identify if Bug 20324049 may be causing the error ORA-227
If Bug 20324049 is causing the error ORA-227; then the maximum sequence number in the Database will be close to the maximum value (4294967295):
set numwidth 15
select max(FHCSQ),
case trunc(max(FHCSQ)/4294967295,1)
when 0.9 then 'WARNING: Reference Bug 20324049'
else 'NO Warning' end "Bug 20324049"
from x$kcvfh;
控制文件序列值(controlfile sequence number)最大值為:4294967295(即43億左右,2的32次方減1),可以利用上面的語句檢查你的系統是否存在控制文件序列值超過極限值的風險。
Take a backup of the control files as they are now
Take a backup of the existent control files with a regular copy (cp) or any other mechanism. This can be used for future diagnosis and in case they are needed for additional recovery.
Identify which control file is failing
ORA-00227 is normally accompanied by ORA-202 which prints the affected control file name. Review the alert log for more details.
Execute DBVERIFY on all Control Files
DBVERIFY is a tool that is intended to identify corruptions in Datafiles but in some cases may help to identify block corruptions in Control Files.
Identify control file block size by executing dbfsize. Example:
$ dbfsize /oradata/controlfile/control1.ctl
Database file: /oradata/controlfile/control1.ctl
Database file type: file system
Database file size: 614 16384 byte blocks
In this case the control file block size is 16384.
Alternatively identify the control file block size by querying view V$CONTROLFILE.
Execute DBVERIFY on each Control File copy using the block size identified in 1.1. Example:
dbv file=/oradata/controlfile/control1.ctl blocksize=16384
Solution
Solution 1. Use another mirror copy of the control file
If it is determined that the control file is damaged (the corruption is persistent) and if the database is down, then take one of the other control files from the control_files parameter and copy it over the bad control file noted above or startup the database with one control file at a time.
Try opening the database and if error persists then go to Solution 2:
Solution 2. Recreate the control file
Use the next article to recreate the control file:
Doc ID 735106.1 How to Recreate a Controlfile
Solution 3. Restore a backup of the control file and apply media recovery
Restore a backup of control file and apply media recovery using RECOVER DATABASE USING BACKUP CONTROLFILE
If the database is on IBM AIX, make sure to review the issue described in Note 2237498.1
發生檢查點
日志文件的切換
歸檔online redolog
運行崩潰后的恢復
熱備的開始和結束
DML通過nologging等選項執行對象時
對大象進行直接DML操作等
還有其他你能想到的方面嗎?
歡迎補充
| 北京 | 上海 | 廣州 | 成都 |
4008-906-960