Friday, March 27, 2015

crsd errors after downgrading the Oracle clusterware from 12.1.0.1.6 to 11.2.0.4 on Oracle Linux 6

After an upgrade from Oracle 11.2.0.4 to the version 12.1.0.1 we hit a bug described in my post "Errors applying Grid Infrastructure PSU 12.1.0.1.6 (JAN2015)". Afterwards we tried a downgrade of the Oracle clusterware from 12.1.0.1.6 to 11.2.0.4 but the crsd service does not start.

[root@rac1 crsd]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online 



The crsd daemon has the following error in the crsd.log logfile:


2015-02-14 8:10:53.631: [ CLSE][119007008]clse_get_auth_loc: Returning default authloc: /u01/app/11.2.0/grid/auth/crs/rac1
2015-02-14 8:10:53.631: [ default][119007008] AuthLoc /u01/app/11.2.0/grid/auth/crs/rac1
[ default][119007008]Failure 3 in trying to open AV key SYSTEM.version.activeversion





2015-02-14 1:53:59.689: [  OCRSRV][1564342016]th_upgrade: Starting upgrade calculation [  OCRMAS][1564342016]th_calc_av:8': Failed in vsnupr. Incorrect SV stored in OCR. Key [SYSTEM.version.hostnames.] Value []
2015-02-14 1:53:59.698: [  OCRSRV][1564342016]th_upgrade:9 Shutdown CacheMaster. prev AV [186647552] new calc av [186647552] my sv [186647552]

 ...
2015-02-14 2:36:26.070: [ CRSMAIN][4095674144] CRSD listening on 10 style E2E port (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.1)(PORT=51548))
2015-02-14 2:36:26.070: [ CRSD][4095674144] Created alert : (:CRSD00132:) : Couldn't write E2E key to OCR, error:
2015-02-14 2:36:26.070: [ CRSD][4095674144][PANIC] CRSD exiting: Couldn't write E2E key to OCR, error: 3
2015-02-14 2:36:26.070: [ CRSD][4095674144] Done. 


Also in the crs.out logfiles the error can be found:

CRSD REBOOT
2015-02-14 2:36:24
Changing directory to /u01/app/11.2.0/grid2/log/rac1/crsd
2015-02-14 2:36:24
CRSD REBOOT
CRSD exiting: Couldn't write E2E key to OCR, error: 3
CRSD handling signal 6
Dumping CRSD state
Dumping CRSD stack trace



There is the failure 3 if he is trying to read the  System.version.activeversion value from the OCR.
Furthermore there is an incorrect Software Version after the downgrade stored in the key System.version.hostnames causing also error 3. Error 3 looks loke the clusterware ORA-600. Strange at this point is the fact that ocrcheck says the OCR is just fine and the  System.version.activeversion is correct according to ocrdump.bin:

...
[SYSTEM.version.activeversion]
ORATEXT : 11.2.0.4.0
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

...

A workaround for this problem coulkd be to restore an old version of the OCR from before the upgrade to Oracle Clusterware 12.1.0.1.x. Such a dump can be found with the "ocrconfig -showbackup" command. Using ocrconfig -restore you can replace the OCR with an older version. Unfortunately, this is not possible because the following error appears:


[root@rac2 oracle]# ocrconfig -restore /u01/app/11.2.0/grid/cdata/oracle-scan/backup01.ocr
Errors in file :
ORA-27091: unable to queue I/O
ORA-17510: Attempt to do i/o beyond file size
ORA-06512: at line 4



Clearly, this is a case for the Oracle support and they say it is due to  bug 17046460. Unfortunately, the bug is not published. At this point there is in my opinion no workaround for the issue. You need to apply the patch or reinstall the whole cluster. To sum it up, there was a bug upgrading the clusterware and applying the most recent PSU. There was another bug downgrading the clusterware and finally there was a possible connected bug replacing the OCR with an older version. Because of that, you need to test your upgrades and updates carefully.