2017-07 Let’s make DB2 z/OS IMAGE COPY great again!
War story :
Why is Db2 z/OS IMAGE COPY very useful in the current Db2 z/OS world?
Please excuse the heading, I simply could not resist…
This month is another war story from the trenches of Production DBAs fighting the fight for 24×7 Shops around the world…
This time it was a request to recover some tables to a specific point in time that started the ball rolling…
Staged too soon? Bad updates and deletes
The developers pushed some software to production and then found out, five hours later, that there was a logical error and it was doing really bad updates and deletes in a bunch of tables that it should *not* have done. This is bad news…
Backup system to the Rescue?
This firm uses System Level Backup (SLB), but the DBA group, I kid you not, did not know this little factoid (they had alternate facts to works with), and so when the request to restore tables x, y and z to a specific Point in Time (PiT) arrived, they simply created the required RECOVER control cards and, after checking that the tables, indexes and tablespaces were not being used, submitted the RECOVER jobs. Now, the first job worked fine, but the second and the third “bought the farm” with *very* weird messages that the DBA group did not really understand:
RECOVER Messages you do not want to see – Part one
DSNU1520I 319 08:26:44.64 DSNUCBRT - THE RECOVERY BASE FOR TABLESPACE ROYDB.ROYTS DSNUM 1 IS THE SYSTEM LEVEL BACKUP WITH DATE = 20161212, TIME 041302, AND TOKEN X'FF..FF' DSNU1522I 319 08:26:46.85 DSNUCBRT - THE DFSMSHSM CALL TO RESTORE TABLESPACE ROYDB.ROYTS DSNUM 1 FAILED WITH RC = X'0000005D' AND REASON CODE = X'00000042' SEE THE JOB LOG FOR DFSMSHSM MESSAGES INDICATING THE CAUSE OF THE ERROR DSNU832I )DSJP 319 08:26:44.61 DSNUCARS - INDEX ROYSCHEMA.ROYINDEX PARTITION 1 IS IN REBUILD PENDING STATE DSNU560I )DSJP 319 08:26:46.87 DSNUGSRX - TABLESPACE ROYDB.ROYTS PARTITION 1 IS IN RECOVER PENDING STATE DSNU012I 319 08:26:46.88 DSNUGBAC - UTILITY EXECUTION TERMINATED, HIGHEST RETURN CODE=8
(Some names have been changed to protect the innocent!)
Naturally I have added the red and bold highlights. Now this message *really* upset everyone…Why? The tablespaces were previously all RW and OK, but now they were COPY Pending status!
So what just happened? Well Db2 detected that there was actually an SLB that could be used for the base of this recovery, and so “asked” HSM to get it “back”. It failed with Return Code (RC) X’5D’ decimal 93 and Reason Code X’42’ decimal 66. I love that 42! And with the great “tip” to “Read the Job Log”. The DBA group were a “tad” unhappy about the fact that a *failed* RECOVER set the tablespace in question to COPY Pending by the way!
Where on Earth?
So the DBAs started trying to read the Job Log and could not see anything. They have a 16-way data-sharing group running on 14 separate LPARs… There is *lots* of “job log”…
Eventually under one of 14 different STCs all called “HSM” I found this info:
RECOVER Messages you do not want to see – Part two
08.26.44 S0998158 ARC1801I FAST REPLICATION DATA SET RECOVERY IS STARTING FOR DATA SET QA1B.DSNDBC.ROYDB.ROYTS.J0001.A001, AT 08:26:44 ON 2016/12/12 08.26.46 S0998158 ARC0624I PHYSICAL DATA SET COPY OF VOLUME QA1B.DSNDBC.ROYDB.ROYTS.J0001.A001 TERMINATED PRIOR TO COMPLETION, DFSMSDSS FAILING RC = 8 08.26.46 S0998158 ARC1860I THE FOLLOWING 0001 DATA SET(S) FAILED DURING FAST REPLICATION DATA SET RECOVERY:QA1B.DSNDBC.ROYDB.ROYTS.J0001.A001, COPYPOOL=DSN$LOCDS0P$DB, DEVTYPE=DASD, VOLUME=WSPS95, ARC1166, RC=0 08.26.46 S0998158 ARC1802I FAST REPLICATION DATA SET RECOVERY HAS COMPLETED FOR DATA SET QA1B.DSNDBC.ROYDB.ROYTS.J0001.A001, AT 08:26:46 ON 2016/12/12, FUNCTION RC=0008, MAXIMUM DATA SET RC=0093
So here we see that it failed for RC=0093 (Now in decimal) but with a DFSMSDSS RC = 08
Confused? You will be…
I have not yet found what an RC = 0093 means – My best guess at the moment, is that the data had been “moved” since the SLB, and so the VOLUME swap failed due to some reason or other. The tablespace and index were unavailable and the “window of change” was closing fast…
Can I run backwards out the door?
Panic was approaching when I mentioned they could do the “backwards” LOGAPPLY (BACKOUT YES) and so with high hopes we attempted it, only to see that it sadly died a death if you have COPY Pending status – We felt pretty recursive at this point…
So that left us with one last chance and that was to tell RECOVER to ignore SLB and go directly to an earlier (RESTOREBEFORE) IC with a TORBA syntax.
We had to wait awhiles but we got there! The RECOVERs all ran though clean and all was well… <phew>
Where’s the Beef?
So “What’s the point?” I hear you muttering… quite simple really:
1- Never rely on SLB to always work
2- Make sure you test it before switching your objects to COPY Pending by accident
3- Still take good old fashioned Image Copies
4- Use of BACKOUT YES can save your bacon *if* you know about it
As usual, if you have any comments or queries please feel free to drop me a line!