2020-11 Recover Résumé

This month I hope to answer a reader question – I was asked all about when/how to use the TOLOGPOINT phrase in a RECOVER utility. So here is my RECOVER résumé.

What is it?

RECOVER is one of those utilities that you hear a lot about but normally do not get to use that often! This is what makes it so dangerous and so interesting at the same time of course. We should all be using it on a regular basis so that when the time comes we are not afraid, or left dithering, while desperately trying to come up to speed with a “new” Utility at 03:00 in the morning…

What does it do?

It recovers – what else stupid! Well, actually it does a ton of stuff but at its most basic it takes a dataset (this can be any valid type of COPY’ed dataset) and recreates a VSAM Linear Dataset (LDS) from it. Typically, a Full Image Copy (FIC) is used as input but it can, and does, use Incremental Image Copies (IIC) as well. Finally, it can also apply log updates to get the LDS to the point you wish it be at, but note that this whole process is only forward in nature. RECOVER also has BACKOUT YES which starts at the current LDS and UNDOes updates against the object using the log. It could well be that going backwards is actually faster than applying an old image copy and going forwards! There is one small limit to BACKOUT YES:

If you specify BACKOUT YES, the recovery point must be within the most recent Db2 system checkpoints that are recorded in the BSDS for each member. Otherwise, the recovery cannot proceed and returns an error.

That’s all?

Basically yes! It is designed to get data back to a point at which you are happy that it is good and consistent. Note here that *you* must be happy. Db2 does not give a hoot about transactional consistency in this case because you are doing the recover. RECOVER guarantees that a Unit Of Work (UOW) will always be consistent with COMMITted data and indexes will also be consistent – this might well mean that you must rebuild or recover your indexes, naturally!

What is recovered?

Recover uses database and space name as input so it is instantly apparent that the “unit of recovery” is *not* a table but a tablespace (which can contain 1000’s of tables) and/or an index (Remember that an indexspace can only contain one index). There is an exception here and that is, you can also give a Page Number to recover but that is very rarely used these days as IO Errors really are a thing of the past.

The phases of Recover

First it accesses SYSIBM.SYSCOPY to see which dataset(s) it must use (Unless you use LOGONLY option in which case RECOVER assumes that some other process got the LDS back *or* you use the new NOSYSCOPY parameter to give it a dataset name that has been deleted from SYSCOPY presumably by a harsh usage of MODIFY RECOVER) and it checks the SYSIBM.SYSLGRNX to build a list of log datasets that possibly must be scanned for recovery info (Unless you use LOGRANGE NO which is not recommended!) This is also what the REPORT RECOVERY utility does, and you should also run that utility every now and again to get used to the style and type of data we are talking about.

Are you consistent?

If using Flashcopy Consistent Copies it then runs two early phases of work looking for uncommitted work that was backed out at the point of the Flashcopy and also uncommitted work that was committed at the point of Flashcopy.

The real work

Then comes the LOGAPPLY phase where, using data from SYSLGRNX, it reads and applies all log records from the FIC/IIC up to the recovery point you wished for. Here the Fast Log Apply (FLA) greatly speeds up the entire process.

FLA?

Fast Log Apply uses up to 10MB of memory to buffer the LOG being used for the RECOVER which made the LOG APPLY phase much quicker. The value of the, deprecated in Db2 9, ZPARM LOGAPSTG was set at a maximum of  100MB so this meant you could run 10 RECOVER jobs in parallel each with 100’s of RECOVERs to benefit from FLA. Then in Db2 10 they upped the maximum to 510MB by APAR while at the same time removing the ZPARM meaning you could do 51 parallel jobs.

FLA – What does it do?

As William Favero wrote many years ago: Fast log apply was first delivered in DB2 Version 6 and as the name implies, aids in processing log records by sorting log records by page before applying log record changes. It uses one log read task per recovery job and multiple log apply tasks whenever possible. By sorting the log records in page order prior to applying them, fast log apply can almost completely eliminate the need to touch a page set page more than once. This process can drastically reduce the log apply times. This is not just used by RECOVER but for today’s blog we can ignore all the other uses of FLA.

Enough of FLA, back to the point, well at least TOLOGPOINT?

Then come two special phases but only if using TOLOGPOINT – See later for details about this parameter. Here the recover LOGCSR phase builds info for any inflight, indoubt, inabort, and postponed abort units of recovery (Transactions) to be applied and the LOGUNDO phase rolls back any uncommitted changes that transactions made to any of the recovered objects.

What’s in a name?

Recover accepts four (well, actually five…) different parameters telling it what to do at the highest level:

  1. TOCOPY – recover the named object to this IC.
  2. TORBA – recover up to and including this Relative Byte Address (RBA) on the log. These days it is recommended not to use this anymore but to use the next option. See more later.
  3. TOLOGPOINT – recover up to and including this log point (Timestamp).
  4. TOLASTCOPY – recover up to the last copy taken – This can be a FIC or an IIC.
  5. TOLASTFULLYCOPY – recover up to the last FIC.

RBA or LOGPOINT?

If you are running non-datasharing then an RBA is the same as LOGPOINT and so you can use TOLOGPOINT with no worries. If, however, you are in datasharing then you can *only* use TORBA for the time span before this data sharing group was created. Thus, to keep things easier, it makes sense to just use TOLOGPOINT.

Is it safe?

The problem here is the famous saying “A little knowledge is dangerous”. It is recommended by IBM that the use of TOCOPY, TOLOGPOINT (TORBA), TOLASTCOPY and TOLASTFULLCOPY be restricted to senior users who know exactly what they are doing! Why so? Because all of these options are so called “Point in Time” (PiT) recoveries and they are all special!

Aren’t we all special?

Nope! If you do a “normal” RECOVER it brings you back to the current time. All of your data is back as if the transactions had actually all executed. This is fine if you have had a media failure etc. The problem is that most of the recoveries done these days are due to bad application code. This means you have to recover to a Point in Time when the bug was not live…

The problems of PiT

The absolute biggest problem is data consistency. I am not even talking about referential integrity (RI) or Check Rules here! Imagine you are a bank and someone withdraws 200 Dollars at 10:01 in the morning. You do a PiT to 10:00 and what is the end result? Your data is possibly ok and consistent but the end user has got 200 bucks and your bank hasn’t… this is traditionally known as “not good”…

Then come all the “other problems” – RI is number one here… You recover a dependent table to a PiT where the parent table has no row… this is “sub-optimal”. Or, even worse IMHO, you do a PiT to a point where a Check Constraint was added… this is 100% Nightmare time…

The problem is…

We are not alone! The objects that we have and use are all defined and managed in the Db2 Catalog and any PiT you do to so-called “user data” has no knowledge and no inference with the Db2 Catalog data. Here is where all the really nasty problems lurk around and loiter in the shadows… The basic rule is “If you changed anything in the catalog – check it all before doing the PiT!” It is much easier to check it all beforehand – doing it afterwards can be especially painful!

SLB?

System Level Back-up came in a while ago and RECOVER got updated to also be able to use it as a basis for recovery data. It started out with quite a few Ifs and Buts e.g. No REORG, REBUILD, LOAD REPLACE etc. but these all disappeared with z/OS 1.11 as then it could handle when datasets moved around the disks. There are still two problem areas where SLBs cannot be used as input and that is when you have done a REORG to remove a dropped column or you have done a REORG to remove unused partitions at the end. Apart from these two minor irritations SLBs are just as good as FICs.

In depth per option

All of the TOxxxCOPYs do the same thing. They recover the object to that COPY and thus use *no* log access. If the IC is not available, for whatever reason, Db2 falls back to the next earliest IC and then *does* do a LOGAPPLY up to the PiT of the requested IC.

TOLOGPOINT. Recover chooses the newest FIC before this point. If any IICs exist these are merged, then it does LOGAPPLY up to and including the record contained by the LOGPOINT. In this case if there is no FIC or SLB then it will use *just* LOGAPPLY from the creation of the object, LOAD TS or REORG TS. Naturally, this LOGAPPLY is only possible as long as you have *not* run any MODIFY RECOVERY that deleted records from the SYSLGRNX. This is why MODIFY RECOVERY actually inserts records in SYSCOPY!

Any uncommitted transactions running at the PiT are rolled back leaving them all in a transactionally consistent state.

Scoping out problems?

New in Db2 12 is the parameter SCOPE with default value UPDATED. Now this was quite a good idea as it meant that RECOVER now only does real work for objects that have really *changed* since the last FIC. It does this little bit of magic by looking into the SYSLGRNX to see whether or not any changes have actually been processed. This is also a serious problem! John Campbell tells us all “Please change all RECOVER jobs to use SCOPE ALL or make sure that PTF PH20056 is applied”.

Where’s the beef?

It is a well known fact that the road to disaster is paved with good intentions. The SCOPE UPDATED sounds great at first but then what would happen if you are using LISTDEF to do a PiT and the Db2 Catalog is involved? Further what happens if you, purely by bad luck, RECOVER SYSLGRNX before any other tablespaces? Yep, the “bit of magic” then has no data and is effectively blind and you get seriously bad recovered data.

Top Tip

If you want to use TOxxxCOPY but *require* consistency just use TOLOGPOINT instead and simply specify the LOGPOINT of the IC you wished to use anyway. This way Db2 guarantees that even SHRLEVEL CHANGE copies will not cause a problem!

Pain Points

RI – This is the number one bug bear! You must make sure that all objects in the RI chain get the *same* PiT. Failure to do so will leave you in CHECK pending and possibly a nasty mess to sort out!

Check Constraint added – If added try and make sure you hit a quiesce point or FIC otherwise verify that the PiT is before the creation timestamp of the constraint.

Catalog – As mentioned the catalog is a description of the objects and is *not* automatically updated by recover. This is for the carbon based beings to sort out. ALTER can be especially annoying here! Plus SYSLGRNX as mentioned above.

Identity Columns – These will no longer be in sync with SYSSEQUENCES and so care must be taken if they *must* be synchronized.

Compression dictionaries can kill you if you recover a single LDS of a multiple LDS but with a different version of the dictionary. Then the data is naturally not readable any more!

How to find a LOGPOINT

We have now seen how it all hangs together so the big question now is “How do you find an acceptable TOLOGPOINT?”

The xxxxMSTR address space contains numerous LOGPOINTs:

06.24.45 STC03367 DSNJ003I -SC10 DSNJOFF3 FULL ARCHIVE LOG VOLUME  
DSNAME=DB2SC1.SC10.ARCHLG1.D20280.T0624437.A0000374,
STARTRBA=00D6CAB49FFF50D10000, ENDRBA=00D6CAB49FFF52ECFFFF,
STARTLRSN=00D8A08DB179EBDBB800, ENDLRSN=00D8A09432EF45F1A800,
UNIT=SYSALLDA, COPY1VOL=SC1L00, VOLSPAN=00, CATLG=YES

From this message you get the ENDLRSN. If you convert the LRSN to a timestamp you get:

SELECT TIMESTAMP(BX'00D8A09432EF45F1A800' CONCAT BX'000000000000')
FROM SYSIBM.SYSDUMMY1 ;
---------+---------+---------+---------+---------+---------+------
2020-10-06-04.24.43.727967
DSNE610I NUMBER OF ROWS DISPLAYED IS 1
DSNE616I STATEMENT EXECUTION WAS SUCCESSFUL, SQLCODE IS 100

So you now have a LRSN to timestamp to work with. Be aware that the LRSN is *local* time as you can see here it is 04.24 but reported in the MSTR as 06.24 Berlin time is UTC+2.

Another possibly good starting point is to run the DSNJU004 routine and look at the CHECKPOINT QUEUE right at the very end of the output:

                    CHECKPOINT QUEUE 
09:20:31 OCTOBER 12, 2020
TIME OF CHECKPOINT 04:24:46 OCTOBER 06, 2020
BEGIN CHECKPOINT RBA 00D6CAB49FFF5308C5A8
END CHECKPOINT RBA 00D6CAB49FFF530A143A
END CHECKPOINT LRSN 00D8A09435A32E201600
TIME OF CHECKPOINT 03:55:38 OCTOBER 06, 2020
BEGIN CHECKPOINT RBA 00D6CAB49FFF511E3612
END CHECKPOINT RBA 00D6CAB49FFF511F5CB8
END CHECKPOINT LRSN 00D8A08DB297403D0E00

Full of RBAs and LRSNs – Great stuff indeed! Plus all the normal Active and Archive log details.

I have a horrible old REXX where I can enter a timestamp and it outputs the LRSN. You must manually edit it but it does work!

/* REXX */ 
NUMERIC DIGITS 64
RBAFORMAT = 'E' /* Output format (E = extended, B = basic) */
OFFSET = '000000000000' /* Set to STCK offset of data sharing group */
CVT = C2D(STORAGE(10, 4))
CHKBYTE = STORAGE(D2X(CVT + 304), 1)
CVTTZ = STORAGE(D2X(CVT + 304), 4)
IF BITAND(CHKBYTE, '80'X) = '80'X THEN CVTTZ = C2D(CVTTZ, 4)
ELSE CVTTZ = C2D(CVTTZ)
CVTTZ = CVTTZ * 1.048576 / 3600
CVTTZ = FORMAT(CVTTZ, 2, 0)
SAY 'Timezone diff. hours ' CVTTZ
MS = (DATE('BASE') - 693595) * 24 * 60 * 60
MS = MS + TIME('SECONDS')
MS = MS - (CVTTZ * 60 * 60)
MS = MS * 1000000
SAY 'Today MS = ' MS
LRSNBIN = X2D(SUBSTR(D2X(MS), 1, 12))
IF LENGTH(OFFSET) = 20 THEN OFFSET = SUBSTR(OFFSET, 3, 12)
LRSN = D2X(LRSNBIN + X2D(OFFSET))
IF RBAFORMAT = 'E' THEN LRSN = '00'LRSN'000000'
SAY 'Current LRSN (with STCK offset 'OFFSET') = 'LRSN
/* Now do for any time e.g. */
/* 2019-11-08-11.59.21.445591 00D6FE3F55460D3AD600 From syscopy */
/* output 00D6FE3F55460D000000 */
fromdate = '2019-11-08 11.59.21' /* example */
MS = (DATE('BASE','8 Nov 2019') - 693595) /* Here the date */
MS = MS * 24 * 60 * 60
MS = MS + ( 11 * 60 * 60) /* Here the hours */
MS = MS + ( 59 * 60) /* Here the minutes */
MS = MS + 21 /* Here the seconds */
MS = MS - (CVTTZ * 60 * 60)
MS = MS * 1000000
MS = MS + 445591 /* Here the rest seconds */
SAY 'From date MS = ' MS
LRSNBIN = X2D(SUBSTR(D2X(MS), 1, 12))
IF LENGTH(OFFSET) = 20 THEN OFFSET = SUBSTR(OFFSET, 3, 12)
LRSN = D2X(LRSNBIN + X2D(OFFSET))
IF RBAFORMAT = 'E' THEN LRSN = '00'LRSN'000000'
SAY 'From date = ' fromdate
SAY 'From date LRSN (with STCK offset 'OFFSET') = 'LRSN

Feel free to change it to allow the date/time/seconds as a parameter – I never got the time – pardon the pun…

SYSCOPY contains LRSNs and TIMESTAMPs and can be great starting point. SYSLGRNX obviously is nowadays also worth a look and the RTS also contains LRSNs and TIMESTAMPs.

REPORT RECOVERY also lists out a ton of info including LRSNs and timestamps of course.

Ready, Steady, RECOVER!

So with all the sources available and the ability to switch LRSN to Timestamp and vice versa you are now fully prepared to use a PiT. Just remember to think it all through, get the application people on board and get the plan approved before you start doing RECOVER!

Redirected RECOVER?

Testing all this is fraught with danger of course… You cannot simply do a RECOVER of a production table while *anything* is running in production. So how can you test whether or not your RECOVER JCL is correct and whether or not your Service Level Agreements are met by your RECOVER JCL? IBM Db2 development have now solved this for us with “Redirected RECOVERY” APAR PH27043 PTF UI72057 for Db2 12 FL500 and above. It adds a new parameter to the RECOVER syntax ” FROM xxx.yyy” so you can RECOVER an unused “dummy” object from a live production object with *no* interference! There are, as usual, a couple of caveats: The “dummy” space must be “the same” as the to-be-recovered space. Think table defs, DSSIZE, BUFFERPOOL etc. here. Finally, only UTS are supported at this time.

I would love to hear of any of your war stories about RECOVER use or abuse!

Finally, many thanks to Vasu for asking me to dig deeper into RECOVER.

TTFN

Roy Boxwell