2017-08 Anomalies detected in MERGECOPY and MODIFY Recovery

Db2 z/OS Utilities: Identify and avoid some MERGECOPY or MODIFY Recovery anomalies

Just a little note this month about a strange anomaly in the way MERGECOPY and MODIFY RECOVERY seem to work. Let’s begin with MERGECOPY.

1 – MERGECOPY to the death


How many IICs to change a lightbulb?

I was doing some tests, to see how many Incremental Image Copies you can take before RECOVER dies, about 71 by the way, and found this out…

SCRATCHing my head

I did a Full Image Copy, then various updates, and three Incremental Image Copies. Due to bad luck, my Full Image Copy was “accidentally” scratched… Whoops!

MERGECOPY runs

I ran the MERGECOPY and look what happened:

DSNU000I    199 10:37:11.20 DSNUGUTC - OUTPUT START FOR UTILITY, UTILID = DC10MC00MCU012
DSNU1044I   199 10:37:11.21 DSNUGTIS - PROCESSING SYSIN AS EBCDIC
DSNU050I    199 10:37:11.23 DSNUGUTC - MERGECOPY TABLESPACE R510D0DC.R510S81 NEWCOPY YES
DSNU463I    199 10:37:11.30 DSNUYBR3 - THE PRIMARY IMAGE COPY DATA SET SETEMP.R510D0DC.R510S81.P0000.D17195.T1734 WITH
DATE=170714 AND TIME=174140
            IS PARTICIPATING IN MERGECOPY.
DSNU463I    199 10:37:11.37 DSNUYBR3 - THE PRIMARY IMAGE COPY DATA SET SETEMP.R510D0DC.R510S81.P0000.D17198.T1025 WITH
DATE=170717 AND TIME=102429
            IS PARTICIPATING IN MERGECOPY.
DSNU463I    199 10:37:11.43 DSNUYBR3 - THE PRIMARY IMAGE COPY DATA SET SETEMP.R510D0DC.R510S81.P0000.D17198.T1235 WITH
DATE=170717 AND TIME=123501
            IS PARTICIPATING IN MERGECOPY.                                                                   
DSNU463I    199 10:37:11.48 DSNUYBR3 - THE PRIMARY IMAGE COPY DATA SET SETEMP.R510D0DC.R510S81.P0000.D17111.A10756 WITH
DATE=170421 AND TIME=075641
            IS PARTICIPATING IN MERGECOPY.
DSNU030I    199 10:37:11.50 DSNUYBR3 - UNABLE TO ALLOCATE SETEMP.R510D0DC.R510S81.P0000.D17111.A10756, RC=4,
CODE=X'17080002' 
DSNU454I    199 10:37:11.63 DSNUYBR0 - COPY MERGE COMPLETE 
            NUMBER OF COPIES=4        
            NUMBER OF COPIES MERGED=3   
            TOTAL NUMBER OF PAGES MERGED=363 
            ELAPSED TIME=00:00:00  
DSNU010I    199 10:37:11.67 DSNUGBAC - UTILITY EXECUTION COMPLETE, HIGHEST RETURN CODE=4

Look again

Did you see what went wrong? No, nor did I the first time! Review again…

When a Return Code = 04 is actually important

Now do you see it?

DSNU030I    199 10:37:11.50 DSNUYBR3 - UNABLE TO ALLOCATE
SETEMP.R510D0DC.R510S81.P0000.D17111.A10756, RC=4,
CODE=X'17080002'

Hidden amongst all the other output…and “only” a RC=4 !

So? What does that have to do with me?

Well, who checks RC=04? Your data is non-recoverable, but you *think* you are green! Pretty nasty if you ask me! Db2 does get a bonus point though for actually merging the incrementals into another incremental…

2 – MODIFY Recovery


DSNUM ALL gone forever?

Remember the bad old days, when you *had* to take a TS level Image Copy of partitioned objects when you REORGed them?

Wasn’t that a terrible time? Terabytes of disk space pointlessly being filled with needless image copy data. Then along came Db2 11, which enabled TP level copies! Hoorah! Ok, not so hot with tapes etc. but who cares, it worked! Finally, we did not need to keep *huge* DSNUM ALL style copies lounging around on our expensive disks.

Too good to be true?

All sounds good, huh? What did we forget? I will tell you… NPSIs (The indexes formally known as NPIs) What is the problem, I hear you shout! Well, let me walk you through a normal scenario with COPY YES indexes…

Keep it Simple Stupid

Let us imagine a little tablespace with two partitions, one DPSI and two NPSIs. All of the indexes are COPY YES. Due to the fact that we now only do INLINE COPYs at the TP level, SYSCOPY gets lots of records looking like this:

SpacePartType
TP1F
IP1F
TP2F
IP2F
NPSI10F
NPSI20F
Then a day later
TP2F
IP2F
NPSI10F
NPSI20F
Then another day later
TP2F
IP2F
NPSI10F
NPSI20F

Remember that the NPSIs are copied with the DPSIs.

Time to DELETE the old data

Now, using MODIFY RECOVERY, you wish to rid yourself of the oldest data from Part 2.

A nice little

MODIFY RECOVERY TABLESPACE DB.TP DSNUM 2 RETAIN LAST(1)

Is enough. However, as it says in the documentation, this form does *not* delete NPSI data.

COPY PEND is not your friend!

If you use the DSNUM ALL, which *will* delete the NPSI data, then it will most probably put the entire table space into COPY PEND, which you do *not* want!

Space is the problem

So now the problem begins to appear… imagine that this has been happening for a year or more… you now have hundreds of syscopy entries *and* datasets for NPSI data that you cannot simply, or easily, MODIFY “away” anymore!

What I think is needed, is a new parameter, say “INCLUDE NPSI”, which will also get rid of NPSI data. RFE anyone??

I have no fix for this problem either, except to actually take a DSNUM 0 copy, including all indexes, and then do DSNUM ALL style MODIFY RECOVERY. Pretty messy…

As usual, if you have any comments or queries please feel free to drop me a line!

TTFN

 

Roy Boxwell

2016-10 Discovering hidden recovery problems in the SYSLGRNX

A query to read the SYSLGRNX

This month I want to have a look inside the SYSLGRNX – Not just for fun but actually to see if I, and you, have a serious RECOVER problem.

The Problem began with too many TP Image Copy steps

The problems all began when it was noticed by one of our customers that our DB2 Utility generation software (RTDX) was generating Image Copy jobs for “old” PBG partitions that were no longer being updated and they asked us why we were recommending an Image Copy.

RTDX and Image Copy

RTDX (RealTime DBAExpert for z/OS and SAX – Space AssuranceExpert) is threshold based and uses the Real-time Statistics (RTS) tables as the major driver of decisions about REORG, RUNSTATS, and IMAGE COPY. The normal rules for Image Copy are “If the RTS COPY counters have not been updated then there has not been an update and we do not need a Copy”. This is normally extended with a special copy rule “But generate a Full Image Copy every xx days even if no updates”. This is done to guarantee that you can perform a recovery even if your non-DB2 migration/back-up software automatically deletes datasets older than xx days.

Everything green?

So you think you are covered? Not so fast! When the RTS were first introduced back in DB2 V7 they were not that stable and, sometimes, the RTS DB was actually stopped. This meant that the numbers were sometimes “not that reliable” – Now for REORG and RUNSTATS it does not really matter because, at some point, the counters start to count again and all is OK. But for IMAGE COPY it would be a disaster to “miss” an update! RTDX handles this with an extra “belt and braces” or, for the Americans amongst us, “belt and suspenders” check: “CHECK SYSLGRNX? Y/N” setting. If set to “Y” then RTDX generates an extra job that extracts the SYSLGRNX data and loads it into a DB2 table. This can then be queried by RTDX to see when objects were opened for update (Just like DB2 does in RECOVER processing to get the required log ranges.)

Something Old Something New

So now you know how the system works I can explain what happened at the customer site. They use Partition By Growth tablespaces (PBGs) a lot as “containers” to just insert records into. They had PBGs with hundreds of partitions and they wondered why RTDX generated Image Copy steps for partitions 1 – 136 when they only held “old non-changed data”.

Check the code

First thing I did was to check the code for any silly mistakes. Everything was fine. I then got the SYSLGRNX extract file and saw that the Partitions were indeed in there… I then worked with the customer directly using log prints and tracing until we finally saw what was happening.

PBG doing the ripple!

When a new partition was added, all of the previous partitions header pages got an update “max no. of partitions”. Now this update counts as a system page and so is *not* in the RTS Counters as they only count data pages! RECOVER of course does not care! It would demand to look for the log ranges and what would happen if your logs have been deleted? Yep – Unrecoverable data!

APAR PI60104 fixes the problem

So you can apply the APAR and all is good? Not really. You still have to go and check if you have any potentially unrecoverable PBG datasets out there. How do you do that? Well – you ask me!

Reading SYSLGRNX is not so easy

The problem with SYSIBM.SYSLGRNX is that the data is nearly unusable when you want to join to the DB2 Catalog. E.g. Two byte character fields that actually contain SMALLINT values and, sometimes, with the high bit set…

Just creating a table and then a dumb INSERT INTO SELECT FROM dies because of the DSNDB01 Cursor rules. So first you’ll need to create your new version of SYSLGRNX:

-- CREATE NEW VERSION OF SYSLGRNX WITH SMALLINTS --
CREATE TABLE BOXWELL.SYSLGRNX2
   (LGRDBID          SMALLINT          NOT NULL
   ,LGRPSID          SMALLINT          NOT NULL
   ,LGRUCTS          TIMESTAMP         NOT NULL
   ,LGRSRBA          CHAR(10)          FOR BIT DATA NOT NULL
   ,LGRSPBA          CHAR(10)          FOR BIT DATA NOT NULL
   ,LGRPART          SMALLINT          NOT NULL
   ,LGRSLRSN         CHAR(10)          FOR BIT DATA NOT NULL
   ,LGRELRSN         CHAR(10)          FOR BIT DATA NOT NULL
   ,LGRMEMB          SMALLINT          NOT NULL
   ,LGRNEGPSID       CHAR(1)           NOT NULL)
;
COMMIT ;

Cross loader to the rescue!

After trying various tricks to get at the data I hit on the idea of abusing the cross loader to do it for me. Here is my JCL that you must tailor for your site to do the dirty deed:

SELECT COALESCE(                                              
       CASE WHEN ASCII(SUBSTR(A.LGRDBID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRDBID, 2, 1)) > 32767 THEN  
                (ASCII(SUBSTR(A.LGRDBID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRDBID, 2, 1))) - 32768      
            ELSE ASCII(SUBSTR(A.LGRDBID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRDBID, 2, 1))               
       END , 0)         AS LGRDBID                            
      ,COALESCE(                                              
       CASE WHEN ASCII(SUBSTR(A.LGRPSID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRPSID, 2, 1)) > 32767 THEN  
                (ASCII(SUBSTR(A.LGRPSID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRPSID, 2, 1))) - 32768      
            ELSE ASCII(SUBSTR(A.LGRPSID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRPSID, 2, 1))               
       END , 0)         AS LGRPSID                            
      ,TIMESTAMP(                                             
           CASE WHEN SUBSTR(A.LGRUCDT, 5 , 2 ) > '83' THEN '19'
                ELSE '20'                                     
           END CONCAT                                         
           SUBSTR(A.LGRUCDT, 5 , 2 ) CONCAT '-' CONCAT        
           SUBSTR(A.LGRUCDT, 1 , 2 ) CONCAT '-' CONCAT        
           SUBSTR(A.LGRUCDT, 3 , 2 ) CONCAT '-' CONCAT        
           SUBSTR(A.LGRUCTM, 1 , 2 ) CONCAT '.' CONCAT        
           SUBSTR(A.LGRUCTM, 3 , 2 ) CONCAT '.' CONCAT        
           SUBSTR(A.LGRUCTM, 5 , 2 ) CONCAT '.' CONCAT        
           SUBSTR(A.LGRUCTM, 6 , 2 ) CONCAT '0000'            
                 )      AS LGRUCTS                            
      ,A.LGRSRBA                                              
      ,A.LGRSPBA                                              
      ,A.LGRPART                                              
      ,A.LGRSLRSN                                             
      ,A.LGRELRSN                                             
      ,ASCII(SUBSTR(A.LGRMEMB, 2, 1)) AS LGRMEMB              
      ,CASE WHEN ASCII(SUBSTR(A.LGRPSID, 1, 1)) * 256         
               + ASCII(SUBSTR(A.LGRPSID, 2, 1)) > 32767 THEN ‘Y’
            ELSE ‘N’                                          
       END              AS LGRNEGPSID                         
FROM SYSIBM.SYSLGRNX A                                        
ENDEXEC                                                               
LOAD DATA INCURSOR(C1) RESUME NO SHRLEVEL NONE REPLACE                    
  INTO TABLE BOXWELL.SYSLGRNX2                                        
/*
The use of ASCII is to “translate” the character data into smallint and also to detect if we have negative numbers. This happens if you run out of DBIDs and also if you use CLONE objects. The requirement for the two COALESCEs is down to the cross loader syntax check on the CASE statements by the way…

This could load a lot of data of course!

Do not forget the INDEX

Now the Index that we need to speed up the processing (and the RUNSTATS of course!)

-- CREATE INDEX ON THE COLUMNS WE NEED --
CREATE INDEX BOXWELL.SYSLGRNX_IX
          ON BOXWELL.SYSLGRNX2
       (LGRDBID,LGRPSID,LGRPART,LGRMEMB)
        USING STOGROUP SYSDEFLT
             PRIQTY -1
             SECQTY -1
        CLUSTER
        CLOSE YES
;
COMMIT ;

And the RUNSTATS

RUNSTATS  TABLESPACE <yourdatabase>.<yourtablespace>
          TABLE (ALL)
          INDEX(ALL)
          FREQVAL NUMCOLS 1 COUNT 100 BOTH
          FREQVAL NUMCOLS 2 COUNT 100 BOTH
          FREQVAL NUMCOLS 3 COUNT 100 BOTH
          SHRLEVEL CHANGE REPORT NO UPDATE ALL HISTORY NONE

Now we can SQL So finally we have the SYSLGRNX data in a format that we can actually use! Here is my SQL that lists out any PBG spaces that have, according to the RTS, no updates but the last Image Copy timestamp is earlier than the SYSLGRNX record:

-- FINALLY THE QUERY TO SEE IF A PROBLEM EXISTS OR NOT --
--
-- SELECT ALL PBGS IN SYSTABLESPACE THAT HAVE DATA IN THE
-- SYSLGRNX WHICH ALSO HAVE AN ENTRY IN THE RTS WHERE THE
-- COUNTERS ARE NULL OR ZERO BUT THE COPYLASTTIME IS EARLIER
-- THAN THE SYSLGRNX TIMESTAMP.
--  THIS MEANS THAT THE RTS THINKS "NO REASON TO COPY"
--  BUT SYSLGRNX THINKS "SOMETHING WAS UPDATED"
--

SELECT A.DBNAME, A.NAME, B.LGRPART, A.DBID, A.OBID, A.PSID
      ,B.LGRUCTS, B.LGRSRBA, B.LGRSPBA, B.LGRSLRSN
      ,B.LGRELRSN, HEX(B.LGRMEMB) AS MEMBER
      ,C.COPYLASTTIME
      ,MAX(D.START_RBA) AS HIGHEST_IC_RBA
      ,MAX(D.TIMESTAMP) AS HIGHEST_IC_TS
FROM SYSIBM.SYSTABLESPACE A
INNER JOIN
    BOXWELL.SYSLGRNX2    B
 ON  A.DBID     = B.LGRDBID
 AND A.PSID     = B.LGRPSID
 AND A.INSTANCE = B.LGRINST
INNER JOIN
     SYSIBM.SYSTABLESPACESTATS C
 ON  C.DBID     = A.DBID
 AND C.PSID     = A.PSID
 AND C.PARTITION= B.LGRPART
 AND C.INSTANCE = A.INSTANCE
LEFT OUTER JOIN
     SYSIBM.SYSCOPY       D
  ON D.DBNAME    = A.DBNAME
 AND D.TSNAME    = A.NAME
 AND D.DSNUM   IN ( 0 , B.LGRPART)
WHERE A.TYPE   = 'G'
 AND (C.COPYUPDATEDPAGES = 0
  OR C.COPYUPDATEDPAGES IS NULL)
 AND (C.COPYCHANGES = 0
  OR C.COPYCHANGES IS NULL)
 AND (C.COPYLASTTIME < B.LGRUCTS
  OR C.COPYLASTTIME IS NULL)
 AND (D.ICTYPE = 'F'
  OR D.ICTYPE IS NULL)
GROUP BY A.DBNAME, A.NAME, B.LGRPART, A.DBID, A.OBID, A.PSID
     ,B.LGRUCTS, B.LGRSRBA, B.LGRSPBA, B.LGRSLRSN
     ,B.LGRELRSN, HEX(B.LGRMEMB)
     ,C.COPYLASTTIME
ORDER BY 1 , 2 , 3 , 7
WITH UR
;

And the winner is?

I hope you have zero records returned? Even if you do have some records, it is relatively easy to quickly get them image copied so that you can sleep at night. If zero, you still have a great copy of the SYSLGRNX that you can use for other purposes!

A customer replies

Here’s some real data back from a customer:

discovering hidden recovery problems in the SYSLGRNX; DB2 z/OS; PBG partition

The COUNT is the number of SYSLGRNX entries they have – So they have nine DB & TS groups and from a further analysis (removing the data sharing member duplicates) they have 1926 distinct DB, TS and Partitions!

They scheduled image copy processing for all of these and were then done.

I hope you liked this month’s topic.

As always, any questions or comments would be most welcome!

TTFN,

Roy Boxwell