2020-03 Db2 REORG SYSCOPY: DRAIN Delays are Despicable

I heard about an interesting problem the other day. Please remember that “interesting” to me is just that: “interesting”. ”Interesting” for the DBAs and employees of the firm where it happened is, naturally, a bit different.

A normal start

  • Monday morning and all is well until around 07:00, when delays start appearing in online transactions.
  • Soon the delays are gaining the advantage and customers are starting to complain.
  • At around about 07:20 nearly the whole machine just sat there…
  • About 15 minutes later everything started running normally.

Hmmm, interesting.

The stage is set

So, the lead DBAs are off and running, looking for bad SQL that could possibly have caused the disturbance in the force. They were checking whether the coupling facility was under stress, they were checking for parallel running REORG, MODIFY, or QUIESCE in the SYSIBM.SYSCOPY, they were using our WorkLoadExpert (WLX) to see what was happening in the time window involved.

Tuesday arrives

And so do I! As luck would have it, I am at this site to hold a presentation all about BindImpactExpert, which saves you from bad access paths, and RunstatsRescue, which rescues you from badly timed RUNSTATS. Now this site already has these products, but I must present to a new intake of DBA and developer employees.

Check everything

After my presentation we checked everything and found a few timeouts and deadlocks, but nothing serious. Then I got my Deer Stalker hat on, (now there’s an image!), and decided to see where delays were coming in. One of the developers had already done a quick WLX check and had seen very high Drain Lock values.

WLX outputs a summary of what workload has been processed which, here in the labs on my little test system, looks like this:

Wait times in microseconds because of …                                
 latch requests               :                 594                    0
 page latch                   :                   0                    0
 drain locks                  :                   0                    0
 drain lock claims            :                   0                    0
 log writer                   :               32230                    0
 synchronous I/O              :             6840389                 9623
 locks                        :                   0                    0
 synchronous execute          :                   0                    0
 global locks                 :                   0                    0
 other threads read  activity :            28429563                    0
 other threads write activity :               13166                    0 

At the actual customer site I could see a 1000 times increase in wait drain locks between two runs!

Utility versus SQL

Now, as I am sure you are all aware, a drain is *only* used by a command or a utility, so I started thinking:

“There must be a parallel running something somewhere!”

“There must be a parallel running something somewhere!”

So I used WLX to show me the SQLs that had the highest wait drain locks. I took the top two (over 30,000 seconds of delay!) and got their tablespace names from the Db2 Catalog using the first referenced table column.

Horrible Job to do

It is not a pleasant task to search master address space sysouts, but in this case it was the only way. Using the tablespace names from the Db2 Catalog. I just navigated to the date and time in question and did F commands in SDFS on the tablespace names.

BINGO!

After a few minutes I found a strange message (Correlation Id=010.TLPLKNC3) about a drain not being possible for an internal Db2 system task

– This happens to be used by REORG, and it gave me the info about where the drain came from. I looked at that system’s log output in the time range, and sure enough there was a REORG of that very table which kept failing due to not getting the drain!

A retry too far?

At this site they use a 3rd Party software tool to generate REORG, RUNSTATS and COPY and it had a default of RETRY 30. It kept trying 30 times before eventually failing.

This explains the missing SYSCOPY entry as the REORG had failed!

The other one?

So that was one bad boy found – What about the other? That tablespace did not appear in any of the sysouts. So I drilled down to get the full SQL text (Over 8000 bytes long!) and scrolled on down to the FROM lines – and there was the *first* table name! After the dust had settled, I went back and I saw that :

this one table was actually in every single SQL in the top 200 delay candidates! A pretty central table if you ask me!

Who? and Why?

The management level now wanted to know who did it? And why? I left that part up to the customer, of course, as I do not want to get involved in a finger pointing exercise! My feeling is: like most disasters, it was probably a chain of events something like:

  1. REORG generated on Sunday.
  2. Due to some unforeseen problem the JCL was shunted to Monday.
  3. On Monday at 07:00 it started and killed the machine.

Never again?

Best way is to generate jobs straight to the Job Scheduler for instantaneous execution (No waiting or shunting allowed) and guess what? We have the RealTimeDBAExpert (RTDX) that allows you to do just that! You can easily exclude objects from utilities based on days of the week, hours of day etc. If you have a bought-in or home-grown system would it also have caused this disaster?

Console Messages

If the customer had had our WLX Console Message Use Case licensed, it would have also made the detective work much easier, as then you have a central place to go where *all* console messages from *all* members are written and searchable! This would have saved a lot of time and trouble.

Bottom Line

(Removing my Deer Stalker hat and replacing it with a mortar board.)

Look everywhere, trust no-one and remember that a DRAIN is almost definitely nothing to do with SQL or a badly timed RUNSTATS.

As always, any questions or comments would be most welcome!

TTFN,

Roy Boxwell

2020-02 Db2 UPDATE column: UPDATEs for nothing and CPU ain’t free!

A bad misquote of a great Dire Straits song, but it is one great thing I saw last year!

What is an Update?

We all know what an update is, right? You have a column containing some value and you wish to update it to a new – different – value. You code an UPDATE with SET and all is done.

But, what happens “under the covers” when the column value you are updating is exactly the same as the new value?

Suspected real-time abuse

The problem surfaced gradually… as all good problems do… The DBAs were wondering why an SQL that was executed half a million times per day was waiting for other threads for so long. (This was discovered using our WorkLoadExpert (WLX)). The wait times were frighteningly high, and so it was decided that this SQL should be the target of some sort of tuning effort.

It then came out, while gathering basic tablespace statistical data, that:

the related tablespace had not been REORGed for over six months when, according to SEGs WorkLoadExpert, there were half a million updates every day against it!

RTS or SEG Bug?

Naturally the first idea is it must be a bug.

  • Either the Real-time Statistics (Not incrementing the REORGUPDATES counter – It could even be NULL for example) or,
  • heaven forbid, a bug in our WorkLoadExpert.

I took a closer look at the SQL:

UPDATE aaa.bbb
 SET COL1 = ?
 WHERE COL1 = ?
 ;

and just sort of wondered out loud,

“They are not using the same value in both parameter markers are they?”

The “they” in this case was the developer of course…

Oh My Word!

After a quick e-mail discussion, it then came out that, that was indeed the case! Db2 is clever but sometimes not that brilliant! The developer had had the idea of executing this SQL to “see” if the value existed or not… He did not think about what Db2 then actually does…

Under the Covers

Db2 does not “know” what the current value of COL1 is. It used, in this case, Index access to get and obtain an X lock on the target page – (this was then the reason for the very large wait times on the other threads!). Once the lock was held, it could then discover that there was *nothing* to do, and so it did nothing! Then it happily released the lock(s) after doing nothing and returned SQLCODE 0.

No Log data was written as nothing was done, and REORGUPDATES was not incremented as nothing was done, but the CPU/Elapsed overhead was enormous!

The right way

The head DBA has said the SQL should look like:
SELECT ‘A’ FROM aaa.bbb
WHERE COL1 = ?
FETCH FIRST 1 ROW ONLY
WITH UR
;

This is now on its way through change management! Naturally, it is the way the developer should have coded it from the get go!

What can you do?

Now this, of course, caused alarm bells to ring as “cut-and-paste” is your friend. If there is bad code in one place it is probably being copied further, even as you read this! Using SEG’s WorkLoadExpert and the Real-time Statistics, you can easily pull out and analyze any “bad guys”.


Put simply, use the UPDATE count from WLX and correlate it to the REORGUPDATES counter. If they are wildly different, taking into account REORGLASTTIME and WLX_TIMESTAMP, then you have a candidate to further track down!


Now, where are those refrigerators we have to move?

As always, I would be pleased to hear from you and any war stories you have!

TTFN,
Roy Boxwell
Senior Architect


More about WLX

2019-12 Fun with Db2 12 PBR RPN

I have recently enhanced our SpaceAssuranceExpert (SAX) product to automatically help out if partitions, or partitioned indexes, are getting too big for their boots in a productive system.

SAX – What is it?

The idea behind SAX, is to stop any and all chances of getting a dreaded SQLCODE -904 in production, especially “out of the blue”.

Our idea was, in Db2 12, with the new PBR RPN to do an on-the-fly ALTER to the DSSIZE, just like SAX does now with the SECQTY to avoid running out of extents.

RTFM Time

A quick look in the manual tells you that :


it is an immediate ALTER (as long as you make the new DSSIZE larger than the old one!) and there are no package invalidations or REORGS required.


This is fantastic!

So I created a nice little PBR and then ran a horrible Cartesian join SPUFI to flood the first partition with data. This join had a TIME card of (,1) to limit it to one second of CPU before getting an Abend S322.

SAX Monitor

The SAX monitor reacted perfectly and did the TP alter to 513 GB DSSIZE (I had an increment size of 512 as a test), and got an SQLCODE -666 (I just love that SQLCODE…), as the INSERT was still running.

This ALTER was then internally queued to be attempted later, like in any other failure case.


All well and good.

Wham! Nasty errors !

Then I did the same for a DPSI on my PBR RPN… Oh dear!I got a nasty IO Error and then an even nastier ROLLBACK loop, meaning I had to cancel the IRLM to stop Db2… (There is an APAR for this problem PH18977.)

Rollback loop?

The ROLLBACK loop was caused by me choosing to use NOT LOGGED as a tablespace attribute. We have a finite amount of log space, and when the transaction was S322’d after one second of CPU,

the rollback could not find one of the archive logs and then we had to cold start Db2

– Not pretty!

Proper test!

I then wrote a couple of little test programs that actually COMMITted after 5000 inserts and then the ALTERs all worked as designed.


Do I worry too much about extended format and extended accessibility?

Next, I worried about the ominous “extended format and extended addressability“ attributes in the DATACLASS for a PBR RPN, and wondered what would happen if a customer has SAX running and it happily ALTERs a TP to say 6 GB when they can only address 4 GB…


IBM to the rescue!

Luckily for us, Db2 development had thought about this!

  • If you attempt to create a PBR RPN (even with a very small size) and your DATACLASS does not have the two attributes set, you get an error message 00D70008 telling you this detail.

  • If you ALTER an existing tablespace to be a PBR RPN and your DATACLASS does not have the two attributes set, then the ALTER works fine. But remember, this is a pending alter and you *must* do a REORG at the TS level with inline TP level copies. This REORG then fails – also with 00D70008.

So, in other words, SAX cannot hurt you here!


What about PBGs?

After all this we also considered PBGs. They have a limit as well – MAXPARTITIONS in their case. So we added an ability to also ALTER MAXPARTITIONS as well. Here you must be more careful though, as these ALTERs are still immediate *but* they invalidate any referring packages!

ABIND YES or NO?

If you work with ABIND set to YES all is good as the ALTER comes in, Db2 invalidates your packages and the auto rebind happens so fast that you do not even notice it happening, however, if you work with ABIND NO then any packages, even the package actually running and doing the inserts, will fail! If you are just using dynamic SQL then it is 100% OK otherwise – Buyer beware!

Aha!

I have opened an Analytics Idea (DB24ZOS-I-1057) to try and get this loophole closed, as I cannot see what access path change could be affected by going from 32 to, say, 36 MAXPARTITIONS.

Db2 keeps adding the parts dynamically and nothing happens then… Go figure… Anyway, if you would like it – Please vote for it!


There are already some nice comments attached to it:

  • DP commented

    this is just a limit in the catalog.  So how could access path be affected?  The actual growth of a partition doesn’t invalidate the package so how just changing the limit do so ?
  • BW commented

    I opened a Case on this asking why packages are being invalidated in this situation and it is still open waiting for a reply.
  • BD commented

    With only impact to catalog, not sure how Access Path would be impacted.  Seems wasteful and counter productive to invalidate packages.

Remember – You never stop learning!

As always I would be pleased to hear from you!

TTFN,
Roy Boxwell
Senior Architect

2018-01 Db2 GETPAGES by Insert

 

Troubleshooting on site : how a simple Db2 SQL INSERT statement causes a high GETPAGES rate and how to fix it.

One shop I visit had an interesting problem the other day.  I thought I would share it with you, all as it really was cool.

The game is afoot!

A few days into this brand new year the production machine started to die…they swapped in a bunch of CPs and ZiiPs to try and help but it didn’t. So now the hunt was on – Who was killing the machine and how?

The usual suspects?

Using our WLX (WorkLoadExpert) software, it was quickly found that a simple SQL INSERT statement appeared to be the cause. It was never in the old top 10 “bad guys” list, but was currently at around number four – and rising up the charts faster than Elvis!

First thought were…

  • cross member invalidation, when multiple members insert in a data-sharing group,
  • or a bad usage of INSERT into SELECT syntax.

Deerstalker hat switched to “ON”

Deerstalker hat mode was quickly switched on and the DBA group started looking at *why* a simple  INSERT could cause such a headache… Then someone noticed that an ALTER had been done…  At first, it was thought that this could cause the problem but the ALTER timestamp was 03:43 in the morning! No DBA was working at that time. “Aha!” that is the Db2 Database Maintenance size checker and corrector that ALTERs PRIQTY and SECQTY. (Even though they use the sliding scale and allocate with -1 and -1, go figure…)

PBG pains

I happened to be on site for another reason and started looking around. I noticed that the tablespace in question was a PBG with MAXPARTITION 99 and was at the 45th Partition. This partition had been created at 03:43 and was the ALTER event recorded in SYSIBM.SYSCOPY.

Compress me till it hurts!

Now to add some spice to the story, this table space is set to COMPRESS YES. So it would be good to know if the compression dictionary was copied from partition 44 or not… After looking through various documents and also browsing the VSAM cluster, it became apparent that Db2 creates the dictionary “on-the-fly” after a “Db2-determined number of inserts” – I would love to know that number!

It gets fishy…

All of this was actually a red herring in the grand scheme of things as then the Programming Development Team happened to mention that a brand new logging process had been written that was “faster better cheaper.” It had been rolled out to production after passing all tests with flying colours.

The smoking gun…

Using WLX, it was possible to see that for every insert about 500 GETPAGES were being done. Now, on this table, there are *no* indexes. It really is a just a container that gets dumped out every now and again. So I looked at the DDL and saw that the PBG space was indeed created with PCTFREE 0 and FREEPAGE 0 and also with MEMBER CLUSTER, a so-called MC00 space,

but APPEND YES was missing from the table definition and *this* was the reason for the high number of GETPAGES. It was reading all the SPACEMAP pages instead of just inserting at end.

More roads to Rome

Now another way of speeding this all up, would be to ALTER the tablespace to be TRACKMOD NO which would save a lot of time as the SPACEMAP updates are then not done. However, at this firm, they do a ton of Incremental Image Copies on this space, so removing this was not an option.

Listserv Helps

From listserv comes this info:


“APPEND YES avoids ‘scanning’ the table space part for free space before actually extending the dataset beyond its current high used RBA.  How much does this save?  The cost of getpages for each of your table space bit maps.  If this is mostly a busy insert only table then the bitmaps are most likely in the buffer pool and will not require a physical I/O.  APPEND YES will not avoid the get pages to the calculated home page because of the need to check for duplicates.”


Now it was an extremely busy table but the system was also under extreme stress, so it really was rereading the spacemaps. The DBA group scheduled an ALTER to APPEND YES and hope that if this level of INSERTS happens again that the system will not stop!

And finally

Here’s a last bit of info from IBM’s John Campbell all about this:


TRACKMOD

< NO> – is recommended if do not require incremental COPY

  • Db2 does not keep track of updated pages
  • Less space map page updates which will improve performance
  • Less data sharing overhead
  • Can be altered via ALTER TABLESPACE DDL

New APPEND option is provided for INSERT in V9 NFM

  • CREATE/ALTER TABLE … APPEND YES

Can relieve high get pages during space search

  • APPEND search at the end of table space quickly
  • Not going through looking for deleted space
  • Table space size will tend to grow

With high number of concurrent inserts, APPEND could cause bottleneck on the last space map page

  • Using MEMBER CLUSTER option together with APPEND
  • to relieve the contention at the end

 

Hindsight is always good isn’t it?

 

I hope your new year is off to a better start. As usual any queries or criticism gladly accepted!

 

TTFN

Roy Boxwell

2017-02 Why SIZE still matters in Db2 12

What has changed for space management in Db2 12?

How to avoid SIZE limits in Db2 12 like in previous Db2 Releases?

How to monitor the maximum possible SIZE of table and index spaces and table and index partitions?

SIZE in Db2 12: Now that Db2 12 has gone GA, I thought it would be nice to do a quick re-recap of Space management and its problems over the releases. My “old” newsletter 2014-05: Why SIZE matters for Db2 still receives a lot of hits on our website, so I know that this is a big topic of interest for many of you. Some nifty things have been introduced in Db2 12 to make space a lot easier to use and manage.

In the beginning…secondary allocation for tablespaces and indexes since Db2 V7, Db2 V8…

Since Db2 V8, the DBAs of this world have all had the ability to forget about PQTY and SQTY in the DDL for Tablespaces and Indexes. At first, nearly no-one trusted the sliding scale algorithm, and SOFTWARE ENGINEERING’s product Space AssuranceExpert (aka SAX) monitored and reacted instantly to secondary allocations.

However, we now have Db2 12, and I thought it would be interesting to review what was done in Db2 V7 (when our SAX was launched), and the difference nowadays in the Db2 12 world.

IFCID issuing for space extents

Every time a secondary allocation is done in Db2, it can be made to spit out an IFCID. SAX runs as a started task, active 24×7, from Db2 start up until just before Db2 shut down. It catches all of these IFCIDs thrown by Db2, and performs an analysis with six basic questions:

1Can this dataset reach its maximum physical size *before* running out of physical extents? (The actual size is dependent on the “geometry” of the object of course!)
2Will this object run out of datasets? (The number of datasets an object can have is, once again, dependent on the “geometry” of the object)
3Is this partition nearing its maximum size?
4Did Db2 ask for one extent but got more back?
5Are any of my SMS disk storage pools running out of space?
6Are there any SEQUENCES that are about to hit the wall?

(Numbers five and six are actually triggered by a timer, naturally.)

Can this dataset reach its maximum size before running out of extents?

Remember, back in those old days of Db2 V7? We only had 255 extents and 254 partitions, but datasets could still get pretty big pretty fast.

The problem lots of shops had, was that an important dataset would “hit the buffers” of maximum number of extents *way* before it ever ran out of physical space. Thus causing grief, wailing and gnashing of teeth! SAX stopped all this by giving WTO “heads-up” style messages in two flavors. First, a warning message, and then a critical message. This gave DBAs and space managers much needed time to plan for the outage and the, inevitably, long running REORG to actually action the required ALTER, or perhaps even any DROP/RECREATE that had to be done.

IBM also noticed this problem and so introduced in Db2 V8 the “sliding scale” of secondary allocations, as long as the ZPARM OPTIMIZE EXTENT SIZING field (MGEXTSZ) was set to YES (this is the default from Db2 9, by the way). Of course, to really use this, you then had to ALTER all of the existing spaces PQTY and SQTY to be -1, and then remember to delete all PRIQTY and SECQTY lines in your DDL and also rely on the TSQTY and IXQTY ZPARMs giving a big enough “first default”. (By the way, defaults for these two ZPARMS are 0, which is actually translated to be 720k or one cylinder for normal spaces and 7200k or 10 cylinders for LOB spaces). This all probably explains why the take up of this great feature has not been that spectacular and, in fact, Listserv *still* gets questions about “How good is this feature?” This also explains why the primary reason for having SAX is still valid at most shops today!

However, most shops these days tend to ignore the extents problem and only REORG when over 1000 extents have been allocated. This is no problem for SAX, as it knows the SECQTY and the MGEXTSZ ZPARM settings and can decide to “ignore” an IFCID for extent and ALTER SECQTY processing if the SECQTY is -1 and the MGEXTSZ is YES.

Will this object run out of datasets?

Now the problem of running out of datasets is very, very evil indeed… For a non-partitioned space, you can have up to 32 datasets.  Db2 will happily allocate away and you will never know, or even be informed, if, and when, the last possible dataset has just been allocated and, of course, you will not know that the 33rd one cannot be allocated until you get a -904 unavailable resource error! By definition this is “not good”, as you must do a HUGE REORG with a bunch of managers breathing down your neck and *not* make any mistakes with the new allocations. (Again, this is a very good reason to have SAX doing all the monitoring and triggering early warning “heads-up” style messages!)

Is this partition nearing its maximum size?

A partition running out of space is rare, but when it does happen it is, of course, a disaster! The idea in SAX, is to warn when “the end is near” for any of the partitions in use and thus, as before, allow time for the ALTER etc.

Did Db2 ask for one extent but got more back?

Degenerated extents are annoying as well. You have only 255 or 7,257 extents, Db2 requests one but gets up to five back! This is “wasting” your precious supply of extents and so SAX can also warn you if this starts happening. Remedial action can again be planned to correct the problem, (normally a volume defrag in this case). Now in z/OS 1.7 “Extent Constraint Removal” was introduced for the DATACLAS which, if set to “Y”, allows 7,257 extents but still limits you to 123 extents per volume and 59 volumes. So watch out if you are using huge “virtual” disks (E.g. MOD 54 or EAV), as you can end up wasting space because you still cannot exceed 123 extents per volume.

SAX also takes care of duplicate recording – This is where an Extent is registered but SMS “consolidates it into the primary/existing extent – This would normally get logged as an extent but SAX sees this and does not report it as an extent.

Are any of my SMS disk storage pools running out of space?

When an SMS Pool runs out of space, either for sort/work or image copy, it is *not* good! The idea here, is to also give a “heads-up” style alert. The DBA can trigger the space management people to have a look at the state and size of the SMS storage groups this time alerted by percentage used or GBs of space free.

Are there any SEQUENCES that are about to hit the wall?

The usage of SEQUENCES has taken off. Nowadays shops can run into the problem of SEQUENCES hitting the maximum/minimum number for a NOCYCLE defined sequence. SAX tests sequences at the same time as the SMS groups to warn about any encroaching problem with WTO/MSG and reporting.


What was new in Db2 V8?

Db2 V8 introduced a big change – Partitions went up to a maximum of 4,096, and the calculation about how many pieces your NPI can have got “a little bit complex” (see also my previous newsletter: “2014-04 Are you going to pieces”).


What was new in Db2 9?  PBG and UTS spaces

In Db2 9 the next major advance came with UTS spaces. The one that caused the most grief was, of course, PBG. Why? Well, the first problem was that some people went mad and used MAXPARTITIONS 4096 right from the get-go. They then found out this could not simply be changed and ended up being a huge problem. IBM came out with a bunch of fixes for these people, but the recommendation is still true today: “Use the number you expect to use!”

PBGs, however, came with a new set of space management problems:

1By definition every partition is full, and so a TP REORG is “dangerous” -especially if you have VARCHAR, and even more so if compressed.
2ALTER at TP level is not supported for PBG.
3Getting rid of empty partitions was not supported
4Adding partitions dynamically (by command) was not supported.
5What to do if the partition that is “in use” is growing and is

a – The last allowed Partition
b – MAXPARTITIONS is set to one?

Now these are “non trivial” because the Db2 catalog is so defined and you would not want an alert every time someone created a table or index!

The trick here, is to treat these conditions as if it was a normal space and so, instead of warning that you are using the last part, it waits until you are using, e.g. 80% of that part. Then, e.g. at 90% comes the critical threshold warning.

Big changes happened here in Db2 12.


What was new in Db2 10?

With Db2 10 came the ability to ALTER PBGs to add parts which made using DSN1COPY to clone data around a lot better!


What was new in Db2 11?

In Db2 11 the REORG utility can be used to remove any empty parts in PBGs by the use of the ZPARM REORG_DROP_PBG_PARTS being set to ENABLE (DISABLE is the default).


What is new in Db2 12?

Now in Db2 12 there is partition independence for DSSIZE. Before, all partitions had to have the same maximum size (DSSIZE). Now you can have different sizes for different parts. This requires either making a new tablespace (UTS Relative Page Numbering), or an ALTER and TS level reorg of an existing UTS space. The tablespace goes relative page numbering and the RID increases in size to seven bytes hence the need for a TS level REORG. The Partitioning indexes also get DSSIZE so they can vary in size as well. Once you are there, all of the Partitions can then be ALTERed up in size with no outage! This is really, really good!

REORG of a PBG can “spill” into a new partition. This is also really good, as it was the major problem with PBG TP level reorgs. The chance of LOB data going into COPYP during the log apply phase has been stopped – Thankfully! Finally, delete of empty partitions is controlled with a utility DROP PART syntax.


The SAX way for Space monitoring

The SAX tool way of processing all this info is neatly summarised in the help panel of the tool itself:

SUPERVISE LPS

Supervise linear pagesets. If specified, a warning is issued
in case of high allocated reaches this percentage of the
maximum data set size for partitioned objects.

For non-partitioned objects, a warning is issued for every
newly allocated data set as soon as the data set number
reaches this percentage of the maximum number of data sets:

Two different values may be entered for warning and critical
values with different message ids. This may be useful for
automation reasons (see below).

Object type: TABLESPACE      ! Maximum number of data sets
-----------------------------+----------------------------
LOB tablespaces              ! 254
-----------------------------+----------------------------
Non-partitioned tablespaces  ! 32
-----------------------------+----------------------------
Partitioned tablespaces      ! 1 (Percent used check)
-----------------------------+----------------------------
Partitioned By Growth        ! MAXPARTITIONS. LPS check if
tablespaces                  ! more than one. If on last
                             ! partition then percent used.
-----------------------------+----------------------------
Object type: INDEX           ! Maximum number of data sets
-----------------------------+----------------------------
Non-partitioned indexes on   ! MIN ( 4096 , 2 power 32 /
tablespace with LARGE,       !      ( DSSIZE / TS PGSIZE))
DSSIZE, or more than 64      ! Eg: 128 GB DSSIZE with
Partitions                   !       8 KB Tablespace Page
                             ! gives 256 Pieces (datasets)
                             ! Or    4 GB DSSIZE with
                             !       4 KB Tablespace Page
                             ! gives 4096 Pieces (datasets)
-----------------------------+----------------------------
Non-partitioned indexes      ! 32
otherwise                    !
-----------------------------+----------------------------
Partitioned indexes          ! 1 (Percent used check)
-----------------------------+----------------------------
To support automation based on WTO ids two different
thresholds may be specified:
Field (1) specifies a warning threshold using WTO ids
O2RTSU04 - 12W  (non-partitioned spaces)
O2RTSU04 - 14W  (partitioned spaces)
O2RTSU04 - 16W  (partition by growth spaces)
Field (2) specifies a critical threshold using WTO ids
O2RTSU04 - 13W  (non-partitioned spaces)
O2RTSU04 - 15W  (partitioned spaces)
O2RTSU04 - 17W  (partition by growth spaces)

AUDIT DEGENERATED XTS
Audit secondary quantity for de-generated extents. If
specified, a warning is issued in case of the last extent
does not reach this percentage of the SECQTY specified
in the Db2 catalog. If this field is left blank, no
auditing is performed.

AUDIT SMS STOGROUPS
Should the Space AssuranceExpert audit SMS stogroups. Y/N
If Y is entered, a pop-up window will allow you to enter up
to 24 SMS storage groups which will be audited.
If WARN IF % ALLOC > or WARN IF GB FREE < is specified and
exceeded, a warning (WTO) will be issued.

CHECK SYSSEQUENCES
Should the Monitor also check for SYSIBM.SYSSEQUENCES that
are running out of room every PING minutes?

N  - do nothing.  This is the default.
I  - check Identity Columns and Doc Ids for XML.
S  - check User Defined Sequences.
B  - do both.

PERCENT USED
If checking of SEQUENCES is desired then a threshold
percentage must be given from 1 to 99. If this percentage of
the available sequences is exceeded then an action is
triggered.

EXCEEDED ACTION
When a percentage is exceeded this specifies what type and
and which style of message should be externalized.

N  - do nothing.  This is the default.
W  - to write out a WTO.
M  - to write a message to the job log.
B  - do both.

To support automation based on WTO ids the following
messages are output:

O2RTS000 - 20W  (SEQUENCES MAXVALUE)
O2RTS000 - 21W  (SEQUENCES MINVALUE)
O2RTS000 - 22W  (IDENTITY MAXVALUE)
O2RTS000 - 23W  (IDENTITY MINVALUE)

So now you know why size still matters for Db2 12! The big question now is: “Are your space management and monitoring tools up-to-date, or are they still Db2 V7?”

As usual, if you have any comments or queries please feel free to drop me a line!

TTFN

Roy Boxwell

Rotten Results from RUNSTATS Require Rescue

Do you know the basic rules to ensure access path stability when using RUNSTATS?

Time for another of my “I noticed something strange at a customer site recently” newsletters. Enjoy!

 

RUNSTATS are good aren’t they?

At this particular site, the RUNSTATS methodology of RUNSTATS was, shall we say, “sub-optimal.” They use an ancient system to decide when to RUNSTATS, and they do tablespace’s and index’s *never* at the same time. Just to complicate matters even more, they never use inline RUNSTATS because “if the REORG abends, the statistics in the DB2 catalog are dead”. Now you are all probably well aware of the scale of the disaster at this site?

 

Daily fire fighting

Nearly every day, some access path somewhere goes horribly wrong… the under- manned and over-worked DBA group are tasked to find and fix ASAP. Cures range from a quick INDEX create or change, or perhaps even a really needed RUNSTATS or REORG.

 

Why do the Access Paths go “wrong”?

The real goal is to stop firefighting and to investigate the root cause. Why do so many access paths go wrong on such a regular basis? The answer is the systemic horribleness of RUNSTATS collection. Dynamic SQL is, obviously, very very sensitive to RUNSTATS. For one thing, the statements are kicked out of the cache! The very next time they come back, the DB2 Optimizer redrives the cost calculations and “Hey Presto!” you have a bad access path. Terry Purcell and Pat Bossmann have often said that about 90% of DB2 performance problems stem from bad RUNSTATS. The old adage “garbage in – garbage out” is still true!

 

Timing is everything

The timing of the RUNSTATS is critical for stable access paths.

Basic rules are:

 1 Only do a RUNSTATS if you really really need to!
a. RUNSTATS are not cheap!
b. The Dynamic Statement cache gets wiped
c. Locks on the Catalog can occur
2Avoid doing RUNSTATS even if RTS says to run one!
a. Lots of people use the incorrect counters to trigger a RUNSTATS. Use the correct ones for the correct Object type
b. Never RUNSTATS LOB spaces – completely pointless work!
c. Even if a MASSDELETE has occurred do you really want to “reset” the DB2 catalog statistics?
d. VOLATILE tables must be handled with *extreme* care!
3Choose your RUNSTATS parameters wisely!
a. Doing a blind “RUNSTATS the world” is just as bad as running an empty RUNSTATS!
b. HISTOGRAM should be used with caution
c. More than a hundred COLGROUPs should start alarm bells ringing

Quite a list here, and it really only shows some “Rules of Thumb”. I’ll bet you all have you own?

Is there a way back from the abyss?

But what happens if you have 1000’s of partitions with terabytes of data and the RUNSTATS was, shall we say, ill-advised or badly timed? Can you go back in time? Hands up those who wants to do a PiT recovery on the production catalog! No takers???

Yes! There is a way back from the abyss

I’ll bet you are all well ahead of me here, but the way to do this is pretty straightforward. You simply acquire our latest tool, RUNSTATS Rescue, to handle it all for you. Or, you could try and reset the data in the DB2 catalog from off-line backups that you happen to have taken before the RUNSTATS that is now killing you. …You did do that, right?

Why a tool?

Apart from the fact that this tool is from us, my firm, just trying to “roll your own” can be a real nightmare. Why?

  • Because you must first find out all of the objects that were touched by the badly performing SQL.
  • Then you must get all of the DB2 Optimizer relevant data back from a point in time before the RUNSTATS executed, and/or the last REBIND(s),
  • and then you must flush the dynamic statement cache and REBIND any static SQL.

Sounds like a lot of work.

What else must you do?

You also have to be transparent and so log what you do. You must allow for the ability to back-out your changes as perhaps you make another access path even worse. And it would be really cool if you could do “on the fly” explains to check that the RUNSTATS really *is* the root of all that evil. Remember that ZPARMS and BUFFERPOOLS also have a major influence on access paths. Even the speed of your machine! It is also a must to then be able to go even further back in time – perhaps as much as a year?

Hang on – What about PLAN STABILITY?

Doesn’t plan stability save you? I hear you all cry. Well, “No” is the short answer! If your package is invalidated by a Schema change (the classics are index drop and recreate or VIEW change), then plan stability does not work anymore. Further, in DB2 12, Dynamic Plan Stability has been announced. Sadly it *also* fails right here as there is no SWITCH PREVIOUS/ORIGINAL support!

It all works together

So, for the static SQL case where the package is not invalided, Plan Stability is good. If not: – RUNSTATS Rescue to the rescue. For Dynamic SQL – RUNSTATS Rescue is the answer.

As always, any questions or comments would be most welcome!

TTFN,

Roy Boxwell