728x90


Problem(Abstract)

After disabling Self Tuning Memory (ie. SELF_TUNING_MEM =OFF in DB CFG), the STMM log file is still updated with messages every 3 minutes.

Symptom

Note from DB CFG that STMM is disabled:


$ db2 get db cfg |grep -i self_tuning_mem
Self tuning memory                    (SELF_TUNING_MEM) = OFF


From stmm.log file we can see events like the following still being dumped:

2013-05-24-09.34.38.373583-180 I195993E593         LEVEL: Event
PID     : 2961                 TID  : 139972652820224PROC : db2sysc 0
INSTANCE: db2inst2             NODE : 000          DB   : INTERSEC
APPHDL  : 0-8                  APPID: *LOCAL.DB2.130524032237
AUTHID  : DB2INST2
EDUID   : 25                   EDUNAME: db2stmm (INTERSEC) 0
FUNCTION: DB2 UDB, Self tuning memory manager, stmmMemoryTunerMain, probe:0
MESSAGE : ZRC=0x0000067F=1663
DATA #1 : String, 115 bytes
Going to sleep for 180000 milliseconds.
Interval = 185, State = 1, intervalsBeforeStateChange = 10, lost4KPages = 0

2013-05-24-09.37.38.365898-180 I196587E454         LEVEL: Event
PID     : 2961                 TID  : 139972652820224PROC : db2sysc 0
INSTANCE: db2inst2             NODE : 000          DB   : INTERSEC
APPHDL  : 0-8                  APPID: *LOCAL.DB2.130524032237
AUTHID  : DB2INST2
EDUID   : 25                   EDUNAME: db2stmm (INTERSEC) 0
FUNCTION: DB2 UDB, Self tuning memory manager, stmmMemoryTunerMain, probe:1528
MESSAGE : Starting New Interval



Cause

Turning OFF STMM via DB CFG parameter disables STMM feature, however it does not disable STMM logging.


Diagnosing the problem

Check whether SELF_TUNING_MEM parameter is disabled or not into DB CFG configuration.

db2 get db cfg |grep -i self_tuning_mem

SELF_TUNING_MEM parameter must be set to OFF like below:


Self tuning memory                    (SELF_TUNING_MEM) = OFF

Check the last events from stmm.log file in order to confirm that it is still being updated.


Resolving the problem

In order to completely disable STMM set the following DB2 registry variable as follows:

db2set DB2STMM=OFF

Run db2stop / db2start 

The stmm.log file will not have further updates with new messages.


http://www-01.ibm.com/support/docview.wss?uid=swg21640056

728x90
728x90


Question

Why does the DB2 optimizer sometimes choose an access plan which uses an index scan followed by a full table scan instead of just a full tablescan?

After gathering a query explain, you may see in certain cases that the access plan shows an index scan that is joined with a full table scan. 

For example (with a table that has only 300 rows):

300 
TBSCAN 
( 4) 
26.0089 


300 
SORT 
( 5) 
26.0086
2
|
300
FETCH
( 6)
25.9339
2
/---+----\
300 300
IXSCAN TABLE: 
( 7) TBL1
0.152007 
0
|
300
INDEX:
TBL1_INDEX1

Answer

In certain cases the Optimizer will choose to do a full index scan and join it with the table scan, because there may be 'EMPTY' pages in the table. Performing an index scan can help avoid these empty pages, as opposed to a full table scan.

To calculate the number of empty pages in a table, query the FPAGES and NPAGES columns in the SYSCAT.TABLES view and then subtract the NPAGES value (the number of pages that contain rows) from the FPAGES value (the total number of pages in use). Empty pages can occur when entire ranges of rows are deleted.

As the number of empty pages increases, so does the need for table reorganization. Reorganizing a table reclaims empty pages and reduces the amount of space that a table uses. In addition, because empty pages are read into the buffer pool during a table scan, reclaiming unused pages can improve scan performance. The quickest solution is to REORG the table to remove the empty pages.


http://www-01.ibm.com/support/docview.wss?uid=swg21656021

728x90
728x90

 

Problem(Abstract)

Slow transaction commit times may be seen occasionally when the recovery history file grows very large.

Cause

Slow transaction commit times may sometimes be caused because of a slowdown when writing to the History File.

If the database History File size has grown to a large size (symptoms may start appearing at 10MB and up), then it might start hampering the commit performance by a few seconds.

In general, the mechanism of commit is as follows:
1) Thread doing Commit puts in the request to write the log record and waits
2) if MINCOMMIT=1 the db2loggw thread takes the record and writes the record to disk
3) if it is the end of log file, it requests a new log file to be set up and db2loggr fulfills the request
4) db2loggr has to write into the history file that the log file has been cut
5) Also, writes the same info in the backup file as well.
6) Sends a request to db2logmgr to archive the log file
7) db2loggw resumes writing

If by any chance the history file has become large enough that it cannot be held into memory any longer, it has to write to disk to update it. And since it has to do it serially, it has to read all the records in a serial fashion and append to the history file, this can cause a slowdown.

Resolving the problem

1) Prune the History File as much as possible using the PRUNE HISTORY command and see if the performance becomes consistent.

2) Consider lowering the value of REC_HIS_RETENTN

 

http://www-01.ibm.com/support/docview.wss?uid=swg21646712&myns=swgimgmt&mynp=OCSSEPGG&mync=E

728x90
728x90

Question

Sometimes users might notice their sessions are stuck in "Commit Active" state for over a period of time and then the state goes away. What could be the possible reason for that ?

Cause

Session in "Commit Active" means it's performing the commit of a Unit of Work, but not being able to finish the same yet for some reasons.

Answer

Out of many possible reasons following could be some common ones :

- The database is using HADR and at the standby side it's hitting a momentary HADR log receive buffer full situation.

- Network issues affecting HADR communications between primary and standby

- The database has very large history file

- Disk bottleneck at the storage side where the database transaction logs are kept

 

 

To identify HADR receive buffer full situation just run "db2pd -hadr" at the standby side while the "Commit Active" state is present at the primary side of the HADR and see what the output of StandByRcvBufUsed field shows. If it's near 100% then it can cause the "Commit Active" situation in the primary side. Consider increasing DB2_HADR_BUF_SIZE registry setting to address the issue.

If a network issue is suspected between two sides of HADR pair then that should be taken to the network administrator to check on possible network bottleneck or any other kinds of network issues.

To check whether history file have become too big. Check for the file db2rhist.asc file under database directory. If it's big (example, more than 100MB or so) then try to prune it down.

Bottleneck in disk containing the active log directory or, mirror log path can cause the "Commit Active" situation most. Usually, iostat output collected while the "Commit Active" is in place can show if the disk bottleneck is in place. As for example, under AIX operating system collecting "iostat -TDla 1 30" might show the column "%tm" reaching near 100% for the disk containing log paths. If the %tm is 100% and all the data flow in that disk shows zero, that will indicate a momentary stalled disk situation.


 

http://www-01.ibm.com/support/docview.wss?uid=swg21650932&myns=swgimgmt&mynp=OCSSEPGG&mync=E

728x90
728x90


Question

You might encounter locking issues that require you to monitor the status of the database. Why don't snapshot monitoring commands, such as the SYSIBMADM.LOG_UTILIZATION administrative view, work in these locking situations?

Cause

Some snapshot monitoring commands or administrative views acquire DB2 internal latches for resource serialization. These latches might be owned by EDUs on the problematic database.

Answer

The MON_GET_* table functions, such as the MON_GET_TABLESPACE and MONREPORT module, don't require a DB2 internal latch.

Unfortunately, the MON_GET_TRANSACTION_LOG table function is not available in releases prior to DB2 Version 10.1. However, you can calculate used log space with the db2pd -db DBNAME -dbcfg -logs output.

The following represents the calculation of the total available log space.

LOGDATAPAGE = 4096 (LOGPAGE) - 20 (OVERHEADBYTES) = 4076

Total available log space
= LOGFILSIZ * LOGDATAPAGE * ( LOGPRIMARY + LOGSECOND )


The following represents the calculation of the total used log space.

Total used log space
= Current LSO - StartLSO of First active log file + Compensation log space


Where "First active log Number" is the number in the "First active log file" name, found in the db2pd -dbcfg output, and "Compensation log space" is the total of all "LogSpace" values in db2pd -transactions output.

The following is an example of how to query the "Total available log space" and "Total used log space" with the SYSIBMADM.LOG_UTILIZATION administrative view.

$ date; db2 "select char(DB_NAME,16) DB_NAME, LOG_UTILIZATION_PERCENT, TOTAL_LOG_USED_KB, TOTAL_LOG_AVAILABLE_KB, TOTAL_LOG_USED_TOP_KB, DBPARTITIONNUM from SYSIBMADM.LOG_UTILIZATION"

Wed Mar 6 02:38:04 EST 2013

DB_NAME LOG_UTILIZATION_PERCENT TOTAL_LOG_USED_KB TOTAL_LOG_AVAILABLE_KB TOTAL_LOG_USED_TOP_KB DBPARTITIONNUM
---------------- ----------------------- -------------------- ---------------------- --------------------- --------------
DB01 90.80 7402 749 7402 0

1 record(s) selected.



In the following example, the Total available log space calculation equates to 8152K and Total used log space equates to 7402K, based on thedb2pd output.

The following is the "Total available log space" and "Total used log space", calculated from the db2pd output.

LOGDATAPAGE = 4096 - 20 = 4076

Total available log space
= LOGFILSIZ * LOGDATAPAGE * ( LOGPRIMARY + LOGSECOND )
= 256 * 4076 * 8
= 8347648 = 8152K

Total used log space
= Current LSO - StartLSO of First active log file + Compensation log space
= Current LSO 46082460 - StartLSO 41803457 + all LogSpace (1320458 + 1980634)
= 4279003 + (1320458 + 1980634) = 7580095 = 7402.4365234375K = 7402K

TOTAL_LOG_AVAILABLE_KB = 8152K - 7402K = 750K


$ date; db2pd -db DB01 -dbcfg -logs -transactions

Wed Mar 6 02:38:06 EST 2013
...

Database Member 0 -- Database DB01 -- Active -- Up 0 days 00:11:51 -- Date 03/06/2013 02:38:07

Database Configuration Settings:
...

LOGBUFSZ (4KB) 64 64
LOGFILSIZ (4KB) 256 256
LOGPRIMARY 5 5
LOGSECOND 3 3
First active log file S0000001.LOG S0000001.LOG
...

Database Member 0 -- Database DB01 -- Active -- Up 0 days 00:11:52 -- Date 03/06/2013 02:38:08

Logs:
Current Log Number 5
Pages Written 25
Cur Commit Disk Log Reads 0
Cur Commit Total Log Reads 0
Method 1 Archive Status Success
Method 1 Next Log to Archive 5
Method 1 First Failure n/a
Method 2 Archive Status n/a
Method 2 Next Log to Archive n/a
Method 2 First Failure n/a
Log Chain ID 0
Current LSO 46082460
Current LSN 0x0000000000046772

Address StartLSN StartLSO State Size Pages Filename
0x0700000063BB0940 000000000003A183 41803457 0x00000000 256 256 S0000001.LOG
0x0700000063BB11C0 000000000003D0A4 42846913 0x00000000 256 256 S0000002.LOG
0x0700000063BB1A40 0000000000040146 43890369 0x00000000 256 256 S0000003.LOG
0x0700000063BB22C0 00000000000431E9 44933825 0x00000000 256 256 S0000004.LOG
0x0700000063BB00C0 000000000004628B 45977281 0x00000000 256 256 S0000005.LOG
0x0700000070FBC580 0000000000000000 47020737 0x00000000 256 256 S0000006.LOG
0x0700000070F44100 0000000000000000 48064193 0x00000000 256 256 S0000007.LOG
0x0700000070FD00C0 0000000000000000 49107649 0x00000000 256 256 S0000008.LOG

Database Member 0 -- Database DB01 -- Active -- Up 0 days 00:11:52 -- Date 03/06/2013 02:38:08

Transactions:
Address AppHandl [nod-index] TranHdl Locks State Tflag Tflag2 Firstlsn Lastlsn Firstlso Lastlso LogSpace SpaceReserved TID AxRegCnt GXID ClientUserID ClientWrkstnName ClientApplName ClientAccntng
0x07000000606AF380 28 [000-00028] 3 20002 WRITE 0x00000000 0x00000000 0x000000000003A2C0 0x000000000003F16D 41862316 43550312 1320458 3008538 0x000000000139 1 0 n/a n/a n/a n/a
0x07000000606B4480 29 [000-00029] 4 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F2 1 0 n/a n/a n/a n/a
0x07000000606B9580 30 [000-00030] 5 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x000000000150 1 0 n/a n/a n/a n/a
0x07000000606BE680 31 [000-00031] 6 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F4 1 0 n/a n/a n/a n/a
0x07000000606C3780 32 [000-00032] 7 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F6 1 0 n/a n/a n/a n/a
0x07000000606C8880 33 [000-00033] 8 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F7 1 0 n/a n/a n/a n/a
0x07000000606CD980 34 [000-00034] 9 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F8 1 0 n/a n/a n/a n/a
0x07000000606D2A80 35 [000-00035] 10 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000F9 1 0 n/a n/a n/a n/a
0x07000000606D7B80 36 [000-00036] 11 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000FA 1 0 n/a n/a n/a n/a
0x07000000606DCC80 37 [000-00037] 12 0 READ 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0 0 0x0000000000FC 1 0 n/a n/a db2evml_DB2DETAILDEADLOCK n/a
0x07000000606E1D80 39 [000-00039] 13 30003 WRITE 0x00000000 0x00000000 0x000000000003F16E 0x0000000000046771 43550396 46082376 1980634 4512698 0x000000000101 1 0 n/a n/a n/a n/a

References:
Space requirements for log files http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0004930.html

db2pd - Monitor and troubleshoot DB2 database commandhttp://pic.dhe.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0011729.html


http://www-01.ibm.com/support/docview.wss?uid=swg21626153

728x90
728x90


Problem(Abstract)

I receive the below error when applying the DB2 Enterprise Server Edition license on a 32-bit Linux system:

LIC1449N The license was not installed due to a platform restriction. 

Explanation: 
This DB2 product is only supported in trial mode, also known as "Try and Buy" mode, on this platform. 

User response: 
Continue to use this product in trial mode, or install one which is fully supported on this platform. 

Cause

DB2 Enterprise Server Edition v9.5 and higher is only supported in Test and Development mode on 32-bit Linux systems.

Resolving the problem

A DB2 Enterprise Server Edition permanent license may only be applied on a 64-bit Linux system.

Here are your options: 

  • Upgrade the OS to 64-bit Linux and then install Enterprise Server Edition and license.
  • Purchase another DB2 edition such as DB2 Workgroup Server Edition or DB2 Express Edition. You will need to also obtain and apply the corresponding permanent license.
  • Purchase DB2 Developer's Edition for 32-bit Linux. The Developer's Edition license certificate (db2dede.lic) can be applied on top of a Enterprise Server Edition install without any problems.


http://www-01.ibm.com/support/docview.wss?uid=swg21421213

728x90
728x90

DB2에는 복구 히스토리 파일(db2rhist.asc)이 있는데 주로 테이블스페이스 생성, 삭제, 로그 아카이빙 등과 같은 유틸리티성 작업이 기록된다. 이 파일이 일정 크기 이상으로 커지면 Commit이 완료되는데 오래 걸리는 현상이 발생한다. 

아래 포스트를 보면 DPF환경에서 히스토리 파일이 약 240MB인 경우에 트랜잭션 행(hang)현상이 발생했다고 한다.



The history file in DB2 is a critical file that is usually overlooked.


Earlier this week I work with a customer that had been experiencing hangs on certain partitions (Yes this is DPF) daily since upgrading to a later fixpack, and intermittently on the earlier fixpack.


We found that DB2 was stuck in uninteruptable kernel code. As this was Linux, the usual tools to get a kernel trace were not available. Looking at the db2support provided by the customer, I saw that the db2rhist.asc (aka history file) was > 240MB.


The customer was advised to prune the history file, so that it only contained 1o days of history. This resulted in the db2rhist.asc only being 8 MB (which is still reasonable big). Since the pruning of the history file no further hangs have been experienced to date.


There are numerous cases where a large history file causes a slow down and even hang like symptoms. In a majority of cases where there is high load activity the history file can grow quickly and there are DB2 internal options so that Loads are not logged in the history file - see DB2 support for details.


It is good practice to keep the history file as small as possible.



좀 심한 경우에는 600MB까지 늘어난 경우도 있다.


Dear all,
I wrote some weeks ago about a problem with all the committing transactions blocked for many seconds in a Commit Active state. Well, finally I solved it and I just would like to share this information.

 As I suspected there was a lock. I suspected log writing due to thecommit active state and I was right. Using truss on the db2loggw process I found that when the db2 was blocked it was waiting on a Unix semaphore. I didn't find a way to find on AIX a command that gives you the pid of the semaphore locking process (you have the last operation process pid, that was the db2loggw itself). This semaphore lock appear just after a statx on
the db2rhist.asc file, so I tried to understand what was happening with that file.

Oh well, the file was pretty big: 600MB. I suspected the db2logmgr and infact it was accessing the file during the lock. once again, truss helped me. It was reading/seeking into the file... when it finished it did a semop on the same semaphore (an unlock obviously) and the db2loggw started to write again and everything was unblocked and working correctly.

I removed the db2rhist.asc and the db2rhist.bak and reduced the REC_HIS_RETENTN variable to 7 days (we keep backup for less than a week) while the default is 366...

Anyway I have some questions for any DB2 folk around here. First of all anything happened just when there was an asnapply or asncap committing transaction on the database (the replication processes). Second is that I do not understand while the db2logmgr must read so many data from the db2rhist.asc: is it responsible of the history rotation process or whatever?

Thanks a lot to everybody and bye.

Matteo Gelosa
I.NET S.p.A.



어쩄든 History FIle 에 액세스하는 동안 Commit이 느려진다는 점은 IBM문서에서 나와있으나, 어느정도 크기 이상이 되었을 때 성능에 문제가 있다던지 하는 기준은 없다. 가능하면 작게 유지하는 것이 좋다는 게 공식적인 입장이다.


다만 최근에 출시된 10.1 버전 Fixpack 2 부터는 History FIle이 커져도 성능에 영향을 주지 않도록 변경된 것으로 보인다.



http://publib.boulder.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.wn.doc/doc/c0060645.html

http://veryuglyboxer.blogspot.kr/2010/01/history-file.html

http://www.dbforums.com/db2/1652165-applicaiton-staying-long-time-commit-active-state.html

http://database.ittoolbox.com/groups/technical-functional/db2-l/application-in-commit-active-status-4428765

http://dbaspot.com/ibm-db2/195591-commit-active-problem-solved-db2rhist-asc.html

http://www-01.ibm.com/support/docview.wss?uid=swg21575692




728x90
728x90



Problem(Abstract)

A very large db2rhist.asc file can cause performance degradation to DML that modifies tables.

Symptom

Poor performance during inserts, updates and deletes. The application snapshots will show applications in COMMIT-ACTIVE longer than it should


Cause

During a log archive, DB2 writes to the db2rhist.asc file. As the file gets larger, the writes takes longer than usual. During this time any inserts, updates and deletes attempting to COMMIT during a log switch have to wait for the db2rhist.asc update to complete. This can momentarily 'freeze' the system for a few seconds.

This problem usually affects customers who use flash copy to backup the database. As DB2 backup is not invoked directly, the db2rhist.asc does not get pruned despite the RETENTION setting.


Diagnosing the problem

Disable HADR for isolation, as it may cause COMMIT-ACTIVE as well

 

Resolving the problem

Reduce the size of the db2rhist.asc via the prune force command periodically


http://www-01.ibm.com/support/docview.wss?uid=swg21575692

http://dbaspot.com/ibm-db2/195591-commit-active-problem-solved-db2rhist-asc.html


728x90
728x90


Technote (FAQ)


Question

Are there recommended AIX® Virtual Memory Manager settings for DB2® database product?

Answer

While it is difficult to prescribe a standard recommendation for all virtual memory manager tunables on AIX systems running the DB2 product, there are some important specific settings where this is possible. This document focuses primarily on parameters which have non-default settings and are considered "Best Practice". These settings apply to all versions of DB2 product running on AIX 5.3, 6.1, and 7.1, and also apply to systems using both cached and non-cached (Direct I/O) file system I/O.

Recommendations:

The "Best Practice" recommendation for DB2 product running on AIX systems is :

maxperm%=90
maxclient%=90
minperm%=3
lru_file_repage=0

You can implement these settings with the following command, which is dynamic and takes effect immediately: 

vmo -p -o maxperm%=90 -o maxclient%=90 -o minperm%=3 -o lru_file_repage=0


Optional: 
page_steal_method=1 

Note that maxperm%, maxclient%, lru_file_repage, and page_steal_method are restricted tunables on AIX 6.1 and 7.1, and you should not tune these restricted tunables without first consulting AIX support.

You can view the current settings by executing the command "vmo -a". on AIX 6.1 and 7.1, you can view the current settings for restricted tunables by executing the command "vmo -a -F". You can view a subset of settings such as minperm%, maxperm%, and maxclient%, by executing the command"vmstat -v" and you do not require root authority to issue this command.

Definitions:

Computational memory - Application memory (such as DB2), kernel memory

Permanent memory - File cache, such as JFS, JFS2, NFS, and GPFS.

Client memory - JFS2, NFS, and GPFS, which means all permanent memory except JFS.

maxperm% - the threshold above which permanent memory pages will be evicted in favor of computational memory pages. The default value of this setting for AIX 5L is 80%, and for AIX 6.1 and 7.1 is 90% but it is now a restricted setting for AIX 6.1 and 7.1.

maxclient% - the threshold above which client memory pages will be evicted in favor of computational memory pages. The default value of this setting for AIX 5L is 80% and for AIX 6.1 and 7.1 is 90% but it is now a restricted setting for AIX 6.1 and 7.1.

minperm% - the threshold below which computational memory pages will be evicted in favor of permanent memory pages. The default value of this setting for AIX 5L is 20% and for AIX 6.1 and 7.1 is 3%.

lru_file_repage - when the number of permanent memory pages (numperm) falls between minperm and maxperm, or the number of client memory pages falls between minperm and maxclient, this setting indicates whether repaging rates are considered when deciding to evict permanent memory pages or computational memory pages. Setting this to 0 tells AIX to ignore repaging rates and favor evicting permament memory pages, and thus keeping more computational memory in RAM. The default value for AIX 5L is 1 or true, which means by default it considers the repaging rate. The default value for AIX 6.1 and 7.1 default is 0 or false but it is now a restricted setting.

page_steal_method - selects the AIX LRU scanning method. "0" is the default value for AIX 5.3 and tells AIX to use the legacy page frame table method. "1" is the default value for AIX 6.1 and 7.1 and tells AIX to use a list-based scanning method. This is a restricted tunable on AIX 6.1 and 7.1.

Explanation:

Since the lru_file_repage tunable and related AIX behavior changes were introduced, the recommended settings of key virtual memory manager parameters have changed. The older strategy for DB2 product running on AIX systems was to reduce file cache usage with a combination of moderately low maxperm or maxclient settings, such as 15-30%, and a much lower minperm setting, such as 5-10%. This strategy was required to prevent the paging out of computational memory on systems with a very active file cache.

The new tuning strategy takes advantage of lru_file_repage, and it includes settings which apply to all DB2 products running on AIX systems. These settings allow for a large filesystem cache where desirable, but not at the expense of paging out computational memory. Following these guidelines allows DB2 to take maximum advantage of available system memory without risking the effects of unnecessary paging due to file cache activity. It should be noted that the"lru_file_repage=0" setting is especially important for the optimal performance of DB2's Self-Tuning Memory Manager (STMM). STMM assumes that file pages above minperm can be used for other purposes, such as bufferpools, without paging out computational memory.

The new virtual memory manager settings are a general AIX recommendation and are not restricted to only systems running DB2 product. These settings are also the default settings on AIX 6.1 and 7.1. They have been adopted by the IBM Smart Analytics System and InfoSphere Balanced Warehouse and are now the recommended best practice for all DB2 products running on AIX systems.

An optional setting, "page_steal_method=1", has been included as an acceptable setting for DB2 product running on AIX systems. It can improve AIX performance by switching to list-based LRU scanning, and this is also the default on AIX 6.1 and 7.1. However, the benefits for DB2 product running on AIX systems are not as widespread as the other tunables, and it is not strictly a "Best Practice".

Other Resources:

For details on the AIX concepts discussed in this technote, please see the following links :
http://www.ibm.com/developerworks/aix/library/au-vmm/
http://publib.boulder.ibm.com/infocenter/systems/index.jsp
http://www.redbooks.ibm.com/redbooks/pdfs/sg246039.pdf


http://www-01.ibm.com/support/docview.wss?uid=swg21328602

728x90
728x90

Question

If you think that the DB2® index manager might be the origin of a problem, collect diagnostic data that you or IBM® Software Support can use to diagnose and resolve the problem.

Cause

Background

There are many flavours of index-related corruption problems, for instance:

  • Unique index contains duplicates with different RID(s) or same RID(s)
  • Multiple index entries pointing to the same RID
  • Index key is out of place (wrong index key order)
  • Row exists, but index keys do not exist in any or some of the indexes
  • Index entry pointing to an empty data slot or unused data slot or RID is invalid
  • Incorrect previous or next index page pointers, incorrect high key or other corruptions on index pages

Even if you see the same return code from the same function or probe point, problems could vary from one to another. For instance, when SQLI_NOKEY error is received from index function "procLeaf2Del" or "sqlischd" during a delete or update, you could be hitting problems 3, 4, or 6. 

If "row not found" or "bad page" type of error is received when an index is used to fetch the RID for a delete, update or select, then you could be hitting problems 3 or 5. 

In short, for each problem, a DB2 support analyst will need to look at the data collected, and analyze it to determine root cause. To have data collected as much as possible at the first failure point as well as preserving necessary log files are often critical to the root cause determination. 

Since index corruption or index data inconsistency type of errors could be caused by DB2 defects or by other issues such as hardware problems, it is necessary to collect the proper diagnostic data in order to determine the potential cause of the error. 


DB2 tools to detect these problems 

DB2 products offer a variety of tools to detect index or data corruption problems and index or data mismatch problems. 

  1. An off-line db2dart /t (table inspection) would be able to detect corruption symptoms 1, 2, 3, 6 of the above list. For example:

    Table resides in table space id 4, and object id of table is 2, the database name is T30, the output is "inspTsi4oi2.rpt":
    db2dart T30 /t /tsi 4 /oi 2 /scr n /rptn inspTsi4oi2.rpt

  2. The RUNSTATS command with a special registry variable set will collect statistics as well as detect corruption symptoms 4, 5 of the above list. For example:

    db2stop
    db2set DB2_KEYDATACHECK=IXSTATDATADEBUG
    db2start

    db2 connect to <database>
    db2 -v "runstats on table <tableschema>.<tabname> and detailed indexes all allow read access"

    NOTE: After each RUNSTATS, you need to save the contents of the sqllib/db2dump directory. You can do so by moving the entire content to another location. As long as the RUNSTATS does not return with DB2 completed successfully, you will need to report this to DB2 support.

Some enhancements are currently being investigated. The goal is to improve the existing tools for serviceability and usability in future releases in the above areas.

Answer

You often need an advanced knowledge of DB2 in order to determine whether the problem is an index corruption or an index or data inconsistency problem. Therefore you should always contact IBM support organization to help you recover from these problems.



For index-related corruption or index or data inconsistency problems, the following information is typically required:

  • the contents of the DB2DUMP directory
  • the on-disk version of the related index pages (collected via the db2dart tool)
  • related data page(s) for the index or data object (collected via the db2dart tool)
  • database transaction log files, which are used to understand the past history of the index or table object that encountered the error.

The following steps are the recommended best practice for data collection (and database recovery) when an error related to index corruption occurs. NOTE: DO NOT perform these steps unless you have been asked to do so by an IBM DB2 Support analyst. 

  1. Collect the standard diagnostic information using the db2support command. This should include all the files in DB2DUMP directory (or the path specified by DIAGPATH in the dbm configuration).

    There are cases where the corruptions are caused by a disk problem, and you do not care about root cause analysis of the problem. In these situations, skip step 2, and go directly to step 3.

  2. Run db2dart as indicated below (note: if this is a multi-partition DPF environment, then you will need to run db2dart on the partition where failure occurred):

    db2dart <dbname> /t /tsi <pool id> /oi <object id> /rptn inspTsi<poolid>oi<objectid>_bcrashrec.rpt

    In general, this inspection should be fairly quick. If the database is mission critical database and needs to be brought up online immediately, then you may skip this step at this time.

  3. Restart your database and verify crash recovery completes successfully.


    If crash recovery completes successfully and you cannot take the time needed to collect this data for root cause analysis of the problem, skip steps 4, 5 and 6 and go directly to step 7.


    If the restart fails, and it is determined to be a problem redoing or undoing an index record, then you can mark an index invalid (e.g. db2dart <dbname> /mi /oi <objectID> /tsi <tablespaceID>). INDEXREC will control when the invalid index object will be recreated.
  4. To prevent the previous failing application from accessing the table again before the problem on the related index or table is resolved, you can lock the table in exclusive mode. Use the following command:


    db2 +c "LOCK TABLE <SCHEMA NAME>.<TABLENAME> IN EXCLUSIVE MODE"


    The "+c" turns off the auto commit, since LOCK TABLE is only effective within the unit of work.


    If you do care about the availability of the rest of the tables in the same table space as the problem table, instead you may also want to consider quiescing the table spaces for the table object. For example:


    db2 "QUIESCE TABLESPACES FOR TABLE <SCHEMA NAME>.<TABLE NAME> EXCLUSIVE"


    NOTE: QUIESCE command will also quiesce the index table space if the index object is in a separate table space. In a partitioned database environment, this command will only quiesce the portion on current node.

  5. Run db2dart against the index and table object.


    If the problem is an index NOKEY error, then a number of db2dart commands will be printed in the db2diag.log file. Search for these commands (using eithergrep on UNIX® or find on Windows®) and save them in a file. Edit the file, replace "DBNAME" with the database name. If the problem has been hit multiple times, then there could be some duplicate entries. You only need to keep the latest set of db2dart commands.

    If it is the data manager that reports row not found or bad page, then NO db2dart commands are printed out. You can determine the pool id, object id, data page number from the db2diag.log entries. For example:


    2006-06-17-11.32.59.941406+000 I235565A460 LEVEL: Severe
    PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0
    INSTANCE: db2inst NODE : 000 DB : OB8
    APPHDL : 0-78 APPID: ....
    FUNCTION: DB2 UDB, data management, sqldDeleteRow, probe:1329
    RETCODE : ZRC=0x87040001=-2029780991=SQLD_BADPAGE "Bad Data Page"
    DIA8500C A data file error has occurred, record id is "".


    2006-06-17-11.32.59.964241+000 I236403A380 LEVEL: Severe
    PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0
    INSTANCE: db2inst NODE : 000 DB : OB8
    APPHDL : 0-78 APPID: ...
    MESSAGE : PAGE OBJECT IDENTIFIERS:
    DATA #1 : String, 54 bytes
    Tablespace ID = 15, Object ID = 8737, Object Type = 0


    2006-06-17-11.32.59.964448+000 I236784A350 LEVEL: Severe
    PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0
    INSTANCE: db2inst NODE : 000 DB : OB8
    APPHDL : 0-78 APPID: ...
    MESSAGE : PAGE NUMBERS:
    DATA #1 : String, 35 bytes
    Obj Page = 297, Pool Page = 787527


    In this case, the data page is 787527, pool id is 15 and object id is 8737.

    The commands that you need are:

    1. db2dart <dbname> /dd /tsi <pool id> /oi <object id> /ps <data page number>p /np 1 /v y /scr n /rptn dTsi<poolid>oi<objectid>.rpt

      ...where <data page number> is 787527p using the example above, "p" is used to indicate this is pool relative page which is important for DMS table space.

    2. db2dart <dbname> /di /tsi <pool id> /oi <object id> /ps 0 /np <large enough value to dump all index pages> /v y /scr n /rptn iTsi<poolid>oi<objectid>_0-<pagenum>.rpt.rpt


      Note: The command might take longer to run if a large /np value is specified. Ideally, you should collect as much data as possible. At minimum, it should be no less than 100 pages.

      If table is not too big, also dump the whole table. If the table is too big to dump, select a reasonably large number so that at least a part of the table can be dumped, for example:

      db2dart <dbname> /dd /tsi <pool id> /oi <object id> /ps <data page to start> /np <large enough number> /v y /scr n /rptn dTsi<pool id>oi<object id>_0-<page num>.rpt

      where <data page to start> would be "0" if you want to dump the whole table, or <data page number - 100>p if the table is too big to dump (using the example, that would be "787427p"). Replace the <pool id> and <object id> with the actual number.

  6. Run RUNSTATS to check for any index or data inconsistency. For example:


    db2stop
    db2set DB2_KEYDATACHECK=IXSTATDATADEBUG
    db2start
    db2 connect to <db>
    db2 -v "SELECT INDSCHEMA, INDNAME FROM SYSCAT.INDEXES WHERE TABSCHEMA='<TABLE SCHENAME>' AND TABNAME='<TABLENAME>' "


    Then use the index schema name and index name from the above SELECT output for the following RUNSTATS command:

    db2 -v "runstats on table <tableschema>.<tabname> and index <index schema>.<index name> allow read access"


    NOTE: You need to perform the RUNSTATS on all of the indexes defined on the offending table; after each RUNSTATS command, you need to save the contents of the sqllib/db2dump directory (or DIAGPATH, if you're not using the default path).


    RUNSTATS on a large table could take a while and the table will not be available for updates while RUNSTATS is running. In a multi-partition DPF environment, RUNSTATS is done on a per partition basis.


    If you had quiesced the table space at step 4, then unquiesce the table spaces before you terminate the database connection. You will need to use the same user id for the QUIESCE command in order to reset:


    db2 -v "QUIESCE TABLESPACES FOR TABLE <SCHEMA NAME>.<TABLE NAME> RESET" 


    Once all the RUNSTATS commands are completed, you need to collect the db2diag.log along with the content of sqllib/db2dump (or content of DIAGPATH).

  7. If it is determined that the index is corrupted (that is, if either RUNSTATS or db2dart /t inspection returns with an error), then proceed with rebuilding the index as follows:


    db2 "REORG INDEXES ALL FOR TABLE <SCHEMA NAME>.<TABLE NAME> ALLOW NO ACCESS"


    Applications cannot access your table while the above REORG is running. If you need the table to be available immediately, you can choose to run the above REORG INDEXES command with ALLOW READ ACCESS or ALLOW WRITE ACCESS options. Be aware, however, that the database is at risk of being brought down again if the previous failing application is run against the table before the REORG completes.


    If you are not concerned with root cause analysis of the problem, you do not need to collect any additional information and can skip the subsequent steps.

  8. Collect table or index DDLs. You can use the db2look utility to do so. For example: 

    db2look -d <database> -l -e -a -t <table name> -o table.ddl


  9. Collect transaction log files

    You will only know which log files to collect after a DB2 Support analyst examines the data collected thus far. In general, they will ask you to send a raw log file. In some special cases, they may ask you to format the file and only send the relevant log files. The number of log files required will depend on when the problem was introduced, so it might be necessary to retrieve older log files from archive. 

  10. Additional information that may be important to root cause analysis or known APAR identification. For example:
    • What type of workload is typically running against the table?
    • Did you perform any actions that you don't normally do on this table, for instance any REORG INDEXES/TABLE commands, load or import operations, a high utilities workload, or restore or rollforward operations?
    • Were there any other unusual occurrences? For example, if there were recent hardware problems, provide the operating system error log. This would involve collecting the errpt -a output on AIX® or /var/adm/messages file on Solaris.


http://www-01.ibm.com/support/docview.wss?uid=swg21244563

728x90

+ Recent posts