Monday 10 March 2014

What’s Up With dbms_stats?

What’s Up With dbms_stats?

Introduction
For years Oracle has been telling us to use the DBMS_STATS package instead of the ANALYZE command to gather statistics to be used by the Oracle optimizer. Many of us have heeded this advice and switched from using ANALYZE for scheduled statistics-gathering jobs to using DBMS_STATS. Still, others have said, “ANALYZE has always worked, so I see no reason to change.” And clients often ask, “What settings should I use to analyze my tables?” The options available for the ANALYZE command were essentially limited to “compute” and “estimate”, with some histogram options, and people tended to choose based on how long the ANALYZE job took. But the DBMS_STATS package covers a lot of territory, and within its many procedures there are a multitude of options available. The choice of which options to use can dramatically affect your results, both in the accuracy of statistics and the performance of the statistics gathering operation itself. So, what’s a DBA to do?
The objective of this paper is to examine some of the statistics-gathering options and their impact. The focus will be on actual experience, measured performance, and detailed examples—as opposed to what the documentation says. We can’t cover every option possible (or this would be a very long paper), but we will focus on the options which seem most useful and/or problematic. Contrary to what the folks at Redwood Shores would prefer, most of you are probably not yet using Oracle 10g for all your production databases. Therefore, much of the focus of this paper will be on Oracle 9i issues. But we will address some of the changes and differences in 10g.
When we’re done, hopefully you’ll have a better idea why you might want to choose various options when gathering statistics on your databases. And you’ll probably also think of additional testing you’ll want to perform on your own data.
DBMS_STATS Procedures
The DBMS_STATS package contains over 40 procedures. These include procedures to: delete existing statistics for a table, schema, or database; set statistics to desired values; export and import statistics; gather statistics for a schema or entire database; and monitor tables for changes. We will focus on DBMS_STATS.GATHER_TABLE_STATS and DBMS_STATS.GATHER_INDEX_STATS, because they are the starting point for getting statistics so the Oracle optimizer can make informed decisions. These two procedures gather statistics on the data in a specific table or index, respectively.
The parameters of DBMS_STATS.GATHER_TABLE_STATS are:
DBMS_STATS.GATHER_TABLE_STATS (
   ownname          VARCHAR2, 
   tabname          VARCHAR2, 
   partname         VARCHAR2 DEFAULT NULL,
   estimate_percent NUMBER   DEFAULT NULL, 
   block_sample     BOOLEAN  DEFAULT FALSE,
   method_opt       VARCHAR2 DEFAULT 'FOR ALL COLUMNS SIZE 1',
   degree           NUMBER   DEFAULT NULL,
   granularity      VARCHAR2 DEFAULT 'DEFAULT', 
   cascade          BOOLEAN  DEFAULT FALSE,
   stattab          VARCHAR2 DEFAULT NULL, 
   statid           VARCHAR2 DEFAULT NULL,
   statown          VARCHAR2 DEFAULT NULL,
   no_invalidate    BOOLEAN  DEFAULT FALSE);
The parameters of DBMS_STATS.GATHER_INDEX_STATS are:
DBMS_STATS.GATHER_INDEX_STATS (
   ownname          VARCHAR2, 
   indname          VARCHAR2, 
   partname         VARCHAR2 DEFAULT NULL,
   estimate_percent NUMBER   DEFAULT NULL,
   stattab          VARCHAR2 DEFAULT NULL, 
   statid           VARCHAR2 DEFAULT NULL,
   statown          VARCHAR2 DEFAULT NULL,
   degree           NUMBER   DEFAULT NULL,
   granularity      VARCHAR2 DEFAULT 'DEFAULT',
   no_invalidate    BOOLEAN  DEFAULT FALSE);
For the purposes of our testing, the interesting parameters are estimate_percent, block_sample, method_opt, and cascade, because these are the parameters which address statistics accuracy and performance. We won’t address degree and granularity, which can be useful in improving performance for very large databases. And most of the other parameters deal with non-performance choices.
These options together are one big advantage that DBMS_STATS has over the old ANALYZE command, because they give us additional flexibility in how the statistics gathering is done. Other advantages will be addressed below.
estimate_percent
This parameter is similar to the old “estimate statistics sample x percent” parameter of the ANALYZE command. The value for estimate_percent is the percentage of rows to estimate, with NULL meaning compute. You can use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to have Oracle determine the best sample size for good statistics.
block_sample
The value for block_sample determines whether or not to use random block sampling instead of random row sampling. According to the documentation, “Random block sampling is more efficient, but if the data is not randomly distributed on disk, then the sample values may be somewhat correlated.”
method_opt
This parameter determines whether to collect histograms to help in dealing with skewed data. FOR ALL COLUMNS or FOR ALL INDEXED COLUMNS with a SIZE value determines which columns and how many histogram buckets to use. Instead of an integer value for SIZE, you can specify SKEWONLY to have Oracle determine the columns on which to collect histograms based on their data distribution. Or you can specify AUTO to have Oracle determine the columns on which to collect histograms based on data distribution and workload.
cascade
The value for cascade determines whether to gather statistics on indexes as well.
Our approach was to try different values for these parameters and look at the impact on accuracy of statistics and performance of the statistics gathering job itself.
Our Data
For our tests, we will use two tables, FILE_HISTORY and PROP_CAT:
FILE_HISTORY [1,951,673 rows  28507 blocks, 223MB ]
 Column Name              Null?    Type          Distinct Values
 ------------------------ -------- ------------- ---------------
 FILE_ID                  NOT NULL NUMBER                1951673
 FNAME                    NOT NULL VARCHAR2(240)         1951673
 STATE_NO                          NUMBER                      6
 FILE_TYPE                NOT NULL NUMBER                      7
 PREF                              VARCHAR2(100)           65345
 CREATE_DATE              NOT NULL DATE
 TRACK_ID                 NOT NULL NUMBER
 SECTOR_ID                NOT NULL NUMBER
 TEAMS                             NUMBER
 BYTE_SIZE                         NUMBER
 START_DATE                        DATE
 END_DATE                          DATE
 LAST_UPDATE                       DATE
 CONTAINERS                        NUMBER

PROP_CAT    [11,486,321 rows  117705 blocks, 920MB ]
 Column Name              Null?    Type          Distinct Values
 ------------------------ -------- ------------- ---------------
 LINENUM                  NOT NULL NUMBER(38)           11486321
 LOOKUPID                          VARCHAR2(64)            40903
 EXTID                             VARCHAR2(20)         11486321
 SOLD                     NOT NULL NUMBER(38)                  1
 CATEGORY                          VARCHAR2(6)
 NOTES                             VARCHAR2(255)
 DETAILS                           VARCHAR2(255)
 PROPSTYLE                         VARCHAR2(20)            48936
The indexes on these tables are as follows:
Table        Unique?    Index Name                               Column Name
------------ ---------- ---------------------------------------- ---------------
FILE_HISTORY NONUNIQUE  TSUTTON.FILEH_FNAME                      FNAME

             NONUNIQUE  TSUTTON.FILEH_FTYPE_STATE                FILE_TYPE
                                                                 STATE_NO

             NONUNIQUE  TSUTTON.FILEH_PREFIX_STATE               PREF
                                                                 STATE_NO

             UNIQUE     TSUTTON.PK_FILE_HISTORY                  FILE_ID


PROP_CAT     NONUNIQUE  TSUTTON.PK_PROP_CAT                      EXTID
                                                                 SOLD

             NONUNIQUE  TSUTTON.PROPC_LOOKUPID                   LOOKUPID

             NONUNIQUE  TSUTTON.PROPC_PROPSTYLE                  PROPSTYLE
The number of distinct values for the various columns were calculated using
select count (distinct col_name) from table_name;
rather than from gathering statistics.
When we performed our tests, we used two queries to find out the values for the statistics gathered.
  1. index.sql
    select
      ind.table_name,
      ind.uniqueness,
      col.index_name,
      col.column_name,
      ind.distinct_keys,
      ind.sample_size
    from
      dba_ind_columns  col,
      dba_indexes      ind
    where
      ind.table_owner = 'TSUTTON'
        and
      ind.table_name in ('FILE_HISTORY','PROP_CAT')
        and
      col.index_owner = ind.owner 
        and
      col.index_name = ind.index_name
        and
      col.table_owner = ind.table_owner
        and
      col.table_name = ind.table_name
    order by
      col.table_name,
      col.index_name,
      col.column_position;
  2. tabcol.sql
    select table_name, column_name, data_type, num_distinct,
     sample_size, to_char(last_analyzed, '     HH24:MI:SS') last_analyzed,
     num_buckets buckets 
    from dba_tab_columns 
    where table_name in ('FILE_HISTORY','PROP_CAT')
    order by table_name, column_id;
Occasionally we chose to omit columns or rows from the output to improve readability.
The Tests
Our initial testing was on an Oracle 9.2.0.4 Enterprise Edition database on Sun Solaris 8.
ANALYZE
We performed a couple of quick tests using the old analyze command, just to see where we came from. Let’s look at a “quick and dirty estimate statistics” run.
SQL> analyze table file_history estimate statistics;

Table analyzed.

Elapsed: 00:00:08.26


SQL> analyze table prop_cat estimate statistics;

Table analyzed.

Elapsed: 00:00:14.76
Let’s say this is how we’ve been analyzing our tables nightly. Then someone performs the query:
SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING';
and it takes “forever.” So we dutifully repeat the query with autotrace on:
SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING';

no rows selected

Elapsed: 00:00:08.66

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2743 Card=10944 Bytes=678528)
   1    0   TABLE ACCESS (FULL) OF 'FILE_HISTORY' (Cost=2743 Card=10944 Bytes=678528)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      28519  consistent gets
      28507  physical reads
          0  redo size
        465  bytes sent via SQL*Net to client
        460  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          0  rows processed
So we have the old favorite “it’s not using our index” complaint; there is an index on file_history(FNAME). Let’s look at the index statistics:
SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,937,490         1106

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               12         1266
                                             STATE_NO                12         1266

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                65,638         1053
                                             STATE_NO            65,638         1053

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,952,701         1347


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,429,046         1356
                                             SOLD            11,429,046         1356

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID           197,743         1237

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE              172         1317
Hmm, it says our index has 1,937,490 distinct keys—pretty close to the actual value of 1,951,673. But this is what we see when we look at DBA_TAB_COLUMNS:
SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1948094        1034      13:42:42       1
                FNAME           VARCHAR2            178        1034      13:42:42       1
                STATE_NO        NUMBER                1        1034      13:42:42       1
                FILE_TYPE       NUMBER                7        1034      13:42:42       1
                PREF            VARCHAR2            478        1034      13:42:42       1
                CREATE_DATE     DATE            1948094        1034      13:42:42       1
                TRACK_ID        NUMBER                9        1034      13:42:42       1
                SECTOR_ID       NUMBER                6        1034      13:42:42       1
                TEAMS           NUMBER              101        1034      13:42:42       1
                BYTE_SIZE       NUMBER                0                  13:42:42       1
                START_DATE      DATE                  0                  13:42:42       1
                END_DATE        DATE                  0                  13:42:42       1
                LAST_UPDATE     DATE             419317        1034      13:42:42       1
                CONTAINERS      NUMBER              101        1034      13:42:42       1

PROP_CAT        LINENUM         NUMBER         11476127        1048      13:43:17       1
                LOOKUPID        VARCHAR2          19954        1048      13:43:17       1
                EXTID           VARCHAR2       11476127        1048      13:43:17       1
                SOLD            NUMBER                1        1048      13:43:17       1
                CATEGORY        VARCHAR2            705        1048      13:43:17       1
                NOTES           VARCHAR2              0                  13:43:17       1
                DETAILS         VARCHAR2            704        1048      13:43:17       1
                PROPSTYLE       VARCHAR2          21576        1048      13:43:17       1
Yikes! DBA_TAB_COLUMNS shows only 178 distinct values of FNAME. The optimizer uses this value and concludes that it would be fastest to do a full table scan. We can also see that it thinks we only have 478 distinct values for PREF, rather than the 65,345 we actually have. In the prop_cat table, the PROPSTYLE column shows 21576 distinct values in DBA_TAB_COLUMNS and its index shows 172 distinct values, while there are actually 48,936 distinct values. So much for “quick and dirty.”
So, let’s get some better statistics. The 1000 rows sampled by “estimate statistics” is not very many (0.05%). Let’s try 5% of the rows.
SQL> analyze table file_history estimate statistics sample 5 percent;

Table analyzed.

Elapsed: 00:00:36.21

SQL> analyze table prop_cat estimate statistics sample 5 percent;

Table analyzed.

Elapsed: 00:02:35.11
How does our query look now?
SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING';

no rows selected

Elapsed: 00:00:00.54

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=54 Card=110 Bytes=6820)
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'FILE_HISTORY' (Cost=54 Card=110 Bytes=6820)
   2    1     INDEX (RANGE SCAN) OF 'FILEH_FNAME' (NON-UNIQUE) (Cost=3 Card=110)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads
          0  redo size
        465  bytes sent via SQL*Net to client
        460  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          0  rows processed
Much better. Let’s see how the stats look.
SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,926,580       101179

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE                8       102128
                                             STATE_NO                 8       102128

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                74,935        98709
                                             STATE_NO            74,935        98709

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,952,701       101025


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,478,536       589350
                                             SOLD            11,478,536       589350

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID            68,643       589125

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE           72,154       582261



SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED
--------------- --------------- ---------- ------------ ----------- -------------
FILE_HISTORY    FILE_ID         NUMBER          1951673       93010      13:59:20
                FNAME           VARCHAR2          11133       93010      13:59:20
                STATE_NO        NUMBER                5       93010      13:59:20
                FILE_TYPE       NUMBER                7       93010      13:59:20
                PREF            VARCHAR2          23744       93010      13:59:20

PROP_CAT        LOOKUPID        VARCHAR2          19462      541472      14:00:28
                EXTID           VARCHAR2       11486321      541472      14:00:28
                SOLD            NUMBER                1      541472      14:00:28
                PROPSTYLE       VARCHAR2          19633      541472      14:00:28
The number of distinct values for the FNAME column still only shows as 11133, but that’s a far cry better than 178. And at least our query works. Also, the number of distinct values for the PREF column now shows 23744, which is a big improvement. The PROPSTYLE column in prop_cat now shows 19,633 distinct values in DBA_TAB_COLUMNS and 72,154 in DBA_INDEXES.
Out of curiosity, let’s see what results we get if we take the time to do a full “compute statistics.”
SQL> analyze table file_history compute statistics;

Table analyzed.

Elapsed: 00:07:38.32

SQL> analyze table prop_cat compute statistics;

Table analyzed.

Elapsed: 00:29:15.29

SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,951,673      1951673

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               23      1951673
                                             STATE_NO                23      1951673

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                65,390      1951673
                                             STATE_NO            65,390      1951673

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,951,673      1951673


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,486,321     11486321
                                             SOLD            11,486,321     11486321

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID            40,903     11486321

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE           48,936     11486321




SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED
--------------- --------------- ---------- ------------ ----------- -------------
FILE_HISTORY    FILE_ID         NUMBER          1951673     1951673      14:26:26
                FNAME           VARCHAR2          78692     1951673      14:26:26
                STATE_NO        NUMBER                6     1951673      14:26:26
                FILE_TYPE       NUMBER                7     1951673      14:26:26
                PREF            VARCHAR2          65345     1951673      14:26:26

PROP_CAT        LOOKUPID        VARCHAR2          40903    11486321      14:34:04
                EXTID           VARCHAR2       11486321    11486321      14:34:04
                SOLD            NUMBER                1    11486321      14:34:04
                PROPSTYLE       VARCHAR2          48936    11486321      14:34:04
We spent nearly 37 minutes analyzing these two tables with “compute statistics,” and we still only show 78692 distinct values for FNAME. That’s enough for our simple query. But, it’s hard to believe that after examining every row in the table, the statistics are off by more than an order of magnitude for this column—especially since every row has a different value for FNAME. At least the values are accurate for the PROPSTYLE column in prop_cat.
DBMS_STATS
Maybe it’s time to take Oracle’s recommendation, and use DBMS_STATS. Let’s start with a “quick and dirty” run, as a starting point. We’ll use a 1% estimate and cascade=true, so that the indexes will be analyzed also.
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>1,cascade=>true)

PL/SQL procedure successfully completed.

Elapsed: 00:01:41.70

SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>1,cascade=>true)

PL/SQL procedure successfully completed.

Elapsed: 00:01:44.29


SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,926,511       125515

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               11       473488
                                             STATE_NO                11       473488

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                14,207       422476
                                             STATE_NO            14,207       422476

             UNIQUE     PK_FILE_HISTORY      FILE_ID          2,035,824       537004


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           10,994,527       373922
                                             SOLD            10,994,527       373922

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,886       446130

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            2,978       506796



SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED
--------------- --------------- ---------- ------------ ----------- -------------
FILE_HISTORY    FILE_ID         NUMBER          1937700       19377      15:20:30
                FNAME           VARCHAR2        1937700       19377      15:20:30
                STATE_NO        NUMBER                3       19377      15:20:30
                FILE_TYPE       NUMBER                7       19377      15:20:30
                PREF            VARCHAR2           6522       14342      15:20:30

PROP_CAT        LOOKUPID        VARCHAR2          16959      115150      15:23:16
                EXTID           VARCHAR2       11515000      115150      15:23:16
                SOLD            NUMBER                1      115150      15:23:16
                PROPSTYLE       VARCHAR2          16984      115150      15:23:16
Gathering statistics on the two tables and their indexes took 3 minutes, 26 seconds (3:26). This compares to 23 seconds for the “analyze estimate statistics” and 3:11 for “analyze estimate 5%.” The statistics are very accurate for the column FNAME, but PREF is still off a bit (showing 6,522 rows rather than the correct value of 65,345). And PROPSTYLE shows 2978 distinct index values and 16,984 distinct column values (vs. the correct value of 48,936).
If we increase the estimate percent to 5%, what will we get?
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>5,cascade=>true)
 
PL/SQL procedure successfully completed.

Elapsed: 00:01:23.52

SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>5,cascade=>true)

PL/SQL procedure successfully completed.

Elapsed: 00:03:08.75


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED
--------------- --------------- ---------- ------------ ----------- -------------
FILE_HISTORY    FILE_ID         NUMBER          1955420       97771      16:40:12
                FNAME           VARCHAR2        1955420       97771      16:40:12
                PREF            VARCHAR2          24222       72377      16:40:12

PROP_CAT        LOOKUPID        VARCHAR2          19860      573201      16:51:20
                EXTID           VARCHAR2       11464020      573201      16:51:20
                SOLD            NUMBER                1      573201      16:51:20
                PROPSTYLE       VARCHAR2          20095      573201      16:51:20
Gathering statistics on the two tables and their indexes now takes 4:32, but PREF now shows 24,222 distinct values (65,345 is the correct value) and PROPSTYLE shows 20,095 (48,936 is the correct value). While these are not super-accurate, they’re probably close enough for most queries. So a DBMS_STATS estimate of 5% took 4:32 vs. 3:11 for an ANALYZE estimate of 5%, but the statistics generated were much better for FNAME, and roughly equivalent for PREF and PROPSTYLE.
If we increase to a full compute, we get:
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>null,cascade=>true)

PL/SQL procedure successfully completed.

Elapsed: 00:09:35.13

SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>null,cascade=>true)

PL/SQL procedure successfully completed.

Elapsed: 00:29:09.46
SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            2,019,679       131585

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               14       456182
                                             STATE_NO                14       456182

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                16,365       428990
                                             STATE_NO            16,365       428990

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,951,673      1951673


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           10,995,615       373959
                                             SOLD            10,995,615       373959

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,678       469772

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,434       504580


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED
--------------- --------------- ---------- ------------ ----------- -------------
FILE_HISTORY    FILE_ID         NUMBER          1951673     1951673      14:19:30
                FNAME           VARCHAR2        1951673     1951673      14:19:30
                PREF            VARCHAR2          65345     1448208      14:19:30

PROP_CAT        LOOKUPID        VARCHAR2          40903    11486321      14:50:19
                EXTID           VARCHAR2       11486321    11486321      14:50:19
                PROPSTYLE       VARCHAR2          48936    11486321      14:50:19
The total time taken is slightly longer than for the “analyze compute,” but at least the statistics are right on. We see another interesting behavior here. Look at the sample size for the various indexes. Even though we’re doing a “compute” with “cascade,” the sample sizes for the indexes vary from 3% to 100%.
The results for various tests will be displayed in a table below, but it looks like an estimate percentage of 5% gets us reasonably accurate statistics without taking too much time.
block_sample = true
What about the block_sample option? From the description, one might expect it to be faster. The thinking is, if we’re sampling 5% of the rows, and there are 20 or more rows per data block, then you’re likely to visit almost every block in the table anyway. If you only sample 5% of the blocks, it should be faster.
We did the test setting block_sample=true. The results were slightly less accurate, but not too much so. But, surprisingly, it didn’t save a noticeable amount of time. The small time savings doesn’t seem worth the loss of accuracy. The results are included in the table below.
cascade=false
Rather than using the cascade option, some people have suggested doing a small estimate percentage for the table, then a compute for the indexes. We tried this with a 1% estimate for the tables, using both a compute for the indexes, and a 20% estimate for the indexes. Both options saved time over the cascade option, with no significant difference in accuracy. The drawback of this option is that you have to run the procedure for each individual index, rather than once for each table. So if your indexes change, you might miss one.
estimate_percentblock_sampleElapsed Time# Rows FNAME (1951673)# Rows PREF (65345)#Rows PROPSTYLE (48936)
1%, cascadeFalse3:261937700652216984
1% table, 20% indexesFalse2:441950100683516932
1% table, compute indexesFalse2:591941600679716917
5%, cascadeFalse4:3219554202422220095
5%, cascadeTrue4:1120245401592710092
5% table, 20% indexesFalse4:0019561802418820084
10%, cascadeFalse6:0419628403617223547
20%, cascadeFalse9:2119507504827129108
20%, cascadeTrue9:1318852604065321897
50%, cascadeFalse20:2419504726080439708
null (compute) , cascadeFalse38:4519516736534548936
Other Options- What About All that “Auto” Stuff?
It gets confusing sometimes, because there are two commonly-referenced “auto options” in the DBMS_STATS procedures we’re using. One “auto” option is comes into play when using the method_opt parameter to collect histograms. More on this option later. The other “auto option” is the DBMS_STATS.AUTO_SAMPLE_SIZE option for estimate_percent. This seems like a logical choice to use. Why should we spend all this effort figuring out the correct sample size to use? Of course, we dutifully tested it:
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>dbms_stats.auto_sample_size,
                           cascade=>true);
end;
/

Elapsed: 00:08:19.19


begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>dbms_stats.auto_sample_size,
                           cascade=>true);
end;
/

Elapsed: 00:22:55.99
SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,977,281       143457

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               23      1951673
                                             STATE_NO                23      1951673

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                14,094       444518
                                             STATE_NO            14,094       444518

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,951,673      1951673


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,538,628       530200
                                             SOLD            11,538,628       530200

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             3,909       692382

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,257       533670


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1969341        5841      13:39:41       1
                FNAME           VARCHAR2        1951673     1951673      13:39:41       1
                STATE_NO        NUMBER                6     1951673      13:39:41       1
                FILE_TYPE       NUMBER                7     1951673      13:39:41       1
                PREF            VARCHAR2          65345     1448208      13:39:41       1
                CREATE_DATE     DATE             960724     1951673      13:39:41       1
                TRACK_ID        NUMBER                9     1951673      13:39:41       1
                SECTOR_ID       NUMBER                6     1951673      13:39:41       1
                TEAMS           NUMBER             1206     1605130      13:39:41       1
                BYTE_SIZE       NUMBER                0                  13:39:41       1
                START_DATE      DATE                  0                  13:39:41       1
                END_DATE        DATE                  0                  13:39:41       1
                LAST_UPDATE     DATE             837279     1951673      13:39:41       1
                CONTAINERS      NUMBER             1206     1605706      13:39:41       1

PROP_CAT        LINENUM         NUMBER         11486321    11486321      14:03:31       1
                LOOKUPID        VARCHAR2          40903    11486321      14:03:31       1
                EXTID           VARCHAR2       11486321    11486321      14:03:31       1
                SOLD            NUMBER                1    11486321      14:03:31       1
                CATEGORY        VARCHAR2            843    11486321      14:03:31       1
                NOTES           VARCHAR2              0                  14:03:31       1
                DETAILS         VARCHAR2            873    11486321      14:03:31       1
                PROPSTYLE       VARCHAR2          48936    11486321      14:03:31       1
So it took 31:15 to gather statistics for our two tables and their indexes, resulting in 7 ½ minutes less than a “compute, cascade,” but over 10 minutes longer than a “50%, cascade” which gathered statistics nearly as accurate. It also took three times as long as “20%, cascade” and seven times as long as a “5% cascade,” both of which generated serviceable statistics. From the sample_size statistics above, it looks like the job sampled nearly all of the rows in the table anyway.
Method_opt
We tested different options for histogram collection. The first was the commonly-used option that many have used with the old ANALYZE command, FOR ALL INDEXED COLUMNS. We acknowledge the caveat that you are unlikely to need histograms on all your indexed columns, and you may well want them on some non-indexed columns. But it’s a good start for a test.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all indexed columns size 30',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:01:35.63


begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all indexed columns size 30',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:06:01.14


SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            2,039,648       132886

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               10       483109
                                             STATE_NO                10       483109

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                13,674       423169
                                             STATE_NO            13,674       423169

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,903,053       501982


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,501,674       391170
                                             SOLD            11,501,674       391170

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,741       468441

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,038       498096




SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1950360      195036      18:05:22      30
                FNAME           VARCHAR2          18725      195036      18:05:22      10
                STATE_NO        NUMBER                5      195036      18:05:22       4
                FILE_TYPE       NUMBER                7      195036      18:05:22       6
                PREF            VARCHAR2          36039      144909      18:05:22      14
                CREATE_DATE     DATE
                TRACK_ID        NUMBER
                SECTOR_ID       NUMBER
                TEAMS           NUMBER
                BYTE_SIZE       NUMBER
                START_DATE      DATE
                END_DATE        DATE
                LAST_UPDATE     DATE
                CONTAINERS      NUMBER

PROP_CAT        LINENUM         NUMBER
                LOOKUPID        VARCHAR2          22959     1148726      18:11:54      30
                EXTID           VARCHAR2       11487260     1148726      18:11:54      30
                SOLD            NUMBER                1     1148726      18:11:54       1
                CATEGORY        VARCHAR2
                NOTES           VARCHAR2
                DETAILS         VARCHAR2
                PROPSTYLE       VARCHAR2          23583     1148726      18:11:54      30
We see that it took 7:37 to gather statistics with histograms for our two tables and their indexes. This compares to 6:04 for statistics gathering without histograms, which is not unreasonable if you actually need the histograms. But one problem that we see when we use this option is that statistics aren’t gathered for the non-indexed columns. This could result in problems down the road.
Instead of specifying on which columns to collect histograms, we can specify SKEWONLY and have Oracle determine the columns, based on their data distribution. So we’ll try that option.
begin
   dbms_stats.gather_table_stats(
                         ownname=>'TSUTTON', 
                         tabname=>'FILE_HISTORY',
                         estimate_percent=>10,
                         method_opt=>'for all indexed columns size skewonly',
                         cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:01:43.32

begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all indexed columns size skewonly',
                         cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:06:02.38

SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,943,994       126654

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               17       518642
                                             STATE_NO                17       518642

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                15,738       434460
                                             STATE_NO            15,738       434460

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,892,839       499288


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,842,752       402770
                                             SOLD            11,842,752       402770

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,424       476052

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,120       481642


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1954060      195406      18:37:53     200
                FNAME           VARCHAR2          18792      195406      18:37:53      38
                STATE_NO        NUMBER                5      195406      18:37:53       4
                FILE_TYPE       NUMBER                7      195406      18:37:53       6
                PREF            VARCHAR2          36272      145146      18:37:53      93
                CREATE_DATE     DATE
                TRACK_ID        NUMBER
                SECTOR_ID       NUMBER
                TEAMS           NUMBER
                BYTE_SIZE       NUMBER
                START_DATE      DATE
                END_DATE        DATE
                LAST_UPDATE     DATE
                CONTAINERS      NUMBER

PROP_CAT        LINENUM         NUMBER
                LOOKUPID        VARCHAR2          22848     1148229      18:47:32     200
                EXTID           VARCHAR2       11482290     1148229      18:47:32     200
                SOLD            NUMBER                1     1148229      18:47:32       1
                CATEGORY        VARCHAR2
                NOTES           VARCHAR2
                DETAILS         VARCHAR2
                PROPSTYLE       VARCHAR2          23498     1148229      18:47:32     200
This takes 7:45. This is about the same as when we specified the number of buckets, but again we have no statistics on the non-indexed columns. So we’ll try it using FOR ALL COLUMNS.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all columns size skewonly',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:02:27.02


begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all columns size skewonly',
                             cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:12:57.15


SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,933,618       125978

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE               13       498382
                                             STATE_NO                13       498382

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                13,907       433992
                                             STATE_NO            13,907       433992

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,962,629       517697


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,442,368       389153
                                             SOLD            11,442,368       389153

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,488       468009

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,216       490564


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1951550      195155      12:58:41     200
                FNAME           VARCHAR2          18908      195155      12:58:41      39
                STATE_NO        NUMBER                5      195155      12:58:41       4
                FILE_TYPE       NUMBER                7      195155      12:58:41       6
                PREF            VARCHAR2          35913      144624      12:58:41      93
                CREATE_DATE     DATE             489501      195155      12:58:41     200
                TRACK_ID        NUMBER                9      195155      12:58:41       8
                SECTOR_ID       NUMBER                6      195155      12:58:41       5
                TEAMS           NUMBER              971      160436      12:58:41      46
                BYTE_SIZE       NUMBER                0                  12:58:41       1
                START_DATE      DATE                  0                  12:58:41       1
                END_DATE        DATE                  0                  12:58:41       1
                LAST_UPDATE     DATE             525993      195155      12:58:41     200
                CONTAINERS      NUMBER              971      160483      12:58:41      46

PROP_CAT        LINENUM         NUMBER         11498490     1149849      13:11:53       1
                LOOKUPID        VARCHAR2          22795     1149849      13:11:53     200
                EXTID           VARCHAR2       11498490     1149849      13:11:53     200
                SOLD            NUMBER                1     1149849      13:11:53       1
                CATEGORY        VARCHAR2            820     1149849      13:11:53     200
                NOTES           VARCHAR2              0                  13:11:53       1
                DETAILS         VARCHAR2            834     1149849      13:11:53     200
                PROPSTYLE       VARCHAR2          23414     1149849      13:11:53     200
This gets us statistics on all the columns, but takes 15:24, twice as long as the other histogram tests. In addition to taking much longer, some interesting choices are made for the columns on which to collect histograms. For instance, there are 200 buckets used for the primary key! It’s kind of unusual to have a skewed primary key. Fortunately there is an explanation for this. It’s bug #3929552, detailed in Metalink Note 284917.1. The subject of the note says it all: “DBMS_STATS WITH SKEWONLY GENERATES HISTOGRAMS FOR UNIQUE KEY COLUMN.” The note says this bug is in 9.2.0.5, but it exists on our 9.2.0.4 database. There is a one-off patch available for it, but the note doesn’t say if the bug is fixed in later patch sets. So the SKEWONLY option is out of the question currently.
Let’s see if the AUTO option for SIZE does any better.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all columns size auto',
                     cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:01:40.49

begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all columns size auto',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:03:53.78

SQL> @index

Table        Uniqueness      Index Name      Column Name  Distinct Keys  Sample Size
------------ ---------- -------------------- ------------ ------------- ------------
FILE_HISTORY NONUNIQUE  FILEH_FNAME          FNAME            1,892,759       123316

             NONUNIQUE  FILEH_FTYPE_STATE    FILE_TYPE                7       499648
                                             STATE_NO                 7       499648

             NONUNIQUE  FILEH_PREFIX_STATE   PREF                14,649       431568
                                             STATE_NO            14,649       431568

             UNIQUE     PK_FILE_HISTORY      FILE_ID          1,874,115       494349


PROP_CAT     NONUNIQUE  PK_PROP_CAT          EXTID           11,272,123       383363
                                             SOLD            11,272,123       383363

             NONUNIQUE  PROPC_LOOKUPID       LOOKUPID             2,874       479087

             NONUNIQUE  PROPC_PROPSTYLE      PROPSTYLE            3,540       490096


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1946580      194658      13:25:14       1
                FNAME           VARCHAR2          18905      194658      13:25:14      38
                STATE_NO        NUMBER                4      194658      13:25:14       1
                FILE_TYPE       NUMBER                7      194658      13:25:14       1
                PREF            VARCHAR2          36144      144197      13:25:14       1
                CREATE_DATE     DATE             485546      194658      13:25:14       1
                TRACK_ID        NUMBER                9      194658      13:25:14       1
                SECTOR_ID       NUMBER                6      194658      13:25:14       1
                TEAMS           NUMBER              966      160023      13:25:14       1
                BYTE_SIZE       NUMBER                0                  13:25:14       1
                START_DATE      DATE                  0                  13:25:14       1
                END_DATE        DATE                  0                  13:25:14       1
                LAST_UPDATE     DATE             523461      194658      13:25:14       1
                CONTAINERS      NUMBER              967      160096      13:25:14       1

PROP_CAT        LINENUM         NUMBER         11487560     1148756      13:30:07       1
                LOOKUPID        VARCHAR2          22904     1148756      13:30:07       1
                EXTID           VARCHAR2       11487560     1148756      13:30:07       1
                SOLD            NUMBER                1     1148756      13:30:07       1
                CATEGORY        VARCHAR2            822     1148756      13:30:07       1
                NOTES           VARCHAR2              0                  13:30:07       1
                DETAILS         VARCHAR2            840     1148756      13:30:07       1
                PROPSTYLE       VARCHAR2          23558     1148756      13:30:07       1
This time it only takes 5:34 to gather our statistics, the best yet for a 10% estimate. But the statistics leave something to be desired. Oracle thinks there are 18,905 distinct values of FNAME, and there are 38 buckets used for it. We know from the table’s data that FNAME has all unique values. (It doesn’t have a unique constraint, but it is the case that all the values are different.) So there’s no reason to have histograms on it at all.
So far we’ve determined that collecting histograms doesn’t seem to be as efficient or accurate as it should be with DBMS_STATS.
Is 10g Any Better?
The performance of gathering and accuracy of statistics is reasonable in Oracle 9i for the “standard” choices, such as estimate and cascade, but there are serious deficiencies with AUTO_SAMPLE_SIZE and histogram collection. Is 10g any better in these areas? Let’s do some testing on an Oracle 10.1.0.3 database on the same Sun server.
First we’ll try the SKEWONLY option for SIZE in histogram collection.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all columns size skewonly',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:02:45.05

begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all columns size skewonly',
                             cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:12:28.17


SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1947030      194703      21:24:46     200
                FNAME           VARCHAR2          26167      194703      21:24:46     200
                STATE_NO        NUMBER                6      194703      21:24:46       6
                FILE_TYPE       NUMBER                7      194703      21:24:46       7
                PREF            VARCHAR2          45092      144582      21:24:46     200
                CREATE_DATE     DATE             488202      194703      21:24:46     200
                TRACK_ID        NUMBER                9      194703      21:24:46       9
                SECTOR_ID       NUMBER                6      194703      21:24:46       6
                TEAMS           NUMBER              960      160244      21:24:46     200
                BYTE_SIZE       NUMBER                0                  21:24:46       0
                START_DATE      DATE                  0                  21:24:46       0
                END_DATE        DATE                  0                  21:24:46       0
                LAST_UPDATE     DATE             525201      194703      21:24:46     200
                CONTAINERS      NUMBER              960      160300      21:24:46     200

PROP_CAT        LINENUM         NUMBER         11458820     1145882      21:35:54       1
                LOOKUPID        VARCHAR2          22891     1145882      21:35:54     200
                EXTID           VARCHAR2       11458820     1145882      21:35:54     200
                SOLD            NUMBER                1     1145882      21:35:54       1
                CATEGORY        VARCHAR2            823     1145882      21:35:54     200
                NOTES           VARCHAR2              0                  21:35:54       0
                DETAILS         VARCHAR2            829     1145882      21:35:54     200
                PROPSTYLE       VARCHAR2          23566     1145882      21:35:54     200
This statistics gathering took 15:13, which is about the same as in Oracle 9i. Unfortunately, the results are just as bad. We’re still collecting histograms on the primary key (file_id) and on FNAME, and the number of distinct values of FNAME shows as 26,167, approximately two orders of magnitude off the correct value.
So let’s try the AUTO for SIZE.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all columns size auto',
                     cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:03:22.81

begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>10,
                           method_opt=>'for all columns size auto',
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:08:15.00

SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1947700      194770      22:02:39       1
                FNAME           VARCHAR2        1947700      194770      22:02:39       1
                STATE_NO        NUMBER                4      194770      22:02:39       1
                FILE_TYPE       NUMBER                7      194770      22:02:39       1
                PREF            VARCHAR2          36073      144920      22:02:39       1
                CREATE_DATE     DATE             490502      194770      22:02:39       1
                TRACK_ID        NUMBER                9      194770      22:02:39       1
                SECTOR_ID       NUMBER                6      194770      22:02:39       1
                TEAMS           NUMBER              978      160229      22:02:39       1
                BYTE_SIZE       NUMBER                0                  22:02:39       0
                START_DATE      DATE                  0                  22:02:39       0
                END_DATE        DATE                  0                  22:02:39       0
                LAST_UPDATE     DATE             522845      194770      22:02:39       1
                CONTAINERS      NUMBER              979      160296      22:02:39       1

PROP_CAT        LINENUM         NUMBER         11495240     1149524      22:10:18       1
                LOOKUPID        VARCHAR2          22955     1149524      22:10:18       1
                EXTID           VARCHAR2       11495240     1149524      22:10:18       1
                SOLD            NUMBER                1     1149524      22:10:18       1
                CATEGORY        VARCHAR2            827     1149524      22:10:18       1
                NOTES           VARCHAR2              0                  22:10:18       0
                DETAILS         VARCHAR2            847     1149524      22:10:18       1
                PROPSTYLE       VARCHAR2          23658     1149524      22:10:18       1
This statistics gathering took 11:38, twice as long as the equivalent 9i test. But the important thing to note in this test is that the statistics are reasonable, and histograms are not collected on the unique columns. So it looks like, in Oracle 10g, the AUTO option for SIZE in histogram collection solves some of the problems seen in 9i. However, it’s not perfect. Let’s look at the STATE_NO column:
SQL> select STATE_NO, COUNT(*) from FILE_HISTORY group by STATE_NO;

  STATE_NO    COUNT(*)
---------- ----------
         0         95
        20        569
        30    1950957
        40         39
       999          4
      9999          9
 
6 rows selected.
This column certainly seems like a candidate for a histogram, yet the AUTO option for SIZE doesn’t create a histogram for it.
Similarly, the file_type column looks like a candidate for a histogram:
SQL> select file_type, count(*) from file_history group by file_type;
 
 FILE_TYPE   COUNT(*)
---------- ----------
         1     670950
         2      83799
         3      58925
         4      48241
         5     777258
         6      62681
         8     249819

7 rows selected.
So let’s try some queries:
select count(*) from file_history where file_type = 5;
select count(*) from file_history where file_type = 4;
select count(*) from file_history where fname = 'SOMETHING';
And gather stats again:
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>10,
                           method_opt=>'for all columns size auto',
                     cascade=>true);
end;
/

PL/SQL procedure successfully completed.

SQL> @tabcol

TABLE_NAME   COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE BUCKETS
------------ --------------- ---------- ------------ ----------- ---
FILE_HISTORY FILE_ID         NUMBER          1952650      195265       1
             FNAME           VARCHAR2          26179      195265     254
             STATE_NO        NUMBER                5      195265       1
             FILE_TYPE       NUMBER                7      195265       7
             PREF            VARCHAR2          36154      144848       1
Now dbms_stats has taken the workload into account and created histograms on the columns we queried (though it’s again creating a histogram on FNAME).
Our last test uses the DBMS_STATS.AUTO_SAMPLE_SIZE setting for estimate_percent.
begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'FILE_HISTORY',
                           estimate_percent=>dbms_stats.auto_sample_size,
                           cascade=>true);
end;
/

Elapsed: 00:02:15.95


begin
   dbms_stats.gather_table_stats(
                           ownname=>'TSUTTON', 
                           tabname=>'PROP_CAT',
                           estimate_percent=>dbms_stats.auto_sample_size,
                           cascade=>true);
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:04:01.00

SQL> @tabcol

TABLE_NAME      COLUMN_NAME     DATA_TYPE  NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS
--------------- --------------- ---------- ------------ ----------- ------------- -------
FILE_HISTORY    FILE_ID         NUMBER          1951112        5764      22:20:32       1
                FNAME           VARCHAR2        1951112       57166      22:20:32       1
                STATE_NO        NUMBER                2        5764      22:20:32       1
                FILE_TYPE       NUMBER                7        5764      22:20:32       1
                PREF            VARCHAR2           2200        4246      22:20:32       1
                CREATE_DATE     DATE             593203      487778      22:20:32       1
                TRACK_ID        NUMBER                9        5764      22:20:32       1
                SECTOR_ID       NUMBER                6        5764      22:20:32       1
                TEAMS           NUMBER              275        4687      22:20:32       1
                BYTE_SIZE       NUMBER                0                  22:20:32       0
                START_DATE      DATE                  0                  22:20:32       0
                END_DATE        DATE                  0                  22:20:32       0
                LAST_UPDATE     DATE             577486      487778      22:20:32       1
                CONTAINERS      NUMBER              275        4688      22:20:32       1

PROP_CAT        LINENUM         NUMBER         11476655      520243      22:30:16       1
                LOOKUPID        VARCHAR2          16505       52192      22:30:16       1
                EXTID           VARCHAR2       11476655      520243      22:30:16       1
                SOLD            NUMBER                1        5264      22:30:16       1
                CATEGORY        VARCHAR2            739        5264      22:30:16       1
                NOTES           VARCHAR2              0                  22:30:16       0
                DETAILS         VARCHAR2            738        5264      22:30:16       1
                PROPSTYLE       VARCHAR2          16512       52192      22:30:16       1
This statistics gathering took 6:17, an enormous improvement over the Oracle 9i test (which took 31:15). But the accuracy of the results may not be as good as desired. The column PREF only shows having 2200 distinct values (the actual number is 65345). Other values look reasonable, but the results aren’t as good as those of a 5% sample in Oracle 9i, which took 1/3 less time.
Summary
We’ve run quite a few tests here. From the results, it appears that:
  1. The sweet spot for balancing performance of the gathering process and the accuracy of the statistics lies between 5 and 20 for estimate_percent. Five percent will probably work for most cases, but 20% may be advisable if really accurate statistics are needed.
  2. Gathering statistics separately for indexes is faster than the cascade=true option while gathering table statistics.
  3. Block_sample=true doesn’t appear to appreciably speed up statistics gathering, while at the same time delivering somewhat less-accurate statistics.
  4. Using AUTO_SAMPLE_SIZE takes nearly as long as COMPUTE in Oracle 9i. It is much faster in 10gR1, but the accuracy of the statistics is not as good as with an estimate_percent of 5, which gathers the statistics faster.
  5. Using the SKEWONLY option to size histograms is inadvisable in both Oracle 9i and 10gR1.
  6. Using the AUTO option to size histograms is inadvisable in Oracle 9i; it seems to work better in 10gR1, though it still has issues.
These results should help you decide what options you want to use in gathering statistics on your databases. And they may also inspire you to test different options on your own data. While this paper doesn’t address some of the additional benefits of DBMS_STATS, such as using parallel processing to gather statistics, it is clear that DBMS_STATS provides greater power and flexibility than ANALYZE, and it is improving with each version.

No comments: