Introduction
For years Oracle has been telling us to use the DBMS_STATS package instead of the ANALYZE command to gather statistics to be used by the Oracle optimizer. Many of us have heeded this advice and switched from using ANALYZE for scheduled statistics-gathering jobs to using DBMS_STATS. Still, others have said, “ANALYZE has always worked, so I see no reason to change.” And clients often ask, “What settings should I use to analyze my tables?” The options available for the ANALYZE command were essentially limited to “compute” and “estimate”, with some histogram options, and people tended to choose based on how long the ANALYZE job took. But the DBMS_STATS package covers a lot of territory, and within its many procedures there are a multitude of options available. The choice of which options to use can dramatically affect your results, both in the accuracy of statistics and the performance of the statistics gathering operation itself. So, what’s a DBA to do?
The objective of this paper is to examine some of the statistics-gathering options and their impact. The focus will be on actual experience, measured performance, and detailed examples—as opposed to what the documentation says. We can’t cover every option possible (or this would be a very long paper), but we will focus on the options which seem most useful and/or problematic. Contrary to what the folks at Redwood Shores would prefer, most of you are probably not yet using Oracle 10g for all your production databases. Therefore, much of the focus of this paper will be on Oracle 9i issues. But we will address some of the changes and differences in 10g.
When we’re done, hopefully you’ll have a better idea why you might want to choose various options when gathering statistics on your databases. And you’ll probably also think of additional testing you’ll want to perform on your own data.
DBMS_STATS Procedures
The DBMS_STATS package contains over 40 procedures. These include procedures to: delete existing statistics for a table, schema, or database; set statistics to desired values; export and import statistics; gather statistics for a schema or entire database; and monitor tables for changes. We will focus on DBMS_STATS.GATHER_TABLE_STATS and DBMS_STATS.GATHER_INDEX_STATS, because they are the starting point for getting statistics so the Oracle optimizer can make informed decisions. These two procedures gather statistics on the data in a specific table or index, respectively.
The parameters of DBMS_STATS.GATHER_TABLE_STATS are:
The parameters of DBMS_STATS.GATHER_INDEX_STATS are:DBMS_STATS.GATHER_TABLE_STATS ( ownname VARCHAR2, tabname VARCHAR2, partname VARCHAR2 DEFAULT NULL, estimate_percent NUMBER DEFAULT NULL, block_sample BOOLEAN DEFAULT FALSE, method_opt VARCHAR2 DEFAULT 'FOR ALL COLUMNS SIZE 1', degree NUMBER DEFAULT NULL, granularity VARCHAR2 DEFAULT 'DEFAULT', cascade BOOLEAN DEFAULT FALSE, stattab VARCHAR2 DEFAULT NULL, statid VARCHAR2 DEFAULT NULL, statown VARCHAR2 DEFAULT NULL, no_invalidate BOOLEAN DEFAULT FALSE);
For the purposes of our testing, the interesting parameters are estimate_percent, block_sample, method_opt, and cascade, because these are the parameters which address statistics accuracy and performance. We won’t address degree and granularity, which can be useful in improving performance for very large databases. And most of the other parameters deal with non-performance choices.DBMS_STATS.GATHER_INDEX_STATS ( ownname VARCHAR2, indname VARCHAR2, partname VARCHAR2 DEFAULT NULL, estimate_percent NUMBER DEFAULT NULL, stattab VARCHAR2 DEFAULT NULL, statid VARCHAR2 DEFAULT NULL, statown VARCHAR2 DEFAULT NULL, degree NUMBER DEFAULT NULL, granularity VARCHAR2 DEFAULT 'DEFAULT', no_invalidate BOOLEAN DEFAULT FALSE);
These options together are one big advantage that DBMS_STATS has over the old ANALYZE command, because they give us additional flexibility in how the statistics gathering is done. Other advantages will be addressed below.
estimate_percent
This parameter is similar to the old “estimate statistics sample x percent” parameter of the ANALYZE command. The value for estimate_percent is the percentage of rows to estimate, with NULL meaning compute. You can use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to have Oracle determine the best sample size for good statistics.
block_sample
The value for block_sample determines whether or not to use random block sampling instead of random row sampling. According to the documentation, “Random block sampling is more efficient, but if the data is not randomly distributed on disk, then the sample values may be somewhat correlated.”
method_opt
This parameter determines whether to collect histograms to help in dealing with skewed data. FOR ALL COLUMNS or FOR ALL INDEXED COLUMNS with a SIZE value determines which columns and how many histogram buckets to use. Instead of an integer value for SIZE, you can specify SKEWONLY to have Oracle determine the columns on which to collect histograms based on their data distribution. Or you can specify AUTO to have Oracle determine the columns on which to collect histograms based on data distribution and workload.
cascade
The value for cascade determines whether to gather statistics on indexes as well.
Our approach was to try different values for these parameters and look at the impact on accuracy of statistics and performance of the statistics gathering job itself.
Our Data
For our tests, we will use two tables, FILE_HISTORY and PROP_CAT:
The indexes on these tables are as follows:FILE_HISTORY [1,951,673 rows 28507 blocks, 223MB ] Column Name Null? Type Distinct Values ------------------------ -------- ------------- --------------- FILE_ID NOT NULL NUMBER 1951673 FNAME NOT NULL VARCHAR2(240) 1951673 STATE_NO NUMBER 6 FILE_TYPE NOT NULL NUMBER 7 PREF VARCHAR2(100) 65345 CREATE_DATE NOT NULL DATE TRACK_ID NOT NULL NUMBER SECTOR_ID NOT NULL NUMBER TEAMS NUMBER BYTE_SIZE NUMBER START_DATE DATE END_DATE DATE LAST_UPDATE DATE CONTAINERS NUMBER PROP_CAT [11,486,321 rows 117705 blocks, 920MB ] Column Name Null? Type Distinct Values ------------------------ -------- ------------- --------------- LINENUM NOT NULL NUMBER(38) 11486321 LOOKUPID VARCHAR2(64) 40903 EXTID VARCHAR2(20) 11486321 SOLD NOT NULL NUMBER(38) 1 CATEGORY VARCHAR2(6) NOTES VARCHAR2(255) DETAILS VARCHAR2(255) PROPSTYLE VARCHAR2(20) 48936
The number of distinct values for the various columns were calculated usingTable Unique? Index Name Column Name ------------ ---------- ---------------------------------------- --------------- FILE_HISTORY NONUNIQUE TSUTTON.FILEH_FNAME FNAME NONUNIQUE TSUTTON.FILEH_FTYPE_STATE FILE_TYPE STATE_NO NONUNIQUE TSUTTON.FILEH_PREFIX_STATE PREF STATE_NO UNIQUE TSUTTON.PK_FILE_HISTORY FILE_ID PROP_CAT NONUNIQUE TSUTTON.PK_PROP_CAT EXTID SOLD NONUNIQUE TSUTTON.PROPC_LOOKUPID LOOKUPID NONUNIQUE TSUTTON.PROPC_PROPSTYLE PROPSTYLE
rather than from gathering statistics.select count (distinct col_name) from table_name;
When we performed our tests, we used two queries to find out the values for the statistics gathered.
- index.sql
select ind.table_name, ind.uniqueness, col.index_name, col.column_name, ind.distinct_keys, ind.sample_size from dba_ind_columns col, dba_indexes ind where ind.table_owner = 'TSUTTON' and ind.table_name in ('FILE_HISTORY','PROP_CAT') and col.index_owner = ind.owner and col.index_name = ind.index_name and col.table_owner = ind.table_owner and col.table_name = ind.table_name order by col.table_name, col.index_name, col.column_position;
- tabcol.sql
select table_name, column_name, data_type, num_distinct, sample_size, to_char(last_analyzed, ' HH24:MI:SS') last_analyzed, num_buckets buckets from dba_tab_columns where table_name in ('FILE_HISTORY','PROP_CAT') order by table_name, column_id;
The Tests
Our initial testing was on an Oracle 9.2.0.4 Enterprise Edition database on Sun Solaris 8.
ANALYZE
We performed a couple of quick tests using the old analyze command, just to see where we came from. Let’s look at a “quick and dirty estimate statistics” run.
Let’s say this is how we’ve been analyzing our tables nightly. Then someone performs the query:SQL> analyze table file_history estimate statistics; Table analyzed. Elapsed: 00:00:08.26 SQL> analyze table prop_cat estimate statistics; Table analyzed. Elapsed: 00:00:14.76
and it takes “forever.” So we dutifully repeat the query with autotrace on:SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING';
So we have the old favorite “it’s not using our index” complaint; there is an index on file_history(FNAME). Let’s look at the index statistics:SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING'; no rows selected Elapsed: 00:00:08.66 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2743 Card=10944 Bytes=678528) 1 0 TABLE ACCESS (FULL) OF 'FILE_HISTORY' (Cost=2743 Card=10944 Bytes=678528) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 28519 consistent gets 28507 physical reads 0 redo size 465 bytes sent via SQL*Net to client 460 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 0 rows processed
Hmm, it says our index has 1,937,490 distinct keys—pretty close to the actual value of 1,951,673. But this is what we see when we look at DBA_TAB_COLUMNS:SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,937,490 1106 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 12 1266 STATE_NO 12 1266 NONUNIQUE FILEH_PREFIX_STATE PREF 65,638 1053 STATE_NO 65,638 1053 UNIQUE PK_FILE_HISTORY FILE_ID 1,952,701 1347 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,429,046 1356 SOLD 11,429,046 1356 NONUNIQUE PROPC_LOOKUPID LOOKUPID 197,743 1237 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 172 1317
Yikes! DBA_TAB_COLUMNS shows only 178 distinct values of FNAME. The optimizer uses this value and concludes that it would be fastest to do a full table scan. We can also see that it thinks we only have 478 distinct values for PREF, rather than the 65,345 we actually have. In the prop_cat table, the PROPSTYLE column shows 21576 distinct values in DBA_TAB_COLUMNS and its index shows 172 distinct values, while there are actually 48,936 distinct values. So much for “quick and dirty.”SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1948094 1034 13:42:42 1 FNAME VARCHAR2 178 1034 13:42:42 1 STATE_NO NUMBER 1 1034 13:42:42 1 FILE_TYPE NUMBER 7 1034 13:42:42 1 PREF VARCHAR2 478 1034 13:42:42 1 CREATE_DATE DATE 1948094 1034 13:42:42 1 TRACK_ID NUMBER 9 1034 13:42:42 1 SECTOR_ID NUMBER 6 1034 13:42:42 1 TEAMS NUMBER 101 1034 13:42:42 1 BYTE_SIZE NUMBER 0 13:42:42 1 START_DATE DATE 0 13:42:42 1 END_DATE DATE 0 13:42:42 1 LAST_UPDATE DATE 419317 1034 13:42:42 1 CONTAINERS NUMBER 101 1034 13:42:42 1 PROP_CAT LINENUM NUMBER 11476127 1048 13:43:17 1 LOOKUPID VARCHAR2 19954 1048 13:43:17 1 EXTID VARCHAR2 11476127 1048 13:43:17 1 SOLD NUMBER 1 1048 13:43:17 1 CATEGORY VARCHAR2 705 1048 13:43:17 1 NOTES VARCHAR2 0 13:43:17 1 DETAILS VARCHAR2 704 1048 13:43:17 1 PROPSTYLE VARCHAR2 21576 1048 13:43:17 1
So, let’s get some better statistics. The 1000 rows sampled by “estimate statistics” is not very many (0.05%). Let’s try 5% of the rows.
How does our query look now?SQL> analyze table file_history estimate statistics sample 5 percent; Table analyzed. Elapsed: 00:00:36.21 SQL> analyze table prop_cat estimate statistics sample 5 percent; Table analyzed. Elapsed: 00:02:35.11
Much better. Let’s see how the stats look.SQL> SELECT FILE_ID, FNAME, TRACK_ID, SECTOR_ID FROM file_history WHERE FNAME = 'SOMETHING'; no rows selected Elapsed: 00:00:00.54 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=54 Card=110 Bytes=6820) 1 0 TABLE ACCESS (BY INDEX ROWID) OF 'FILE_HISTORY' (Cost=54 Card=110 Bytes=6820) 2 1 INDEX (RANGE SCAN) OF 'FILEH_FNAME' (NON-UNIQUE) (Cost=3 Card=110) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 3 consistent gets 0 physical reads 0 redo size 465 bytes sent via SQL*Net to client 460 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 0 rows processed
The number of distinct values for the FNAME column still only shows as 11133, but that’s a far cry better than 178. And at least our query works. Also, the number of distinct values for the PREF column now shows 23744, which is a big improvement. The PROPSTYLE column in prop_cat now shows 19,633 distinct values in DBA_TAB_COLUMNS and 72,154 in DBA_INDEXES.SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,926,580 101179 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 8 102128 STATE_NO 8 102128 NONUNIQUE FILEH_PREFIX_STATE PREF 74,935 98709 STATE_NO 74,935 98709 UNIQUE PK_FILE_HISTORY FILE_ID 1,952,701 101025 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,478,536 589350 SOLD 11,478,536 589350 NONUNIQUE PROPC_LOOKUPID LOOKUPID 68,643 589125 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 72,154 582261 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED --------------- --------------- ---------- ------------ ----------- ------------- FILE_HISTORY FILE_ID NUMBER 1951673 93010 13:59:20 FNAME VARCHAR2 11133 93010 13:59:20 STATE_NO NUMBER 5 93010 13:59:20 FILE_TYPE NUMBER 7 93010 13:59:20 PREF VARCHAR2 23744 93010 13:59:20 PROP_CAT LOOKUPID VARCHAR2 19462 541472 14:00:28 EXTID VARCHAR2 11486321 541472 14:00:28 SOLD NUMBER 1 541472 14:00:28 PROPSTYLE VARCHAR2 19633 541472 14:00:28
Out of curiosity, let’s see what results we get if we take the time to do a full “compute statistics.”
We spent nearly 37 minutes analyzing these two tables with “compute statistics,” and we still only show 78692 distinct values for FNAME. That’s enough for our simple query. But, it’s hard to believe that after examining every row in the table, the statistics are off by more than an order of magnitude for this column—especially since every row has a different value for FNAME. At least the values are accurate for the PROPSTYLE column in prop_cat.SQL> analyze table file_history compute statistics; Table analyzed. Elapsed: 00:07:38.32 SQL> analyze table prop_cat compute statistics; Table analyzed. Elapsed: 00:29:15.29 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,951,673 1951673 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 23 1951673 STATE_NO 23 1951673 NONUNIQUE FILEH_PREFIX_STATE PREF 65,390 1951673 STATE_NO 65,390 1951673 UNIQUE PK_FILE_HISTORY FILE_ID 1,951,673 1951673 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,486,321 11486321 SOLD 11,486,321 11486321 NONUNIQUE PROPC_LOOKUPID LOOKUPID 40,903 11486321 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 48,936 11486321 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED --------------- --------------- ---------- ------------ ----------- ------------- FILE_HISTORY FILE_ID NUMBER 1951673 1951673 14:26:26 FNAME VARCHAR2 78692 1951673 14:26:26 STATE_NO NUMBER 6 1951673 14:26:26 FILE_TYPE NUMBER 7 1951673 14:26:26 PREF VARCHAR2 65345 1951673 14:26:26 PROP_CAT LOOKUPID VARCHAR2 40903 11486321 14:34:04 EXTID VARCHAR2 11486321 11486321 14:34:04 SOLD NUMBER 1 11486321 14:34:04 PROPSTYLE VARCHAR2 48936 11486321 14:34:04
DBMS_STATS
Maybe it’s time to take Oracle’s recommendation, and use DBMS_STATS. Let’s start with a “quick and dirty” run, as a starting point. We’ll use a 1% estimate and cascade=true, so that the indexes will be analyzed also.
Gathering statistics on the two tables and their indexes took 3 minutes, 26 seconds (3:26). This compares to 23 seconds for the “analyze estimate statistics” and 3:11 for “analyze estimate 5%.” The statistics are very accurate for the column FNAME, but PREF is still off a bit (showing 6,522 rows rather than the correct value of 65,345). And PROPSTYLE shows 2978 distinct index values and 16,984 distinct column values (vs. the correct value of 48,936).SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>1,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:01:41.70 SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>1,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:01:44.29 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,926,511 125515 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 11 473488 STATE_NO 11 473488 NONUNIQUE FILEH_PREFIX_STATE PREF 14,207 422476 STATE_NO 14,207 422476 UNIQUE PK_FILE_HISTORY FILE_ID 2,035,824 537004 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 10,994,527 373922 SOLD 10,994,527 373922 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,886 446130 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 2,978 506796 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED --------------- --------------- ---------- ------------ ----------- ------------- FILE_HISTORY FILE_ID NUMBER 1937700 19377 15:20:30 FNAME VARCHAR2 1937700 19377 15:20:30 STATE_NO NUMBER 3 19377 15:20:30 FILE_TYPE NUMBER 7 19377 15:20:30 PREF VARCHAR2 6522 14342 15:20:30 PROP_CAT LOOKUPID VARCHAR2 16959 115150 15:23:16 EXTID VARCHAR2 11515000 115150 15:23:16 SOLD NUMBER 1 115150 15:23:16 PROPSTYLE VARCHAR2 16984 115150 15:23:16
If we increase the estimate percent to 5%, what will we get?
Gathering statistics on the two tables and their indexes now takes 4:32, but PREF now shows 24,222 distinct values (65,345 is the correct value) and PROPSTYLE shows 20,095 (48,936 is the correct value). While these are not super-accurate, they’re probably close enough for most queries. So a DBMS_STATS estimate of 5% took 4:32 vs. 3:11 for an ANALYZE estimate of 5%, but the statistics generated were much better for FNAME, and roughly equivalent for PREF and PROPSTYLE.SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>5,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:01:23.52 SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>5,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:03:08.75 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED --------------- --------------- ---------- ------------ ----------- ------------- FILE_HISTORY FILE_ID NUMBER 1955420 97771 16:40:12 FNAME VARCHAR2 1955420 97771 16:40:12 PREF VARCHAR2 24222 72377 16:40:12 PROP_CAT LOOKUPID VARCHAR2 19860 573201 16:51:20 EXTID VARCHAR2 11464020 573201 16:51:20 SOLD NUMBER 1 573201 16:51:20 PROPSTYLE VARCHAR2 20095 573201 16:51:20
If we increase to a full compute, we get:
The total time taken is slightly longer than for the “analyze compute,” but at least the statistics are right on. We see another interesting behavior here. Look at the sample size for the various indexes. Even though we’re doing a “compute” with “cascade,” the sample sizes for the indexes vary from 3% to 100%.SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>null,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:09:35.13 SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>null,cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:29:09.46 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 2,019,679 131585 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 14 456182 STATE_NO 14 456182 NONUNIQUE FILEH_PREFIX_STATE PREF 16,365 428990 STATE_NO 16,365 428990 UNIQUE PK_FILE_HISTORY FILE_ID 1,951,673 1951673 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 10,995,615 373959 SOLD 10,995,615 373959 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,678 469772 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,434 504580 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED --------------- --------------- ---------- ------------ ----------- ------------- FILE_HISTORY FILE_ID NUMBER 1951673 1951673 14:19:30 FNAME VARCHAR2 1951673 1951673 14:19:30 PREF VARCHAR2 65345 1448208 14:19:30 PROP_CAT LOOKUPID VARCHAR2 40903 11486321 14:50:19 EXTID VARCHAR2 11486321 11486321 14:50:19 PROPSTYLE VARCHAR2 48936 11486321 14:50:19
The results for various tests will be displayed in a table below, but it looks like an estimate percentage of 5% gets us reasonably accurate statistics without taking too much time.
block_sample = true
What about the block_sample option? From the description, one might expect it to be faster. The thinking is, if we’re sampling 5% of the rows, and there are 20 or more rows per data block, then you’re likely to visit almost every block in the table anyway. If you only sample 5% of the blocks, it should be faster.
We did the test setting block_sample=true. The results were slightly less accurate, but not too much so. But, surprisingly, it didn’t save a noticeable amount of time. The small time savings doesn’t seem worth the loss of accuracy. The results are included in the table below.
cascade=false
Rather than using the cascade option, some people have suggested doing a small estimate percentage for the table, then a compute for the indexes. We tried this with a 1% estimate for the tables, using both a compute for the indexes, and a 20% estimate for the indexes. Both options saved time over the cascade option, with no significant difference in accuracy. The drawback of this option is that you have to run the procedure for each individual index, rather than once for each table. So if your indexes change, you might miss one.
estimate_percent | block_sample | Elapsed Time | # Rows FNAME (1951673) | # Rows PREF (65345) | #Rows PROPSTYLE (48936) |
---|---|---|---|---|---|
1%, cascade | False | 3:26 | 1937700 | 6522 | 16984 |
1% table, 20% indexes | False | 2:44 | 1950100 | 6835 | 16932 |
1% table, compute indexes | False | 2:59 | 1941600 | 6797 | 16917 |
5%, cascade | False | 4:32 | 1955420 | 24222 | 20095 |
5%, cascade | True | 4:11 | 2024540 | 15927 | 10092 |
5% table, 20% indexes | False | 4:00 | 1956180 | 24188 | 20084 |
10%, cascade | False | 6:04 | 1962840 | 36172 | 23547 |
20%, cascade | False | 9:21 | 1950750 | 48271 | 29108 |
20%, cascade | True | 9:13 | 1885260 | 40653 | 21897 |
50%, cascade | False | 20:24 | 1950472 | 60804 | 39708 |
null (compute) , cascade | False | 38:45 | 1951673 | 65345 | 48936 |
It gets confusing sometimes, because there are two commonly-referenced “auto options” in the DBMS_STATS procedures we’re using. One “auto” option is comes into play when using the method_opt parameter to collect histograms. More on this option later. The other “auto option” is the DBMS_STATS.AUTO_SAMPLE_SIZE option for estimate_percent. This seems like a logical choice to use. Why should we spend all this effort figuring out the correct sample size to use? Of course, we dutifully tested it:
So it took 31:15 to gather statistics for our two tables and their indexes, resulting in 7 ½ minutes less than a “compute, cascade,” but over 10 minutes longer than a “50%, cascade” which gathered statistics nearly as accurate. It also took three times as long as “20%, cascade” and seven times as long as a “5% cascade,” both of which generated serviceable statistics. From the sample_size statistics above, it looks like the job sampled nearly all of the rows in the table anyway.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>dbms_stats.auto_sample_size, cascade=>true); end; / Elapsed: 00:08:19.19 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>dbms_stats.auto_sample_size, cascade=>true); end; / Elapsed: 00:22:55.99 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,977,281 143457 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 23 1951673 STATE_NO 23 1951673 NONUNIQUE FILEH_PREFIX_STATE PREF 14,094 444518 STATE_NO 14,094 444518 UNIQUE PK_FILE_HISTORY FILE_ID 1,951,673 1951673 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,538,628 530200 SOLD 11,538,628 530200 NONUNIQUE PROPC_LOOKUPID LOOKUPID 3,909 692382 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,257 533670 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1969341 5841 13:39:41 1 FNAME VARCHAR2 1951673 1951673 13:39:41 1 STATE_NO NUMBER 6 1951673 13:39:41 1 FILE_TYPE NUMBER 7 1951673 13:39:41 1 PREF VARCHAR2 65345 1448208 13:39:41 1 CREATE_DATE DATE 960724 1951673 13:39:41 1 TRACK_ID NUMBER 9 1951673 13:39:41 1 SECTOR_ID NUMBER 6 1951673 13:39:41 1 TEAMS NUMBER 1206 1605130 13:39:41 1 BYTE_SIZE NUMBER 0 13:39:41 1 START_DATE DATE 0 13:39:41 1 END_DATE DATE 0 13:39:41 1 LAST_UPDATE DATE 837279 1951673 13:39:41 1 CONTAINERS NUMBER 1206 1605706 13:39:41 1 PROP_CAT LINENUM NUMBER 11486321 11486321 14:03:31 1 LOOKUPID VARCHAR2 40903 11486321 14:03:31 1 EXTID VARCHAR2 11486321 11486321 14:03:31 1 SOLD NUMBER 1 11486321 14:03:31 1 CATEGORY VARCHAR2 843 11486321 14:03:31 1 NOTES VARCHAR2 0 14:03:31 1 DETAILS VARCHAR2 873 11486321 14:03:31 1 PROPSTYLE VARCHAR2 48936 11486321 14:03:31 1
Method_opt
We tested different options for histogram collection. The first was the commonly-used option that many have used with the old ANALYZE command, FOR ALL INDEXED COLUMNS. We acknowledge the caveat that you are unlikely to need histograms on all your indexed columns, and you may well want them on some non-indexed columns. But it’s a good start for a test.
We see that it took 7:37 to gather statistics with histograms for our two tables and their indexes. This compares to 6:04 for statistics gathering without histograms, which is not unreasonable if you actually need the histograms. But one problem that we see when we use this option is that statistics aren’t gathered for the non-indexed columns. This could result in problems down the road.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all indexed columns size 30', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:01:35.63 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all indexed columns size 30', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:06:01.14 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 2,039,648 132886 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 10 483109 STATE_NO 10 483109 NONUNIQUE FILEH_PREFIX_STATE PREF 13,674 423169 STATE_NO 13,674 423169 UNIQUE PK_FILE_HISTORY FILE_ID 1,903,053 501982 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,501,674 391170 SOLD 11,501,674 391170 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,741 468441 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,038 498096 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1950360 195036 18:05:22 30 FNAME VARCHAR2 18725 195036 18:05:22 10 STATE_NO NUMBER 5 195036 18:05:22 4 FILE_TYPE NUMBER 7 195036 18:05:22 6 PREF VARCHAR2 36039 144909 18:05:22 14 CREATE_DATE DATE TRACK_ID NUMBER SECTOR_ID NUMBER TEAMS NUMBER BYTE_SIZE NUMBER START_DATE DATE END_DATE DATE LAST_UPDATE DATE CONTAINERS NUMBER PROP_CAT LINENUM NUMBER LOOKUPID VARCHAR2 22959 1148726 18:11:54 30 EXTID VARCHAR2 11487260 1148726 18:11:54 30 SOLD NUMBER 1 1148726 18:11:54 1 CATEGORY VARCHAR2 NOTES VARCHAR2 DETAILS VARCHAR2 PROPSTYLE VARCHAR2 23583 1148726 18:11:54 30
Instead of specifying on which columns to collect histograms, we can specify SKEWONLY and have Oracle determine the columns, based on their data distribution. So we’ll try that option.
This takes 7:45. This is about the same as when we specified the number of buckets, but again we have no statistics on the non-indexed columns. So we’ll try it using FOR ALL COLUMNS.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all indexed columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:01:43.32 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all indexed columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:06:02.38 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,943,994 126654 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 17 518642 STATE_NO 17 518642 NONUNIQUE FILEH_PREFIX_STATE PREF 15,738 434460 STATE_NO 15,738 434460 UNIQUE PK_FILE_HISTORY FILE_ID 1,892,839 499288 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,842,752 402770 SOLD 11,842,752 402770 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,424 476052 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,120 481642 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1954060 195406 18:37:53 200 FNAME VARCHAR2 18792 195406 18:37:53 38 STATE_NO NUMBER 5 195406 18:37:53 4 FILE_TYPE NUMBER 7 195406 18:37:53 6 PREF VARCHAR2 36272 145146 18:37:53 93 CREATE_DATE DATE TRACK_ID NUMBER SECTOR_ID NUMBER TEAMS NUMBER BYTE_SIZE NUMBER START_DATE DATE END_DATE DATE LAST_UPDATE DATE CONTAINERS NUMBER PROP_CAT LINENUM NUMBER LOOKUPID VARCHAR2 22848 1148229 18:47:32 200 EXTID VARCHAR2 11482290 1148229 18:47:32 200 SOLD NUMBER 1 1148229 18:47:32 1 CATEGORY VARCHAR2 NOTES VARCHAR2 DETAILS VARCHAR2 PROPSTYLE VARCHAR2 23498 1148229 18:47:32 200
This gets us statistics on all the columns, but takes 15:24, twice as long as the other histogram tests. In addition to taking much longer, some interesting choices are made for the columns on which to collect histograms. For instance, there are 200 buckets used for the primary key! It’s kind of unusual to have a skewed primary key. Fortunately there is an explanation for this. It’s bug #3929552, detailed in Metalink Note 284917.1. The subject of the note says it all: “DBMS_STATS WITH SKEWONLY GENERATES HISTOGRAMS FOR UNIQUE KEY COLUMN.” The note says this bug is in 9.2.0.5, but it exists on our 9.2.0.4 database. There is a one-off patch available for it, but the note doesn’t say if the bug is fixed in later patch sets. So the SKEWONLY option is out of the question currently.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:02:27.02 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:12:57.15 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,933,618 125978 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 13 498382 STATE_NO 13 498382 NONUNIQUE FILEH_PREFIX_STATE PREF 13,907 433992 STATE_NO 13,907 433992 UNIQUE PK_FILE_HISTORY FILE_ID 1,962,629 517697 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,442,368 389153 SOLD 11,442,368 389153 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,488 468009 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,216 490564 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1951550 195155 12:58:41 200 FNAME VARCHAR2 18908 195155 12:58:41 39 STATE_NO NUMBER 5 195155 12:58:41 4 FILE_TYPE NUMBER 7 195155 12:58:41 6 PREF VARCHAR2 35913 144624 12:58:41 93 CREATE_DATE DATE 489501 195155 12:58:41 200 TRACK_ID NUMBER 9 195155 12:58:41 8 SECTOR_ID NUMBER 6 195155 12:58:41 5 TEAMS NUMBER 971 160436 12:58:41 46 BYTE_SIZE NUMBER 0 12:58:41 1 START_DATE DATE 0 12:58:41 1 END_DATE DATE 0 12:58:41 1 LAST_UPDATE DATE 525993 195155 12:58:41 200 CONTAINERS NUMBER 971 160483 12:58:41 46 PROP_CAT LINENUM NUMBER 11498490 1149849 13:11:53 1 LOOKUPID VARCHAR2 22795 1149849 13:11:53 200 EXTID VARCHAR2 11498490 1149849 13:11:53 200 SOLD NUMBER 1 1149849 13:11:53 1 CATEGORY VARCHAR2 820 1149849 13:11:53 200 NOTES VARCHAR2 0 13:11:53 1 DETAILS VARCHAR2 834 1149849 13:11:53 200 PROPSTYLE VARCHAR2 23414 1149849 13:11:53 200
Let’s see if the AUTO option for SIZE does any better.
This time it only takes 5:34 to gather our statistics, the best yet for a 10% estimate. But the statistics leave something to be desired. Oracle thinks there are 18,905 distinct values of FNAME, and there are 38 buckets used for it. We know from the table’s data that FNAME has all unique values. (It doesn’t have a unique constraint, but it is the case that all the values are different.) So there’s no reason to have histograms on it at all.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all columns size auto', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:01:40.49 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all columns size auto', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:03:53.78 SQL> @index Table Uniqueness Index Name Column Name Distinct Keys Sample Size ------------ ---------- -------------------- ------------ ------------- ------------ FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,892,759 123316 NONUNIQUE FILEH_FTYPE_STATE FILE_TYPE 7 499648 STATE_NO 7 499648 NONUNIQUE FILEH_PREFIX_STATE PREF 14,649 431568 STATE_NO 14,649 431568 UNIQUE PK_FILE_HISTORY FILE_ID 1,874,115 494349 PROP_CAT NONUNIQUE PK_PROP_CAT EXTID 11,272,123 383363 SOLD 11,272,123 383363 NONUNIQUE PROPC_LOOKUPID LOOKUPID 2,874 479087 NONUNIQUE PROPC_PROPSTYLE PROPSTYLE 3,540 490096 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1946580 194658 13:25:14 1 FNAME VARCHAR2 18905 194658 13:25:14 38 STATE_NO NUMBER 4 194658 13:25:14 1 FILE_TYPE NUMBER 7 194658 13:25:14 1 PREF VARCHAR2 36144 144197 13:25:14 1 CREATE_DATE DATE 485546 194658 13:25:14 1 TRACK_ID NUMBER 9 194658 13:25:14 1 SECTOR_ID NUMBER 6 194658 13:25:14 1 TEAMS NUMBER 966 160023 13:25:14 1 BYTE_SIZE NUMBER 0 13:25:14 1 START_DATE DATE 0 13:25:14 1 END_DATE DATE 0 13:25:14 1 LAST_UPDATE DATE 523461 194658 13:25:14 1 CONTAINERS NUMBER 967 160096 13:25:14 1 PROP_CAT LINENUM NUMBER 11487560 1148756 13:30:07 1 LOOKUPID VARCHAR2 22904 1148756 13:30:07 1 EXTID VARCHAR2 11487560 1148756 13:30:07 1 SOLD NUMBER 1 1148756 13:30:07 1 CATEGORY VARCHAR2 822 1148756 13:30:07 1 NOTES VARCHAR2 0 13:30:07 1 DETAILS VARCHAR2 840 1148756 13:30:07 1 PROPSTYLE VARCHAR2 23558 1148756 13:30:07 1
So far we’ve determined that collecting histograms doesn’t seem to be as efficient or accurate as it should be with DBMS_STATS.
Is 10g Any Better?
The performance of gathering and accuracy of statistics is reasonable in Oracle 9i for the “standard” choices, such as estimate and cascade, but there are serious deficiencies with AUTO_SAMPLE_SIZE and histogram collection. Is 10g any better in these areas? Let’s do some testing on an Oracle 10.1.0.3 database on the same Sun server.
First we’ll try the SKEWONLY option for SIZE in histogram collection.
This statistics gathering took 15:13, which is about the same as in Oracle 9i. Unfortunately, the results are just as bad. We’re still collecting histograms on the primary key (file_id) and on FNAME, and the number of distinct values of FNAME shows as 26,167, approximately two orders of magnitude off the correct value.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:02:45.05 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all columns size skewonly', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:12:28.17 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1947030 194703 21:24:46 200 FNAME VARCHAR2 26167 194703 21:24:46 200 STATE_NO NUMBER 6 194703 21:24:46 6 FILE_TYPE NUMBER 7 194703 21:24:46 7 PREF VARCHAR2 45092 144582 21:24:46 200 CREATE_DATE DATE 488202 194703 21:24:46 200 TRACK_ID NUMBER 9 194703 21:24:46 9 SECTOR_ID NUMBER 6 194703 21:24:46 6 TEAMS NUMBER 960 160244 21:24:46 200 BYTE_SIZE NUMBER 0 21:24:46 0 START_DATE DATE 0 21:24:46 0 END_DATE DATE 0 21:24:46 0 LAST_UPDATE DATE 525201 194703 21:24:46 200 CONTAINERS NUMBER 960 160300 21:24:46 200 PROP_CAT LINENUM NUMBER 11458820 1145882 21:35:54 1 LOOKUPID VARCHAR2 22891 1145882 21:35:54 200 EXTID VARCHAR2 11458820 1145882 21:35:54 200 SOLD NUMBER 1 1145882 21:35:54 1 CATEGORY VARCHAR2 823 1145882 21:35:54 200 NOTES VARCHAR2 0 21:35:54 0 DETAILS VARCHAR2 829 1145882 21:35:54 200 PROPSTYLE VARCHAR2 23566 1145882 21:35:54 200
So let’s try the AUTO for SIZE.
This statistics gathering took 11:38, twice as long as the equivalent 9i test. But the important thing to note in this test is that the statistics are reasonable, and histograms are not collected on the unique columns. So it looks like, in Oracle 10g, the AUTO option for SIZE in histogram collection solves some of the problems seen in 9i. However, it’s not perfect. Let’s look at the STATE_NO column:begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all columns size auto', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:03:22.81 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>10, method_opt=>'for all columns size auto', cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:08:15.00 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1947700 194770 22:02:39 1 FNAME VARCHAR2 1947700 194770 22:02:39 1 STATE_NO NUMBER 4 194770 22:02:39 1 FILE_TYPE NUMBER 7 194770 22:02:39 1 PREF VARCHAR2 36073 144920 22:02:39 1 CREATE_DATE DATE 490502 194770 22:02:39 1 TRACK_ID NUMBER 9 194770 22:02:39 1 SECTOR_ID NUMBER 6 194770 22:02:39 1 TEAMS NUMBER 978 160229 22:02:39 1 BYTE_SIZE NUMBER 0 22:02:39 0 START_DATE DATE 0 22:02:39 0 END_DATE DATE 0 22:02:39 0 LAST_UPDATE DATE 522845 194770 22:02:39 1 CONTAINERS NUMBER 979 160296 22:02:39 1 PROP_CAT LINENUM NUMBER 11495240 1149524 22:10:18 1 LOOKUPID VARCHAR2 22955 1149524 22:10:18 1 EXTID VARCHAR2 11495240 1149524 22:10:18 1 SOLD NUMBER 1 1149524 22:10:18 1 CATEGORY VARCHAR2 827 1149524 22:10:18 1 NOTES VARCHAR2 0 22:10:18 0 DETAILS VARCHAR2 847 1149524 22:10:18 1 PROPSTYLE VARCHAR2 23658 1149524 22:10:18 1
This column certainly seems like a candidate for a histogram, yet the AUTO option for SIZE doesn’t create a histogram for it.SQL> select STATE_NO, COUNT(*) from FILE_HISTORY group by STATE_NO; STATE_NO COUNT(*) ---------- ---------- 0 95 20 569 30 1950957 40 39 999 4 9999 9 6 rows selected.
Similarly, the file_type column looks like a candidate for a histogram:
So let’s try some queries:SQL> select file_type, count(*) from file_history group by file_type; FILE_TYPE COUNT(*) ---------- ---------- 1 670950 2 83799 3 58925 4 48241 5 777258 6 62681 8 249819 7 rows selected.
And gather stats again:select count(*) from file_history where file_type = 5; select count(*) from file_history where file_type = 4; select count(*) from file_history where fname = 'SOMETHING';
Now dbms_stats has taken the workload into account and created histograms on the columns we queried (though it’s again creating a histogram on FNAME).begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>10, method_opt=>'for all columns size auto', cascade=>true); end; / PL/SQL procedure successfully completed. SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE BUCKETS ------------ --------------- ---------- ------------ ----------- --- FILE_HISTORY FILE_ID NUMBER 1952650 195265 1 FNAME VARCHAR2 26179 195265 254 STATE_NO NUMBER 5 195265 1 FILE_TYPE NUMBER 7 195265 7 PREF VARCHAR2 36154 144848 1
Our last test uses the DBMS_STATS.AUTO_SAMPLE_SIZE setting for estimate_percent.
This statistics gathering took 6:17, an enormous improvement over the Oracle 9i test (which took 31:15). But the accuracy of the results may not be as good as desired. The column PREF only shows having 2200 distinct values (the actual number is 65345). Other values look reasonable, but the results aren’t as good as those of a 5% sample in Oracle 9i, which took 1/3 less time.begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>dbms_stats.auto_sample_size, cascade=>true); end; / Elapsed: 00:02:15.95 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', estimate_percent=>dbms_stats.auto_sample_size, cascade=>true); end; / PL/SQL procedure successfully completed. Elapsed: 00:04:01.00 SQL> @tabcol TABLE_NAME COLUMN_NAME DATA_TYPE NUM_DISTINCT SAMPLE_SIZE LAST_ANALYZED BUCKETS --------------- --------------- ---------- ------------ ----------- ------------- ------- FILE_HISTORY FILE_ID NUMBER 1951112 5764 22:20:32 1 FNAME VARCHAR2 1951112 57166 22:20:32 1 STATE_NO NUMBER 2 5764 22:20:32 1 FILE_TYPE NUMBER 7 5764 22:20:32 1 PREF VARCHAR2 2200 4246 22:20:32 1 CREATE_DATE DATE 593203 487778 22:20:32 1 TRACK_ID NUMBER 9 5764 22:20:32 1 SECTOR_ID NUMBER 6 5764 22:20:32 1 TEAMS NUMBER 275 4687 22:20:32 1 BYTE_SIZE NUMBER 0 22:20:32 0 START_DATE DATE 0 22:20:32 0 END_DATE DATE 0 22:20:32 0 LAST_UPDATE DATE 577486 487778 22:20:32 1 CONTAINERS NUMBER 275 4688 22:20:32 1 PROP_CAT LINENUM NUMBER 11476655 520243 22:30:16 1 LOOKUPID VARCHAR2 16505 52192 22:30:16 1 EXTID VARCHAR2 11476655 520243 22:30:16 1 SOLD NUMBER 1 5264 22:30:16 1 CATEGORY VARCHAR2 739 5264 22:30:16 1 NOTES VARCHAR2 0 22:30:16 0 DETAILS VARCHAR2 738 5264 22:30:16 1 PROPSTYLE VARCHAR2 16512 52192 22:30:16 1
Summary
We’ve run quite a few tests here. From the results, it appears that:
- The sweet spot for balancing performance of the gathering process and the accuracy of the statistics lies between 5 and 20 for estimate_percent. Five percent will probably work for most cases, but 20% may be advisable if really accurate statistics are needed.
- Gathering statistics separately for indexes is faster than the cascade=true option while gathering table statistics.
- Block_sample=true doesn’t appear to appreciably speed up statistics gathering, while at the same time delivering somewhat less-accurate statistics.
- Using AUTO_SAMPLE_SIZE takes nearly as long as COMPUTE in Oracle 9i. It is much faster in 10gR1, but the accuracy of the statistics is not as good as with an estimate_percent of 5, which gathers the statistics faster.
- Using the SKEWONLY option to size histograms is inadvisable in both Oracle 9i and 10gR1.
- Using the AUTO option to size histograms is inadvisable in Oracle 9i; it seems to work better in 10gR1, though it still has issues.
No comments:
Post a Comment