Collecting statistics (Data Virtualization)

To optimize query performance, make sure that you are collecting statistics on the data that is being queried.

The decisions that are made by the cost-based optimizer are crucial to determine the performance of queries. The optimizer makes its decisions based on statistical information that it has about the data that is being queried. Accurate and up-to-date statistics on the data are required to ensure optimal query performance. Statistics should be collected whenever the following conditions apply:

  • A new table is created and populated with data.
  • An existing table’s data undergoes significant changes such as the following conditions:
    • New data is added.
    • Old data is removed.
    • Existing data is updated.

For the optimizer to have enough information to build efficient execution plans, statistics should be gathered for:

  • Any table that is referenced in a query.
  • All columns that are referenced in predicates (including join predicates) and aggregation functions that are referenced in queries.
  • All columns that are referenced in informational constraints.
  • Column group statistics for columns with implied relationships.

You don't need to gather statistics for columns that appear only in the first SELECT list of a query.

To gather statistics on virtualized objects, you can use the SYSPROC.NNSTAT procedure. The statistics are saved in the system catalog. For more information, see the SYSPROC.NNSTAT stored procedure.

Note: Gathering statistics is critical for good query performance, so it is important to have up-to-date statistics. However, the gathering of statistics can be expensive so consideration should be given to gathering statistics when the system is less busy.
Virtualized Tables
When a table is virtualized and the remote data source supports statistics collection locally, the statistics at the remote data source are used to update the local statistics.
Virtualized table statistics are only as good as what is stored on remote data source. You must ensure that statistics are collected on remote system before you virtualize the table. If the remote table undergoes significant changes, the virtualized table statistics should be updated using the SYSPROC.NNSTAT procedure using the data-based collection method.
Virtualized Files
When a file is virtualized, it is important to gather statistics on the virtualized object for improving the performance of queries that access that data. The SYSPROC.NNSTAT procedure should be called to collect the key statistics for this new table by using the data-based method. If the file undergoes a significant change, the SYSPROC.NNSTAT procedure should be called again to collect the new statistics.
Views
When you create a virtual view, it is important to make sure that statistics are collected for all the tables that are referenced in the view, as described in the previous sections.

Checking for missing statistics

To determine whether table level statistics are missing for a table, you can use the following query:

select cast(TABNAME as varchar(25)),  TYPE, COLCOUNT,  STATS_TIME from SYSCAT.TABLES 
           where TABSCHEMA='<schema>'   
                   and CARD = -1;

Replace <schema>with the schema name of the virtual table. The command lists all the objects in the schema that have a CARD of -1, which is an indication that table level statistics have not been collected.

A similar query can be used to check for missing column statistics:

select cast (COLNAME as varchar(25)) from SYSSTAT.COLUMNS 
          where TABSCHEMA = '<schema>' 
                  and TABNAME = '<table_name>' 
                  and COLCARD = -1; 

Replace <schema>with the schema name and <table_name> with the name of the virtual table. The command lists all the objects in the schema that have a COLCARD of -1, which is an indication that column level statistics have not been collected.

Note: Not all columns in a table require column level statistics.

Collecting missing statistics

The following example demonstrates a call to gather statistics with the data-based SYSPROC.NNSTAT procedure.

Important: Do not use the catalog-based statistics collection method of SYSPROC.NNSTAT on a virtualized table. The catalog-based method deletes any previously collected statistics, which has an adverse impact on query performance.
call SYSPROC.NNSTAT(NULL, 
          '<schema>',
          '<table_name>',
          '<list_of_columns_commonly_used_in_predicates>',
          '',
          2,
          '/tmp/<table_name>.log',
          ?,
          1);

This call collects table level statistics and column level statistics for the specified columns in <list_of_columns_commonly_used_in_predicates>. If the table has a small number of columns, you can collect statistics for all columns in the table by passing NULL for the column list argument. If you don't want to collect column statistics, specify the empty string, '', as the fourth parameter.

For more information, see the SYSPROC.NNSTAT stored procedure.