Flattening (en)

Z PostgreSQL
Skočit na navigaci Skočit na vyhledávání

Translated by Martin Šmejkal

Flattening presents one of a few stools in PostgreSQL for beginner (and sometimes for advanced user). The inspiration of this article was conversation with my partner, who could't use one specific index in PostgreSQL. In this point of view is not PostgreSQL accommodating program. If we want to understand reasons, when PostgreSQL stopped find optimal working plan, We have to understand process of selection optimal executive plan. The trying variation of problematic SQL query (like derived table)in the time, when PostgreSQL gets into a troubles is really close to meet with flattening. Workaround is easy, but you have to notify, that the our problems make just flattening.

This problem is not frequent an I have heard about it only in discuss. Who watch discussion on pg_performance can immediately identify and dispatch it. But who has so much time to watch every discussion? Flattening doesn't caused problems in normal factors and it is very useful technique, that make faster many queries. Without this technique can't we effectively use views in PostgreSQL.

When prediction fails to start

The basic hypothesis right behavior of optimalizer is accurate presumption of operation predicate. It is based on column statistic. To every attributes from a table is joined histogram of frequency single class's value. If I can get predicate to one concrete class I will predict effect of predicate by number of rows in the table. The column statistic is actualized by order ANALYZE. PostgreSQL use default set up until this order is started. Default set up suppose that number of rows is inversely proportional (for 3 columns 1770 rows). The next suppositions is steady split values of this attributes. As well as in Prachett is probability of satisfaction these two supposition 1:1000000. The order ANALYZE update statistics and we are able to do credible presumption:

root=# create table t1(a integer, b integer);
CREATE TABLE

root=# explain analyze select * from t1;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..29.40 rows=1940 width=8) (actual time=0.004..0.004 rows=0 loops=1)
 Total runtime: 0.055 ms
(2 rows)

root=# insert into t1 values(10,20);
INSERT 0 1

root=# ANALYZE t1;
ANALYZE

root=# explain analyze select * from t1;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..1.01 rows=1 width=8) (actual time=0.009..0.012 rows=1 loops=1)
 Total runtime: 0.066 ms
(2 rows)

The presumption of number of rows (1940) and real number of rows are different. After using ANALYZE is presumption the same like real number of rows.

Very interesting statistical data can you see when you display the content of table pg_stats (after addition 10 random data in t2 and do ANALYZE):

root=# select * from pg_stats where tablename = 't1';
 schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals |    most_common_freqs    | histogram_bounds | correlation 
------------+-----------+---------+-----------+-----------+------------+------------------+-------------------------+------------------+-------------
 root       | t1        | a       |         0 |         4 |       -0.4 | {13,10,30}       | {0.466667,0.2,0.133333} | {11,12,20}       |    0.803571
 root       | t1        | b       |         0 |         4 |  -0.533333 | {30,20,91}       | {0.4,0.133333,0.133333} | {40,45,66,87,90} |    0.996429
(2 rows)

We can also sort this problem by implementation similar algorithm in RDBMS Firebird 2.x and MySQL5.x. But it is not very suitable in this cases:

  • using function LIKE (useful presumption is since 8.2 version)
  • use table function(SRF), every function get back 1000.th row (repaired in 8.3, when we can setup demand factors (attribute COST) and supposed number of returned rows (attribute ROWS)).
root=# create or replace function foo() returns setof t1 as $$begin return; end; $$ language plpgsql rows 100;
CREATE FUNCTION
root=# explain select * from foo();
                        QUERY PLAN                         
-----------------------------------------------------------
 Function Scan on foo  (cost=0.00..26.00 rows=100 width=8)
(1 row)
  • Data has a character, that can't by exactly approximated by histogram about n classes, when n is default set up and valuated right 10. Imagine, that you have 1000 collection spots. In the towns are 80% and minimally once a day every collection spot generate one notation. The other 20% are located in the unpopulated part of a country and this 20% generates one notation per month together. We can't be surprised that the histogram with ten classes is not included this 20% places. Optimal queries on this 80% collection spots and non optimal queries on this 20% of minorities collection spots.It is not a mistake of PostgreSQL. Just we are not able accurately represent reality with this resolution level. The increase of the class number can help us. Technically is number of class limited by one thousand but the real number is something around three hundred. The need more classes signalize that something in input data is not in order and it is necessary to change data model or sort data

I will do an example: I will randomly generate pairs from space [1..800, 1..100]. It is eight hundred collection spots. I can measure value 1..100. I dispatch 90% notation for interval 50-100 to simulate inequality in dates.

root=# insert into t1 select (random()*800)::int, (random()*100)::int from generate_series(1,15000);
INSERT 0 15000
root=# delete from t1 where a between 50 and 100 and b between 10 and 100;
DELETE 827
root=# analyze t1;
ANALYZE
root=# explain analyze select * from t1 where a = 120; -- nearly done
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..244.16 rows=18 width=8) (actual time=0.799..11.135 rows=20 loops=1)
   Filter: (a = 120)
 Total runtime: 11.250 ms
(3 rows)

root=# explain analyze select * from t1 where a = 55; -- PostgreSQL fail
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..244.16 rows=18 width=8) (actual time=6.151..11.145 rows=1 loops=1)
   Filter: (a = 55)
 Total runtime: 11.218 ms
(3 rows)

root=# ALTER TABLE t1 ALTER a SET statistics 1000;
ALTER TABLE
root=# analyze t1;
ANALYZE
root=# explain analyze select * from t1 where a = 120;
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..244.16 rows=20 width=8) (actual time=0.771..12.432 rows=20 loops=1)
   Filter: (a = 120)
 Total runtime: 12.546 ms
(3 rows)

root=# explain analyze select * from t1 where a = 55; -- it is better, but maximum
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..244.16 rows=10 width=8) (actual time=6.081..10.537 rows=1 loops=1)
   Filter: (a = 55)
 Total runtime: 10.611 ms
(3 rows)

root=# explain analyze select * from t1 where a  between 130 and 150 and b between 0 and 10;
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..350.46 rows=46 width=8) (actual time=1.106..15.428 rows=47 loops=1)
   Filter: ((a >= 130) AND (a <= 150) AND (b >= 0) AND (b <= 10))
 Total runtime: 15.605 ms
(3 rows)

-- for both attributes is not PostgreSQL suitable (statistic on maximum)
root=# explain analyze select * from t1 where a  between 65 and 85 and b between 0 and 10;
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..350.46 rows=5 width=8) (actual time=0.066..14.882 rows=51 loops=1)
   Filter: ((a >= 65) AND (a <= 85) AND (b >= 0) AND (b <= 10))
 Total runtime: 15.076 ms
(3 rows)

One of the reasons why PostgreSQL fail is because of random generating data which are independent. So the test aggregate is not to large. Real data has fortunately different character.

The exchange of number don't have to help and we have to prescribe query. Exactly two variants of queries or shatter queries and write saved procedure (in our example shouldn't we use predicate "a"). Should we automatically increase statistics on hundreds of classes? Absolutely no. We are not able to create new index ahead and it is the same reason. It will be about previous optimalization. In SQL should we sort problems when they break out. Not before, not later. We can use logging the slow queries. The control this logo is normal quality of good dba.

Now I think that we know about areas where can optimalization of queries fail. Flattening can make problems only in this areas. So if everything work well we don't care about the details. Realy ther are quite non typical incidents. We know that optimalization make problems sometimes. But we don't know how find resolution which can solve presumption of predicate's effect without exceptions.

Flattening

What is Flattening? Flattening is a method when we transform SQL query with derived table into equivalent SQL query - but without derived table. For example:

select * from (select * from data where a = 10) s where s.b = 20; --> select * from data where a = 10 and b = 20;

proof:

root=# explain select * from (select * from t1 where a = 20) s where s.b =  20;
                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on t1  (cost=0.00..279.60 rows=1 width=8)
   Filter: ((a = 20) AND (b = 20))
(2 rows)

root=# explain select * from t1 where a = 20 and b = 20;
                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on t1  (cost=0.00..279.60 rows=1 width=8)
   Filter: ((a = 20) AND (b = 20))
(2 rows)

Where is a problem? In understanding of braces.In normal programing languages the braces changes sequence of scoring. So we think that it firstly analyze predicate a = 10 (and use index upon a), and than upon the output analyze predicate b = 20. It is mistake. In this case the braces are only for correct syntactic entry and it doesn't cause the sequence of the count. So if PostgreSQL preferred index b, we can query try thousands times and PostgreSQL still preferred index b. Because of Flattening, nested query is been showed and Pg choose between indexes upon a and b - still choose index upon b. So we can't preferred index in PostgreSQL. One step back. Why nested query? We use nested query in cases when we think that we are able to make Pg what we exactly want to - when the data is not in order and we don't want to lose control. It is really not be done.

To my example. Who will write so hard query? No one. But everyone who works with views use it. The query from example is similar like

select * from view where b = 20
create view view as select * from data where a = 10.

It is the reason why is flattening in PostgreSQL so important. Because of it is easy to use views without another work with no other query into a table.

We have to compel PostgreSQL to respect braces. Flattening can't be used if the nested query contain words LIMIT or OFFSET. LIMIT is not functional and it changes the property of query. But OFFSET 0 is only a insertion and it is why it's used. One time people think about introduction of hint - it wasn't accepted for problems which hint of flatting can cause - there will be problem with introduction of hint for index. In addition is OFFSET 0 public secret. But you can find it. So why make up new non optimal syntax? If you are interested in this theme you should search in pg_performance keywords flattening and "OFFSET 0". It is why it shouldn't contain LIMIT.

root=# explain select * from (select * from t1 where a = 20 offset 0) s where s.b =  20;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Subquery Scan s  (cost=0.00..244.42 rows=1 width=8)
   Filter: (b = 20)
   ->  Limit  (cost=0.00..244.16 rows=21 width=8)
         ->  Seq Scan on t1  (cost=0.00..244.16 rows=21 width=8)
               Filter: (a = 20)
(5 rows)

I don't think that exist some other problems in it. PostgreSQL just work and tricks are not the right resolution. I've remembered if we want to compel PostgreSQL - 7.4 version to use indexes we have to explicitly change types of numeral types. On the other hand 7.4 version is reliable database but now is retired ( 7.4 was the first useful version). The 8.x version are much better and allow programators concentrate to their work - without tricking.