PostgreSQL SQL Tricks I
SQL is a language where one task can be solved multiple ways with different efficiency. You can see effective use of PostgreSQL's possibilities on this page.
Google AdSense
Slow queries in stored procedures
Sometimes some users reports slow queries in stored procedure. Direct execution of same query is fast. Reason is simple. Stored procedures uses prepared statements and optimalization of prepared statements runs without knowledge of real parameters. This behave ensure safety against sql injection, but carry some problems, planner is blind. This problem isn't too often and solution is simply - we have to use dynamic queries. But we cannot to forget carefully check against sql injection. This kind of problems we can find with analyzing of prepared statements.
t2=# PREPARE pp(integer) AS SELECT count(*) FROM foo WHERE a BETWEEN $1 + 100 AND $1 + 200; t2=# EXPLAIN ANALYZE execute pp(100); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Aggregate (cost=438.60..438.61 rows=1 width=0) (actual time=16.149..16.150 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=13.39..437.35 rows=500 width=0) (actual time=1.391..5.458 rows=1033 loops=1) Recheck Cond: ((a >= ($1 + 100)) AND (a <= ($1 + 200))) -> Bitmap Index Scan on fx (cost=0.00..13.26 rows=500 width=0) (actual time=1.131..1.131 rows=1033 loops=1) Index Cond: ((a >= ($1 + 100)) AND (a <= ($1 + 200))) Total runtime: 16.340 ms (6 rows) t2=# EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE a BETWEEN 100 + 100 AND 100 + 200; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Aggregate (cost=432.21..432.22 rows=1 width=0) (actual time=15.930..15.932 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=22.21..429.78 rows=971 width=0) (actual time=1.179..13.338 rows=1033 loops=1) Recheck Cond: ((a >= 200) AND (a <= 300)) -> Bitmap Index Scan on fx (cost=0.00..21.97 rows=971 width=0) (actual time=0.917..0.917 rows=1033 loops=1) Index Cond: ((a >= 200) AND (a <= 300)) Total runtime: 16.058 ms (6 rows)
Execution plans are nearly identical, so we can use prepared statements.
PostgreSQL 9.2 has redesigned implementation of prepared statements - described performance issues should be history.
Slow DISTINCT
Current implementation of DISTINCT needs sort, that can be slow. We should to use GROUP BY clause - this clause can to use hash aggregate - and it's significantly faster than DISTINCT.
postgres=# select count(*) from (select distinct i from g) a; count ------- 10001 (1 row) Time: 1563,109 ms postgres=# select count(*) from (select i from g group by i) a; count ------- 10001 (1 row) Time: 594,481 ms
This topis is related to PostgreSQL 8.2 and older. PostgreSQL 8.3 and higher can to use a hash aggregation for distinct.
Taking quarter from interval
Interval arithmetic is little bit alchemy. Internally type interval carries days, months and years. Input value isn't normalized, it stays in entered units (i.e. 200 days are stored as 0 years, 0 months and 200 days). Some functions can't to work well with similar entered values (etc. EXTRACT QUARTER returns first quarter). We have to do normalization first with justify_interval function (so interval 200 days will be stored as 0 years, 6 months and 20 days). From this value we take correct information about quarter:
postgres=# SELECT interval '200 days', EXTRACT(QUARTER FROM interval '300 days'); interval | date_part ----------+----------- 200 days | 1 (1 row) postgres=# SELECT justify_interval(interval '200 days'), EXTRACT(QUARTER FROM justify_interval('200 days')); justify_interval | date_part ------------------+----------- 6 mons 20 days | 3 (1 row)
Author: Tom Lane from pgsql-bugs
Select random row
Hubert Lubaczewski showed in his [blog] problem with select random row from table. When I did one database audit I found typical but slow solution (from FAQ) ORDER BY random() LIMIT 1. This method never uses index and every similar query do: complete scan of table and sort of table. Next possible method WHERE id >= (random()*C+1)::int LIMIT 1 is much faster, but when series has some gaps then prefer first rows after gaps. I used with success method based on using set of random numbers. Disadvantage of this method is risk of returning zero rows (then we have to call statement again):
SELECT id, ... FROM data WHERE id = ANY(ARRAY( SELECT (random()*max_id)::int FROM generate_series(1,20))) LIMIT 1; -- max_id is constant equal max(id) from table + 20% reserve -- id is primary key of table data
Use exception carefully
Trapping of exceptions has some overhead. So don't use it too often. In majority of cases this overhead is insignificant, but that can change when you are using cycles. Next two procedures add new item into table of unique records. If item exists, it returns false, else true.
CREATE OR REPLACE FUNCTION addnew1(integer) RETURNS boolean AS $$ BEGIN -- test(a) .. primary key INSERT INTO test(a) VALUES($1); RETURN true; EXCEPTION WHEN OTHERS RETURN false END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION addnew2(integer) RETURNS boolean AS $$ BEGIN IF NOT EXISTS(SELECT a FROM test WHERE a = $1) THEN INSERT INTO test(a) VALUES($1); RETURN true; END IF; RETURN false; END; $$ LANGUAGE plpgsql;
Procedure addnew1 is about 2 times slower than addnew2, but addnew2 is more robust. Every protected block in plpgsql creates new subtransaction. And in this case time for creating new subtrans + INSERT > SELECT + INSERT.
Fast compare variables NEW and OLD in trigger's body
We can run trigger effectively if trigger's body is executed only when any columns are changed. This test can be simple and fast with new (from 8.2) row operator IS DISTINCT FROM.
IF ROW(NEW.*) IS DISTINCT FROM ROW(OLD.*) THEN -- continue only if some is changed IF NEW.a1 IS DISTINCT FROM OLD.a1 OR NEW.a2 IS DISTINCT FROM OLD.a2 THEN -- continue only if important columns are changed ... END IF; END IF; RETURN NEW;
or effectively:
IF ROW(NEW.a1, NEW.a2) IS DISTINCT FROM ROW(OLD.a1, OLD.a2) THEN -- important columns are changed ... END IF; RETURN NEW;
From 8.4 you can compare record values simply:
IF NEW IS DISTINCT FROM OLD THEN -- continue only if some is changed ... END IF; RETURN NEW;
Fast first n rows removing
PostgreSQL doesn't support LIMIT for UPDATE or DELETE statements. Sometimes we would to use it (for queue emulation, ...). Classic solution with IN and subquery isn't too effective:
postgres=# explain analyze delete from del where ctid in (select ctid from del limit 10 offset 0); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Merge IN Join (cost=804.79..854.94 rows=10 width=6) (actual time=63.392..63.464 rows=10 loops=1) Merge Cond: (public.del.ctid = "IN_subquery".ctid) -> Sort (cost=804.39..829.39 rows=10000 width=6) (actual time=63.205..63.217 rows=11 loops=1) Sort Key: public.del.ctid Sort Method: quicksort Memory: 646kB -> Seq Scan on del (cost=0.00..140.00 rows=10000 width=6) (actual time=0.045..27.366 rows=9970 loops=1) -> Sort (cost=0.41..0.43 rows=10 width=6) (actual time=0.172..0.187 rows=10 loops=1) Sort Key: "IN_subquery".ctid Sort Method: quicksort Memory: 17kB -> Limit (cost=0.00..0.14 rows=10 width=6) (actual time=0.021..0.127 rows=10 loops=1) -> Seq Scan on del (cost=0.00..140.00 rows=10000 width=6) (actual time=0.016..0.030 rows=10 loops=1) Total runtime: 63.661 ms (12 rows)
Tom Lane's proposal is based on using of array:
postgres=# explain analyze delete from del where ctid = any (array(select ctid from del limit 10)); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Tid Scan on del (cost=0.14..40.37 rows=10 width=6) (actual time=0.200..0.237 rows=10 loops=1) TID Cond: (ctid = ANY ($0)) InitPlan -> Limit (cost=0.00..0.14 rows=10 width=6) (actual time=0.046..0.139 rows=10 loops=1) -> Seq Scan on del (cost=0.00..140.00 rows=10000 width=6) (actual time=0.040..0.065 rows=10 loops=1) Total runtime: 0.376 ms
The core of this trick is protection from JOIN, what is not (in this case) effective as seq. scan with filter. This trick is usable in other cases. Generally we should to say, so for small N (N < 100) is predicate = ANY better than predicate IN (for PostgreSQL 8.2 and 8.3). But real efficiency depends on indexes.
Getting list of comma separated items
D. Dante Lorenso sent note about getting list from content of table:
SELECT a.id, a.name, array_to_string(ARRAY( SELECT b.name FROM b WHERE b.id = a.id ORDER BY b.name ASC ), ',') AS b_names FROM a ORDER BY a.id ASC; [table a] id | name ----+------ 1 | one 2 | two 3 | three 4 | four [table b] id | name ----+------ 1 | pizza 1 | hot dog 2 | gorilla 2 | monkey 3 | apple 4 | cheese 4 | milk 4 | eggs --result id | name | b_names ----+-------+--------- 1 | one | pizza,hot dog 2 | two | gorilla,monkey 3 | three | apple 4 | four | cheese,milk,eggs
Using functions pgstattuple and pgstatindex
You can find nice contrib package pgstattuple in PostgreSQL 8.1 and higher. This contrib module contains some diagnostics functions for detection of fraction of dead tuples and for detection of index's fragmentation. Using of these functions is more practical with next two table generating functions:
CREATE OR REPLACE FUNCTION print_table_dead_tp(OUT table_name varchar, OUT tuple_count int8, OUT table_len varchar, OUT free_space varchar, OUT dead_tuple_percent numeric(5,2), OUT dead_tuple_count integer) RETURNS SETOF RECORD AS $$ DECLARE r record; s record; BEGIN FOR r IN SELECT c.oid, n.nspname || '.' || c.relname as "tablename" FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'pg_toast') AND pg_catalog.pg_table_is_visible(c.oid) ORDER BY 2 LOOP s := pgstattuple(r.oid); table_name := r.tablename; tuple_count := s.tuple_count; dead_tuple_percent := s.dead_tuple_percent; dead_tuple_count := s.dead_tuple_count; table_len := pg_size_pretty(s.table_len); free_space := pg_size_pretty(s.free_space); RETURN NEXT; END LOOP; RETURN; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION print_indexes_info(OUT index_name varchar, OUT indexsize varchar, OUT leaf_pages integer, OUT deleted_pages integer, OUT leaf_fragmentation double precision, OUT avg_leaf_density double precision) RETURNS SETOF RECORD AS $$ DECLARE r record; o pgstatindex_type; BEGIN FOR r IN SELECT schemaname || '.' || indexname AS indexname FROM pg_indexes LOOP BEGIN o := pgstatindex(r.indexname); index_name := r.indexname; indexsize := pg_size_pretty(o.index_size); deleted_pages := o.deleted_pages; leaf_pages := o.leaf_pages; avg_leaf_density := o.avg_leaf_density; leaf_fragmentation := o.leaf_fragmentation; RETURN NEXT; EXCEPTION WHEN OTHERS THEN indexsize := NULL; deleted_pages := NULL; leaf_fragmentation := NULL; index_name := r.indexname; RETURN NEXT; END; END LOOP; RETURN; END; $$ LANGUAGE plpgsql;
Diacritic removing
We can simply remove diacritic with calling to_ascii function. There is small problem. This function doesn't support UTF8 encoding, so we have to use convert function.
postgres=# select to_ascii(convert('Příliš žlutý kůň' using utf8_to_iso_8859_2),'latin2'); to_ascii ------------------ Prilis zluty kun (1 row)
Note for 7.4, name of conversion is "utf_8_to_iso_8859_2". For 8.3 and 9.0 we have to use small workaround:
CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');
Predicate IN optimalization
I found note pg_performance about optimalization possibility of IN predicate when list of values is longer than eighty numbers. For longer list is better create constant subqueries with using multi values:
SELECT * FROM tab WHERE x IN (1,2,3,..n); -- n > 70 -- faster case SELECT * FROM tab WHERE x IN (VALUES(10),(20));
Using VALUES is faster for bigger number of items, so don't use it for small set of values.
BAR code validation
SQL Guru Joe Celko described DBAzinu interesting technique for BAR code validation with multi values.
CREATE FUNCTION barcode_sum(text) RETURNS bigint AS $$ SELECT ABS(SUM(CAST(SUBSTRING($1 FROM Weights.seq FOR 1) AS INTEGER) * Weights.wgt)) FROM (VALUES (CAST(1 AS INTEGER), CAST(-1 AS INTEGER)), (2, +1), (3, -1), (4, +1), (5, -1), (6, +1), (7, -1), (8, +1), (9, -1), (10, +1), (11,-1), (12, +1)) AS weights(seq, wgt) $$ LANGUAGE sql;
Our function can be used like:
CREATE TABLE products( barcode char(13) NOT NULL CONSTRAINT valid_checkdigit check (MOD(barcode_sum(barcode),10) = CAST(substring(barcode from 13 for 1) AS integer)) CONSTRAINT all_numeric_checkdigit CHECK(barcode NOT SIMILAR TO '%[^0-9]%') ); INSERT INTO products VALUES('2837232811227');
This trick is based on behave of char type. Shorted text than 13 chars is filled with spaces, so test NOT SIMILAR ensure correct number of digits.
Using hash functions for ensuring uniqueness of texts
Classic unique index isn't well solution for longer texts. It's natural using some of hash functions. Output of these functions is 32 char long text. This is hexadecimal number, so we can cut it by casting to bytea type. Attention: hash doesn't ensure 100% uniqueness.
root=# CREATE TABLE test(a text); CREATE TABLE root=# CREATE UNIQUE INDEX uidx ON test((decode(md5(a),'hex'))); CREATE INDEX
root=# CREATE UNIQUE INDEX uidx ON test((decode(md5(lower(replace(a,' ',''))),'hex'))); CREATE INDEX root=# INSERT INTO test VALUES('příliš žluťoučký kůň'); INSERT 0 1 root=# INSERT INTO test VALUES('příliš žluťoučký kůň '); ERROR: duplicate key violates unique constraint "uidx" root=# INSERT INTO test VALUES('Příliš žluťoučký kůň'); ERROR: duplicate key violates unique constraint "uidx" -- less risk, using two hash functions -- CREATE UNIQUE INDEX uidx ON TEST((decode(md5(a),'hex')),(hashtext(a)));
Getting OID of function
Every database object in PostgreSQL has Oid - unique identifier. We can find it in database system's tables or we can get it with casting:
CREATE OR REPLACE FUNCTION a(integer, varchar) RETURNS void AS $$ BEGIN END; $$ LANGUAGE plpgsql; root=# SELECT 'a'::regproc::int; ERROR: more than one function named "a" root=# SELECT 'a(integer, varchar)'::regprocedure::int; int4 ------- 57507 (1 row) root=# SELECT 57507::regprocedure; regprocedure ------------------------------ a(integer,character varying) (1 row)
Table's Oid we could to get with cast to regclass type:
SELECT 'oo'::regclass::int;
General array sort
Very nice exhibition of PostgreSQL's features is general procedure for sorting by David Fetter.
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY) RETURNS ANYARRAY LANGUAGE SQL AS $$ SELECT ARRAY( SELECT $1[s.i] AS "foo" FROM generate_series(array_lower($1,1), array_upper($1,1)) AS s(i) ORDER BY foo ); $$;
little bit faster shorter modernised variant:
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY) RETURNS ANYARRAY LANGUAGE SQL AS $$ SELECT ARRAY(SELECT unnest($1) ORDER BY 1) $$
or
-- by Craig Ringer CREATE OR REPLACE FUNCTION array_sort (ANYARRAY) RETURNS ANYARRAY LANGUAGE SQL AS $$ SELECT array_agg(x ORDER BY x) FROM unnest($1) x; $$
If you need really fast sort of int arrays, then you have to look on intarray contrib module.
Array transformation in PL/Perl
PL/Perl doesn't support OUT arrays. We can use undocumented procedure _plperl_to_pg_array:
CREATE OR REPLACE FUNCTION foo(OUT a integer[], OUT b integer[]) AS $$ return { a=> _plperl_to_pg_array([1,2,3]), b=> _plperl_to_pg_array([4,5,6]) };
UPC code validation
This is sample of validation of code with control number in Perl (PL/perl). I used this function for compare with PL/pgSQL and C languages. Perl is about 20% faster, but C is 10 times faster than Perl.
CREATE OR REPLACE FUNCTION ean_perl ( TEXT ) RETURNS boolean AS $_$ my $ean = length $_[0] == 12 ? "0$_[0]" : $_[0]; # Make sure we really have an EAN. return 'false' unless $ean =~ /^\d{13}$/; my @nums = split '', $ean; return 10 - ( # Sum even numerals. ( ( $nums[1] + $nums[3] + $nums[5] + $nums[7] + $nums[9] + $nums[11] ) * 3 # Multiply total by 3. # Add odd numerals except for checksum (12). ) + $nums[0] + $nums[2] + $nums[4] + $nums[6] + $nums[8] + $nums[10] # Compare to the checksum. ) % 10 == $nums[12] ? 'true' : 'false'; $_$ LANGUAGE plperl immutable;
I found interesting solution by Joe Celko in article. After rewriting to PostgreSQL:
SELECT ( SELECT mod( sum( CAST(substring('2837232811227' from s.seq for 1) AS integer) * CASE mod(s.seq,2) WHEN 0 THEN 1 ELSE -1 END), 10) FROM generate_series(1,12) as s(seq)) = CAST(substring('2837232811227' FROM 13 FOR 1) AS integer);
Show only every n row
When we would to skip some rows from table, we should to use temporary sequences and modulo function:
CREATE TEMP SEQUENCE number; SELECT * FROM ( SELECT *, nextval('number') AS number FROM datatable ) foo WHERE foo.number % 5 = 0;
With 8.4 this can be done without the use of a temporary sequence
SELECT * FROM ( SELECT *, row_number() over () AS number FROM datatable ) foo WHERE foo.number % 5 = 0;
Array to table
We can simply unpack array with generate_series function:
SELECT (ARRAY[1,3,4,6,7])[idx.n] FROM generate_series(1,7) idx(n); CREATE OR REPLACE FUNCTION unpack(anyarray) RETURNS SETOF anyelement AS $$ SELECT $1[i] FROM generate_series(array_lower($1,1), array_upper($1,1)) g(i); $$ LANGUAGE sql STRICT IMMUTABLE; postgres=# select unpack(array['a','b','c']); unpack -------- a b c (3 rows)
With 8.4, we can use the unnest()-function:
test=*# select unnest(array['a','b','c']); unnest -------- a b c (3 rows)
Additionally, we can extend the unnest() function to return one row with a default value if the array is empty or null:
create or replace function unnest(anyarray, anyelement) returns setof anyelement as $$ select unnest(case when array_upper($1, 1) > 0 then $1 else array[$2] end); $$ language sql immutable; gp=> create temp table foo (x text[], y int) distributed by (y); CREATE TABLE gp=> insert into foo values (array['a','b','c'], 1), (null, 2), ('{}'::text[],3); INSERT 0 3 gp=> select *,array_upper(x,1) as size from foo; x | y | size ---------+---+------ {} | 3 | | 2 | {a,b,c} | 1 | 3 (3 rows) gp=> select unnest(x),y from foo; unnest | y --------+--- a | 1 b | 1 c | 1 (3 rows) gp=> select unnest(x,'empty'),y from foo; unnest | y --------+--- empty | 3 empty | 2 a | 1 b | 1 c | 1 (5 rows)
LIKE optimalization
PostgreSQL use index for LIKE only when:
- pattern doesn't start with % and _ symbols,
- database cluster is initialized with C locale.
Last constraint we can pass by special index:
CREATE INDEX like_index ON people(surname varchar_pattern_ops);
When we work wit pattern like %some (for internet domain searching) we should to use trick:
- create functional index on mirrored values.
- search with mirrored pattern
PostgreSQL doesn't allow reverse (mirror) function. We can use rvrs function from Orafunc package or we can use PL/Perl.
CREATE OR REPLACE FUNCTION reverse(varchar) RETURNS varchar AS $$ $reversed = reverse $_[0]; return $reversed; $$ LANGUAGE plperlu IMMUTABLE; CREATE INDEX rev_email ON users( (reverse(email) ) varchar_pattern_ops ); SELECT * FROM _users WHERE reverse(email) LIKE reverse ('%.cz');
Without plperl:
CREATE OR REPLACE FUNCTION reverse(TEXT) RETURNS TEXT AS $$ SELECT array_to_string( ARRAY ( SELECT substring($1, s.i,1) FROM generate_series(length($1), 1, -1) AS s(i) ), ''); $$ LANGUAGE SQL IMMUTABLE;
Note: PostgreSQL 9.1 has buildin function for reverse string.
Deferred constraints
PostgreSQL run all constrains immediately. SQL know deferred constraints, that allows delay tests to time closely to commit. PostgreSQL hasn't full support. You can deferred only referential integrity constraints. But PostgreSQL supports undocumented deferred triggers, so we should use it. Syntax is similar to trigger definition:
CREATE CONSTRAINT TRIGGER sc AFTER INSERT ON fieldtrip_students INITIALLY DEFERRED FOR EACH ROW EXECUTE PROCEDURE field_trip();
Removing of duplicate rows
Any table created without parameter WITHOUT OIDS has row unique identifier OID. We can use it for removing duplicate rows:
DELETE FROM people WHERE NOT oid IN (SELECT MIN(oid) FROM people GROUP BY name, surname);
But new tables are created with this parameter usually. So we can't use OID. Every PostgreSQL row has another unique identifier - ctid. For ctid type we have not defined any aggregate functions, we have to use correlated subquery. I found this trick in Greg Mullane presentation.
DELETE FROM lidi WHERE NOT (ctid, tableoid) = (SELECT (ctid, tableoid) FROM lidi l WHERE prijmeni=l.prijmeni AND jmeno=l.jmeno LIMIT 1);
Select first n rows from group
SELECT * FROM people WHERE id IN ( SELECT id FROM people s WHERE people.category = s.category ORDER BY age LIMIT 2) ORDER BY category, age;
Other solution without correlated subquery is SELF JOIN. I have to look on problem from different perspective. I have to ask on persons for which only one person in group is older.
SELECT s1.* FROM people s1 LEFT JOIN people s2 ON s1.category = s2.category AND s1.age < s2.age GROUP BY s1.id, s1.category HAVING COUNT(s2.id) <= 1 ORDER BY s1.category, COUNT(s2.id);
Cast to varchar
Every data type in PostreSQL has input and output functions. We can use these functions for support casting to varchar (if we miss it). Don't use this trick on PostgreSQL 8.3 and higher
testdb011=# SELECT '(1,1)'::point::varchar; ERROR: cannot cast type point to character varying
but
testdb011=# SELECT textin(point_out('(1,1)'::point))::varchar; textin -------- (1,1) (1 row)
Create Aggregates
I needed to be able to do a fast
SELECT count(*) FROM sometable WHERE field='x'
instead of a simple
SELECT count (*) FROM sometable
Both of the above queries initiate sequential scans on a table (which takes too much time). Here is what I created:
CREATE TABLE aggregate AS (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC);
Further, my data contains many one-offs which I don't want to put into my aggregate table (I assume that if a value isn't in my aggregate table that it's too small to be useful)... here is what I do to leave those values out:
CREATE TABLE aggregate AS (SELECT * FROM (SELECT result,count(result) AS count FROM original_table GROUP BY result ORDER BY count DESC) AS countquery WHERE count > 3);
This method is fast enough. Because I have an index on the "result" field, It runs in under a minute against 3 million records. It takes approximately the same amount of time to generate counts for all the result values as it does to count the total number of records.
Now I can do a simple (very fast) query to get a count:
SELECT * FROM aggregate where result='xxxxx'