PostgreSQL SQL Tricks II
SQL is a language where one task can be solved multiple ways with different efficiency. You can see effective use of PostgreSQL's possibilities on this page.
Google AdSense
subscript shift to zero
Implementation of these features is simple in C and it's little bit slow in PL/pgSQL (based on iterated shift). However, we can use the SQL parser capability to recognize an array range:
CREATE OR REPLACE FUNCTION shift_idx(anyarray) RETURNS anyarray AS $$ DECLARE mi int := array_upper($1,1); offset int := array_lower($1,1); BEGIN RETURN '[0:' || mi - offset || '] = ' || $1::varchar; END $$ LANGUAGE plpgsql IMMUTABLE STRICT; CREATE FUNCTION postgres=# SELECT shift_idx(array[1,2,3]); shift_idx --------------- [0:2]={1,2,3} (1 row)
Author: Pavel Stěhule
Record evolution
When we have functions that return a record but cannot use derived tables, we must use following syntax:
CREATE FUNCTION foo(OUT a int, OUT b int, IN c int) RETURNS record AS $$ BEGIN a := c + 1; b := c + 2; RETURN; END; $$ LANGUAGE plpgsql; CREATE TABLE t(c int); INSERT INTO t VALUES(10),(20); postgres=# SELECT c, foo(c) FROM t; c | foo ----+--------- 10 | (11,12) 20 | (21,22) (2 rows) postgres=# SELECT c, (foo(c)).* FROM t; c | a | b ----+----+---- 10 | 11 | 12 20 | 21 | 22 (2 rows)
Conversion between hex and dec numbers
source: http://www.varlena.com/GeneralBits/104.php.
=# select to_hex(11); to_hex -------- b (1 row) postgres=# create or replace function to_dec(text) returns integer as $$ declare r int; begin execute E'select x\''||$1|| E'\'::integer' into r; return r; end $$ language plpgsql; CREATE FUNCTION postgres=# select to_dec('ff'); to_dec -------- 255 (1 row)
Faster variant (Michael Glaesemann):
CREATE FUNCTION hex2dec(in_hex TEXT) RETURNS INT IMMUTABLE STRICT LANGUAGE sql AS $body$ SELECT CAST(CAST(('x' || CAST($1 AS text)) AS bit(8)) AS INT); $body$;
Variant that is able to handle arbitrary length input (output as varchar, adapt if needed):
CREATE OR REPLACE FUNCTION hex2dec (varchar) RETURNS varchar as $$ DECLARE _x numeric; _count int; _digit int; BEGIN _x := 0; for _count in 1..length($1) loop EXECUTE E'SELECT x\''||substring($1 from _count for 1)|| E'\'::integer' INTO _digit; _x := _x * 16 + _digit ; end loop; return _x::varchar; end ; $$ language plpgsql immutable ;
Fast interval (of time or ip addresses) searching with spatial indexes
Searching intervals is usually slow, because the optimizer don't use an index. The reason lies in the dependency between the start and end columns. One solution is based on spatial indexes: it allows working with two dependent values as if they were a single value:
postgres=# EXPLAIN ANALYZE SELECT * FROM testip WHERE 19999999 BETWEEN startip AND endip; QUERY PLAN ---------------------------------------------------------------- Seq Scan on testip (cost=0.00..19902.00 rows=200814 width=12) (actual time=3.457..434.218 rows=1 loops=1) Filter: ((19999999 >= startip) AND (19999999 <= endip)) Total runtime: 434.299 ms (3 rows) Time: 435,865 ms postgres=# CREATE INDEX ggg ON testip USING gist ((box(point(startip,startip),point(endip,endip))) box_ops); CREATE INDEX Time: 75530,079 ms postgres=# EXPLAIN ANALYZE SELECT * FROM testip WHERE box(point(startip,startip),point(endip,endip)) @> box(point (19999999,19999999), point(19999999,19999999)); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on testip (cost=60.50..2550.14 rows=1000 width=12) (actual time=0.169..0.172 rows=1 loops=1) Recheck Cond: (box(point((startip)::double precision, (startip)::double precision), point((endip)::double precision, (endip)::double precision)) @> '(19999999,19999999),(19999999,19999999)'::box) -> Bitmap Index Scan on ggg (cost=0.00..60.25 rows=1000 width=0) (actual time=0.152..0.152 rows=1 loops=1) Index Cond: (box(point((startip)::double precision, (startip)::double precision), point((endip)::double precision, (endip)::double precision)) @> '(19999999,19999999),(19999999,19999999)'::box) Total runtime: 0.285 ms (5 rows) Time: 2,805 ms
Short form for insert ROW variables to table
postgres=# CREATE TABLE foo(a integer, b integer); CREATE TABLE postgres=# CREATE OR REPLACE FUNCTION fx() RETURNS void as $$ DECLARE r foo; BEGIN SELECT INTO r * FROM foo; INSERT INTO foo VALUES(r.*); RETURN; END; $$ LANGUAGE plpgsql; CREATE FUNCTION postgres=# SELECT fx(); fx ---- (1 row)
ALTER TABLE ALTER COLUMN USING
I hadn't known about the USING clause in ALTER TABLE ALTER COLUMN. I thought the old type had to be castable to the new type. But it's not true. Czech boolean literals cannot be cast directly to PostgreSQL's bool type, but, with the USING clause, I can do an ALTER COLUMN TYPE from varchar to bool:
postgres=# CREATE TABLE foo(a varchar); CREATE TABLE postgres=# INSERT INTO foo VALUES ('ano'); -- czech yes INSERT 0 1 postgres=# ALTER TABLE foo ALTER COLUMN a TYPE boolean ; ERROR: column "a" cannot be cast to type "pg_catalog.bool" postgres=# ALTER TABLE foo ALTER COLUMN a TYPE boolean USING CASE a WHEN 'ano' THEN true ELSE false END; ALTER TABLE postgres=# SELECT * FROM foo; a --- t (1 row)
Quote_ident for schema.name
Using single quotes is one protection against SQL injection. The function quote_ident checks its argument and, when the argument contains problematic chars, it returns the argument surrounded by single quotes. It's quite easy and useful. Problem is with schema.name pair, because the dot is a watched char:
postgres=# select quote_ident('public.foo'); quote_ident -------------- "public.foo" (1 row) postgres=# select * from "public.foo"; ERROR: relation "public.foo" does not exist postgres=# select * from public."foo"; a --- (0 rows) postgres=# select * from "public"."foo"; a --- (0 rows)
We can get around this limitation with custom functions:
CREATE OR REPLACE FUNCTION quote_array(text[]) RETURNS text AS $$ SELECT array_to_string(array(SELECT quote_ident($1[i]) FROM generate_series(1, array_upper($1,1)) g(i)), '.') $$ LANGUAGE SQL IMMUTABLE; CREATE OR REPLACE FUNCTION quote_schema_ident(text) RETURNS text AS $$ SELECT quote_array(string_to_array($1,'.')) $$ LANGUAGE SQL IMMUTABLE; postgres=# select quote_schema_ident('public.foo tab'); quote_schema_ident -------------------- public."foo tab" (1 row)
Autor: Pavel Stehule.
Fast searching of longer prefix
Searching longest prefix is common task - mainly in communications. We search for the longest prefix for some number from from set of prefixes. I solved this task some years ago with stored procedure and modified method of halving of interval. Maybe little bit slower but simple and robust method was described by Hubert Lubaciewski (Depesz).
--table contains about 5000 prefixes postgres=# select * from prefixesx limit 5; costcode_name | costcode --------------------+---------- Afghanistan | 93 Afghanistan Mobile | 9370 Afghanistan Mobile | 9379 Afghanistan Mobile | 9380 Alaska (USA) | 1907 (5 rows)
Simple but slow solution with LIKE:
postgres=# SELECT * FROM prefixesx WHERE '420724181000' LIKE costcode || '%' ORDER BY length(costcode) DESC LIMIT 1; costcode_name | costcode ---------------------------------+---------- Czech Republic Mobile - EuroTel | 42072
Fast solution is generating all possible prefixes from number and search this prefixes:
CREATE OR REPLACE FUNCTION prefixes(varchar) RETURNS varchar[] AS $$ SELECT ARRAY(SELECT substring($1 FROM 1 FOR i) FROM generate_series(1, length($1)) g(i))::varchar[]; $$ LANGUAGE sql IMMUTABLE;
and use query:
postgres=# SELECT * FROM prefixesx WHERE costcode = ANY (prefixes('420724191000')) ORDER BY length(costcode) DESC LIMIT 1; costcode_name | costcode ---------------------------------+---------- Czech Republic Mobile - EuroTel | 42072
On 5000 rows is this query 4 times faster than LIKE.
Little bit slower is query with LIKE and LIKE firstchar || %
SELECT * FROM prefixesx WHERE '420724181000' LIKE costcode || '%' AND costcode LIKE substring('420724191000' FROM 1 FOR 1) || '%' ORDER BY length(costcode) DESC LIMIT 1;
Get count of substrings in string
This function is based on popular trick - using replace and length function without cycle, that is typical for interprets where built-in functions are faster than iteration:
CREATE OR REPLACE FUNCTION CountInString(text,text) RETURNS integer AS $$ SELECT(Length($1) - Length(REPLACE($1, $2, ''))) / Length($2) ; $$ LANGUAGE SQL IMMUTABLE;
Sent by Rodrigo E. De León Plicet. Another my (array based) solution:
CREATE OR REPLACE FUNCTION CountInString(text, text) RETURNS integer AS $$ SELECT Array_upper(String_to_array($1,$2),1) - 1; $$ LANGUAGE SQL IMMUTABLE;
Timestamp recalculation for different timezone and its indexing
The simplest specification of time at some timezone is using AT TIME ZONE operator. We can't indexing timestamp with timezone, because we can change current timezone, so we can create index only for timestamp at some specified timezone. Source: pgsql-general, Tom Lane
CREATE INDEX created_tz_paris ON data((created_on AT TIME ZONE 'Europe/Paris'));
Using IMMUTABLE functions as hints for the optimizer
For highly specific data distributions, we can have a problem with prediction and generation of suboptimal execution plans. These problems we can solve by breaking a SQL query into more dependent queries and evaluating step by step. The following queries use real results (not estimations) precedent queries, and queries are executed optimally. Tom Lane's proposal is based on using IMMUTABLE functions that are evaluated before generating an execution plan:
CREATE OR REPLACE FUNCTION getTypesLieuFromTheme(codeTheme text) RETURNS text[] AS $f$ SELECT ARRAY(SELECT codetylieu::text FROM rubtylieu WHERE codeth = $1); $f$ LANGUAGE SQL IMMUTABLE; in query SELECT ... WHERE ... AND gl.codetylieu = ANY(getTypesLieuFromTheme('RES')) look execution plan doesn't contain function call and query to table rubtylieu too -> Bitmap Heap Scan on genrelieu gl (cost=2.23..6.22 rows=88 width=4) (actual time=0.022..0.075 rows=88 loops=1) Recheck Cond: ((codetylieu)::text = ANY ('{RES}'::text[])) -> Bitmap Index Scan on ind_genrelieu2 (cost=0.00..2.21 rows=88 width=0) (actual time=0.016..0.016 rows=88 loops=1) Index Cond: ((codetylieu)::text = ANY ('{RES}'::text[]))
Use this trick as last step. Test, please, by increasing the statistic number beforehand.
Autonomous transaction in PostgreSQL
Autonomous transaction is usable feature mainly for logging. This feature isn't supported in PostgreSQL yet. With untrusted languages we are able to emulate it. You can see solution by Jon Roberts.
CREATE OR REPLACE FUNCTION fn_log_error(p_function varchar, p_location int, p_error varchar) RETURNS void AS $$ DECLARE v_sql varchar; v_return varchar; v_error varchar; BEGIN PERFORM dblink_connect('connection_name', 'dbname=...'); v_sql := 'INSERT INTO error_log (function_name, location, error_message, error_time) ' || 'VALUES (''' || p_function_name || ''', ' || p_location || ', ''' || p_error || ''', clock_timestamp())'; SELECT INTO v_return * FROM dblink_exec('connection_name', v_sql, false); --get the error message SELECT INTO v_error * FROM dblink_error_message('connection_name'); IF position('ERROR' in v_error) > 0 OR position('WARNING' in v_error) > 0 THEN RAISE EXCEPTION '%', v_error; END IF; PERFORM dblink_disconnect('connection_name'); EXCEPTION WHEN others THEN PERFORM dblink_disconnect('connection_name'); RAISE EXCEPTION '(%)', SQLERRM; END; $$ LANGUAGE plpgsql SECURITY DEFINER;
Optimization of repetitive assignment to varchar variable with arrays
plpgsql isn't a good language for intensive non SQL operations. It's best as glue for SQL statements or for prototyping. But sometimes plperl or plpython can be significantly faster. Plpgsql does not like cumulative iterations over varchar or array variables. When we can't use Perl, .. we could use SQL:
CREATE OR REPLACE FUNCTION SlowList(int) -- slow function, usable for N <= 100 RETURNS varchar AS $$ DECLARE s varchar = ''; BEGIN FOR i IN 1..$1 LOOP s := '<item>' || i || '</item>'; -- slow is s := s || .. END LOOP; RETURN s; END; $$ LANGUAGE plpgsql IMMUTABLE; CREATE OR REPLACE FUNCTON FastList(int) -- fast function RETURNS varchar AS $$ BEGIN RETURN array_to_string(ARRAY(SELECT '<item>' || i || '</item>' FROM generate_series(1, $1) g(i)), ''); END; $$ LANGUAGE plpgsql IMMUTABLE;
For N < 100 execution time is less then 4ms (so isn't any reason for dirty trick). For bigger N is execution time different (SlowList(10000) - 4000ms, FastList(10000) - 52ms). Author - Pavel Stěhule.
Use IS DISTINCT FROM without COALESCE
Sim Zacks reported slow queries after migration to 8.2.4. This version has problem with prediction of result of expression COALESCE(column, false) = false, and then execution plan is suboptimal. After correction PostgreSQL choose optimal execution plan.
SELECT ... FROM some_tab LEFT JOIN some_tab2 ON sometab2.col IS DISTINCT FROM true; --> join false or NULL
So don't use COALESCE in WHERE clause
Next week somebody reported similar problem:
-- execution plan is suboptimal ((COALESCE((at_type)::integer, 1) = 1) AND (COALESCE(at_language, 0::numeric) = 0::numeric)) -- execution plan is optimal (at_type = 1 or at_type is null) AND (at_language = 0 or at_language is null)
Port to Oracle, problems with to_date function
Peter Eisentraut reported incompatibility to_date function between PostreSQL and Oracle.
SQL> select to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss') from dual; TO_DATE(' --------- 31-DEC-07 On PostgreSQL: select to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss'); to_date -------------- 200700-12-31
Oracle ignores spaces in format string, but PostgreSQL requires exact format. This behavior can be changed in custom function that drops spaces (Author: Jon Roberts):
CREATE OR REPLACE FUNCTION fn_to_date(p_date varchar, p_format varchar) RETURNS timestamp AS $$ SELECT to_timestamp(replace($1, ' ', ''), replace($2, ' ', '')); $$ LANGUAGE SQL STRICT IMMUTABLE; # select fn_to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss'); fn_to_date --------------------- 2007-12-31 00:00:00 (1 row)
Identification and correction of wrong chars in PostgreSQL dump
8.2 is stricter than 8.1 for correct chars in dump. Reason is in SQL injection security fix (via incorrect UTF8 chars). This change of behavior can cause problems with migration. Fix is simple. We have to find wrong char and fix it.
$ iconv -f utf-8 -t utf-8 dump.sql > /dev/null iconv: illegal input sequence at position 1000 $ head -c 1010 dump.sql | tail -c 20 | od -c 0000000 h . B u t i s n 222 t i t 0000020 h i s ? 0000024
Source: http://lca2007.linux.org.au/att_data/Miniconfs(2f)PostgreSQL/attachments/professional.pdf
Be carefull when using the NOT IN operator with NULL
Some SQL developer dislike NULL. Other like it. I like it, too. NULL is necessary and saves a lot of work. NULL in a result expression clearly signals some problem in the input data. People who dislike NULL don't understand it and use the following argument, citing an "anomaly":
postgres=# SELECT * FROM (VALUES(10),(20),(30)) v(a) WHERE a IN (10, 20, NULL); --> all is ok a ---- 10 20 (2 rows) postgres=# SELECT * FROM (VALUES(10),(20),(30)) v(a) WHERE a NOT IN (10, 20, NULL); a --- (0 rows) --> anomaly, expects 30
But it isn't an anomaly. The NOT IN operator checks each item of a list, because the left-hand value must be different than all of them. Since we can never have a comparison with NULL that results in anything but NULL, no value can fulfill this predicate.
Fast specification of first rows of some table field
My task is specific. I have address book with 100 000 items and I have to get all first chars of surnames. This task is one from a few tasks, where correlated subquery is faster than everything else. Because seq scan and agg function need scan all 100 000 records, but correlated subquery need only 32 access to functional index:
SELECT * FROM ( SELECT substring('abcdefghijklmnopqrstuvwxyzěščřžýáíé' FROM i For 1) FROM generate_series(1,32) g(i) ) p(onechar) WHERE EXISTS( SELECT 1 FROM address_book WHERE substring(surname from 1 for 1) = p.onechar and active );
Version for support czech char CH (and some others: úůňľó):
SELECT * FROM ( SELECT substring('abcdefghijklmnopqrstuvwxyzěščřžýáíéúůňľó' FROM i For 1) FROM generate_series(1,40) g(i) UNION SELECT 'ch' ) p(onechar) WHERE EXISTS( SELECT 1 FROM a_osoba WHERE lower(substring(prijmeni from 1 for char_length(p.onechar))) = p.onechar );
An employee with highest compensation
It's typical school query: select the highest-paid employees in some group. This query can be solved with derivated table or with correlated subquery. We could use the nonstandard clause DISTINCT ON in PostgreSQL:
postgres=# SELECT * FROM employees ; name | surname | department | payment -----------+-----------+------------+------- Pavel | Stehule | 1 | 10000 Zdenek | Stehule | 1 | 9000 Vladimira | Stehulova | 2 | 9000 (3 rows) postgres=# SELECT DISTINCT ON department * FROM employees ORDER BY department, payment DESC; name | surname | department | payment -----------+-----------+------------+------- Pavel | Stehule | 1 | 10000 Vladimira | Stehulova | 2 | 9000 (2 rows)
Attention, DISTINCT ON isn't portable. A big disadvantage of this solution is that it shows only one person from every department, even when more employees has the same, highest, payment, yielding an incomplete result.
Any other session variables
PostgreSQL doesn't support server session variables. This mean, so we have to write stored procedures more often, because there are variables. Some substitution is module variables. These variables has to allowed in configuration. Sent by Andreas Kretschmer:
- define in your postgresql.conf: custom_variable_classes = 'myvar' - use within psql: test=# set myvar.benutzer = 'foo'; SET test=*# select * from foo; id | name ----+------ 1 | foo 2 | bar (2 rows) test=*# select * from foo where name=current_setting('myvar.benutzer'); id | name ----+------ 1 | foo (1 row) test=*# set myvar.benutzer = 'none'; SET test=*# select * from foo where name=current_setting('myvar.benutzer'); id | name ----+------ (0 rows)
xpath function indexing
8.3 has integrated xpath function. There is gap in XML support, because XML type isn't supported with GIST or GIN index. So xpath function returns array of xml values. But we can write custom casting to int array:
CREATE OR REPLACE FUNCTION xml_list_to_int(xml[]) RETURNS int[] AS $$ SELECT ARRAY(SELECT to_number(($1[i])::text,'999999.99')::int FROM generate_series(1, array_upper($1,1)) g(i)) $$ LANGUAGE SQL IMMUTABLE; CREATE CAST (xml[] AS int[]) WITH FUNCTION xml_list_to_int(xml[]); -- array of integers are supported with GIST CREATE INDEX fx ON foo USING GIN((xpath('//id/text()',order_in_xml)::int[]));
psql together with less
psql has paging support, but it isn't comparable with the options offered by less
. Here is example on how to setup psql to use less
as the pager. Author: Merlin Moncure
#to profile export PAGER=less export LESS="-iMSx4 -FX" #to .psqlrc \timing \pset pager always
Intersection of arrays
Very nice trick by David Fetter. It's based well known trick (universal sort of array):
CREATE OR REPLACE FUNCTION array_intersect(ANYARRAY, ANYARRAY) RETURNS ANYARRAY LANGUAGE SQL AS $$ SELECT ARRAY( SELECT $1[i] AS "the_intersection" FROM generate_series( array_lower($1,1), array_upper($1,1) ) AS i INTERSECT SELECT $2[j] AS "the_intersection" FROM generate_series( array_lower($2,1), array_upper($2,1) ) AS j ); $$;
An alternative and possibly faster way:
CREATE OR REPLACE FUNCTION array_intersect(ANYARRAY, ANYARRAY) RETURNS ANYARRAY LANGUAGE SQL AS $$ SELECT ARRAY(SELECT UNNEST($1) INTERSECT SELECT UNNEST($2)) $$;
VOID function in SQL language
SQL functions can contain only SQL statements. The results of the last statements are results of the functions. The type of a result have to be compatible with the function's type. Because VOID is fictive type, no statement can return this type with one exception. We can cast NULL to VOID.
note: void returning functions are not allowed to be used through the binary protocol
CREATE OR REPLACE FUNCTION none() RETURNS VOID AS $$ SELECT NULL::VOID; $$ LANGUAGE SQL;
Fast select of next row
For older application (ISAM appliaction works on series of rows) we have be able to search next row specified with some combination of values (a1, b1, c1). Traditional solution:
SELECT * FROM data WHERE (a > a1) OR (a = a1 AND b > b1) OR (a = a1 AND b = b1 AND c > c1) ORDER BY a, b, c LIMIT 1;
Merlin Moncure noted simplified form that use row equation. This form is more readable and allows using multicolumn index:
SELECT * FROM data WHERE (a, b, c) > (a1, b1, c1) ORDER BY a, b, c LIMIT 1;
ISAM access downgrade SQL serer performance. For effective SQL we have to operate over sets, but when we cannot rewrite application, we have not choice.