Introduction to PostgreSQL SQL

Z PostgreSQL
Skočit na navigaci Skočit na vyhledávání
Verze k tisku již není podporovaná a může obsahovat chyby s vykreslováním. Aktualizujte si prosím záložky ve svém prohlížeči a použijte prosím zabudovanou funkci prohlížeče pro tisknutí.

Translated by Jiří Novák

PostgreSQL supports complete repertoire of the ANSI SQL92 statements and partly also ANSI SQL2003. This article should present a simple introduction to SQL including the basic terminology and several pieces of advice. Examples of useful SQL statements can be found in the article SQL Triky (SQL Tricks).

Google AdSense

SQL

Structured Query Language

SQL is nowadays the most widespread query (non-procedural) programming language designed for definition, maintenance and searching of data in the relational databases. In 1970 Dr. E. F. Codd published his fundamental theoretical work about relational model of data. His work became the basis for the research project called System/R, which took place in the second half of the seventies in the laboratories of IBM. Within the scope of this project was created also SQL language. At the turn of the seventies and eighties already existed the first commercial versions of the current RDBMS systems Oracle and IBM.

In 1986 was created a first ANSI standard of the SQL language. The last ANSI standard is ANSI SQL:2003. It's requirements are nevertheless implemented only in few RDBMSs. So far the most widespread RDBMSs respect ANSI SQL:1999, respectively older ANSI SQL:1992. PostgreSQL supports completely ANSI SQL:1999 and partly ANSI:2003.

Database normalization rules

SQL language is determined for databases, which use so called relational database model. In such a model are data saved in tables with clearly defined and relatively unvarying structure. One data entry corresponds to a row of one or several tables. Data attribute corresponds then to an intersection of the specific column and specific raw of the table. If we omit several unimportant exceptions, we can say that every single cell in the table stores one value. Data in tables are stored in the specified format and type. By separating data into independent tables we are trying to precede the data redundancy. As an example of a badly designed data table can serve the following one (doesn't respect the rules of so called normal forms):

  • 0NF (zero normal form): Table will be in the zero normal form if and only if exists at least one field, which includes more than one value.
  • 1NF (first normal form): Table will be in the first normal form if and only if it's possible into each field fill only a simple data type (that means that are further indivisible).
  • 2NF (second normal form): Table will be in the second normal form if and only if it already is in the first normal form and moreover exists a key and every single single non-key field is a function of the whole key (not only of its parts).
  • 3NF (third normal form): Table will be in the third normal form if and only if each non-key attribute is transitively dependent on no scheme key, in other words if it's in the second normal form and simultaneously exists no dependency between non-key columns of the table.
postgres=# select * from family;
    parent      |      child
----------------+----------------
 Jindřich Suchý | Jindřich Suchý
 Jindřich Suchý | Lenka Suchá
(2 rows)

On the contrary as an example of well designed set of tables can be:

postgres=# SELECT * FROM parent;
 id |   name   | surname
----+----------+----------
  1 | Jindřich | Suchý
(1 row)

postgres=# SELECT * FROM child;
 id | parent|   name   | surname
----+-------+----------+----------
  1 |     1 | Lenka    | Suchá
  2 |     1 | Jindřich | Suchý
(2 rows)

postgres=# SELECT parent.*, child.* FROM parent JOIN child ON parent.id = child.parent;
 id |  name    | surname  | id | parent|  name    | surname
----+----------+----------+----+-------+----------+----------
  1 | Jindřich | Suchý    |  1 |     1 | Lenka    | Suchá
  1 | Jindřich | Suchý    |  2 |     1 | Jindřich | Suchý
(2 rows)

By respecting 3NF we achieve separating data into independent tables. The number of table columns should move between 2 and 8. Under ordinary circumstances (99% of cases) all tables with which we work should be in 3NF. Do not use wide tables. What you save on joining of tables, you will lost on slower reading of the table - with the length of an record decrease the number of the records on the page and increase the number of reading (accesses to harddisk). The fact that are not displayed all of the records, doesn't mean that they are not loaded from harddisk.

Cardinality

Relational model of data allows us to work simply with the tables, whose mutual relation, which we called called cardinality, is either: 1:1 (to one entry in the first table corresponds one entry in the second table) or 1:n (to one entry in the first table correspond n (none or more than one) entries from the second table). In SQL systems is this relation explicitly determined by the equality of values of certain columns (keys) in tables. In the case of joining of tables, we have to explicitly introduce this equality as an logical expression in SQL statement:

SELECT * FROM parent JOIN child ON parent.id = child.parent_id;

The key from the independent table (parent) we denote as a primary key, the key from the dependent table we denote as a foreign key. Primary key has to be unique (no value can repeat). Foreign key has to satisfy the integrity with the primary key (values can repeat, but can not appear any value, which is not in primary key). From the SQL systems we expect that they will be able to satisfy us those requirements, it's called guarantee of reference integrity.

Beyond the relations 1:1 and 1:n exists one more relation n:m. For expressing such relation, we need nedd so called auxiliary table, which includes only (in common cases) two foreign keys. On the example of relation between entities: AUTHOR and BOOK I will show all possible interpretations:

  • 1:1 - every author wrote one book and can not write any other,
  • 1:n - every author can publish several books (the limit case is no book),
  • n:m - every author can publish several books, on one book can participate more authors.
SELECT author.*, book.* 
  FROM author a 
    JOIN author_book ab ON a.id = ab.author_id 
    JOIN book b ON ab.book_id = b.id;   

DDL

SQL language consists of several statements divided into two basic groups: statements for definition of data structures (Data Definition Language) and statements for working with the content of the tables (Data Modeling Language).

DDL has declarative character and allows us to create, modify and remove database objects such as: tables, views, indexes, name spaces, table spaces, databases, stored procedures and functions, triggers and users. Objects can be created by statement CREATE and deleted by statement DROP. Statement GRANT adds access privileges and statement REVOKE on the other hand remove them. Object's properties we modify by the statement ALTER. Every SQL statement is consisting of several optional parts. Nevertheless we have to respect logical relations between those parts and write them in the right order, e.g. the WHERE clause have to be written after the FROM clause and before clauses GROUP BY and ORDER BY.

CREATE TABLE child (
  id SERIAL PRIMARY KEY,
  parent INTEGER NOT NULL REFERENCES parent(id),
  name VARCHAR(15) NOT NULL CHECK name <> '',
  surname VARCHAR(15) NOT NULL CHECK surname <> ''
);

ALTER TABLE child ADD COLUMN age INTEGER NOT NULL CHECK age > 0;
GRANT SELECT ON TABLE child TO public;

DML

DML statements serve for manipulation with data in database. ANSI SQL:1999 standard defines statements INSERT, UPDATE, DELETE and SELECT. This essential set is the in ANSI SQL:2003 extended of statement MERGE (is not implemented in PostgreSQL.)

  • Statement INSERT is used for adding new entries into the table
  • Statement UPDATE is used for updating existing entries in the table
  • Statement DELETE is used for removing entries from the table
  • Statement SELECT is used for displaying and searching of entries in the table
INSERT INTO child(parent,name,surname,age) VALUES(1,'Jindřich','Suchý',12);
UPDATE child SET age = 13 WHERE id = 2;
DELETE FROM child WHERE age > 26;
SELECT * FROM child WHERE parent = 1;

Most frequently used DML statement is probably SELECT. It consists of following obligatory and optional clauses (while writing this statement it's necessary to respect the order of clauses):

SELECT <list of displayed columns> 
FROM <source or sources of tables>
WHERE <selective criteria>
[GROUP BY <criteria for associating of data records>]
[HAVING <criteria for filtering of associated records>]
[ORDER BY <data sorting method>]

As an example can serve the query, which displays all parents that have more than one child:

SELECT parent.* FROM parent 
  WHERE id IN (SELECT parent FROM child GROUP BY parent HAVING count(*) > 1)

Transactions

In every database system have the transactions it's unsubstitutable role. They secure the consistence of data without the reference to errors of hardware, operation system, application or database system. Transaction joins the set of statements to one complex. It's done so, that the changes in data caused by individual statements will become permanent if every single statement of transaction is executed successfully. First unsuccessful statement cause interruption of the transaction and revoke all of the data changes. Typical example is transfer of a specific amount from one account to another:

BEGIN;
UPDATE accounts SET amount = amount + 100 WHERE id = 122;
UPDATE accounts SET amount = amount - 100 WHERE id = 133;
COMMIT;

If we would not insert this two SQL queries into the transaction, it could happen (in an extreme case), that user 122 would receive the amount of 100 on his account without it's subtraction from the user 133's account. Second thing which transaction does is isolation of the users. Unclosed transactions are isolated from other users - they still see the original content. Untill the confirmation of transaction they won't see updated values, however update of all values will be implemented in one time quantum. Thanks to that can not happen, that the second transaction would see the data inconsistent, for example with the total amount of 100 higher on all of the accounts. This level of isolation is called READ COMMITED and in the PostgreSQL is set as default for all transactions.

Isolation of transaction should ensure subjective impression, that user work with the database alone. So for example if we repeat some SQL query, without modifying data, we suppose the same result. This presumption we are not able to provide with the level READ COMMITED. Anybody can add a row to the table and implement COMMIT during our transaction. Therefore exists level READ SERIALIZABLE. While using this level, user sees only those changes in data, which were confirmed before the first calling of the SELECT statement. Thanks to this mechanism all SQL queries have to return the same values. The fact that application sees the same data doesn't mean that they were not changed. In the case that it will try to modify records, which were changed by concurrent transaction, it will wait until the concurrent transaction is finished. If it finishes by revocation, actual transaction will normally continue. But if it on the contrary confirm its changes, actual transaction will end with the following error message:

ERROR:  could not serialize access due to concurrent update

SQL/PSM

SQL was designed as a simple non-procedural language serving primary for storing data into database and for gaining data from the database. In the course of time was found that it's too simple and that we often have to write additional external applications so that we can achieve implementation of necessary operations over data. Also was found that often we are not able to guarantee reference and domain integrity in the complicated company applications by using simple resources. External applications in the client-server architecture have one principal disadvantage - they require relatively intensive communication on the network between client and server. At the moment, when you work with several thousands of records, it can invoke a performance-related problem. This could be solved by transfer of the part of so called business logic on the database server. Because of this reason SQL was extended of several basic constructions, which are known from the classical programing languages: variables, loops, conditions, procedures and functions, trapping of exceptions. By utilizing these constructions we are able to implement separate problems, whose code is logically stored in the database and which are initiated within the database context.

Stored procedures serves particularly for:

  • Generating of outputs, which could not be simply or effectively implemented by statement SELECT.
  • Transfer data processing from client to server.
  • Implementation of more complicated access rules, referential and domain integrity, which could not be done by standard resources.
  • Extension of fundamental set of built-in functions by the functions of other RDBMSs, so that the portation of database would be easier.

ANSI SQL:1999 defines those language extensions in the chapter PSM (Persistent Stored Modules). This syntax use DB2 and MySQL5. Similar are also languages Transact SQL from Microsoft and Procedural SQL RDMBS Firebird. Oracle differs and goes its own way with the PL/SQL language. PL/SQL is procedural language on the basis of the programming language ADA extended of SQL. PostgreSQL take over this language and modify it. Oracle implemented PL/SQL in version Oracle 6 in 1988, in PostgreSQL PL/pgSQL appeared in version 6.3 ten years after. In 2005 was PL/pgSQL was already full-value procedural programming language.

Integration of PL/pgSQL into the system is absolute, it shares data types, functions. On the contrary we can use in SQL statements the functions created in PL/pgSQL. In PL/pgSQL we are able to create functions, whose result is a table.

CREATE OR REPLACE FUNCTION init(IN _from integer, _to integer) RETURNS void AS $$
DECLARE _r integer = 1;
BEGIN
  FOR _i IN _from .. _to LOOP
    _r := _r + _r;
    INSERT INTO data(value) VALUES (_r);
  END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT init(1, 10);

Using of the prefix _ for local variables is a convention preventing collision of the variable names with the names of table columns, and collision with the SQL keywords.

PostgreSQL doesn't offer any simple resource for numbering the lines. Function, which will assure pagination and adding of the row number could looks like the following one:


CREATE OR REPLACE FUNCTION page_of_the_phone_book(
  IN _page integer, IN _chpp integer, 
  OUT _rn integer,
  OUT _name varchar(20), OUT _surname varchar(20), OUT _telephone numeric(9,0) 
) RETURNS SETOF RECORD AS $$
DECLARE 
  _rn integer; _r list%ROWTYPE;
BEGIN 
  _rn := (_page - 1)*_chpp;
  FOR _r IN SELECT * FROM phone_book LIMIT _chpp OFFSET (_page - 1)*chpp LOOP
    _name := _r.name; _surname := _r.surname; _telephone := _r.telephone;
    _rn := _rn + 1;
    RETURN NEXT;
  END LOOP;
  RETURN;
END;
$$ LANGUAGE plpgsql; 
SELECT * FROM page_of_the_phone_book(2,30);

Let's suppose, that we have numbered data in the database and don't want, if it's at least a bit possible, to have there a "hole" in the sequence of numbers. We have two possibilities: either check all the data before the operation so that we could be certain that the operation will succeed; or to implement the operation experimentally and then if the it ends successfully, apply for the number from our sequence and update the numerical value of the inserted entry.

CREATE OR REPLACE FUNCTION new_entry(IN _v varchar) RETURNS void AS $$
DECLARE 
  _id integer; 
  _tmp_code integer; _true_code integer;
BEGIN
  /* I don't know the code of the purchase order, in the testing attempt I will replace it by a random number.
  It can happen that I will raise an exception UNIQUE_VIOLATION. Then I will try it again. */
  FOR i IN 1..20 LOOP /* maximum is 20 attempts */
    BEGIN
      SELECT INTO _tmp_code MAX(code) FROM data;
      _tmp_code := COALESCE(_tmp_code, 1) + 20 + CAST(random()*100 AS integer);
      INSERT INTO data(code, v) VALUES(_tmp_code, _v);
      _id := lastval(); _true_code := nextval('code');
      UPDATE data SET code = _true_code WHERE id = _id;
      RETURN;
    EXCEPTION
      WHEN unique_violation THEN 
        NULL; -- don't do anything
      WHEN others THEN
        RAISE EXCEPTION '%', SQLERRM;
    END;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

The benefit which provide us using of stored procedures is not only the lower load of the network, but also quicker and safer implementation of SQL statements. All of the SQL statements are in advance-prepared, that means that before their first implementation an executive plan is prepared, and that plan is then used during every other calling of SQL statement. All of the in advanced-prepared SQL statements are programmable, there is so no menace of SQL injection.

CREATE FUNCTION safe_display(_par varchar) RETURNS SETOF varchar AS $$
DECLARE _v varchar;
BEGIN
  IF _par = '%' THEN
    RAISE EXCEPTION 'Use of the forbidden character';
  END IF;
  FOR _v IN SELECT surname FROM list WHERE surname LIKE _par LOOP
    RETURN NEXT _v;
  END LOOP;
  RETURN;
END;
$$ LANGUAGE plpgsql;
SELECT * FROM safe_display('P%');

In which aspect was this procedure safer than using of a common SQL query prepared on the side of client? In the fact that we can be sure that no matter the value of argument _par would be executed the query:

SELECT surname FROM list WHERE surname LIKE _par;

On the side of client is not such a problem to give as an parameter the value "%' UNION SELECT usename FROM pg_user". While composing the SQL query, e.g. in php, that string will be generated and processed:

SELECT surname FROM list WHERE surname LIKE '%' 
  UNION 
  SELECT username FROM pg_user;

This query shows us the list of users. By using smarter injections we will be able to find out the structure of the database and subsequently read all data to which the client's application has permission. You should try at least to reduce the length of the string and to forbid using of characters '"%.

In RDBMS Oracle there are two very useful functions: next_day and last_day. If we are writing an application simultaneously for PostgreSQL, we either can not use them (in PostgreSQL they are not implemented) or we have to write them up:

CREATE OR REPLACE FUNCTION ERROR(IN msg VARCHAR) RETURNS varchar AS $$
BEGIN
  RAISE EXCEPTION '%', msg;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION last_day(IN d date) RETURNS date AS $$
BEGIN
  RETURN CAST(date_trunc('month',current_date + interval '1month') AS date) - 1;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION next_day(IN _d date, IN _day varchar) RETURNS date AS $$
DECLARE _id integer; _dow integer;
BEGIN 
  _dow := EXTRACT(dow FROM _d);
  _id := CASE lower(_day)
      WHEN 'sunday'    THEN 0
      WHEN 'monday'    THEN 1
      WHEN 'tuesday'   THEN 2
      WHEN 'wednesday' THEN 3
      WHEN 'thursday'  THEN 4
      WHEN 'friday'    THEN 5
      WHEN 'saturday'  THEN 6
      ELSE
        CAST(ERROR(E'Wrong identifier for day \''||_day||E'\'') AS integer)
    END;
  RETURN CASE _id <= _dow
      WHEN true  THEN _d + (_id - _dow + 7)
      WHEN false THEN _d + (_id - _dow)
    END;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

Similarly we can extend PostgreSQL also of functions supported by other RDBMSs.

Using of the stored procedures forces some changes in the application. It can has both negative and positive influence on the portability of the application. SP code is not generally portable (it depends on from where to where you want to portate the application). On the other hand, the code which is dependent on a platform is concentrated to one or several modules. From my experience I would recommend using of SP. Thanks to it the code will simplify and as a rule the application will be executed far more rapidly.

Recommendations for designing of the databases

  • Do not use SQL keywords in the names of the tables and columns.
  • Try to declare the entries which form the primary key on the beginning of table.
  • If you don't have some data available, use the value NULL. Do not use 0 or an empty string.
  • Do not use prefixes (e.g. t_processes).
  • Do not use CamelCase way of writing of the identifiers (e.g. isoCode).
  • Names of the columns should be apposite, of reasonable length and written in lowercase.
  • Name the tables in plural and write them out with capital letter on the beginning (other characters write in lowercase).
  • All keywords always write in lowercase or uppercase.
  • Use in advance arranged set of suffixes (_id, _code, _date, _nbr, _name, _size, _tot, _cat, _class, _type).
  • Do not use postfixes PK, FK.
  • Do not use the name of the table in the name of the column.
  • Do not use the special symbols (character "_" is an exception) and diacritics in the identifiers of tables and columns.
  • Always write a space character or new line mark after the comma or semicolon. Do not begin a new line with comma.
  • Indent: 1 space character beginning of a clause, 2 space characters continuation of the clause, 3 space characters continuation of the expression, word of continuation of a list.
  • Always begin a subquery on a new line. Bind up the brackets with the subquery.
  • Search for natural codes.
  • Hold together attributes, which objectively and chronologically relates to each other.
    • Separating data into tables: FemalePersonel, MalePersonel, sale2001, sale2002, etc. is a fault.
    • If you have the date in three columns (day, month, year), it's a fault.
    • When you separate data needlessly into two rows, e.g start of the event, end of the event, it's a fault.
  • If you are designing your own listing code, define respective reparation for NULL (0 Unknown, 1 Male, 2 Female, 9 Not applicable (doesn't have sense)).
  • Store the codes in database in the special tables. Using of one multi-purpose table for all of the code classes is a fault.
  • If it's possible, use standardized construction and functions.

An example of suitably formatted SQL subquery:

SELECT DISTINCT pilot 
  FROM PilotSkills AS PS1
 WHERE NOT EXISTS
       (SELECT *
          FROM Hangar
         WHERE NOT EXISTS
               (SELECT * 
                  FROM PilotSkills AS PS2
                 WHERE PS1.pilot = PS2.pilot 
                   AND PS2.plane = Hangar.plane
               )
       );       

An example of join of tables:

SELECT O1.order_nbr, ..
  FROM Orders AS O1
       INNER JOIN 
       OrderDetails AS D1
       ON O1.order_nbr = D1.order_nbr
 WHERE D1.dept = 'mens wear';

An example of CASE construction noted down.

CASE WHEN foo = 1 THEN 'bar'
     WHEN foo = NULL THEN 'no_bar' 
     ELSE NULL END

Nested subqueries and derived tables

SQL allows us to insert a query recursively into another query. If the query is inserted into the WHERE clause, we call it nested subquery. On the contrary, if it's inserted into the FROM clause, we call it derived table or in some cases a temporary view. Subquery can be connected with the external query by the attribute values, then it's called constrained or correlated query.

An example of nested query (searching of the Finland's cities which are more populated than some city in the Czech republic):

SELECT cities FROM database_of_cities 
  WHERE number_of_inhabitants > ALL 
    (SELECT number_of_inhabitants FROM database_of_cities WHERE country = 'CZ')
    AND country = 'FI';

An example of correlated subquery (displaying of employees with the highest salary within the individual departments):

SELECT surname, department, salary FROM employees ea
  WHERE salary = 
    (SELECT MAX(salary) FROM employees eb WHERE eb.department = ea.department);

PostgreSQL supports also multi-columns predicates:

SELECT surname, salary, department FROM employees
  WHERE (salary, department) IN 
    (SELECT MAX(salary), department FROM employees GROUP BY department);

An example of derived table (again deals displaying of employees with the highest salary within the individual departments):

SELECT surname, dt.mm AS salary, dt.department FROM employees e1,
  (SELECT MAX(salary) AS ms, department FROM employees GROUP BY department) dt
  WHERE e1.salary = dt.ms AND e1.department = dt.department;

Some nested queries can be written out by using of variously effective statements. As an typical example can serve the replacement of "> ALL" by the function in a subquery (second query is more suitable):

SELECT * FROM a WHERE a > ALL(SELECT v FROM b);
SELECT * FROM a WHERE a > (SELECT MAX(v) FROM b);

Internal and external table join

There exists two principal ways how a join of tables could be written down. The older alternative, based on the listing of tables in the FORM clause, you should rather forgot. Primarily it supports only internal join, that means that it displays only the conjugated entries, and secondly, thanks to imperfect record, it can be the source of serious error, which can have an influence on the server performance. Only few years ago, this variation was supported by the fact, that not all of the RDBMSs had the construction JOIN implemented. Now it all have...

Find a mistake in the following example:

SELECT DISTINCT a.* FROM a, b, c 
  WHERE a.id = b.pid;

You should see the error on the first glance. On the other hand it's very difficult to find it in a complicated query. Using of the phrase DISTINCT generally warn us of the error. That is either not joining of the table c or not omitting of this table from the list of sources. Thanks to it the number of rows of the result will be multiplied by the number of rows of tables c and then subsequently deleted by the phrase DISTINCT. In the case that this table is small, the performance underflow will be minimal; but with the increasing table size, it will grow. Errors which will start cause difficulties yet after several years of the system activity are so created. Those difficulties usually come in the time, when nobody who would know something about the system, is still in the company and when the system sometimes, not regularly, but absolutely slow down.

The way of connection of tables by the JOIN phrase supports now all the expanded SQL databases, so there is no reason to continue using the older variant. The way of writing it is simple:

SELECT .. FROM a [LEFT|RIGHT|CROSS|FULL] JOIN b ON logical expression

While using operation JOIN cartesian product of tables a and b is generated. Logical expression enclose a set of this product, which have the sense. Usually deals the equation of primary and foreign key. If had for example the tables of parents and children, then the cartesian product of those tables would assign each person from the parent's table every single child from the children table and that is logically nonsense. Valid are only that couples, which satisfy the condition of equation of primary key in the table parents and foreign key in the table children.

SELECT * FROM parents JOIN children ON parents.id = children.parent_id

Internal join displays only the corresponding entries from both the tables. External join displays either all entries from the table a (LEFT JOIN) and relevant entries from the table b; or relevant entries from the table a and all entries from the table b (RIGHT JOIN); eventually all the entries from the tables a and b (FULL JOIN). A result of a SQL query can be only a table. If the external join displays on one side specific values and on the other side does not find relevant record, it will fill on the other side NULL values. This construction is used often as a search criteria. As an example can serve the query on all the parents, who don't have any child.

SELECT * FROM parents LEFT JOIN children ON parents.id = children.parent_id
  WHERE children.id IS NULL;

The primary key from the table children will have in join the NULL value only if the given parent has no child. Otherwise, from the requests to primary key, has to be always NOT NULL.

Composing of the queries

Utilizing the operation UNION we can join the returned sets of entries. Variation UNION ALL returns all of the records including possible duplicate ones. Since it doesn't do the elimination of duplications, it will be evaluated faster. We can use it in the connection with the IN predicate, which removes the duplications internally, or there, where we know that we will be joining the disjoint sets.

SELECT 'maximum' AS type, surname, dt.ms AS salary, dt.department FROM employees e1,
  (SELECT max(salary) AS ms, department FROM employees GROUP BY department) dt
  WHERE e1.salary = dt.ms AND e1.department = dt.department
UNION ALL
SELECT 'minimum' AS type, surname, dt.ms AS salary, dt.department FROM employees e1,
  (SELECT max(salary) AS ms, department FROM employees GROUP BY department) dt
  WHERE e1.salary = dt.ms AND e1.department = dt.department
ORDER BY dt.department, type;

This query returns maximal or minimal salaries and thanks to ORDER BY clause, displays in a list the employees with the maximum and minimum salary within the individual departments. Beyond the operation UNION we can also use the operations INTERSECT and EXCEPT.

Using of the conditional expressions

Without the conditional expressions the life would be difficult. Until they were not available, even the trivial transformations of data had to be processed on the client side of application or some special functions had to be written. I will show an example. Let's suppose that we have a chronological sequence of values, from which we would like to aggregate sums every 15 minutes. Utilizing the CASE construction, this task would be easy:

SELECT date_trunc('hour',time) + 
  CASE 
      WHEN EXTRACT(minute FROM time) BETWEEN  0 AND 14 THEN  0
      WHEN EXTRACT(minute FROM time) BETWEEN 15 AND 29 THEN 15
      WHEN EXTRACT(minute FROM time) BETWEEN 30 AND 44 THEN 30
      WHEN EXTRACT(minute FROM time) BETWEEN 45 AND 59 THEN 45
  END * interval '1 minute' AS ttime, 
  SUM(value) FROM data
    GROUP BY ttime ORDER BY ttime;

CASE

A conditional expression has two principal forms:

CASE expression WHEN value THEN value .. ELSE value END
CASE WHEN expression THEN value .. ELSE value END

All the branches are interpreted in sequence, first true evaluation terminates it. If no branch is true and the ELSE branch is missing, the evaluated expression has NULL value. Statement can be used for substitution of code, as a protection before dividing by zero, etc.:

SELECT CASE is_man THEN 'Man' ELSE 'Woman' END AS Sex FROM employees;
SELECT CASE attr <> 0 THEN 100/attr*100 ELSE NULL END FROM data;

COALESCE

For a long time existed the constant of table NULL. If it appears in an expression, the result of the expression will be NULL. Moreover there were fears of the efficiency of implementation of this value. Using of substitutive constants such as 0 or -1 or eventually an empty string extended. PostgreSQL stores NULL value as 1 bit, therefore it allocates far less space than the substitutive constant: varchar of the zero length needs 4 bytes. So I wouldn't be afraid of this constant, only place where it cannot appear is the primary key.

Nevertheless we sometimes need to design an expression, where NULL can appear, but we don't want to have it as the result. An example of this is putting together the name and surname into one string, where the name is not obligatory. ANSI SQL a PostgreSQL solve this situation with the calling of function COALESCE.

SELECT COALESCE(name||' '||surname, surname,'');

The result of this function is the first parameter, not the NULL one.

LEAST and GREATEST

These two functions can serve similarly as the function COALESCE. They are also able to return the value NULL and moreover they select the minimal (LEAST) or maximal (GREATEST) value from the list of parameters. E. g. I need to filter out all the NULL values and values smaller than some specific limit. Nevertheless I want to sustain the rows (they include also another data or the number of records is import for me):

  SELECT GREATEST(measurement, 0) FROM data;

Variations of statement SELECT

In SQL is common that one task could be solved by several different ways. In the following example I have a table of the employees with the 1000 of records and a table of the workplaces with 5 entries (codebook). I want to solve the classical task of searching of the employees with the highest salary within the individual workplaces. First variation utilize a nested correlated subquery:

SELECT label, surname, salary FROM employees e, workplace w
  WHERE salary = (SELECT MAX(salary) FROM employees WHERE workplace = e.workplace)
  AND e.workplace = w.id;

This query requires an index over the column salary. Without it lasts the implementation of this query 1360 ms, with it only 54 ms. Another alternates are based on the external join and using of derived tables. In principle, from the performance aspect, it deals the same solution. They don't require the index over the column salary and the time of processing is approximately 18 ms.

SELECT label, surname, MAX FROM employees e JOIN
    (SELECT MAX(salary), workplace FROM employees GROUP BY workplace) w 
  ON w.max = e.salary AND e.workplace = w.workplace
    JOIN workplace w2 ON w2.id = e.workplace;

SELECT label, surname, max FROM employees e JOIN
  (SELECT MAX(salary), workplace, label FROM employees e JOIN workplace w 
     ON w.id = e.workplace GROUP BY workplace, label) s
  ON e.workplace = s.workplace AND e.salary = max;

There happens some changes in the algorithm of searching the optimal strategy of evaluating the query between the PostgreSQL versions. So far I haven't met with it, but I have to take it into account, that after upgrading to newer version, I will have to carry out reoptimalization of the database. If I kept at disposal some of the tools supporting the automatic testing of the code, I would simplify my work a lot. Necessary presumption is to have the SQL statements separated from the code of the application. Following example can serve as an illustration (fortunately positive). In the versions 8.0 and lowers was a big performance difference between that two variations. In the version 8.1 are both variations processed in the same time.

In the WHERE clause prefer joining of queries (UNION ALL) rather than compound expressions - optimizer is prepared for optimalization of queries, not of arithmetical expressions. Let's suppose that I would extend our example by the table mobile_tel, where I would store the telephone numbers of the employees, who obtained a mobile phone. And because of some reason I need to write out a list of the employees, who has some specific telephone number including one particular employee. First non-optimized variation:

SELECT * FROM employees 
  WHERE id = 10 OR id = ANY (SELECT emp_id FROM mobile_tel WHERE group = 4);

and second optimized (for 8.0 and lower versions):

SELECT * FROM employees
  WHERE id IN (SELECT emp_id FROM mobile_tel WHERE group = 4 
               UNION ALL 
               SELECT 10)

Where is the difference? Compound expression id = 10 OR... I replaced by simpler id IN (... and I removed the binary operator OR. It would be impractical, if you searched all the variations of SQL query and test them. You should nevertheless each SQL statement test and when you wouldn't be satisfied with its efficiency, you should try to find another form of it.

Optimalization

The base is suitably designed database:

  • suitably designed and documented structure of the tables, which we should extend and conserve,
  • suitably chosen data types, e.g. be aware of using varchar instead of timestamp, etc.,
  • suitable selected and sustained restrictions for domain integrity (databases can be cleaned subsequently only with the big difficulty),
  • regular processing of the database audit: canceling of unused indexes, working tables and views,
  • expressions including comparison always try to write down so, that the one side of the comparison would form one attribute
SELECT * FROM test WHERE a+1 = 100 OR b+1 = 100; -- wrong
SELECT * FROM test WHERE a = 100-1 OR b = 100-1; -- right

One SQL query can be often written down by several ways. If it's possible, try to avoid using of nested, correlated or derived queries and use JOIN clause instead even if it probably wouldn't be always possible and it even if it might also appear that JOIN would be more exacting. Usually the contrary is truth and there is a lower risk that would be used really unsuitable processing plan. And it could signify really striking difference while using badly adjusted indexes.

The example of converting of nested query into the JOIN clause. If I wanted to display all the authors, who had written mimeographed, I could write several semantic-equivalent SQL queries:

SELECT * FROM people WHERE EXISTS(SELECT id FROM mimeographed WHERE people.id = author);
SELECT * FROM people WHERE id IN (SELECT author FROM mimeographed);
SELECT DISTINCT people.* FROM people JOIN mimeographed ON people.id = author;
SELECT people.* FROM people JOIN (SELECT DISTINCT author FROM mimeographed) s ON s.author = people.id;

which would be differently fast depending on the ratio between table people and table mimeographed, based on the data apportionment. The fastest probably would be second and fourth example. The first one has a problem with the sequential reading of the table author and repeated executing of the subquery. Error of the the third example is the elimination of a great number of rows. That we can precede if we would place the DISTINCT clause into the derived table. For illustration the times of executing individual queries are: 3000, 27, 436, 36.

Always try to find a SQL query, which was the solution of your task. Reduce the assembling of results on the client side. Reduce the number of queries that you are sending to SQL server. In the case that the query would be too complicated and unreadable, write stored SRF function. If you can, use in advance prepared statements.

The fundamental tools for optimizing of the queries are: your fantasy (it's necessary generate at least one variation of the query), meta-statement \timing, statements VACUUM ANALYZE and EXPLAIN. In principle you don't have to understand to the printout of the EXPLAIN statement at all. The key phrase "Seq Scan", which (if appears beside the table with more than 100 rows) signalize a problem. If it's possible, examine the queries over the data which by its volume and apportionment are close to the real data (count with the growing of the database). None from the common queries (creating of reports is usually done aside) should be executing longer than 200 ms. If I knew that the application included some longer-lasting queries, I should then have the chance to displace its executing to the non-working hours and cache the result.

postgres=# explain SELECT label, surname, salary FROM employees e, workplace w
postgres-#   WHERE salary = (SELECT MAX(salary) FROM employees WHERE workplace = e.workplace)
postgres-#   AND e.workplace = w.id;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..20037.56 rows=5 width=24)
   Join Filter: ("outer".workplace = "inner".id)
   ->  Seq Scan on employees e  (cost=0.00..20032.00 rows=5 width=18)
         Filter: (salary = (subplan))
         SubPlan
           ->  Aggregate  (cost=20.00..20.01 rows=1 width=4)
                 ->  Seq Scan on employees  (cost=0.00..19.50 rows=200 width=4)
                       Filter: (workplace = $0)
   ->  Seq Scan on workplace w  (cost=0.00..1.05 rows=5 width=14)
(9 rows)

CREATE INDEX idx_salaries ON employees(salary);

postgres=# explain SELECT label, surname, salary FROM employees e, workplace w
postgres-#   WHERE salary = (SELECT MAX(salary) FROM employees WHERE workplace = e.workplace)
postgres-#   AND e.workplace = w.id;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.06..263.14 rows=5 width=24)
   Hash Cond: ("outer".workplace = "inner".id)
   ->  Seq Scan on employees e  (cost=0.00..262.00 rows=5 width=18)
         Filter: (salary = (subplan))
         SubPlan
           ->  Result  (cost=0.23..0.24 rows=1 width=0)
                 InitPlan
                   ->  Limit  (cost=0.00..0.23 rows=1 width=4)
                         ->  Index Scan Backward using idx_salaries on employees  (cost=0.00..46.50 rows=200 width=4)
                               Filter: ((workplace = $0) AND (salary IS NOT NULL))
   ->  Hash  (cost=1.05..1.05 rows=5 width=14)
         ->  Seq Scan on workplace w  (cost=0.00..1.05 rows=5 width=14)

The older RDBMSs version prefered the EXISTS operator over the IN one. Today it's not so. On a very similar example of the tables A (27000 of rows) and B (21000 of rows) I am finding the rows that are in A and simultaneously are not in B. Once again this example can be solved by different ways (in the brackets is the time of its executing without index):

SELECT a FROM A WHERE id NOT IN (SELECT id FROM B); -- 133 ms (836)
SELECT a FROM A WHERE NOT EXISTS (SELECT id FROM B WHERE A.id = B.id); -- 353 ms (81716)
SELECT a FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL; -- 143 ms (1244)

The suitability of a query is conditioned by the character and volume of data. On a different testing set, where the number of the rows in the table was 100000, was NOT IN the slowest query and JOIN the fastest one. Nevertheless the NOT EXISTS variation wasn't significantly slower than JOIN.

The optimizer of the query suppose the minimal utilization of server while selecting the optimal variation. Therefore in some cases can be chosen ostensibly non-optimal plan variation (particularly when you are debugging the queries on an unloaded development server). If you do not suppose a big server load or you have an above-average fast harddisk, try to modify the configuration variable random_page_cost.

Optimalization of GROUP BY and DISTINCT

Try to minimize the volume of data on which you execute the operations. Use derived tables or nested subqueries. Do not use GROUP BY for removing of duplicate entries or for simplification of the query. An example of separation of GROUP BY can be the following query:

SELECT orders.*, factories.*, customer.*, s.c, s.m FROM
  orders JOIN customer ON orders.customer_id = customer.id
             JOIN factories ON customer.factory_id = factory.id
  JOIN ( SELECT orders.id, count(order_items.id) as c, max(order_history.date) as m
    FROM orders JOIN order_items ON orders.id = order_items.order_id
                  JOIN order_history ON orders.id = order_history.order_id
    WHERE orders.status_id = 321
    GROUP BY orders.id ) s ON orders.id = s.id
  WHERE factory_id IN (1,10,11,2,3,4,5,6,7,8,9) ORDER BY s.m;

Use DISTINCT only if you have a real reason for the existence of duplicate entries, which you don't want to display. If you have prolems with duplicate entries in the query, without having an obvious reason for it, it will deal the error of designing of the SQL query. Check then if you have declared relations for all the tables in the query.

Optimalization of LIKE query

PostgreSQL will use index for the condition of LIKE type only if are fulfilled the following conditions:

  • mask doesn't start with the symbols % and _
  • database cluster is initialized by 'C' locales

We can get round the last condition by creating a special index:

create index like_index on people(surname varchar_pattern_ops);

National characters doesn't matter, we only have to be careful of the fact that like is case sensitive. If you fit into this conditions, you will remove one sequential reading of the table and you will shorten the query executing from the hundreds of milliseconds to milliseconds.

If the table has less than 100 rows, index probably wouldn't be used even if exists. For such small table are the costs for reading whole the table less than costs connected with utilizing of the index. For bigger tables already is worth designing indexes, especially if they are bigger than 10000 rows. Utilizing of indexes needs to be monitoring and unused index (if it's possible) should be removed. With every other index are connected some specific costs of operations INSERT, UPDATE and DELETE, apart from the disk space usage. Use indexes for the columns with the sufficient selectivity, that means do not use the columns where is relatively few unique values (e. g. sex, department). If you want in spite of do something like that, try foremost, if would not help you partial index or partitioning.

Think differently

PostgreSQL offers relatively strong resources for enforcement of referential, domain, application integrity. If we use them, we can distinctively minimize load of the database. An example: we have a table of the prices of products of some specific class. This table includes also the archive data. Valid is the price which has the highest id (lastly added). If we want to search for the price of some specific product we can use several queries:

SELECT how_much FROM price 
  WHERE id = (SELECT MAX(id) FROM price WHERE pid = 1);

or more economical variation

SELECT how_much FROM price WHERE pid = 1 ORDER BY id DESC LIMIT 1;

If we wanted to execute another operations over the actual prices, would be worth to change tactics. We will introduce an attribute actual and over this attribute a partial index. However, at this moment we will complicate the operation INSERT. Within the adding of a row we also will have to cancel the attribute actual to previous actual price. This we would make easily utilizing a trigger.

CREATE OR REPLACE FUNCTION actual_constr_trg_function() RETURNS trigger AS $$
BEGIN
  IF TG_OP = 'UPDATE' THEN 
    IF NEW.id <> OLD.id OR NEW.how_much <> OLD.how_much THEN
      RAISE EXCEPTION 'Forbidden operation UPDATE on the table prices';
    ELSIF NEW.actual = false AND OLD.actual = true THEN 
      RETURN NEW;      
    ELSE
      RAISE EXCEPTION 'Forbidden operation UPDATE on the table prices';
    END IF;
  ELSIF TG_OP = 'DELETE' THEN
    RAISE EXCEPTION 'Forbidden operations DELETE on the table prices';
  ELSE
    UPDATE price SET actual = false WHERE actual = true AND pid = NEW.pid;
    NEW.actual = true;
    RETURN NEW;
  END IF;
END; $$ LANGUAGE plpgsql;

CREATE TRIGGER actual_constr_trg BEFORE INSERT OR UPDATE OR DELETE 
  ON price FOR EACH ROW EXECUTE PROCEDURE actual_constr_trg_function();

CREATE INDEX actual ON price(pid) WHERE actual = true;

Index actual is an example of partial index - index includes only the records, where the value of the column actual is true. While using partial index, there occur a lot a mistake, when it's indexing a column by which are chosen the records. Now we have easy access to all actual prices. Thanks to trigger protection can not occur inconsistence of the database: either that no entry has the attribute actual or that more than one entry for one product has this attribute. Every function is executing under transaction, triggers are not an exception. Therefore could not happen, that I would replace the attribute and haven't finish INSERT.

Matter of interest - search for all actual prices of 4000 items (with the attribute actual and partial index lasts 1 ms, with the attribute actual and without partial index lasts 15 ms, without attribute actual 20 ms).

SELECT * FROM price WHERE id IN
  (SELECT MAX(id) FROM price GROUP BY pid);

Effective database using is conditioned by good knowledge of all resources and possibilities, which RDBMS offers us. And that is in the case of PostgreSQL:

  • knowledge of statement SELECT (nested queries, joining of tables),
  • knowledge of types and usage of indexes: B-tree, Hash, GiST and unique, partial and functional index. PostgreSQL do not create automatically indexes over the columns with the foreign keys. Try to create them and test yourself if they bring the sufficient benefit or not,
  • knowledge of data types: integer, numeric, varchar, date timestamp, bytea, boolean and NULL, NOT NULL,
  • knowledge of the guarantees of referential and domain integrity: PRIMARY KEY, FOREIGN KEY, CHECK, triggers,
  • knowledge of partitioning: segregating of some tables from the processing plan because of the restrictive rules.

Google AdSense