This document explains the installation and configuration of the PostgreSQL database engine.
Install PostgreSQL
sudo apt install postgresql postgresql-client
Debian wiki explains how to install and set up PostgreSQL.
In case you need to change the listening port (for example when I had
to test postgresql on docker containers while keeping postgresql running
locally) Edit the configuration file
/etc/postgresql/13/main/postgresql.conf
and change the port
definition as follows
port = 5433
sudo service postgresql start
sudo service postgresql stop
Add system users at the regular bash shell before you can add them in postgresql. I call this user “rdb” because I plan to use R to connect to the database and “R” cannot be used as a user name according to system policy.
sudo adduser rdb --disabled-password
Login as the postgres user!
sudo -i -u postgres
# In case postgresql is not on the default port 5432
export PGPORT=5433
Now logged in as postgres@machine_name
in the regular
bash shell:
Create a user with the same name as the system user created above
createuser –pwprompt rdb # Create a user for myself createuser paul
See also ~/.pgpass
to store the password.
Create a database owned by that user
createdb -O rdb tradeflows createdb -O rdb biotrade createdb -O paul tradeflows_migrated
Note: createdb is a wrapper around the SQL command
CREATE DATABASE
. At the postgreSQL command prompt, you can
use
create database biotrade;
Grant rights to another user. First connect to the database
psql -d biotrade -h localhost -U rdb
Then grand privileges
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA raw_comtrade TO paul;
This is the preferred way to connect to the database for scripts.
Create a ~/.pgpass
file to store the connection details for
that user:
echo 'localhost:*:tradeflows:rdb:localhost' >> ~/.pgpass
Secure the file:
chmod 600 ~/.pgpass
The ~/.pgpass
file should contain lines of the following
format:
hostname:port:database:username:password
“Each of the first four fields can be a literal value, or
*
, which matches anything.”
Therefore I use *
for the port so that it matches any
ports.
The postgresql
service file enables “connection parameters to be associated with a
single service name”. You can specify the database connection parameters
in a file called ~/.pg_service.conf
in you home.
# Trade database
[biotrade]
host=localhost
port=5432
user=rdb
dbname=biotrade
password=localhost
And you would then connect to the database with:
psql service=biotrade
Using the credentials stored in ~/.pgpass
psql -d tradeflows -h localhost -U rdb
psql -d biotrade -h localhost -U rdb
Specifying the port (normally not necessary if it’s specified in
~/.pgpass
)
psql -d biotrade -h localhost -U rdb -p 5433
It’s possible to specify a connection URI of the form:
psql postgresql://rdb@localhost:5432/biotrade
# And since I have a URL set as an environmental variable
psql $BIOTRADE_DATABASE_URL
“It is possible to specify multiple host components, each with an optional port component, in a single URI. A URI of the form postgresql://host1:port1,host2:port2,host3:port3/ is equivalent to a connection string of the form host=host1,host2,host3 port=port1,port2,port3. As further described below, each host will be tried in turn until a connection is successfully established.”
Login as the user who is the owner of that database
sudo -i -u rdb
Connect to the database
psql tradeflows
Name | Description |
---|---|
character varying(n), varchar(n) | variable-length with limit |
character(n), char(n) | fixed-length, blank padded |
text | variable unlimited length |
Tip:
“There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.”
“Generally, there is no downside to using text in terms of performance/memory. On the contrary: text is the optimum. Other types have more or less relevant downsides. text is literally the”preferred” type among string types in the Postgres type system, which can affect function or operator type resolution.”
differences between text and varchar
To sum it all up: - char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing spaces, plus it is problematic to change the limit - varchar(n) – it’s problematic to change the limit in live environment (requires exclusive lock while altering table) - varchar – just like text - text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
In SQLite as well, character(n) doesn’t seem to have any performance advantage.
Note that numeric arguments in parentheses that following the type name (ex: “VARCHAR(255)”) are ignored by SQLite - SQLite does not impose any length restrictions (other than the large global SQLITE_MAX_LENGTH limit) on the length of strings, BLOBs or numeric values.
Keep the table structure but delete it’s content. Delete oil products from the Comtrade monthly table:
# Check results
select distinct(product_code) from raw_comtrade.monthly where product_code like '15%';
# Delete
DELETE FROM raw_comtrade.monthly WHERE product_code like '15%';
DELETE FROM raw_comtrade.monthly WHERE product_code like '44%';
Drop a database
postgres=# drop database databasename;
Drop a schema
psql -d tradeflows -h localhost -U rdb -c "DROP SCHEMA raw_comext"
ERROR: cannot drop schema raw_comext because other objects depend on it
DETAIL: table raw_comext.monthly depends on schema raw_comext
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Drop a schema and all its table with a drop in cascade
psql -d tradeflows -h localhost -U rdb -c "DROP SCHEMA raw_comext CASCADE"
Drop a table
drop table raw_comtrade.yearly;
Create a partial index on the first 4 characters of the product code
CREATE INDEX ix_raw_comtrade_yearly_product_code_prefix_4
ON raw_comtrade.yearly (LEFT(product_code, 4));
The database structure is defined as SQLAlchemy objects, then later translated in either SQLite or PostGreSQL dialect. Show the database structure as defined in PostGreSQL:
pg_dump --schema-only -t raw_comtrade.yearly $BIOTRADE_DATABASE_URL > table_structure_dump.sql
Create an additional index to speed up query on the first 4 digits of the product code (it doesn’t actually speed things up, left here for information purposes).
psql $BIOTRADE_DATABASE_URL
CREATE INDEX ix_raw_comtrade_yearly_product_code_prefix_4
ON raw_comtrade.yearly (LEFT(product_code, 4));
Remove an index
SET search_path TO raw_comtrade, public;
DROP INDEX ix_raw_comtrade_yearly_product_code_prefix_4;
Rename a table to another name with _backup
at the
end
ALTER TABLE raw_comtrade.yearly RENAME TO yearly_backup;
ALTER TABLE raw_comtrade.monthly RENAME TO monthly_backup;
ALTER TABLE raw_comtrade.product RENAME TO product_backup;
Rename a schema
ALTER SCHEMA raw_comtrade RENAME TO raw_comtrade_backup;
Examples copied from man pg_dump
. To dump a database
called mydb into a SQL-script file:
$ pg_dump mydb > db.sql
To reload such a script into a (freshly created) database named newdb:
$ psql -d newdb -f db.sql
The dump contained the following two statements which I removed:
SET default_tablespace = '';
SET default_with_oids = false;
I removed default table space because the official documentation for table spaces says it doesn’t make sense to have more than one table space per system.
I removed default_with_oids because this posrt explains that it’s used to enable a legacy feature.
Duplicate a table:
create table raw_comtrade.monthly_copy as (select * from raw_comtrade.monthly);
Help about psql commands:
?
Help about SQL commands:
\h
\h select
List databases
\l
Connect to a database
\c db_name
List schemas
\dn
List tables
\dt
List tables for a particular schema ilist all tables in schema
\dt raw_comext.*;
\dt raw_comtrade.*;
\dt raw_faostat*;
Describe a table to show the table structure, including column names and data types
\d raw_comtrade.yearly;
\d raw_comtrade.monthly;
Read commands from a file with the -f
argument:
psql -d biotrade -h localhost -U rdb -f ~/rp/biotrade/biotrade/config_data/comtrade.sql
Compare common commands between MariaDB and PostgreSQL.
MariaDB/MySQL | PostgreSQL |
---|---|
show databases; | \l |
connect db_name; | \c db_name |
show tables; | \dt |
Transfer data from MySQL to PostgreSQL:
pgloader mysql://user@localhost/sakila postgresql:///pagila
For my own database I used:
sudo su postgres
createdb -O paul tradeflows_migrated
exit # Go back to user "paul"
pgloader mysql://paul@localhost/tradeflows postgresql:///tradeflows_migrated
Connect to the database (with the credentials defined in
~/.pgpass
)
psql biotrade
postgresql.org the WHERE Clause
Here are some examples of WHERE clauses:
SELECT ... FROM fdt WHERE c1 > 5
SELECT ... FROM fdt WHERE c1 IN (1, 2, 3)
SELECT ... FROM fdt WHERE c1 IN (SELECT c1 FROM t2)
SELECT ... FROM fdt WHERE c1 IN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10)
SELECT ... FROM fdt WHERE c1 BETWEEN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10) AND 100
SELECT ... FROM fdt WHERE EXISTS (SELECT c1 FROM t2 WHERE c2 > fdt.c1)
When using string values, single quotes are required (otherwise the
ERROR: column "02" does not exist
is returned)
select distinct(reporter) from raw_comtrade.yearly_hs2 where commodity_code = '02';
Select product codes that start with "15"
select distinct(product_code) from raw_comtrade.monthly where product_code like '15%';
Select product codes within a list, then group by and count
biotrade=> SELECT product_code, count(*) FROM raw_comtrade.yearly WHERE
product_code IN ('230400', '150710', '120190') group by product_code;
product_code | count
--------------+-------
120190 | 29727
150710 | 33790
230400 | 49398
postgresql.org Roadmap
Create a schema
create schema raw_comtrade;
List schemas
\dn
Official documentation postgresql.org schemas
“A database contains one or more named schemas, which in turn contain tables. Schemas also contain other kinds of named objects, including data types, functions, and operators. The same object name can be used in different schemas without conflict; for example, both schema1 and myschema can contain tables named mytable. Unlike databases, schemas are not rigidly separated: a user can access objects in any of the schemas in the database they are connected to, if they have privileges to do so.”
“There are several reasons why one might want to use schemas:”
“Schemas are analogous to directories at the operating system level, except that schemas cannot be nested.”
List tables for a particular schema:
\dt raw_comtrade.*;
Export the create statement for an existing table
pg_dump -t 'raw_comtrade.yearly_2_digit' --schema-only biotrade >> /tmp/comtrade_yearly.sql
Dimitri Fontaine 2014 Why is pgloader faster ?
“In searching for a modern programming language the best candidate I found was actually Common Lisp.”
See also a more detailed quote under the Lisp page.
Wikipedia paragraph on the pronunciation
“The original standard declared that the official pronunciation for”SQL” was an initialism: /ˌɛsˌkjuːˈɛl/ (“ess cue el”). Regardless, many English-speaking database professionals (including Donald Chamberlin himself) use the acronym-like pronunciation of /ˈsiːkwəl/ (“sequel”), mirroring the language’s prerelease development name, “SEQUEL”.”