AA Packages

Install with Pip

pypi.org pip:

“pip is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes.”

Run at the command line or from an ipython prompt:

pip install packagename

What is the difference between a python module and a python package

“A package is a collection of modules in directories that give a package hierarchy.”

What is setup.py?

Check the version number of a package

In Python check the version of a package with the __version__ attribute, however it’s not always available.

>>> import pandas
>>> print(pandas.__version__)

You can also use importlib

>>> from importlib.metadata import version
>>> version('pandas')

SO answer suggestion a command line way to display the same information

pip freeze | grep pandas

Location of a package

Check the location of a package

import eu_cbm_hat
eu_cbm_hat.__file__

Install an old version

For example to install pandas 0.24.2

python3 -m pip install --user pandas==0.24.2 

or

pip3 install --user pandas==0.24.2

Sometimes you need to overwrite the existing version with I.

pip install -I  package==version

Install a local version

To install the local version of a package with pip

pip install -e /develop/MyPackage

According to man pip, the -e option “installs a project in editable mode (i.e. setuptools”develop mode”) from a local project path or a VCS url”.

Uninstall a local version

When uninstalling a package installed locally, you might get this error message:

uninstall localpackage
# Found existing installation: localpackage 0.0.1
# Can't uninstall 'localpackage'. No files were found to uninstall.

You can show the location of the package with

pip show localpackage

Then remove it manually with

rm -rf ~/.local/lib/python3.9/site-packages/localpackage*

And maybe this as well

rm -rf ~/.local/lib/python3.9/site-packages/build/lib/localpackage*

Install from a git repository

Install from the dev branch of a private repo on gitlab using ssh

pip install git+ssh://git@gitlab.com/bioeconomy/forobs/biotrade.git@dev

Install from the dev branch of a private repo on gitlab using an authentication token

pip install git+https://gitlab+deploy-token-833444:ByW1T2bJZRtYhWuGrauY@gitlab.com/bioeconomy/forobs/biotrade.git@dev

Install from the compressed tar.gz version of a repository that doesn’t require git to be installed on your laptop:

pip install --force-reinstall https://github.com/ytdl-org/youtube-dl/archive/refs/heads/master.tar.gz

pip install --force-reinstall https://github.com/mwaskom/seaborn/archive/refs/heads/master.tar.gz

Installing from Pypi

Install with Anaconda

Use conda update command to check to see if a new update is available. If conda tells you an update is available, you can then choose whether or not to install it.

  • conda vs pip vs virtualenv commands

    “If you have used pip and virtualenv in the past, you can use conda to perform all of the same operations. Pip is a package manager and virtualenv is an environment manager. conda is both.”

Channels

In case a python package is not available in the default conda channel use, you can change the channel to conda-forge as follows:

conda install -c conda-forge <package_name>

conda-forge:

> - "The conda team, from Anaconda, Inc., packages a multitude of packages and
> provides them to all users free of charge in their default channel."

> "conda-forge is a community effort that tackles these issues:
>  - All packages are shared in a single channel named conda-forge.
>  - Care is taken that all packages are up-to-date.
>  - Common standards ensure that all packages have compatible versions.
>  - By default, we build packages for macOS, Linux AMD64 and Windows
>    AMD64."

Install

Documentation conda.io installing packages

To install a specific package such as SciPy into an existing environment “myenv”:

conda install --name myenv scipy

If you do not specify the environment name, which in this example is done by –name myenv, the package installs into the current environment:

conda install scipy

To install a specific version of a package such as SciPy:

conda install scipy=0.15.0

To install multiple packages at once, such as SciPy and cURL:

conda install scipy curl

Note: It is best to install all packages at once, so that all of the dependencies are installed at the same time.

Pip and conda

Update

Documentation conda.io updating packages

Use the terminal or an Anaconda Prompt for the following steps.

To update a specific package:

conda update biopython

To update Python:

conda update python

To update conda itself:

conda update conda

Remove

Remove the package ‘scipy’ from the currently-active environment:

conda remove scipy

Remove a list of packages from an environemnt ‘myenv’:

conda remove -n myenv scipy curl wheel

Conda environments

Environment file

Creating an environment file manually

You can create an environment file (environment.yml) manually to share with others.

EXAMPLE: A simple environment file:

name: stats dependencies: - numpy - pandas

EXAMPLE: A more complex environment file:

name: stats2 channels: - javascript dependencies: - python=3.6 # or 2.7 - bokeh=0.9.2 - numpy=1.9. - nodejs=0.10. - flask - pip: - Flask-Testing

Note

Note the use of the wildcard * when defining the patch version number. Defining the version number by fixing the major and minor version numbers while allowing the patch version number to vary allows us to use our environment file to update our environment to get any bug fixes whilst still maintaining consistency of software environment.

Mamba

The mamba solver can speed up the dependency resolution process. It doesn’t require a special mamba installation You can switch the default solver in a normal conda installation:

conda install -n base -c defaults conda-libmamba-solver
conda config --set solver libmamba

Install with the OS’s package manager

Some packages can be installed with the OS’s package manager. Such as for example on Debian:

sudo apt install python3-pip

Discussion pip, conda, apt

Create a package

Publish to pypi

To upload a package to pypi, you need a pypi account. The instructions on uploading distribution archives explain how to upload the package to test.pypi:

python3 -m twine upload --repository testpypi dist/*

I updated the following package before running this

pip install --upgrade build
pip install --upgrade twine

I built the package with

cd forobs/biotrade
python3 -m build

twine uses kde wallet to store the password, press cancel if you can’t use kde wallet, it will then ask for the password at the command line. There is a twine issue related to the use of keyring.

Register an account on pypi (it’s a different server than test.pypi). Create a token under account settings. Then upload to pypi itself

cd repository
python3 -m build
twine upload dist/*

To use the API token:

Set your username to __token__
Set your password to the token value, including the pypi- prefix

Test install in a virtual environment

In bash, create a virtual environment to test installation and remove the python path otherwise my local version is seen

mkdir /tmp/biotrade_env/
cd /tmp/biotrade_env/
python3 -m venv /tmp/biotrade_env/
source /tmp/biotrade_env/bin/activate
PYTHONPATH=""
python3

In python check that the package is not there

>>> import biotrade
>>> import pandas

Back to the shell test the installation from test.pypi

pip install -i https://test.pypi.org/simple/ biotrade
# ERROR: Could not find a version that satisfies the requirement pandas (from biotrade)
# ERROR: No matching distribution found for pandas

Installing biotrade’s dependencies directly generates an error because pandas is not available in the test repository. You can install them from the pypi directly with pip install pandas.

Install from a wheel

cd ~/repos/forobs/biotrade/dist
pip install biotrade-0.2.2-py3-none-any.whl
# Or 
pip install biotrade-0.2.2.tar.gz

Publish to Conda Forge

On https://conda-forge.org/docs/maintainer/adding_pkgs.html conda recommends https://github.com/conda-incubator/grayskull to create the recipe

“Presently Grayskull can generate recipes for Python packages available on PyPI and also those not published on PyPI but available as GitHub repositories.”

Authorship

It’s only possible to specify one author field in setup.py. The recommendation is to use a mailing list when there are multiple authors and to set separate files for attribution.

How to specify multiple authords in setup.py?

Add non code files

The Python packaging documentation on adding non code files

“The mechanism that provides this is the MANIFEST.in file. This is relatively quite simple: MANIFEST.in is really just a list of relative file paths specifying files or globs to include.:

include README.rst
include docs/*.txt
include funniest/data.json

“In order for these files to be copied at install time to the package’s folder inside site-packages, you’ll need to supply include_package_data=True to the setup() function.”

“Files which are to be used by your installed library (e.g. data files to support a particular computation method) should usually be placed inside of the Python module directory itself. E.g. in our case, a data file might be at funniest/funniest/data.json. That way, code which loads those files can easily specify a relative path from the consuming module’s __file__ variable.”

The Python packaging documentation on the Manifest commands The syntax of recursive-include graft commands.

Add all files under directories matching dir-pattern that match any of the listed patterns

recursive-include dir-pattern pat1 pat2

Add all files under directories matching dir-pattern

graft dir-pattern

The Python packaging documentation on source dist gives an example of the patterns

include *.txt
recursive-include examples *.txt *.py
prune examples/sample?/build

” The meanings should be fairly clear: include all files in the distribution root matching *.txt, all files anywhere under the examples directory matching *.txt or *.py, and exclude all directories matching examples/sample?/build.

Version

SO What is the correct way to share package version with setup.py and the package?

The version of a package has to be set both in setup.py and __init__py it’s crazy the number of options that people have thought about. This answers summarizes the state of the art in 7 options, including a link to the python packaging user guide

Bump version

Install bumpversion

pip install bumpversion

Increment the version number both in setup.py and init.py with the command line tool bumpversion. First create a configuration file .bumpversion.cfg where the current_version matches the versions in setup.py and packagename/__init__.py

[bumpversion]
current_version = 0.0.5
commit = True
tag = True

[bumpversion:file:setup.py]
[bumpversion:file:biotrade/__init__.py]

Increment the version number in all files and the git tag with:

bumpversion patch
# Or to increment minor or major versions
bumpversion minor
bumpversion major

Push the corresponding tags to the remote repository

git push origin --tags

Check the updated version in setup.py

python setup.py --version 

Start an ipython prompt to test the package version

ipython
import packagename
packagename.__version__

Documentation

Pdoc

Generate the documentation of a package with pdoc:

pdoc -o public ./biotrade

This can be added to a .gitlab-ci.yml file in order to generate the documentation on a Continuous Integration system:

pages:
  stage: document
  script:
  # GitLab Pages will only publish files in the public directory
  - pdoc -o public ./biotrade
  artifacts:
    paths:
    - public
  only:
  - main
  interruptible: true

Packaging tools

Setup.py (legacy)

  • https://pip.pypa.io/en/stable/reference/build-system/setup-py/

    “Prior to the introduction of pyproject.toml-based builds (in PEP 517 and PEP 518), pip had only supported installing packages using setup.py files that were built using setuptools.”

    “The interface documented here is retained currently solely for legacy purposes, until the migration to pyproject.toml-based builds can be completed.”

Pyproject-toml

Location or path

Of a package

The location of a package can be obtained from package_name.__file__.

Of the python executable

Get the location of the python executable with

>>> import sys
>>> print(sys.executable)

In virtual env, it can return the symlink to another folder. In that case, the path can be deduced from

>>> import os
>>> os.__file__

Virtual environments

  • venv is available by default in Python 3.3+

    • Installation

      sudo apt install python3-venv

    • Usage

      mkdir /tmp/testenv
      python3 -m venv /tmp/testenv
      source /tmp/testenv/bin/activate
  • Pipenv makes pip and virtual environments work together.

    “There is a subtle but very important distinction to be made between applications and libraries. This is a very common source of confusion in the Python community.”

    “Libraries provide reusable functionality to other libraries and applications (let’s use the umbrella term projects here). They are required to work alongside other libraries, all with their own set of sub-dependencies. They define abstract dependencies. To avoid version conflicts in sub-dependencies of different libraries within a project, libraries should never ever pin dependency versions. Although they may specify lower or (less frequently) upper bounds, if they rely on some specific feature/fix/bug. Library dependencies are specified via install_requires in setup.py.”

    “Libraries are ultimately meant to be used in some application. Applications are different in that they usually are not depended on by other projects. They are meant to be deployed into some specific environment and only then should the exact versions of all their dependencies and sub-dependencies be made concrete. To make this process easier is currently the main goal of Pipenv.”

    • Install on Debian

      sudo apt install pipenv

  • pyenv makes it possible to manage different python versions. But within the containing environment you can also install different packages with pip.

  • Illustration of the complementarity between pyenv and pipenv.

Remove environment variables

To test a fresh install of a package or test it in conditions where some environmental variables are not defined.

For example remove environment variables with unset.

unset BIOTRADE_DATABASE_URL

Applications

Streamlit

Example application with select boxes and a slider. Use the index argument to select a default value.

import streamlit
reporter = streamlit.sidebar.selectbox(
    "Select a reporter Country", options=df["reporter"].unique()
)
products = streamlit.sidebar.multiselect(
    "Select some products", options=df["product_name"].unique()
)
element = streamlit.sidebar.selectbox(
    "Select a variable for the Y Axis", options=["net_weight", "price", "trade_value"]
)
flow = streamlit.sidebar.selectbox(
    "Select a flow direction", options=["import", "export"]
)
n_partners = streamlit.sidebar.slider(
    "Select N First Partners", min_value=1, max_value=10, value=5
)

Compilers

Numba just-in-time compiler

Numba User Manual

“When a call is made to a Numba-decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!”

Pythran Ahead of time compiler

https://github.com/serge-sans-paille/pythran

“Pythran is an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface descriptions and turns it into a native Python module with the same interface, but (hopefully) faster.”

Control flow

If

dotnetperls: not python

Sample code with a function and if conditions:

def function(condition):
    if condition:
        print("Hi")
    if not condition:
        print("Bye")
function(True)
function(False)
function('')
function('lalala')

Elif

Example from https://stackoverflow.com/a/16287793/2641825

Using two if conditions

# According to the UN Convention of the Rights of the Child
ADULT_AGE = 18
def analyze_age(age):
   if age < ADULT_AGE and age > 0:
       print("You are a child")
   if age >= ADULT_AGE:
       print("You are an adult")
   else:
       print("The age must be a positive integer!")

analyze_age(16)
>You are a child
>The age must be a positive integer!

“The elif fixes this and makes the two if statements ‘stick together’ as one:”

def analyze_age(age):
   if age < ADULT_AGE and age > 0:
       print("You are a child")
   elif age >= ADULT_AGE:
       print("You are an adult")
   else:
       print("The age must be a positive integer!")

analyze_age(16)
>You are a child

For loop

A for loop with a continue statement

print("I print all numbers in the range except 2.")
for i in range(5):
    if i==2:
        continue
    print(i)

Command line

Parsing arguments with argparse

Argparse Tutorial explain how to create a python program that processes command line arguments. Save the following in prog.py

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
                    help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int,
                    help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
    print(f"the square of {args.square} equals {answer}")
elif args.verbosity == 1:
    print(f"{args.square}^2 == {answer}")
else:
    print(answer)

Usage

$ python3 prog.py 4
16
$ python3 prog.py 4 -v
usage: prog.py [-h] [-v VERBOSITY] square
prog.py: error: argument -v/--verbosity: expected one argument
$ python3 prog.py 4 -v 1
4^2 == 16
$ python3 prog.py 4 -v 2
the square of 4 equals 16
$ python3 prog.py 4 -v 3
16

SO answer explains that when using ipython, you need to separate ipython arguments from your script arguments using --.

Databases

SQL Alchemy

SQL Alchemy is a database abstraction layer. Interaction with the database is built upon metadata objects:

The core of SQLAlchemy’s query and object mapping operations are supported by database metadata, which is comprised of Python objects that describe tables and other schema-level objects. These objects are at the core of three major types of operations - issuing CREATE and DROP statements (known as DDL), constructing SQL queries, and expressing information about structures that already exist within the database. Database metadata can be expressed by explicitly naming the various components and their properties, using constructs such as Table, Column, ForeignKey and Sequence, all of which are imported from the sqlalchemy.schema package. It can also be generated by SQLAlchemy using a process called reflection, which means you start with a single object such as Table, assign it a name, and then instruct SQLAlchemy to load all the additional information related to that name from a particular engine source.

Reflecting database objects

from sqlalchemy import MetaData
from sqlalchemy import Table
meta = MetaData(schema = "raw_comtrade")
meta.bind = comtrade.database.engine
yearly_hs2 = Table('yearly_hs2', meta, autoload_with=comtrade.database.engine)

SQL Alchemy has an automap feature which generates mapped classes and relationships from a database schema.

I used sqlacodegen to automatically generate python code from an existing PostGreSQl database table as follows

sqlacodegen --schema raw_comtrade --tables yearly_hs2 postgresql://rdb@localhost/biotrade

Check for table existence

Paul’s SO Answer. SQL Alchemy’s recommended way to check for the presence of a table is to create an inspector object and use its has_table() method. The following example was copied from sqlalchemy.engine.reflection.Inspector.has_table, with the addition of an SQLite engine to make it reproducible:

from sqlalchemy import create_engine, inspect
from sqlalchemy import MetaData, Table, Column, Text
engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
user_table = Table('user', meta, 
                   Column("name", Text),
                   Column("full_name", Text))
user_table.create()
inspector = inspect(engine)
inspector.has_table('user')

You can also use the user_table metadata element name to check if it exists as such:

inspector.has_table(user_table.name)

Connection

Create a connection and execute a select statement, it’s a read only operation

Create a connection and execute a create statement followed by a commit:

with engine.connect() as conn:
    if not engine.dialect.has_schema(conn, schema):
        conn.execute(CreateSchema(schema))
        conn.commit()

Migration to version 2

  • https://docs.sqlalchemy.org/en/14/changelog/migration_20.html

    “As a means of both proving the 2.0 architecture as well as allowing a fully iterative transition environment, the entire scope of 2.0’s new APIs and features are present and available within the 1.4 series;”

  • https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-implicit-execution

    “For schema level patterns, explicit use of an Engine or Connection is required.”

    with engine.connect() as connection: # create tables, requires explicit begin and/or commit: with connection.begin(): metadata_obj.create_all(connection)

      # reflect all tables
      metadata_obj.reflect(connection)
    
      # reflect individual table
      t = Table("t", metadata_obj, autoload_with=connection)
    
      # execute SQL statements
      result = conn.execute(t.select())

ORM querying guide

Select where

SQL Alchemy Object Relational Model Querying Guide

from sqlalchemy import select
stmt = select(user_table).where(user_table.c.name == 'spongebob')
print(stmt)

Since version 1.4 .where() is a synonym of .filter() as explained in sqlalchemy.orm.Query.where.

To select only one column you can use Select.with_only_columns:

from sqlalchemy import MetaData, Table, Column, Text
meta = MetaData()
table = Table('user', meta, 
              Column("name", Text),
              Column("full_name", Text))
stmt = (table.select()
        .with_only_columns([table.c.name])
       )
print(stmt)

Entering columns in the select method returns an error. Although it should be valid according to the documentation.

print(table.select([table.c.name]))
# ArgumentError: SQL expression for WHERE/HAVING role expected, 
# got [Column('name', Text(), table=<user>)].

Insert

Insert some data into the user table

from sqlalchemy import insert
from sqlalchemy.orm import Session
stmt = (
    insert(user_table).
    values(name='Bob', full_name='Sponge Bob')
)
with Session(engine) as session:
    result = session.execute(stmt)
    session.commit()

ORM query to pandas

The pandas.to_sql method uses sqlalchemy to write pandas data frame to a PostgreSQL database.

“The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default.”

table.select(), select() or a session

Repeat the example table defined above, read the result of a select statement into a pandas data frame:

import pandas
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table, Column, Text
from sqlalchemy.orm import Session
# Define metadata and create the table
engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
user_table = Table('user', meta,
                   Column("name", Text),
                   Column("full_name", Text))
user_table.create()
# Insert data into the user table
stmt = user_table.insert().values(name='Bob', full_name='Sponge Bob')
with Session(engine) as session:
    result = session.execute(stmt)
    session.commit()
# Select data into a pandas data frame
stmt = user_table.select().where(user_table.c.name == 'Bob')
df = pandas.read_sql_query(stmt, engine)

Another way importing the select statement:

from sqlalchemy import select
stmt = select(user_table).where(user_table.c.name == 'Bob')
df = pandas.read_sql_query(stmt, engine)

Another way using a session

with Session(engine) as session:
    df2 = pandas.read_sql(session.query(user_table).filter(user_table.name=="Bob").statement, session.bind)

Read the whole table into pandas

df3 = pandas.read_sql_table("user", engine)

Stack Overflow Answer

Define and insert the iris dataset

Define an ORM structure for the iris dataset, then use pandas to insert the data into an SQLite database. Pandas inserts with if_exists="append" argument so that it keeps the structure defined in SQL Alchemy.

import seaborn
import pandas
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table, Column, Text, Float
from sqlalchemy.orm import Session

Define metadata and create the table

engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
iris_table = Table('iris',
                   meta,
                   Column("sepal_length", Float),
                   Column("sepal_width", Float),
                   Column("petal_length", Float),
                   Column("petal_width", Float),
                   Column("species", Text))
iris_table.create()

Load data into the table

iris = seaborn.load_dataset("iris")
iris.to_sql(name="iris",
            con=engine,
            if_exists="append",
            index=False,
            chunksize=10 ** 6,
            )

Unique values

The SQL ALchemy iris_table from above can be used to build a select statement that extracts unique values:

from sqlalchemy import distinct, select
stmt = select(distinct(iris_table.c.species))
df = pandas.read_sql_query(stmt, engine)

PostgreSQL

Create a database engin with SQLalchemy

from sqlalchemy import create_engine
engine = create_engine('postgresql://myusername:mypassword@myhost:5432/mydatabase')

Blogs and Stackoverflow

SQLite

Create an SQLITE in memory database and add a table to it.

In [17]: from sqlalchemy import create_engine, inspect
    ...: from sqlalchemy import MetaData, Table, Column, Text
    ...: engine = create_engine('sqlite://')
    ...: meta = MetaData()
    ...: meta.bind = engine
    ...: user_table = Table('user', meta, Column("name", Text))
    ...: user_table.create()
    ...: inspector = inspect(engine)
    ...: inspector.has_table('user')
Out[17]: True

Create a file based database at a specific path:

# absolute path
e = create_engine('sqlite:////path/to/database.db')

Editors

Spyder

I have set the following shortcuts to be similar to RStudio:

  • Ctrl+H find and replace dialog

  • Ctrl+R run selection or current line

  • Ctrl+Shift+C comment/uncomment code block

  • F1 inspect current object (i.e. display function and classes documentation)

  • F2 go to function definition

  • Spyder has a data frame explorer https://docs.spyder-ide.org/current/panes/variableexplorer.html#dataframes

    “DataFrames, like Numpy arrays, display in a viewer where you can show or hide”heatmap” colors, change the format and resize the rows and columns either manually or automatically”

Vim

I use Vim to edit python code and vim-slime to send the code to an ipython interpreter that runs inside a tmux pane. For more information, see my page on vim.html.

Reload a module

Auto reload a module in ipython

%load_ext autoreload
%autoreload 2

The following uses importlib.reload to illustrate the functionality and compares it with auto reload. Create a sample function and load it

import sys
import pathlib
from importlib import reload
tmp_dir = pathlib.Path("/tmp/this_dir")
tmp_dir.mkdir(exist_ok=True)
sys.path.append(str(tmp_dir))
f =  open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n    return i+j", file=f)
f.close()
from script import compute_sum
compute_sum(1,2)

Change the function and reload it using importlib.reload

f =  open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n    print('blabla')\n    return i+j", file=f)
f.close()
reload(sys.modules['script'])
from script import compute_sum
compute_sum(1,2)

Change the function and reload it using auto reload in ipython

%load_ext autoreload
%autoreload 2
f =  open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n    print('blibli')\n    return i+j", file=f)
f.close()
compute_sum(1,2)

Input Output

Comparison of IO files formats

  • CSV minimal common denominator works every where. Great for small datasets to be shared across many languages and platforms.

  • NetCDF Supports rich metadata, complex data types, and is especially good at handling large datasets efficiently. Also supports various types of compression.

  • Parquet Columnar storage, efficient compression, and encoding schemes. Optimized for query performance.

CSV

Read and write csv

Compressed csv

Write a compressed csv file as a gzip archive

import pandas
df = pandas.DataFrame({'x':range(0,3), 'y':['a','b','c']})
df.to_csv("/tmp/df.csv.gz", index=False, compression="gzip")

Write a compressed csv file as a zip archive, using a dict with the option “archive_name” (works only for the zip format)

compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('/tmp/df.csv.zip', index=False, compression=compression_opts)

Read compressed csv files

df1 = pandas.read_csv("/tmp/df.csv.gz")
df.equals(df1)
df2 = pandas.read_csv("/tmp/df.csv.zip")
df.equals(df2)

Many files in one zip archive

pandas.read_csv can only read zip archive that contain one file only. If there are more than one file in the archive, you can use a ZipFile object to provide access to the correct file inside the archive, see SO answer.

import zipfile
import pandas
zf = zipfile.ZipFile("archive_name.zip")
print("Files in the archive:", zf.namelist())
df = pandas.read_csv(zf.open("file_name.csv"))

pyarrow.csv

From an API

Pandas data frames can be used to read CSV files from the Comtrade data API. For example, using the default API URL for all countries:

import pandas
df1 = pandas.read_csv('http://comtrade.un.org/api/get?max=500&type=C&freq=A&px=HS&ps=2020&r=all&p=0&rg=all&cc=TOTAL&fmt=csv')

df2 = pandas.read_csv('http://comtrade.un.org/api/get?max=500&type=C&freq=A&px=HS&ps=2020&r=all&p=0&rg=all&cc=01&fmt=csv',
                       # Force the id column to remain a character column,
                       # otherwise str "01" becomes an int 1.
                       dtype={'Commodity Code': str, 'bli': str})

Then use df.to_csv to write the data frame to a csv file

 df1.to_csv("/tmp/comtrade.csv")

Data sources

Eurostat

Load Eurostat population projection data Eurostat tab separated values have a peculiar way to be a mix of tab separated and command separated values. This is annoying when loading data into pandas.

Here is how to load the population projection dataset available at https://ec.europa.eu/eurostat/databrowser/view/PROJ_23NP/ into pandas

Excel

This reads only one sheet:

pandas.read_excel("file_name.xlsx", "sheet_name")

Open all sheets in an excel file, and concatenate them to a single data frame with an additional column that contains the sheet name.

import pandas as pd
sheets_dict = pd.read_excel("file_name.xlsx", sheet_name=None)
all_data = pd.concat(
    [df.assign(sheet_name=s) for s, df in sheets_dict.items()],
    ignore_index=True
)
print(all_data)

Feather

Load a sample data frame and save it to a feather file

import pandas
import seaborn
iris = seaborn.load_dataset("iris")
iris.to_feather("/tmp/iris.feather")

Load the data from the feather file

iris2 = pandas.read_feather("/tmp/iris.feather")
iris2.equals(iris)

Gamms GDX

GDX files store data for the GAMMS modelling platform. They can be loaded into pandas data frames with the gdxpds package as explained in the gdpxpds documentation:

import gdxpds
gdx_file = 'C:\path_to_my_gdx\data.gdx'
dataframes = gdxpds.to_dataframes(gdx_file)
for symbol_name, df in dataframes.items():
    print("Doing work with {}.".format(symbol_name))

Markdown

Print a data frame to markdown, without the scientific notation https://stackoverflow.com/questions/66713432/suppress-scientific-notation-in-to-markdown-in-pandas

import pandas
import numpy as np
df = pandas.DataFrame({"x" : [1,1e7, 2], "y":[1e-5,100, np.nan]})
print(df.to_markdown())
print(df.to_markdown(floatfmt='.0f', index=False))

Print missing values as a minus sign https://stackoverflow.com/a/71165631/2641825

import pandas
import numpy as np
from tabulate import tabulate
df = pandas.DataFrame({"x": [1, 2], "y": [0, np.nan]})
print(tabulate(df,floatfmt=".0f", missingval="-",tablefmt="grid"))

print(tabulate(df.replace(np.nan, None),floatfmt=".0f", missingval="-",tablefmt="grid"))

In practice I use the data frame .to_markdown() methods which call tabulate in the background, as explained on the pandas documentation

print(df.to_markdown(floatfmt=".0f", index=False, missingval="-"))
print(df.replace(np.nan, None).to_markdown(floatfmt=".0f", index=False, missingval="-"))

See also

  • the jupyter section on markdown

Netcdf

You can use the command line tool ncdump to view the content of netcdf files

sudo apt install netcdf-bin
ncdump fuel.nc

Open a text file

Open a text file and print lines containing “error”

with open('filename.txt', 'r') as file:
    for line in file:
        if "error" in line.lower():
            print(line)

Parquet

Write to one file defaults to [snappy compression](https://en.wikipedia.org/wiki/Snappy_(compression)

import pandas
import seaborn
iris = seaborn.load_dataset("iris")
iris.to_parquet("/tmp/iris.parquet")

Read back the file

iris3 = pandas.read_parquet("/tmp/iris.parquet")
iris3.equals(iris)

You can also use gzip compression for a smaller file size (but slower read and write times)

iris.to_parquet("/tmp/iris.parquet.gzip", compression='gzip') 

Partition column

Write to multiple files along a column used as partition variable

iris.to_parquet("/tmp/iris",partition_cols="species") 

The partitioned dataset is saved under a sub directory for each unique value of the partition variable. For example there is a sub directory for each species in the /tmp/iris directory

iris
├── species=setosa
│   └── 1609afe5535d4e2b94e65f1892210269.parquet
├── species=versicolor
│   └── 18dd7ae6d0794fd48dad37bf8950d813.parquet
└── species=virginica
    └── e0a9786251f54eed9f16380c8f5c3db3.parquet

On can read a single file in memory

virginica = pandas.read_parquet("/tmp/iris/species=virginica")

Note it has lost the species column

Read all files in memory

iris4 = pandas.read_parquet("/tmp/iris")

Note the data frame is slightly different. Values are the same but the species columns has become a categorical variable.

iris4.equals(iris)
# False
iris4.species
# ...
# Name: species, Length: 150, dtype: category
# Categories (3, object): ['setosa', 'versicolor', 'virginica']

Changing it back to a strings makes the 2 data frames equals again.

iris4["species"] = iris4["species"].astype("str")
iris4.equals(iris)
# True

Filters

Read only part of the content from parquet files with a filter. See help(pyarrow.parquet.read_pandas) for arguments concerning the pyarrow engine. Reusing example files from the previous section:

selection = [("species", "in", ["versicolor","virginica"])]
iris5 = pandas.read_parquet("/tmp/iris", filters=selection)

In fact, the filter variable doesn’t have to be a partition variable.

selection = [("species", "in", ["versicolor","virginica"]), 
             ("petal_width", ">", 2.4)]
iris6 = pandas.read_parquet("/tmp/iris", filters=selection)

This works as well on the single file version

iris7 = pandas.read_parquet("/tmp/iris.parquet", filters=selection)
# Change column type for the comparison
iris6["species"] = iris6["species"].astype("str")
iris7.equals(iris6)

Depending on whether or not the query is on the partition variable, read time can be increased by a lot. See experiment in the next section.

Experiment with the parquet format using filters and partition columns.

Note the detaset to perform these comparisons is not made available here. I keep these for information purposes.

Compare a read of 2 countries with the read of the whole dataset

# start_time = timeit.default_timer()
# selection = [("reporter", "in", ["France","Germany"])]
# ft_frde = pandas.read_parquet(la_fo_data_dir / "comtrade_forest_footprint.parquet",
#                                             filters=selection)
# print("Reading 2 countries took:",timeit.default_timer() - start_time)
#
# start_time = timeit.default_timer()
# ft2 = pandas.read_parquet(la_fo_data_dir / "comtrade_forest_footprint.parquet")
# print("Reading the whole dataset took:",timeit.default_timer() - start_time)
#

Time comparison when the reporter is used as a partition column It’s about 10 times faster!

# ft.to_parquet("/tmp/ft", partition_cols="reporter")
# start_time = timeit.default_timer()
# selection = [("reporter", "in", ["France","Germany"])]
# ft_frde2 = pandas.read_parquet("/tmp/ft", filters=selection)
# print("Reading 2 countries took:",timeit.default_timer() - start_time)
#
# # Save to a compressed csv file in biotrade_data
# # file_path = la_fo_data_dir / "comtrade_forest_footprint.csv.gz"
# # ft.to_csv(file_path, index=False, compression="gzip")

Also try the feather format.

# # Save to a feather file
# ft.to_feather(la_fo_data_dir / "comtrade_forest_footprint.feather")
#
# # Read time of a feather file
# start_time = timeit.default_timer()
# ft_frde2 = pandas.read_feather(la_fo_data_dir / "comtrade_forest_footprint.feather")
# print("Reading a feather file took:",timeit.default_timer() - start_time)

What is the difference between Apache Arrow and Apache Parquet?

Apache Arrow FAQ

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

Therefore, Arrow and Parquet complement each other and are commonly used together in applications. Storing your data on disk using Parquet and reading it into memory in the Arrow format will allow you to make the most of your computing hardware.”

What about “Arrow files” then?

Apache Arrow defines an inter-process communication (IPC) mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”). It can be used synchronously between processes using the Arrow “stream format”, or asynchronously by first persisting data on storage using the Arrow “file format”.

The Arrow IPC mechanism is based on the Arrow in-memory format, such that there is no translation necessary between the on-disk representation and the in-memory representation. Therefore, performing analytics on an Arrow IPC file can use memory-mapping, avoiding any deserialization cost and extra copies.

Some things to keep in mind when comparing the Arrow IPC file format and the Parquet format:

Parquet is designed for long-term storage and archival purposes, meaning
if you write a file today, you can expect that any system that says they
can “read Parquet” will be able to read the file in 5 years or 10 years.
While the Arrow on-disk format is stable and will be readable by future
versions of the libraries, it does not prioritize the requirements of
long-term archival storage.
Reading Parquet files generally requires efficient yet relatively complex
decoding, while reading Arrow IPC files does not involve any decoding
because the on-disk representation is the same as the in-memory
representation.
Parquet files are often much smaller than Arrow IPC files because of the
columnar data compression strategies that Parquet uses. If your disk
storage or network is slow, Parquet may be a better choice even for
short-term storage or caching.

One large parquet file or many smaller files?

Is it better to have one large parquet file or lots of smaller parquet files?

“Notice that Parquet files are internally split into row groups https://parquet.apache.org/documentation/latest/ So by making parquet files larger, row groups can still be the same if your baseline parquet files were not small/tiny. There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can’t span multiple parquet files.”

“Also larger parquet files don’t limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups).”

“The only downside of larger parquet files is it takes more memory to create them. So you can watch out if you need to bump up Spark executors’ memory.”

Pickle

Store a dictionary to a pickle file

import pickle
d = {"lkj":1}
with open('/tmp/d.pickle', 'wb') as file:
    pickle.dump(d, file)

Read from a pickle file

with open("/tmp/d.pickle", "rb") as file:
    e = pickle.load(file)
d == e

Neural Networks

Pytorch

Print the size of the output layer

import torch
import torch.nn as nn
x = torch.randn(28,28).view(-1,1,28,28)
model = nn.Sequential(
      nn.Conv2d(1, 32, (3, 3)),
      nn.ReLU(),
      nn.MaxPool2d((2, 2)),
      nn.Conv2d(32, 64, (3, 3)),
)
print(model(x).shape)

Objects

AA object types

type() displays the type of an object.

i = 1
print(type(i))
# <type 'int'>
x = 1.2
print(type(x))
# <type 'float'>
t = (1,2)
print(type(t))
# <type 'tuple'>
l = [1,2]
print(type(l))
# <type 'list'>

Check object types

Check if a variable is a string, int or float

isinstance("a", str)
isinstance(1,  int)
isinstance(1.2, float)

Convert between object types

Character to numeric

int("3")
float("3.33")
int("3.33")

Numeric to character

str(2)

Convert a list to a comma separated string

",".join(["a","b","c"])

Another example with the list of the last 5 years

import datetime
year = datetime.datetime.today().year
# Create a numeric list of years
YEARS = [year - i for i in range(1,6)]
# Convert each element of the list to a string
YEARS = [str(x) for x in YEARS]
",".join(YEARS)

Dictionary

Create a dictionary with curly braces

ceci = {'x':1, 'y':2, 'z':3}

Converts 2 lists into a dictionary with the dict built in function

dict(zip(['x', 'y', 'z'], [1, 2, 3]))

Dictionary comprehension

d = {n: True for n in range(5)}

Loop over the key and values of a dictionary

for key, value in ceci.items():
    print(key, "has the value", value)

Invert keys and values

{value:key for key,value in ceci.items()}

Iterator

The map function makes an iterator object of type map

iter = map(lambda x:x+1,range(3))
type(iter)
[i for i in iter]

List

Create a list

l1 = [1, 2, 3]
l2 = ["a", "b", "c"]

Create a list of strings using split (seen in this answer)

"slope, intercept, r_value, p_value, std_err".split(", ")

Remove an item from a list

Remove an element from a list of strings

li = ['a', 'b', 'c', 'd']
li.remove('c')
li
['a', 'b', 'd']

Reverse a list

Reverse a list with the reverse iterator

list(reversed(range(0,15)))

Reverse a list in place

bli = list(range(5))
print(bli)
bli.reverse()
print(bli)

List of tuples

How to flatten a list of tuples

nested_list = [(1, 2, 4), (0, 9)]

Using reduce:

reduce(lambda x,y:x+y, map(list, nested_list))                                                                                                                              
[1, 2, 4, 0, 9]

Using itertools.chain:

import itertools
list(itertools.chain.from_iterable(nested_list))

Using extend:

flat_list = []
for a_tuple in nested_list:
    flat_list.extend(list(a_tuple))                                                                                                                                     
flat_list
[1, 2, 4, 0, 9]

Set operations

Difference between sets

Difference between two sets:

set1 = {1,2,3}
set2 = {2,3,4}
set1 - set2
# {1}
set2 - set1
# {4}

Intersection and common set elements

Return a new set with elements common to the set and all others.

intersection(*others)
set & other & ...

bli = {1,2,3}
bli.intersection({1,2})
# {1, 2}
bli.intersection({1,2}, {1})

difference(*others)
set - other - ...

    Return a new set with elements in the set that are not in the others.

symmetric_difference(other)
set ^ other

    Return a new set with elements in either the set or other but not both.

For example check whether country names are all the same in 2 data frames

country_differences = set(df1["country"].unique()) ^ set(df2["country"].unique())
assert country_differences == set()

Subset and superset

Instances of set provide the following operations:

issubset(other)
set <= other

Test whether every element in the set is in other. For example SO answer using issubset

l = [1,2,3]
m = [1,2]
set(m).issubset(l)
# True

seta = {1,2,3}
setb = {1,2}
setb.issubset(seta)

set < other

    Test whether the set is a proper subset of other, that is, set <= other and set != other.

issuperset(other)
set >= other

    Test whether every element in other is in the set.

set > other

    Test whether the set is a proper superset of other, that is, set >= other and set != other.

Union of sets

Return a new set with elements from the set and all others.

union(*others)
set | other | ...

Example

{1,2}.union({3,4}, {10})

Note the following perform a union:

set(range(3,10)).union(set(range(5)))
set(range(3,10)) | set(range(5))

But this is not a union:

set(range(3,10)) or set(range(5))

Type hints

For example

import panda
from pathlib import Path
def csv_to_df(path: [str, Path]) -> pandas.DataFrame:
    return pandas.read_csv(path)

Type hints of pandas data frames

Meta programming

Class decorators

See also function decorators in another section below.

Meta class

Programming objects

Inheritance and composition

Below is an example of object inheritance where a Car and a Boat classes inherit from a Vehicle class.

class Vehicle(object):

    def __init__(self, color, speed_max, garage=None):
        self.color = color
        self.speed_max = speed_max
        self.garage = garage

    def paint(self, new_color):
        self.color = new_color

    def go_back_home(self, new_color):
        self.position = self.go_to(self.parent.location)

class Car(Vehicle):

    def open_door(self):
        pass

class Boat(Vehicle):

    def open_balast(self):
        pass

honda = Car('bleu', 60)
gorgeoote = Boat('rouge', 30)
honda.paint('purple')

Note that the object should be able to access it’s parent properties through the super() method.

Below an example of object composition where the Garage class is parent to many Vehicle objects.

class Garage(object):

    def __init__(self, all_vehicles):
        self.all_vehicles = all_vehicles

    def mass_paint(self, new_color):
        for v in self.all_vehicles: v.paint(new_color)

    def build_car(self, color):
        new_car = Car(color, 90, self)
        self.all_vehicles.append(new_car)
        return new_car

    @property
    def location(self):
        return '10, 18'


mike = Garage([honda, gorgeoote])

mike.mass_paint()

sport_car = mike.build_car('rouge')

Why do Python classes inherit object?

Why do Python classes inherit object?

So, what should you do?

In Python 2: always inherit from object explicitly. Get the perks.

In Python 3: inherit from object if you are writing code that tries to be Python agnostic, that is, it needs to work both in Python 2 and in Python 3. Otherwise don’t, it really makes no difference since Python inserts it for you behind the scenes.

Git

Get the active branch name in a git repository with GitPython:

import git
hat = git.Repo(path="~/repos/eu_cbm/eu_cbm_hat")
hat.active_branch.name

Find the location of git repositories for libcbm_py and eu_cbm_hat, then create git repository objects with them:

import sys
import git
def find_sys_path(path_contains):
    """Find path that contains the given characters.
    Raise an error if there's not exactly one matching path"""
    matching_paths = [path for path in sys.path if path_contains in path]
    if len(matching_paths) != 1:
        msg = f"Expected one path containing {path_contains}, "
        msg += f"found {len(matching_paths)}\n"
        msg += f"{matching_paths}"
        raise ValueError(msg)
    return matching_paths[0]
repo_eu_cbm_hat = git.Repo(find_sys_path("eu_cbm_hat"))

Checkout a branch if the repository is clean (no changes)

def checkout_branch(git_repo:git.repo.base.Repo, branch_name:str):
    """Check if a repository has any changes and checkout the given branch
    """
    if git_repo.is_dirty(untracked_files=True):
        msg = f"There are changes in {git_repo}.\n"
        msg += f"Not checking out the '{branch_name}' branch."
        raise RuntimeError(msg)
    git_repo.git.checkout(branch_name)
    print(f"Checked out branch: {branch_name} of {git_repo}.")

#Usage
checkout_branch(repo_libcbm_py, "2.x")

HTTP

File download

Zipped csv files

The following example uses urllib.request.urlopen to download a zip file containing Oceania’s crop production data from the FAO statistical database. In that example, it is necessary to define a minimal header, otherwise FAOSTAT throws an Error 403: Forbidden. It was posted as a StackOverflow Answer.

import shutil
import urllib.request
import tempfile

# Create a request object with URL and headers    
url = "http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip"
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)

# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f"File located at:{dest_file}")

# Create an http response object
with urllib.request.urlopen(req) as response:
    # Create a file object
    with open(dest_file, "wb") as f:
        # Copy the binary content of the response to the file
        shutil.copyfileobj(response, f)

Based on https://stackoverflow.com/a/48691447/2641825 and https://stackoverflow.com/a/66591873/2641825, see also the documentation at https://docs.python.org/3/howto/urllib2.html

JSON files

The following loads a JSON file into a pandas data frame from the Comtrade API.

import urllib.request
import json
import pandas

url_reporter = "https://comtrade.un.org/Data/cache/reporterAreas.json"
url_partner = "https://comtrade.un.org/Data/cache/partnerAreas.json"

# attempt with pandas.io, with an issue related to nested json
pandas.io.json.read_json(url_reporter, encoding='utf-8-sig')
pandas.io.json.read_json(url_partner)
# `results` is a character column containing {'id': '4', 'text': 'Afghanistan'}.
# Is there a way to tell read_json to load the id and text columns directly instead?

SO answer

“Since the whole processing is done in the pd.io.json.read_json method, we cannot select the keys to direct to the actual data that we are after. So you need to run this additional code to get your desired results:”

df = pandas.io.json.read_json(url_reporter, encoding='utf-8-sig')
df2 = pandas.json_normalize(df.results.to_list())

Other attempt using lower level packages

req = urllib.request.Request(url=url_reporter)
with urllib.request.urlopen(req) as response:
    json_content = json.load(response)
    df = pandas.json_normalize(json_content['results'])

In [17]: df
Out[17]:
      id                    text
0    all                     All
1      4             Afghanistan
2      8                 Albania
3     12                 Algeria
4     20                 Andorra
..   ...                     ...
252  876  Wallis and Futuna Isds
253  887                   Yemen
254  894                  Zambia
255  716                Zimbabwe
256  975                   ASEAN

ipython

Add these options at the ipyhton command line to reload objects automatically while you are coding

%load_ext autoreload   
%autoreload 2         

Autoindent

When pasting from another place, turn off auto indentation in ipython

%autoindent off

Debugging in ipython

Once an error occurs at the ipython command line. Press debug then you can move up the stack trace with:

`u`

Move down the stack trace with:

`d` 

Show code context of the error:

`l` 

Show available variable in the current context:

`a` 

To enter interactive mode and paste more than one line of code at a time:

interact

Breakpoint

To break at every step in a loop, use the breakpoint() function in any part of the code as explained in setp by step debuging with ipython.

continue

Profiling in ipython

See also the main section profiling and measuring time.

Magic prun:

%prun statement
# Store the profiler output to a file
%prun -T /tmp/profiler.txt 

Run a script with profiling enabled from the ipython console

%run -i -p script.py

Running from ipython

Run a file from the ipython console

%run -i test.py

https://github.com/ipython/ipython/issues/1001

Jupyter notebooks and lab

Call bash from a notebook

Prefix the bash call with an exclamation mark, for example:

!df -h

In fact the question mark also works from an ipython shell. See also Difference between ! and % in Jupyter Notebooks

Cloud platforms for Jupyter notebooks

Convert and execute notebooks programmatically

At the shell

To work from the ipython command line it’s useful to load execute the whole notebook inside the ipython shell with

ipython -c "%run notebook.ipynb"

It’s also possible to convert the long notebooks to a python script with:

jupyter nbconvert --to script notebook.ipynb

Then run the whole notebook and start an interactive shell with:

ipython -i notebook.py

Otherwise I also sometimes open the synchronized markdown version of the notebook and execute a few cells using Vim slime to sent them to a tmux pane where ipython is running.

Convert notebooks to python scripts or to html

Convert the long notebooks to a python script with:

jupyter nbconvert --to script notebook.ipynb

Notebooks can be converted from the File / Save and Export Notebook As / HTML menu. Or at the command line with nbconvert

jupyter nbconvert --to html notebook.ipynb

From python

Run an ipython notebook from python using nbconver’s execute API:

import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import jupytext

####################
# Run one notebook #
####################
filename = 'notebook.ipynb'
with open(filename) as ff:
    nb_in = nbformat.read(ff, nbformat.NO_CONVERT)

# Read a notebook from the markdown file synchronized by jupytext
nb_md = jupytext.read('notebook.md')

# Run the notebook
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
nb_out = ep.preprocess(nb_in)

# Save the output notebook
with open(filename, 'w', encoding='utf-8') as f:
    nbformat.write(nb_out, f)

Saving fails in my case.

Dashboards and widgets

Interactive widgets

Documentation of interactive widgets

Create a text box

def print_name(name):
    return("Name: " + name)
interact(print_name, name="Paul")

Create a drop down list for an interactive plot

import matplotlib.pyplot as plt
import seaborn
from ipywidgets import interact
iris = seaborn.load_dataset("iris").set_index("species")

def plot_iris(species):
    """Plot the given species"""
    df = iris.loc[species]
    ax = df.plot.scatter(x='petal_length', y='petal_width', title=species)
    ax.set_xlim(0,8)
    ax.set_ylim(0,4)

interact(plot_iris, species=list(iris.index.unique()))

Use the @interact decorator

@interact(species=list(iris.index.unique()))
def plot_iris(species):
    """Plot the given species"""
    df = iris.loc[species]
    ax = df.plot.scatter(x='petal_length', y='petal_width', title=species)
    ax.set_xlim(0,8)
    ax.set_ylim(0,4)

Display options for pandas data frames

https://pandas.pydata.org/docs/user_guide/options.html#frequently-used-options

Round all numbers

pandas.set_option('display.precision', 0)

Precision with 2 digit

pandas.set_option('display.precision', 2)

Scientific notation with 2 significant digits after the dot

pandas.set_option('display.float_format', '{:.2e}'.format)

Display all rows and columns of a data frame

Display all columns

pandas.options.display.max_columns = None

Display max rows

pandas.set_option('display.max_rows', 500)

With a context manager as in this answer

with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
some pandas stuff

with pandas.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(large_prod)

Download data from a jupyter notebook

I wrote this csv download function in an SO answer

def csv_download_link(df, csv_file_name, delete_prompt=True):
    """Display a download link to load a data frame as csv from within a Jupyter notebook"""
    df.to_csv(csv_file_name, index=False)
    from IPython.display import FileLink
    display(FileLink(csv_file_name))
    if delete_prompt:
        a = input('Press enter to delete the file after you have downloaded it.')
        import os
        os.remove(csv_file_name)

To get a link to a csv file, enter the above function and the code below in a jupyter notebook cell :

csv_download_link(df, 'df.csv')

Documentation with Jupyter

Using jupyter to write documentation

Help in a jupyter notebook

To get help on a function, enter function_name? in a cell. Quick hep can also be obtained by pressing SHIFT + TAB.

Install Jupyter

To install Jupyter notebooks on python3:

pip3 install jupyter notebook

Then start the notebook server as such:

jupyter notebook

Plots in notebooks

It is sometimes necessary to add the following at the beginning of a jupyter notebook so that plots are displayed inline

%matplotlib inline

Change the size of a plot displayed in a notebook

import seaborn
p = seaborn.lineplot(x="year", y="value", hue="source", data=df1)
p.figure.set_figwidth(15)

Security and authentication on a public server

Jupyter notebook with authentication

TOC Table of content in your notebooks

Install jupyter_contrib_nbextensions

python3 -m pip install --user jupyter_contrib_nbextensions
python3 -m jupyter contrib nbextension install --user

Activate the table of content extension:

python3 -m jupyter nbextension enable toc2/main

There are many other extensions available in this package. Optionally you can install the jupyter notebook extension configurator (not needed)

python3 -m pip install --user jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user

This will make a configuration interface available at:

http://localhost:8888/nbextensions

Using the old Table of Content extension jupyter table of content extension

jupyter nbconvert --to markdown mynotebook.ipynb
jupyter nbconvert --to html mynotebook.ipynb

For a colleague using Anaconda Installing jupyter_contrib_nbextensions specifies that

“There are conda packages for the notebook extensions and the jupyter_nbextensions_configurator available from conda-forge. You can install both using”

conda install -c conda-forge jupyter_contrib_nbextensions

Jupyter and git version control

Jupytext with markdown and git

Convert notebooks to markdown so they are easier to track in git.

Install https://github.com/mwouts/jupytext

python3 -m pip install --user jupytext

More commands:

python3 -m jupyter notebook --generate-config
vim ~/.jupyter/jupyter_notebook_config.py

Add this line:

c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"

And also this line if you always want to pair notebooks with their markdown counterparts:

c.ContentsManager.default_jupytext_formats = "ipynb,md"

More commands:

python3 -m jupyter nbextension install jupytext --py --user
python3 -m jupyter nbextension enable  jupytext --py --user

Add syncing to a given notebook:

# Markdown sync
jupytext --set-formats ipynb,md --sync ~/repos/example_repos/notebooks/test.ipynb
# Python sync
jupytext --set-formats ipynb,py --sync ~/repos/example_repos/notebooks/test.ipynb

Clear cell output before git commit

As an alternative to Jupytext, you can also clear the output of all cells before committing the notebook. That way the notebooks only contain code and not the output of tables and plots (which can sometimes take several megabytes of data).

nbstripout

  • nbstripout https://github.com/kynan/nbstripout

    “This does mostly the same thing as the Clear All Output command in the notebook UI.”

    • In pre-commit mode

      “nbstripout is used as a git hook to strip any .ipynb files before committing. This also modifies your working copy!”

    • In regular mode

      “In its regular mode, nbstripout acts as a filter and only modifies what git gets to see for committing or diffing. The working copy stays intact.”

    • It’s probably better to use the regular filter mode.

    • Install nbstripout

      pip install –upgrade nbstripout

    • Configure a git repository to use nbstripout

      cd git_repos nbstripout –install

    • Uninstall nbstripout from the current repository “(remove the git filter and attributes)”

      cd git_repos nbstripout –uninstall

  • nbstripout-fast https://pypi.org/project/nbstripout-fast/ 200x faster implementation in rust avoids python startup times. They advertise 40s for git status with large repos, while their tool would speed it up to 1s.

Errors, exceptions and logging

Handling Exceptions with try and except statements

Python documentation on Handling Exceptions.

while True:
    try:
        x = int(input("Please enter a number: "))
        break
    except ValueError:
        print("Oops!  That was no valid number.  Try again...")

Re-raise the exception using from to track the original exception

for i in [1,0]:
    try:
        print(1/i)
    except Exception as e:
        msg = f"failed to compute: 1/{i} {str(e)}"
        raise ValueError(msg) from e

The error message will contain:

"The above exception was the direct cause of the following exception"

Same example Without “from”

for i in [1,0]:
    try:
        print(1/i)
    except Exception as e:
        msg = f"failed to compute: {str(e)}"
        raise ValueError(msg)

The error message will contain:

"During handling of the above exception, another exception occurred"

Simple Exception message capturing with a print statement

for i in [1,0]:
    try:
        print(1/i)
    except Exception as e:
        print("Failed to compute:", str(e))

Data exceptions

Handle empty data in pandas

import pandas
try:
    df = gfpmx_data[s]
    columns = df.columns
except pandas.errors.EmptyDataError:
    print(f"no data in file {s}")
    columns = []

Capture the first few lines of an exception and re-raising it using “from”:

try:
    assert_allclose(
        ds[var].loc[COUNTRIES, t],
        ds_ref[var].loc[COUNTRIES, t],
        rtol=rtol,
    )
except AssertionError as e:
    first_line_of_error = ", ".join(str(e).split('\n')[:3])
    msg = f"{ds.product}, {var}: {first_line_of_error}"
    raise AssertionError(msg) from e

Raising Exceptions

Python documentation on Raising Exceptions

raise Exception('spam')
raise ValueError('Not an acceptable value')
raise NameError("Wrong name: %s" % "quack quack quack")

Display variables in the error message:

raise ValueError("This is wrong: %s" % "wrong_value")
msg = "Value %s and value %s have problems."
raise ValueError(msg % (1, 2))

Default Exception

Built-in Exceptions

  • KeyError is returned when a value is missing. For example if an environment variable is undefined.

    import os os.environ[“avarthatdoesntexist”]

  • ValueError can be used to raise exceptions about data issues.

Warnings

Send a warning to the user

import warnings
warnings.warn("there is no data")

Ignore warnings

Do not display the following warnings:

import warnings
warnings.filterwarnings("ignore", message="option is deprecated")
warnings.filterwarnings("ignore", ".*layout has changed to tight.*", category=UserWarning)
# related to https://github.com/mwaskom/seaborn/issues/3462
warnings.filterwarnings("ignore", "is_categorical_dtype") 
warnings.filterwarnings("ignore", "use_inf_as_na")

Logging

docs.python.org logging cookbook

Pylint error: Use lazy % formatting in logging functions

Answer to Lazy evaluation of strings in python logging: comparing % with .format

The documentation https://docs.python.org/2/library/logging.html suggest the following for lazy evaluation of string interpolation:

logging.getLogger().debug('test: %i', 42)

Functions

Functions in python can be defined with

def add_one(x):
    return x + 1
add_one(1)

# 2

Annotations

PEP 3107

Annotations for parameters take the form of optional expressions that follow the parameter name:

def foo(a: expression, b: expression = 5):
    ...

to annotate the type of a function’s return value. This is done like so:

def sum() -> expression:
    ...

SO example

def kinetic_energy(m:'in KG', v:'in M/S')->'Joules':
return 1/2_m_v**2

kinetic_energy.__annotations__
{'m': 'in KG', 'v': 'in M/S', 'return': 'Joules'}

The pandas code base doesn’t use it everywhere, there are functions that use the standard sphinx type of documentation timedeltas.py#L1094. I have the impression that the annotation are used for the package internal functions, while the sphinx documentation is used for the functions that are exposed to the outside users. And in the same script, they use both sphinx documentation and type annotations timedeltas.py#L952.

Type checking in python

There are several things to know about up front when it comes to type hinting in Python. Let’s look at the pros of type hinting first:

  • Type hints are nice way to document your code in addition to docstrings
  • Type hints can make IDEs and linters give better feedback and better autocomplete
  • Adding type hints forces you to think about types, which may help you make good decisions during the design of your applications.

Adding type hinting isn’t all rainbows and roses though. There are some downsides:

  • The code is more verbose and arguably harder to write
  • Type hinting adds development time
  • Type hints only work in Python 3.5+. Before that, you had to use type comments
  • Type hinting can have a minor start up time penalty in code that uses it, especially if you import the typing module.

Call by reference or call by value

When using numpy arrays, python displays a behaviour of call by reference

a = np.array([1,2])

def changeinput(x, scalar):
    x[0] = scalar

changeinput(a,3)

a
# array([3, 2])

This is really weird coming from R, which has a copy-on-modify principle.

The R Language Definition says this (in section 4.3.3 Argument Evaluation)

“The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. [Emphasis added]”

Decorators

Decorators are a way to wrap a function around another function. It is useful to repeat a pattern of behaviour around a function.

I have used decorators to cache the function output along a data processing pipeline.

Property and cached property

Since python 3.8 there is also a @cached_property decorator functools.cached_property

“Transform a method of a class into a property whose value is computed once and then cached as a normal attribute for the life of the instance. Similar to property(), with the addition of caching. Useful for expensive computed properties of instances that are otherwise effectively immutable.”

Example (by https://www.perplexity.ai/search/5cc7a6e1-ef72-418d-b7ae-d9049815b6f8?s=c#5cc7a6e1-ef72-418d-b7ae-d9049815b6f8):

from functools import cached_property

class MyClass:
    def __init__(self):
        self._data = [1, 2, 3, 4, 5]

    @cached_property
    def sum(self):
        print("Computing sum...")
        return sum(self._data)

Usage:

obj = MyClass()
print(obj.sum)  # prints "Computing sum... 15"
print(obj.sum)  # prints "15"

There is also a @cache decorator functools.cache that creates:

“a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.

Deprecate arguments

Deprecate the old name of a function argument

def agg_trade_eu_row(df, grouping_side="partner", index_side=None):
    if index_side is not None:
        warnings.warn("index_side is deprecated; use grouping_side", DeprecationWarning, 2)
        grouping_side = index_side

This SO Questions asks how to create an argument alias, without changing the number of arguments to the function.

Docstring Documentation

Document python functions with the sphinx convention SO Answer

def send_message(sender, recipient, message_body, priority=1) -> int:
   """
   Send a message to a recipient

   :param str sender: The person sending the message
   :param str recipient: The recipient of the message
   :param str message_body: The body of the message
   :param priority: The priority of the message, can be a number 1-5
   :type priority: integer or None
   :return: the message id
   :rtype: int
   :raises ValueError: if the message_body exceeds 160 characters
   :raises TypeError: if the message_body is not a basestring
   """

Mapping

Cartopy drawing maps

Raster

See the sections on

  • Rasterio
  • xarray

Math

How to do maths in python 3 with operators

2 to the power of 3

2**3
# 8

Floor division and modulo

Floor division

5//3
# 1
# Use it to extract the year of a Comtrade period
202105 // 100

Modulo

5%3
# 2
12
# Use it to extract the last 2 digits of an integer
202105 % 100

Both at the same time

divmod(5,3)
# (1, 2)

Sympy

https://www.sympy.org/en/index.html

“SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.”

Modelling and statistics

Optimization with Pyomo

  • Wikipedia pyomo

  • Pyomo documentation

  • Pyomo Cookbook

    “Pyomo is well suited to modeling simple and complex systems that can be described by linear or nonlinear algebraic, differential, and partial differential equations and constraints.”

OpenAI chat completion

Ask Chat GPT to complete a message

import openai
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What are the trade-offs around deadwood in forests?"}]
)
print(response)

Print available models

models = openai.Model.list()
print([m["id"] for m in models.data])

Econometrics

Panel data regressions

Numpy vectors and matrices (arrays)

All examples below are based on the numpy package being imported as np :

import numpy as np

Logical operators and binary operations

I mostly use binary operators on boolean arrays for index selections in pandas data frames.

Bitwise and

np.array([True, True]) & np.array([False, True])

Bitwise not

~np.array([True, False])

They are equivalent to logical operators numpy.logical_and, numpy.logical_not for logical arrays.

A SO answer quotes the NumPy v1.15 Manual

> If you know you have boolean arguments, you can get away with using
> NumPy’s bitwise operators, but be careful with parentheses, like this:
> `z = (x > 1) & (x < 2)`. The absence of NumPy operator forms of
> `logical_and` and `logical_or` is an unfortunate consequence of Python’s
> design.

So one can also use ~ for logical_not and | for logical_or.

Bitwise and

“The number 13 is represented by 00001101. Likewise, 17 is represented by 00010001. The bit-wise AND of 13 and 17 is therefore 000000001, or 1”

np.bitwise_and(13, 17)
# 1

The & operator can be used as a shorthand for np.bitwise_and on ndarrays.

x1 = np.array([2, 5, 255])
x2 = np.array([3, 14, 16])
x1 & x2

Indexing Multi-dimensional arrays and masks

Numpy array indexing

“Basic slicing extends Python’s basic concept of slicing to N dimensions. Basic slicing occurs when obj is a slice object (constructed by start:stop:step notation inside of brackets), an integer, or a tuple of slice objects and integers.” […] The basic slice syntax is i:j:k where i is the starting index, j is the stopping index, and k is the step (\(k\neq0\)). “[…]”Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).” “Integer array indexing allows selection of arbitrary items in the array based on their N-dimensional index. Each integer array represents a number of indexes into that dimension.”

x[0:3,0:2]
# array([[0.64174957, 0.18540429],
#        [0.97558697, 0.69314058],
#        [0.51646795, 0.71055115]])

In this case because every row is selected, it is the same as:

x[:,0:2]

Examples modified from https://docs.scipy.org/doc/numpy/user/basics.indexing.html

y = np.arange(35).reshape(5,7)
print(y[np.array([0,2,4]), np.array([0,1,2])])

print('With slice 1:3')
print(y[np.array([0,2,4]),1:3])
print('is equivalent to')
print(y[np.array([[0],[2],[4]]),np.array([[1,2]])])
# This one is the same but transposed, which is weird
print(y[np.array([[0,2,4]]),np.array([[1],[2]])])
# Notice the difference with the following
print(y[np.array([0,2,4]),np.array([1,2,3])])

Masks masked array We wish to mark the fourth entry as invalid. The easiest is to create a masked array:

x = np.array([1, 2, 3, -1, 5])
mx = np.ma.masked_array(x, mask=[0, 0, 0, 1, 0])
print(x.sum(), mx.sum())
# 10 11

Matrix creation and shapes

Create a vector

a = np.array([1,2,3])

Create a matrix

b = np.array([[1,2,3],[5,6,6]])

Shape

a.shape
# (3,)
b.shape
# (2, 3)

Matrix of zeroes

np.zeros([2,2])
#array([[0., 0.],
#       [0., 0.]])

Create a matrix with an additional dimension

np.zeros(b.shape + (2,))
array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

Transpose

b.transpose()
# array([[1, 5],
#        [2, 6],
#        [3, 6]])
c = b.transpose()

Math functions in numpy:

np.cos()
np.sin()
np.tan()
np.exp()

min and max

x = np.array([1,2,3,4,5,-7,10,-8])
x.max()
# 10
x.min()
# -8

Matrix multiplication

Matrix multiplication matmul

np.matmul(a,c)
# array([14, 35])

# Can also be written as
a @ c
# array([14, 35])

Otherwise the multiplication symbol implements an element wise multiplication, also called the

Hadamard product. It only works on 2 matrices of same dimensions. Element-wise multiplication is used for example in convolution kernels.

b * b
# array([[ 1,  4,  9],
#        [25, 36, 36]])

So here is again an example showing the difference between

m = np.array([[0,1],[2,3]])

Element wise multiplication :

m * m 
# array([[0, 1],
#        [4, 9]])

Matrix multiplication :

m @ m
# array([[ 2,  3],
#        [ 6, 11]])

Norm of a matrix

Linear algebra functionalities are provided by numpy.linalg For example the norm of a matrix or vector:

np.linalg.norm(x)
# 16.3707055437449
np.linalg.norm(np.array([3,4]))
# 5.0
np.linalg.norm(a)
# 3.7416573867739413

Norm of the matrix for the regularization parameter in a machine learning model

bli = np.array([[1, 1, 0.0, 0.0, 0.0],
                [0.0, 0.0, 0.0, 0.0, 0.0],
                [0.0, 0.0, 0.0, 0.0, 0.0],
                [0.0, 0.0, 0.0, 0.0, 0.0],
                [0.0, 0.0, 0.0, 0.0, 0.0],
                [0.0, 0.0, 0.0, 0.0, 0.0],
                [0.0, 1, 0.0, 0.0, 0.0]])
sum(np.linalg.norm(bli, axis=0)**2) 3.0000000000000004
sum(np.linalg.norm(bli, axis=1)**2) 3.0000000000000004
np.linalg.norm(bli)**2 2.9999999999999996

Append vs concatenate

x = np.array([1,2])
print(np.append(x,x))
# [1 2 1 2]
print(np.concatenate((x,x),axis=None))
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(np.concatenate((a, b), axis=0))
print(np.concatenate((a, b.T), axis=1))
print(np.concatenate((a, b), axis=None))

Power

Power of an array

import numpy as np
a = np.arange(4).reshape(2, 2)
print(a)
print(a**2)
print(a*a)
np.power(a, 2)

Broadcast the power operator

np.power(a, a)

Random vector or matrices

x = np.random.random([3,4])
x
# array([[0.64174957, 0.18540429, 0.7045183 , 0.44623567],
#        [0.97558697, 0.69314058, 0.32469324, 0.82612627],
#        [0.51646795, 0.71055115, 0.74864751, 0.2142459 ]])

Random choice, with a given probability Choose zero with probability 0.1 and one with probability 0.9.

for i in range(10):
    print(np.random.choice(2, p=[0.1, 0.9]))
    
print(np.random.choice(2, 10, p=[0.1, 0.9]))
print(np.random.choice(2, (10,10), p=[0.1, 0.9]))

[[1 1 1 1 1 1 1 1 1 0]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 0]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 0 1 1]
 [1 1 1 0 1 1 1 1 1 1]
 [1 1 0 1 1 1 1 1 0 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 0 1 1 0]
 [1 1 1 1 1 1 1 1 1 1]]

Error if probabilities do not sum up to one

print(np.random.choice(2, p=[0.1, 0.8]))

# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
# <ipython-input-31-8a8665287968> in <module>
# ----> 1 print(np.random.choice(2, p=[0.1, 0.8]))

# mtrand.pyx in numpy.random.mtrand.RandomState.choice()

# ValueError: probabilities do not sum to 1

Pandas data frames

All code below assumes you have imported pandas

import pandas 

Assign values

Create a data frame

You can create a data frame by passing a dictionary of lists with column names as keys

df = pandas.DataFrame({'x':range(0,3), 
                  'y':['a','b','c']})

#      a  b
#   0  0  3
#   1  1  4
#   2  2  5

By passing data as a list of lists and specifying the name of the columns in the columns argument.

data = [['APOLLOHOSP', 8, 6, 'High', 'small'],
        ['COROMANDEL', 9, 9, 'High', 'small'],
        ['SBIN', 10, 3, 'Medium', 'large']]
pandas.DataFrame(data=data, columns=["code", "Growth", "Value", "Risk", 
"Mcap"])

Or by passing a list of tuples and defining the columns argument

pandas.DataFrame(
    list(zip(range(0,3), ['a','b','c'])), 
    columns=["x", "y"]
)

Random numbers

import numpy as np
df = pandas.DataFrame({'x':np.random.random(100)})

Use the assign method

Create a new column based on another one

df = pandas.DataFrame({'a':range(0,3), 
                       'b':['p','q','r'], 
                       'c':['m','n','o']})
df["d"] = df["a"] * 2

Use the assign method

df.assign(e = lambda x: x["a"] * 3)
df.assign(e = lambda x: x["a"] / 1e3)

Sum columns together and compute a share

Sum all columns in an assign and use it to compute a share

import seaborn
iris = seaborn.load_dataset("iris").set_index("species")
iris.assign(sp_sum = lambda x: x.sum(axis=1),
            sl_share = lambda x: x.sepal_length / x.sp_sum)

Aggregated sum by group and compute a share

inv_agg = inventory_all.groupby(["mgmt_strategy"]).agg(area = ("area","sum"))
inv_agg = inv_agg.assign(share = lambda x: x.area / x.area.sum())

Recursive computation x(t) depends on x(t-1)

A recursive function is difficult to vectorize because each input at time t depends on the previous input at time t-1. When possible use a year index for shorter selection with .loc().

import pandas
df = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7)})
df1 = df.copy()
# Set the initial value
t0 = min(df1.year)
df1.loc[df1.year==t0, "x"] = 0

# Doesn't work when the right side of the equation is a pandas.core.series.Series
for t in range (min(df1.year)+1, max(df1.year)+1):
    df1.loc[df1.year==t, "x"] = df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]
print(df1)
#    year  a    x
# 0  2020  3  0.0
# 1  2021  4  NaN
# 2  2022  5  NaN
# 3  2023  6  NaN
print(type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]))
# <class 'pandas.core.series.Series'>

# Works when the right side of the equation is a numpy array
for t in range (min(df1.year)+1, max(df1.year)+1):
    df1.loc[df1.year==t, "x"] = (df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()
    #break
print(df1)
#    year  a     x
# 0  2020  3   0.0
# 1  2021  4   3.0
# 2  2022  5   7.0
# 3  2023  6  12.0
print(type((df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()))
# <class 'numpy.ndarray'>

# Assignement works directly when the .loc() selection is using a year index
df2 = df.set_index("year").copy()
# Set the initial value
df2.loc[df2.index.min(), "x"] = 0
for t in range (df2.index.min()+1, df2.index.max()+1):
    df2.loc[t, "x"] = df2.loc[t-1, "x"] + df2.loc[t-1,"a"]
    #break
print(df2)
#       a     x
# year
# 2020  3   0.0
# 2021  4   3.0
# 2022  5   7.0
# 2023  6  12.0
print(type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"]))
# <class 'numpy.float64'>

#SO answer using cumsum

Our real problem is more complicated since there is a multiplicative and an additive component

import pandas
df3 = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7), 'b':range(8,12)})
df3 = df3.set_index("year").copy()
# Set the initial value
initial_value = 1
df3.loc[df3.index.min(), "x"] = initial_value
# Use a loop
for t in range (df3.index.min()+1, df3.index.max()+1):
    df3.loc[t, "x"] = df3.loc[t-1, "x"] * df3.loc[t-1, "a"] + df3.loc[t-1, "b"]
# Use cumsum and cumprod
df3["cumprod_a"] = df3.a.cumprod().shift(1).fillna(1)
df3["cumsum_cumprod_a_b"] = df3.cumprod_a.cumsum().shift(1).fillna(0) * df3.b
df3["x2"] = df3.cumprod_a * initial_value + df3.cumsum_cumprod_a_b
print(df3)
  • type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]) is a pandas series while type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"]) is a numpy float. Why are types different?

  • Is there a better way to write a recursive .loc() assignment than to use .unique()?

See also:

“It is a general rule in programming that one should not mutate a container while it is being iterated over. Mutation will invalidate the iterator, causing unexpected behavior.” […] “To resolve this issue, one can make a copy so that the mutation does not apply to the container being iterated over.”

Set values with .loc

Create an example data frame

import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
                      index=['cobra', 'viper', 'sidewinder'],
                      columns=['max_speed', 'shield'])

Set value for all items matching the list of labels

df.loc[['viper', 'sidewinder'], ['shield']] = 50

#             max_speed  shield
# cobra               1       2
# viper               4      50
# sidewinder          7      50

Map values with a dictionary

df = pandas.DataFrame({'lettre':['p','q','r','r','s','v','p']})
mapping = {'p':'pour','q':'quoi','r':'roi'}
df["mot"] = df["lettre"].map(mapping)

Categorical data types

Set categorical data type

import pandas
df = pandas.DataFrame({'id':["a", "b", "c"], 'x':range(3)})
list_of_ids = ["b", "c", "a"]
df['id'] = pandas.Categorical(df['id'], categories=list_of_ids, ordered=True)
df.sort_values('id', inplace=True)
df["id"]
# Out:
# 0    a
# 1    b
# 2    c
# Name: id, dtype: category
# Categories (3, object): ['b' < 'c' < 'a']

Remove a category

 df["element"].cat.remove_categories(["nai_merch"])

Convert data frames to other objects or types

See also the IO section to convert data frames to other files.

Convert 2 columns to a dictionary

df = pandas.DataFrame({'a':range(0,3), 
                       'b':['p','q','r'], 
                       'c':['m','n','o']})
df.set_index('b').to_dict()['c']

Convert a column to numeric or string and check

Convert a string to a numeric type using the argument errors="coerce":

s = pandas.Series(["1", "2", "a"])
pandas.to_numeric(s, errors="coerce")

Check if a column is of numeric or string type

pandas.api.types.is_numeric_dtype(s)
pandas.api.types.is_string_dtype(s)

The following would return an error

s.astype(float)
s.astype(int)

And using the df[col].astype() method with errors="ignore" would not convert at all:

s.astype(float, errors="ignore")
pandas.to_numeric(s, errors="ignore")

Convert an integer to a string type

s = pandas.Series(range(3))
s.astype(str)

Convert one value to a scalar

SO question Convert to scalar

  • iat()

    “Access a single value for a row/column pair by integer position. Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a single value in a DataFrame or Series.”

  • squeeze()

    “Squeeze 1 dimensional axis objects into scalars. Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged. This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.”

Compare data frames

Pure equality example from pandas.DataFrame.equals

import pandas
df = pandas.DataFrame({1: [10], 2: [20]})
exactly_equal = pandas.DataFrame({1: [10], 2: [20]})
df.equals(exactly_equal)
different_column_type = pandas.DataFrame({1.0: [10], 2.0: [20]})
df.equals(different_column_type)
different_data_type = pandas.DataFrame({1: [10.0], 2: [20.0]})
df.equals(different_data_type)

Numpy assert all close

Testing closeness (for example with floating point results computed in another software)

import numpy as np
np.testing.assert_allclose([1,2,3], [1.001,2,3],rtol=1e-2)
np.testing.assert_allclose([1,2,3], [1.001,2,3],rtol=1e-6)

Other examples

df.equals(df+1e-6)
np.testing.assert_allclose(df,df+1e-7)
np.testing.assert_allclose(df,df+1e-3)

Concatenate and merge

Concatenate data frames

Example based on https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects modified

df1 = pandas.DataFrame(
    {
        "A": range(3),
        "B": range(3,6)
    }
)

df2 = pandas.DataFrame(
    {
        "A": range(4),
        "B": range(4,8),
        "C": ["w", "x", "y", "z"]
    }
)

df3 = pandas.DataFrame(
    {
        "A": range(10,13),
        "B": range(13,16)
    }
)

result = pandas.concat([df1, df2, df3]).reset_index(drop=True)

Concatenate series

Concatenate two series SO Notice the difference between the default axis=0 concatenate on the index, and axis=1 concatenate on the columns.

import pandas
s1 = pandas.Series([1, 2, 3], index=['A', 'B', 'c'], name='s1')
s2 = pandas.Series([4, 5, 6], index=['A', 'B', 'D'], name='s2')
pandas.concat([s1, s2], axis=0)
pandas.concat([s1, s2], axis=1)

Merge or join

Stackoverflow Pandas merging

pandas merge right_on do not keep variable name Stack Overflow

Propose 3 solutions:

  • rename the original data frame to merge on varables that have the same name

  • merge and drop the redundant column with a different name

  • set the merge column as an index in the right data frame and use right_index=True

Columns

Data types

Three example data types

df = pandas.DataFrame({"a":range(0,3),
                       "b": ["a", "b", "c"],
                       "c": [0.1, 0.2, 0.3]})
df.info()
#  #   Column  Non-Null Count  Dtype
# ---  ------  --------------  -----
#  0   a       3 non-null      int64
#  1   b       3 non-null      object
#  2   c       3 non-null      float64

Each data type has specific methods attached to it. For example the string accessors methods

df.b.str.contains("c")

Categorical data

Reusing the example from above

df["b"] = pandas.Categorical(
      df["b"], categories=["b", "c", "a"], ordered=True
  )
df.info()
df.sort_values("b")

List Columns

List columns as an index object

df = pandas.DataFrame({'a':range(0,3),'b':range(3,6)})
df.columns

List columns as a list

df.columns.tolist()

Select only certain columns in a list

df['bla'] = 0
cols = df.columns.tolist()
[name for name in cols if 'a' in name]

NA values in columns

Select rows where at least one column is NA

  • https://datatofish.com/rows-with-nan-pandas-dataframe/

    import pandas import numpy as np df = pandas.DataFrame({‘i’ : [‘a’, ‘b’, ‘c’, ‘d’, ‘e’], ‘y’ : [np.nan, ‘2’,‘2’, ‘4’, ‘1’], ‘z’ : [‘2’,‘2’, ‘4’, ‘1’, np.nan], }) df[df.isna().any(axis=1)]

Number and proportion of NA values

A function that prints the number and proportion of NA values:

def nrows_available(df, var):
    """Number of rows where this variables is not NA"""
    avail = sum(df[var] == df[var])
    not_avail = sum(df[var] != df[var])
    assert(not_avail + avail == len(df))
    print(f"{var} is available in {avail} rows",
          f"and NA in the other {not_avail} rows",
          f"{round(avail/len(df)*100)}% are available.")
nrows_available(placette, "tpespar1")
nrows_available(placette, "tpespar2")

Remove empty columns

Remove empty columns where values are all NA

import pandas
import numpy as np
df = pandas.DataFrame({'A' : ['bli', 'bla', 'bla', 'bla', 'bla'],
                       'B' : [np.nan, '2','2', '4', '1'],
                       'C' : np.nan})
columns_to_keep = [x for x in df.columns if not all(df[x].isna())]
df = df[columns_to_keep].copy()

Rename columns

Rename the ‘a’ column to ‘new’

df.rename(columns={'a':'new'})

Rename columns to snake case using a regular expression

import re
df.rename(columns=lambda x: re.sub(r" ", "_", str(x)).lower(), inplace=True)
# Another regexp that replaces all non alphanumeric characters by an
# underscore
df.rename(columns=lambda x: re.sub(r"\W+", "_", str(x)).lower(), inplace=True)

Remove parenthesis and dots in column names

df.rename(columns=lambda x: re.sub(r"[()\.]", "", x), inplace=True)

Replace the content of the columns, see below:

iris["species"].replace("setosa","x")

Rename and select at the same time

You can use a selector data frame to select and rename at the same time.

Multiple column headers

Load a csv file which has headers on 2 lines, merge the headers, convert to lower case, remove the “unnamed_1_” part of the column name:

csv_file_name = self.data_dir /  "names.csv"
df = pandas.read_csv(csv_file_name, header=[0, 1])
df.columns = [str('_'.join(col)).lower() for col in df.columns]
df.rename(columns=lambda x: re.sub(r"unnamed_\d+_", "", str(x)).lower(), inplace=True)

Rename a series

You can also rename a series with

iris["species"].rename("bla")

Reorder columns

Place the last column first

  cols = df.columns.to_list()
  cols = [cols[-1] + cols[:-1]
  df = df[cols]

This SO Answer provide 6 different ways to reorder columns.

Place the last 3 columns first

  cols = list(df.columns)
  cols = cols[-3:] + cols[:-3]
  df = df[cols]

Replace the content of columns

See also string operations in pandas.

Replace Comtrade product code by the FAOSTAT product codes

import seaborn
iris = seaborn.load_dataset("iris")
iris["species"].replace("setosa","x")
# Create a dictionary from 2 columns of a data frame
product_dict = product_mapping.set_index('comtrade_code').to_dict()['faostat_code']
df_comtrade["product_code"] = df_comtrade["product_code"].replace(product_dict)

Variable Type

To change the type of a column use astype:

s = pandas.Series(range(3))
s.to_list()
s.astype(str).to_list()
s.astype(float).to_list()

Note using NA values is not possible with the base integer type, it requires a special type Int64 as explained in this SO answer

Check the type of a column

https://stackoverflow.com/questions/22697773/how-to-check-the-dtype-of-a-column-in-python-pandas/45568211#45568211

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

Find mixed data types

When loading data sometimes more than one type are detected per column with a warning such as this one:

arbre = pandas.read_csv(zf.open("ARBRE.csv"), sep=";")
DtypeWarning: Columns (4,5,9,14,21,36) have mixed types.
Specify dtype option on import or set low_memory=False.

Create sample data with a column that has 2 data types

import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
iris.loc[1,"sepal_length"] = iris.loc[1,"sepal_length"].astype(str)

Find columns that use more than one type

for col in iris.columns:
    unique_types = iris[col].apply(type).unique()
    if len(unique_types) > 1:
        print(col, unique_types)

Memory usage

To display the memory usage of each column in a pandas data frame

import pandas
df = pandas.DataFrame({'x':range(0,3), 'y':['a','b','c']})
print(df.memory_usage(deep=True))
print(df.memory_usage(deep=True).sum())
df.info()

Using sys.getsizeof:

import sys
print(sys.getsizeof(df))

Changing a repeated data series to a categorical can help reduce memory usage, although this is probably not true any more in pandas. Categorical variables come with additional annoyances (such as the memory blow-up bug with observed=False in groupby operations).

import seaborn
iris = seaborn.load_dataset("iris")
print(iris["species"].memory_usage(deep=True))
print(iris["species"].astype('category').memory_usage(deep=True))
iris2 = iris.copy()
iris2["species"] = iris["species"].astype('category')
print(sys.getsizeof(iris2))
print(sys.getsizeof(iris))

Copy data frame indices and data

Use copy() to make a copy of a data frame’s indices and data.

import seaborn
iris1 = seaborn.load_dataset("iris")
iris2 = iris1.copy()
iris2["x"] = 0
print(iris2.head(1))
print(iris1.head(1))
iris2.equals(iris1)

If you don’t make a copy, modifying a new assigned data farme also modifies the original data frame

iris3 = iris1
iris3["x"] = 0
print(iris3.head(1))
print(iris1.head(1))
iris3.equals(iris1)

Datetime operations

Create date time columns from a character column

import pandas
pandas.to_datetime('2020-01-01', format='%Y-%m-%d')
pandas.to_datetime('2020-01-02')
pandas.to_datetime('20200103')

Extract the year

s = pandas.Series(pandas.date_range("2000-01-01", periods=3, freq="Y"))
print(s)
print(s.dt.year)

Convert integer years to a time series

s = pandas.Series([2020, 2021, 2022])
pandas.to_datetime(s, format="%Y")

Convert to date time

Convert UN Comtrade dates in the format 202201 to a datetime type

df = pandas.DataFrame({'period':[202201, 202202]})
df["period2"] = pandas.to_datetime(df['period'], format='%Y%m')
df.info()

Rolling sum and mean

Rolling mean over a 5 year window for the whole data frame (provided that year is the index variable)

df.rolling(window=5).mean()

Plot the difference to a 5 years rolling mean

(df - df.rolling(window=5).mean()).plot.bar()

Example “rolling sum with a window length of 2 observations.”

df = pandas.DataFrame({'B': [0, 1, 2, np.nan, 4, 5, 6, 7]})
df.rolling(2).sum()

Yearly rolling of a monthly time series:

.transform(lambda x: x.rolling(13, min_periods=1).mean()))

Note you might actually not need the transform in this case.

df["x"].rolling(13, min_periods=1).mean()

also works.

Groupby operations

Compute the sum of sepal length grouped by species

import seaborn
iris = seaborn.load_dataset("iris")
# Aggregate one value
iris.groupby('species')["sepal_length"].agg(sum).reset_index()
# Aggregate multiple values
iris.groupby('species')[["sepal_length", "petal_length"]].agg(sum).reset_index()
# Aggregate multiple values and give new names
iris.groupby('species').agg(sepal_length_sum = ('sepal_length', sum),
                            petal_length_sum = ('petal_length', sum))

Compute the sum but repeated for every original row

iris['sepal_sum'] = iris.groupby('species')['sepal_length'].transform('sum')
iris

This is useful to compute the share of total in each group for example.

Compute the cumulative sum of the sepal length

iris['cumsum'] = iris.groupby('species').sepal_length.cumsum()
ris['cumsum'].plot()
from matplotlib import pyplot
pyplot.show()

Compute a lag

iris['cumsum_lag'] = iris.groupby('species')['cumsum'].transform('shift', fill_value=0)
iris[['cumsum', 'cumsum_lag']].plot()
pyplot.show()

Aggregate by decades

Aggregate a trade data frame by decades

bins = range(2000, 2031, 10)
tf_agg["decade"] = pandas.cut(
    tf_agg["year"], bins=bins, include_lowest=True, labels=range(2000, 2021, 10)
)
index = ["reporter", "partner", "flow", "product_code_4d", "decade"]
tf_decade = (
    tf_agg.groupby(index)[["net_weight", "trade_value"]]
    .agg(sum)
    .reset_index()
)

Compute with a lambda function

Beyond standard function such as sum and mean, it’s possible to use a self defined lambda function as follows

import numpy as np
(iris
 .groupby(["species"])
 .agg(pw_sum = ("petal_width", sum),
      pw_sum_div_by_10 = ("petal_width", lambda x: x.sum()/0),
      n = ("petal_width", len),
      mean1 = ("petal_width", np.mean))
 .assign(mean2 = lambda x: x.pw_sum / x.n)
)

Describe mean, std, min, max

Display mean, std, min, 25%, 50%, 75%, max across group by variables

df.groupby(["status"])["diff"].describe()

Different aggregation functions

Example aggregating some variables with a sum and taking the unique value (first) for other variables (input coefficients). The code below passes a dictionary of variables and aggregation functions to the df.groupby().agg() method.

    # Aggregate product codes from the 6 to the 4 digit level
    index = [
        "year",
        "period",
        "reporter_code",
        "reporter",
        "reporter_iso",
        "partner_code",
        "partner",
        "partner_iso",
        "product_code_4d",
        "unit_code",
        "unit",
    ]
    agg_dict = {'quantity': 'sum',
                'net_weight': 'sum',
                'trade_value': 'sum',
                'vol_eqrwd_ub': 'sum',
                'vol_eqrwd_ob': 'sum',
                'la_fo': 'sum',
                'conversion_factor_m3_mt':'first',
                'bark_factor': 'first',
                'nai': 'first'}
    ft4d = (
        ft
        .groupby(index)
        .agg(agg_dict)
        .reset_index()
    )

Unique values

The fist element of the aggregation dictionary shows how to simply compute all the unique values

agg =  {
    # All unique values in a list
    "country_iso2":lambda x: x.unique(),
    # Concatenate a list of strings into a string
    lambda x: "".join(x.unique()),
    # The first value if the value is repeated and only present once
    'primary_eq':lambda x: x.unique()[0] if x.nunique() == 1 else np.nan,
    'import_quantity':lambda x: x.unique()[0] if x.nunique() == 1 else np.nan,
    # Sum the values
    'primary_eq_imp_1':"sum"
}
df_agg = df.groupby(index)[selected_columns].agg(agg).reset_index()

Proportion within groups

Compute proportion within groups:

df = pandas.DataFrame({
    'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
    'value': [10, 20, 30, 40, 50, 60, 70]
})
df['proportion'] = df.groupby('category')['value'].transform(lambda x: x / x.sum())

Lag or shift a grouped variable

Load the flights dataset and for each month, display the passenger value in the same month of the previous year. Compare the passengers and pass_year_minus_one columns by displaying the tables for January and December.

import seaborn
flights = seaborn.load_dataset("flights")
flights['pass_year_minus_one'] = flights.groupby(['month']).passengers.shift()
flights.query("month=='January'")
flights.query("month=='December'")

Compute a lag

iris['cumsum_lag'] = iris.groupby('species')['cumsum'].transform('shift', fill_value=0)
iris[['cumsum', 'cumsum_lag']].plot()
pyplot.show()

Min or max in group

Extract the min in each group

df.loc[df.groupby('A')['B val'].idxmin()]

Sort by max in each group

  df.groupby('reporter')["value"].max().sort_values(ascending=False)

Number of unique combinations

Number of unique combinations of one or 2 columns

df = pandas.DataFrame({'A' : ['bla', 'bla', 'bli', 'bli', 'bli'],
                       'B' : ['1', '2', '2', '4', '2']})
df.groupby(["A"]).nunique()
df.groupby(["B"]).nunique()
df.groupby(["A", "B"]).nunique()

Slice, get the first elements of each group

  • How do I select the first row in each group in groupby

    import pandas import numpy as np df = pandas.DataFrame({‘A’ : [‘bla’, ‘bla’, ‘bli’, ‘bli’, ‘bli’], ‘B’ : [‘1’, ‘2’,‘2’, ‘4’, ‘1’], ‘C’ : [np.nan, ‘X’, ‘Y’, ‘Y’, ‘Y’]}) df.sort_values(‘B’).groupby(‘A’).nth(0) df.sort_values(‘B’).groupby(‘A’).nth(list(range(2))) df.sort_values(‘B’).groupby(‘A’).head(2)

Transform

Sum by groups

import pandas
df = pandas.DataFrame({'i' : ['a', 'a', 'b', 'b', 'b'],
                       'x' : range(1,6)})
df["y"] = df.groupby("i")["x"].transform("sum")

Yearly rolling of a monthly time series:

import pandas
from matplotlib import pyplot as plt
li = list(range(15))
df = pandas.DataFrame({'x' : li  + list(reversed(li)) + li})
df["y"] = df["x"].transform(lambda x: x.rolling(13, min_periods=1).mean())
df.plot()
plt.show()

Interpolate

import pandas
import numpy as np
df = pandas.DataFrame({'i' : ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
                       'x' : [1,np.nan, np.nan, 4, 1, 2, np.nan]})
df["y"] = df.groupby("i")["x"].transform(pandas.Series.interpolate)

See also

Compute a diff through time

Compute the area diff

  df["area_diff"] = df.groupby(groupby_area_diff)["area"].transform(
      lambda x: x.diff()
  )

Concatenate strings in the same group

Based on SO answer

import pandas
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'], 
                       'B' : ['i', 'j', 'k', 'i', 'j'], 
                       'X' : [1, 2, 2, 1, 3]})
df.groupby("X", as_index=False)["A"].agg(' '.join)
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)

Index

Index can be converted back to a data frame See also index selection in the “.loc” section.

Drop index levels

Example from pandas DataFrame drop

import pandas
midx = pandas.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2]])
df = pandas.DataFrame(index=midx, columns=['big', 'small'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3, 0.2]])
df
df.drop(index='cow', columns='small')
df.drop(index='length', level=1)

Flatten multi index columns

For example the result of a pivot operation on multiple value columns returns a multi index. To flatten that multi index, use .to_flat_index() as follows:

df.columns = ["_".join(a) for a in df.columns.to_flat_index()]

Do this column renaming before the reset_index() that you would have use after the pivot operation.

Inspired by https://stackoverflow.com/questions/14507794/how-to-flatten-a-hierarchical-index-in-columns

Recursive computation on a index in a loop

Simple index case

Compute in a loop based on the value of the previous year t-1. If there is a single value by year, scalar computation

df = pandas.DataFrame({'x':range(0,10)})
df.loc[0, "y"] = 2
for t in range(1, len(df)):
    df.loc[t, "y"] = pow(df.loc[t-1, "y"], df.loc[t, "x"]/2)
df

If there are multiple values for each single year vector computation.

import itertools
import pandas
countries = ["a","b","c","d"]
years = range(1990, 2020)
expand_grid = list(itertools.product(countries, years))
df = pandas.DataFrame(expand_grid, columns=('country', 'year'))
df["x"] = 1
df["x"] = df["x"].cumsum()
df.set_index(["year"], inplace=True)
df.loc[min(years), "y"] = 2
for t in range(min(years)+1, max(years)+1):
    df.loc[t, "y"] = pow(df.loc[t-1, "y"], df.loc[t, "x"]/2)
df

Multi index case

I would like to compute the consumption equation of a partial equilibrium model.

df = pandas.DataFrame({'x':range(0,3), 
                       'y':['a','b','c']})

for t in range(gfpmx_data.base_year + 1, years.max()+1):
    # TODO: replace this loop by vectorized operations using only the index on years
    for c in countries:
        # Consumption
        swd.loc[(t,c), "cons2"] = (swd.loc[(t, c), "constant"]
                                   * pow(swd.loc[(t-1, c), "price"],
                                         swd.loc[(t, c), "price_elasticity"])
                                   * pow(swd.loc[(t, c), "gdp"],
                                         swd.loc[(t, c), "gdp_elasticity"])
                                  )
swd['comp_prop'] = swd.cons2 / swd.cons -1
print(swd["comp_prop"].abs().max())
swd.query("year >= 2019")

Unique values of a multi index

Display the unique values of the two columns with a count of occurrences

import seaborn
penguins = seaborn.load_dataset("penguins")
penguins.value_counts(["species", "island"])
penguins[["species", "island"]].value_counts()

Lower level method using unique() on a multi index and returning a data frame

penguins.set_index(["species", "island"]).index.unique().to_frame(False)

Query index greater or smaller than

See also the query section for other ways to query data frames.

SO answer

df = pandas.DataFrame({'i':range(0,3), 
                       'j':['a','b','c'],
                       'x':range(22,25)})
df = df.set_index(["i","j"])
df.loc[(df.index.get_level_values('i') > 1)]

Using query instead

df.query("i>1")

Interpolate

pandas interpolate

import pandas
import numpy as np
s = pandas.Series([0, 2, np.nan, 8, np.nan, np.nan])
s.interpolate(method='polynomial', order=2)
s.interpolate(method='linear')
# Also fill NA values at the begining and end of the series
s.interpolate(method='linear', limit_direction="both")

Limit interpolation to the inner NAN values

s.interpolate(limit_area="inside")

Interpolate within a groupby

# Interpolate the whole data frame
df.groupby("a").transform(pandas.DataFrame.interpolate)
# Only one column
df.groupby("a")["b"].transform(pandas.Series.interpolate)

PyArrow

“The Apache Arrow format allows computational routines and execution engines to maximize their efficiency when scanning and iterating large chunks of data. In particular, the contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.” “[…] a standardized memory format facilitates reuse of libraries of algorithms, even across languages.” “Arrow libraries for C (Glib), MATLAB, Python, R, and Ruby are built on top of the C++ library.”

Pandas IO Input Output

See the general section on IO input output, many of the subsections there refer to pandas IO.

Pivot, reshape and transpose

from wide to long

The Pandas user guide on reshaping gives several example using melt (easier to rename the “variable” and “value” columns) or stack (designed to work together with MultiIndex objects).

Reshape using melt

import seaborn
iris = seaborn.load_dataset("iris")
iris.melt(id_vars="species", var_name="measurement")

Another example with two index columns

cheese = pandas.DataFrame(
      {
          "first": ["John", "Mary"],
          "last": ["Doe", "Bo"],
          "height": [5.5, 6.0],
          "weight": [130, 150],
      }
)
cheese
cheese.melt(id_vars=["first", "last"], var_name="quantity")

Another example

grading_matrix = pandas.DataFrame({"dbh":["d1", "d2", "d3"],
                                   "abies":["p","q","r"],
                                   "picea":["m","n","o"],
                                   "larix":["m","n","o"]})
grading_long = grading_matrix.melt(id_vars="dbh", 
                                   var_name="species", 
                                   value_name="grading")

Reshape using the wide_to_long convenience function

import numpy as np
dft = pandas.DataFrame(
    {
        "A1970": {0: "a", 1: "b", 2: "c"},
        "A1980": {0: "d", 1: "e", 2: "f"},
        "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
        "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
        "X": dict(zip(range(3), np.random.randn(3))),
        "id":  {0: 0, 1: 1, 2: 2},
    }
)
dft
pandas.wide_to_long(dft, stubnames=["A", "B"], i="id", j="year")

From long to wide

Pivot from long to wide format using pivot:

df = pandas.DataFrame({
    "lev1": [1, 1, 1, 2, 2, 2],
    "lev2": [1, 1, 2, 1, 1, 2],
    "lev3": [1, 2, 1, 2, 1, 2],
    "lev4": [1, 2, 3, 4, 5, 6],
    "values": [0, 1, 2, 3, 4, 5]})
df_wide = df.pivot(columns=["lev2", "lev3"], index="lev1", values="values")
df_wide

# lev2    1         2
# lev3    1    2    1    2
# lev1
# 1     0.0  1.0  2.0  NaN
# 2     4.0  3.0  NaN  5.0

Rename the (sometimes confusing) axis names

df_wide.rename_axis(columns=[None, None])

#         1         2
#         1    2    1    2
# lev1
# 1     0.0  1.0  2.0  NaN
# 2     4.0  3.0  NaN  5.0

Add a prefix to a year columns before pivoting

(df
    .assign(year = lambda x: "net_trade_" + x["year"].astype(str))
    .pivot(columns="year", index=["product", "scenario"], values="net_trade")
    .reset_index()
)

Transpose index and columns

Replace

Python pandas equivalent for replace

import pandas
s = pandas.Series(["ape", "monkey", "seagull"])
s.replace(["ape", "monkey"], ["lion", "panda"])
s.replace("a", "x", regex=True)
`s.replace({"ape": "lion", "monkey": "panda"})`
pandas.Series(["bla", "bla"]).replace("a","i",regex=True)

Replace by the upper case value

s.str.upper()

Replace values where a condition is false

Replace values where the condition is false see help(df.where)

“Where cond is True, keep the original value. Where False, replace with corresponding value from other.”

df = pandas.DataFrame({'a':range(0,3), 
                       'b':['p','q','r'], 
                       'c':['m','n','o']})
df["b"].where(df["c"].isin(["n","o"]),"no")
df.where(df["c"].isin(["n","o"]),"no")

Fill Na values

See also the interpolate section.

Replace NA values by another value

import pandas
import numpy as np
df = pandas.DataFrame([[np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                  [np.nan, np.nan, np.nan, 5],
                  [np.nan, 3, np.nan, 4]],
                 columns=list("ABCD"))
# Replace all NaN elements with 0s.
df.fillna(0)
# Replace by 0 and column 2 and by 1 in column B
df.fillna({"A":0, "B":1}, inplace=True)
df

Select with loc, iloc, query, isin and xs

There are many ways to select data in pandas (squarebraquets, loc, iloc, query, isin). In a first stage, during data preparation it’s better to keep data out of the index. But in a second stage, when you are doing modelling, multi indexes become useful. And especially slicers to computes on part of the dataset - only some years, only some products, only some countries . For this, tools such as df.xs or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IndexSlice.html are needed.

loc

.loc is primarily label based, but may also be used with a boolean array.

I copied the examples below from the pandas loc documentation at: pandas.DataFrame.loc

Create an example data frame

import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
                      index=['cobra', 'viper', 'sidewinder'],
                      columns=['max_speed', 'shield'])

List of index labels

In :  df.loc[['viper', 'sidewinder']]
Out:
                    max_speed  shield
        viper               4       5
        sidewinder          7       8

Selecting a cell with 2 lists returns a data frame

df.loc[["viper"], ["shield"]]

Selecting cell with tuples (for multi indexes) or strings returns its value

df.loc[("viper"), ("shield")]
df.loc["viper", "shield"]

Note: in the case of a multi index, use tuples for index selection, see section below on multi index selection with loc.

Conditional that returns a boolean Series

In :  df.loc[df['shield'] > 6]
Out:
                     max_speed  shield
         sidewinder          7       8

Slice with labels for row and labels for columns.

In :  df.loc['cobra':'viper', 'max_speed':'shield']
Out:
               max_speed  shield
        cobra          1       2
        viper          4       5

Set value for all items matching the list of labels

In : df.loc[['viper', 'sidewinder'], ['shield']] = 50

In : df
Out:
                    max_speed  shield
        cobra               1       2
        viper               4      50
        sidewinder          7      50

Another example using integers for the index

df2 = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
                      index=[7, 8, 9],
                      columns=['max_speed', 'shield'])

Slice with integer labels for rows. Note that both the start and stop of the slice are included. Python slices behave differently.

In :  df2.loc[8:9]
Out:
      max_speed  shield
   8          4       5
   9          7       8

index.isin()

Using the same example as above, select rows that are not in [‘cobra’,‘viper’]. Based on a SO answer use isin on the index:

In : df.index.isin(['cobra','viper'])
Out: array([ True,  True, False])

In : df.loc[~df.index.isin(['cobra','viper'])]
Out: 
            max_speed  shield
sidewinder          7       8

Or assign the selector to reuse it:

selector = df.index.isin(['cobra','viper'])
df.loc[selector]
df.loc[~selector]

index conditions

With an index corresponding to years, select all years below or equal to 2050

df.loc[df.index <= 2050]

Multiple conditions

import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
                      index=['cobra', 'viper', 'sidewinder'],
                      columns=['max_speed', 'shield'])
df.loc[(df["max_speed"] > 1) & (df["shield"] < 7)]
df.query("max_speed > 1 & shield < 7")

Multi-index selection with loc

Create a panel data set with a multi index in years and countries

import pandas
import numpy as np
df = pandas.DataFrame(
    {"country": ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso'] * 2,
     "year": np.repeat(np.array([2020,2021]), 5),
     "value":  np.random.randint(0,1e3,10)
     })
df = df.set_index(["year", "country"])

Use the multi index to select data for 2020 only

idx = pandas.IndexSlice
df.loc[idx[2020, :]]

Use the multi index to select data for Algeria only, in all years

df.loc[idx[:, "Algeria"], :]

Note: it’s better to write df.loc[idx[2020, :], :] than df.loc[(2020,)]. The later is in fact just equivalent to df.loc[(2020,)]. Note that df.loc[(,“Algeria”)] would return a syntax error

See also the course material Pandas for panel data.

Sample data copied from help(df.loc):

tuples = [
   ('cobra', 'mark i'), ('cobra', 'mark ii'),
   ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
   ('viper', 'mark ii'), ('viper', 'mark iii')
]
index = pandas.MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20],
        [1, 4], [7, 1], [16, 36]]
df = pandas.DataFrame(values, columns=['max_speed', 'shield'], index=index)

Single label. Note this returns a DataFrame with a single index.

df.loc['cobra']

Single index tuple. Note this returns a Series.

df.loc[('cobra', 'mark ii')]
df.loc[(:,'mark ii')]

Single tuple. Note using [[]] returns a DataFrame.

df.loc[[('cobra', 'mark ii')]]

Single label for row and column. Similar to passing in a tuple, this returns a Series.

df.loc['cobra', 'mark i']

Slice from index tuple to single label

df.loc[('cobra', 'mark i'):'viper']

Slice from index tuple to index tuple

df.loc[('cobra', 'mark i'):('viper', 'mark ii')]

Invert a selection on the second index

df.loc[~df.index.isin(["mark i"], level=1)]
df.index.get_level_values

Get index level values to use conditional checks on those values. For example select years smaller than 2021:

selector = df.index.get_level_values("year") < 2021
df.loc[selector]
Multi-index slicers to select the second index element

Using slicers

“You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).”

Other example from a SO question

import pandas
df = pandas.DataFrame(index = pandas.MultiIndex.from_product([range(2010,2020),
                      ['mike', 'matt', 'dave', 'frank', 'larry'], ]))
df['x']=0
df.index.names=['year', 'people']
df.loc[2010]
df.loc[(2010,"mike")]

These two df.loc[2010], df.loc[(2010,"mike")] work, but

df.loc["mike"]

Returns a KeyError: 'mike'. To select on the second index level only, you need a multi index slicer.

idx = pandas.IndexSlice
df.loc[idx[:, "mike"],:]

You can also use df.xs

df.xs("mike", level=1)
df.xs("mike", level="people")

Using loc on just the second index in multi index Other example using the same data frame as the previous section.

idx = pandas.IndexSlice
df.loc[idx[:, "mark i"],:]
df.xs("mark i", level=1)

iloc

.iloc is primarily integer position based (from 0 to length -1 of the axis), but may also be used with a boolean array.

Create a sample data frame:

In : example = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
                {'a': 100, 'b': 200, 'c': 300, 'd': 400},
                {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
      df = pandas.DataFrame(example)

In : df
Out: 
            a     b     c     d
      0     1     2     3     4
      1   100   200   300   400
      2  1000  2000  3000  4000

Index with a slice object. Note that it doesn’t include the upper bound.

In :  df.iloc[0:2]
Out: 
          a    b    c    d
     0    1    2    3    4
     1  100  200  300  400

With lists of integers.

In : df.iloc[[0, 2], [1, 3]]
Out: 
            b     d
      0     2     4
      2  2000  4000

With slice objects.

In : df.iloc[1:3, 0:3]
Out: 
            a     b     c
      1   100   200   300
      2  1000  2000  3000

With a boolean array whose length matches the columns.

In : df.iloc[:, [True, False, True, False]]
Out: 
            a     c
      0     1     3
      1   100   300
      2  1000  3000

Query

Query the columns of a Data Frame with a boolean expression.

df = pandas.DataFrame({'A': range(1, 6),
                       'B': range(10, 0, -2),
                       'C': range(10, 5, -1)})
df.query("A > B")

A  B  C
5  2  6

Two queries

df.query("A < B and B < C")
df.query("A < B or B < C")

Query using a variable

limit = 3
df.query("A > @limit")

A  B  C
4  4  7
5  2  6

Query for a variable in a list

df.query("A in [3,6]")

Query for a variable not in a list

df.query("A not in [3,6]")

str.contains and str.startswith

str.contains and str.startswith do not work with the default numexpre engine, you need to set engine="python" as explained in this answer.

Example use on a table of product codes, query products description that contain “oak” but not “cloak” and query sawnwood products starting with “4407”:

comtrade.products.hs.query("product_description.str.contains('oak') and not product_description.str.contains('cloak')", engine="python")
comtrade.products.hs.query("product_code.str.startswith('4407')", engine="python")

isin

Use alist of values to select rows

df = pandas.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
df[df['A'].isin([3, 6])]
df.loc[df['A'].isin([3, 6])]
df.query("A in [3,6]")

Square brackets

Select the second column with square brackets

df[df.columns[1]]

xs cross sections

The key and level arguments specify which part of the multilevel index should be used. Create a sample data frame, copied from help(df.xs):

d = {'num_legs': [4, 4, 2, 2],
     'num_wings': [0, 0, 2, 2],
     'class': ['mammal', 'mammal', 'mammal', 'bird'],
     'animal': ['cat', 'dog', 'bat', 'penguin'],
     'locomotion': ['walks', 'walks', 'flies', 'walks']}
df = pandas.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
print(df)

Select with a key following the order in which levels appear in the index:

df.xs('mammal')
df.xs(('mammal', 'dog'))

Select with a key and specify the levels:

df.xs(key='cat', level=1)
df.xs(key=('bird', 'walks'),
      level=[0, 'locomotion'])

Pandas DataFrame.xs “cannot be used to set values.”

Determine whether a column contains a particular value

How to determine whether a pandas column contains a particular varlue

In of a Series checks whether the value is in the index:

In : s = pd.Series(list('abc'))
In : 1 in s
Out: True
In : 'a' in s
Out: False

One option is to see if it’s in unique values:

In : 'a' in s.unique()
Out: True

Check if df is empty

SO question recomends using the boolean value df.empty to test whether a data frame is empty.

import seaborn
iris = seaborn.load_dataset("iris")
selector = iris["species"] == "non_existant"
df = iris[selector]
df.empty

Sort or arrange values

Sort iris by descending order of species and ascending order of petal width

iris.sort_values(by=["species", "petal_width"], ascending=[False,True])

String operations in pandas

See also string operations in python in another section.

String operations in pandas use vectorized string methods of the class StringMethods(pandas.core.base.NoNewAttributesMixin).

df = pandas.DataFrame({'a':['a','b','c']})
help(df.a.str)

Concatenate all values in a character vector:

df['a'].str.cat()

Extract the first 2 or last 2 characters

df = pandas.DataFrame({'a':['bla','bli','quoi?']})
df["a"].str[:2]
df["a"].str[-2:]

Search and replace

Search one element or another in a character vector:

df = pandas.DataFrame({'a':['bla','ble','bli2']})
df[df['a'].str.contains('a|i')]

Replace elements in a character vector:

df['a'].replace('a|i','b',regex=True)

Keep only numbers

df["a"].replace('[a-zA-Z]', '', regex=True)

Strip spaces

Strip spaces in strings

df = pandas.DataFrame({'a':['bla','bli',' bla ']})
print(df.a.unique())
print(df.a.str.strip().unique())

Separate a columns in two based on a split pattern

The “too many values to unpack” error can also be returned by the str.split method of pandas data frames.

For example splitting a character vector on the “,” pattern. Split by using both n=1 and expand=True. Then assign to new columns using multiple vector assignment. It is equivalent to tidyr::separate in R.

import pandas
df = pandas.DataFrame({"x": ["a", "a,b", "a,b,c"]})

df[["y", "z"]] = df.x.str.split(",", n=1, expand=True)
df

#        x  y     z
# 0      a  a  None
# 1   a, b  a     b
# 2  a,b,c  a   b,c

# The split data frame returned by the split method
df.x.str.split(",", n=1, expand=True)

#    0     1
# 0  a  None
# 1  a     b
# 2  a   b,c

df.x.str.split(",")

# 0          [a]
# 1      [a,  b]
# 2    [a, b, c]

df.x.str.split(",", expand=True)

#    0     1     2
# 0  a  None  None
# 1  a     b  None
# 2  a     b     c

The following form of assignment works only if each row has exactly 2 splits. In this example, it fails with the error “too many values to unpack (expected 2)”, because of the first row which has only one value instead of two:

df["y"], df["z"] = df.x.str.split(",", n=1)

According to the documentation of pandas.Series.str.split If n > 0 and

“If for a certain row the number of found splits < n, append None for padding up to n if expand=True.”

Extract column content into new columns based on a pattern

Extract the first group of character before the first white space into a new column named product

df = pandas.DataFrame({"raw_content": ["A xyz", "BB xyz lala", "CDE o li"]})
df[["product"]] = df["raw_content"].str.extract(r"^(\S+)")
df

Place product patterns in a capture group for extraction

df = pandas.DataFrame({"x": ["am", "an", "o", "bm", "bn", "cm"]})
product_pattern = "a|b|c"
df[["product", "element"]] = df.x.str.extract(f"({product_pattern})?(.*)")
df

Style format

  • df.style.format?

    “Format the text display value of cells.”

      import pandas
      import numpy as np
      df = pandas.DataFrame({"x":[np.nan, 1.0, "A"], "y":[2.0, np.nan, 3.0]})
      df["z"] = df["y"]
      df.style.format("{:.2f}", na_rep="")
      df.style.format({0: '{:.2f}', 1: '£ {:.1f}'}, na_rep='MISS', precision=1)

Difference between 2 data frames

Two methods Using merge:

merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

Using drop_duplicates

newdf=pd.concat[df1,df2].drop_duplicates(keep=False)

Duplicated values

Warn in case the variable x is duplicated

import pandas
df = pandas.DataFrame({"x": ["a", "b", "c", "a"], "y": range(4)})
dup_x = df["x"].duplicated(keep=False)
if any(dup_x):
    msg = "x values are not unique. "
    msg += "The following duplicates are present:\n"
    msg += f"{df.loc[dup_x]}"
    raise ValueError(msg)

Drop duplicates

df["x"].drop_duplicates()
df["x"].drop_duplicates(keep=False)
df["x"].drop_duplicates(keep="last")

Where and mask

where replaces values that do not fit the condition and mask replaces values that fit the condition.

s = pandas.Series(range(5))
s.where(s > 1, 10)
s.mask(s > 1, 10)

On a data frame

import pandas
import numpy as np
df1 = pandas.DataFrame({'x':[0,np.nan, np.nan], 
                        'y':['a',np.nan,'c']})
df2 = pandas.DataFrame({'x':[10, 11, 12], 
                        'y':['x','y', np.nan]})
df1.mask(df1.isna(), df2)
df1.where(df1.isna(), df2)

Paths

Copy or move files

Write to a text file using a context manager, then copy the file somewhere else.

with open("/tmp/bli.md", "w") as f:
    f.write('Hola!')

Copy a file

import shutil
shutil.copy("/tmp/bli.md", "/tmp/bla.md")

Move a file

shutil.move("/tmp/bli.md", "/tmp/bla.md")

Delete files or directories

Create a file and a path object for example purposes

import pathlib
with open("/tmp/bli.md", "w") as f:
    f.write('Hola!')
bli_path = pathlib.Path("/tmp/bli.md")

Delete a file if it exists

if bli_path.exists():
    bli_path.unlink()

Create a directory and path object for example purposes

import pathblib

Delete a directory if it exists

See also:

Pathlib

Pathlib is an object oriented path API for python as explained in PEP 428

Instead of

import os
os.path.join('~','downloads')

You can use:

from pathlib import Path
Path('~') / 'downloads'

Data located in the home folder

 data_dir = Path.home() / "repos/data/"

Check if a directory is empty

Check if a directory is empty using pathlib

import pathlib
p1 = pathlib.Path("/tmp/")
p2 = pathlib.Path("/tmp/thisisempty/")
p2.mkdir()
any(p1.iterdir()) # returns True
any(p2.iterdir()) # returns False

Dir name or parent directory

SO question that illustrate different levels of parents

import os
import pathlib
p = pathlib.Path('/path/to/my/file')
p.parents[0]
p.parents[1]
p.parent

“Note that os.path.dirname and pathlib treat paths with a trailing slash differently. The pathlib parent of some/path/ is some: While os.path.dirname on some/path/ returns some/path”:

pathlib.Path('some/path/').parent
os.path.dirname('some/path/')

Home

Cross platform way to refer to the home directory

from pathlib import Path
Path.home()

List all files in a directory

If p is a pathlib object you can list file names corresponding to a file pattern as such:

[x.name for x in p.glob('**/*.csv')]

You can also use the simpler iterdir() method to list all files in the directory

from pathlib import Path
dir_path = Path('/tmp')
for file_path in dir_path.iterdir():
    print(file_path)

Python Path

Temporarily add to the python path (SO question) in order to import scripts

import sys
sys.path.append('/path/to/dir')
# You might want to prepend if you want to overwrite a system package
sys.path.insert(0, "/home/rougipa/eu_cbm/eu_cbm_hat")
# If it's a pathlib object, you want to convert it to string first
sys.path.insert(0, str(path_lib_object))

To permanently add a package under development to the python path, add the following to your .bashrc or .bash_profile:

export PYTHONPATH="$HOME/repos/project_name/":$PYTHONPATH

Temporary directories and files

Docs.python.org tempfile examples using a context manager

import tempfile
# create a temporary directory using the context manager
with tempfile.TemporaryDirectory() as tmpdirname:
    print('created temporary directory', tmpdirname)
# directory and contents have been removed

Using pathlib to facilitate path manipulation on top of tempfile makes it possible to create new paths using the / path operator of pathlib:

import tempfile
from pathlib import Path
with tempfile.TemporaryDirectory() as tmpdirname:
    temp_dir = Path(tmpdirname)
    print(temp_dir, temp_dir.exists())
    file_name = temp_dir / "test.txt"
    file_name.write_text("bla bla bla")
    print(file_name, "contains", file_name.open().read())

Outside the context manager, files have been destroyed

print(temp_dir, temp_dir.exists())
# /tmp/tmp81iox6s2 False
print(file_name, file_name.exists())
# /tmp/tmp81iox6s2/test.txt False

Plot

Python plotting for exploratory analysis is a great gallery of plot examples, each example is written in 5 different plotting libraries: pandas, plotnine, plotly, altair and R ggplot2. There is also one seaborn example.

Image composition

For some complex plots, I directly pasted images of plots together as follows:

from PIL import Image

# Load the images
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
p_harea_eu = Image.open(composite_plot_dir / "harea_eu.png")
p_harv_nai_eu = Image.open(composite_plot_dir / "harv_nai_eu.png")
# Get the widths and heights of the images
harea_width, harea_height = p_harea_eu.size
hexprov_width, hexprov_height = p_hexprov_eu.size
sink_width, sink_height = p_sink_eu.size
harv_nai_width, harv_nai_height = p_harv_nai_eu.size

# Create a figure with 2 plot images pasted together
# Change the letter in the sink plot
g_sink.savefig(composite_plot_dir / "sink_eu.png")
# Load images again
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
# Determine the width of the combined image (the maximum width)
max_width = max(hexprov_width, sink_width)
# Create a new image with the combined height and maximum width
combined_height = hexprov_height + sink_height
# Offset the x axis of the top figure to align both axes
x_offset = 25
combined_image = Image.new("RGB", (max_width+x_offset, combined_height), color="white")
# Paste the individual images into the combined image
combined_image.paste(p_hexprov_eu, (x_offset, 0))
combined_image.paste(p_sink_eu, (0, hexprov_height))
# Save the combined image
combined_image.save(composite_plot_dir / "hexprov_sink.png")

Matplotlib

All matplotlib examples require the following imports:

from matplotlib import pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

Simple line plot changing the figure size and the axes limit with pyplot

plt.rcParams['figure.figsize'] = [10, 10]
fig = plt.figure()
ax = plt.axes()
x = np.linspace(-1.5, 1.5, 1000)
ax.plot(x, 1-3*x)
ax.set_xlim(-6, 6)
ax.set_ylim(-6, 6)

Scatter plot, using a colour variable and the ‘jet’ colour map.

Y = np.array([1,-1,-1, 1])
X = np.array([
        [-1, -1],
        [ 1, -1],
        [-1,  1],
        [ 1,  1]])
fig = plt.figure()
ax = plt.axes()
ax.scatter(X[:,0], X[:,1],c=Y, cmap='jet')

Use another colour map

ax.scatter(X[:,0], X[:,1],c=Y, cmap='Spectral')

Plot normal distribution

Plot the probability density function of the normal distribution.

\[f(x)=\frac{1}{\sigma{\sqrt {2\pi }}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}\]

With various sigma and mu values displayed in the legend.

fig = plt.figure()
ax = plt.axes()
x = np.linspace(-5, 5, 1000)
def pdensitynormal(x,sigma_squared,mu):
    sigma = np.sqrt(sigma_squared)
    return 1/(sigma*np.sqrt(2*np.math.pi))*np.exp(-1/2*((x-mu)/sigma)**2)
ax.plot(x, pdensitynormal(x,0.2,0), label="$\sigma^2=0.2, \mu=0$")
ax.plot(x, pdensitynormal(x,1,0), label="$\sigma^2=1, \mu=0$")
ax.plot(x, pdensitynormal(x,5,0), label="$\sigma^2=5, \mu=0$")
ax.plot(x, pdensitynormal(x,0.5,-2), label="$\sigma^2=0.5, \mu=-2$")
ax.legend(loc="upper right")
plt.show()

3D line, contour plot and scatter plot

Plot a 3D surface

from mpl_toolkits import mplot3d # Required for 3d plots
fig = plt.figure()
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
xline = np.linspace(-10, 10, 1000)
yline = np.linspace(-10, 10, 1000)
# Just a line
zline = xline**2 + yline**2
ax.plot3D(xline, yline, zline, 'gray')
# A mesh grid
X, Y = np.meshgrid(xline, yline)
Z = X**2 + Y**2
ax.contour3D(X, Y, Z, 50, cmap='binary')
# Scatter points
ax.scatter(1,2,3)
plt.show()

See how the np.meshgridi objects interact with each other. Note this nested loop is not the optimal way to compute. Better to use X2 + Y2 directly as above.

for i in range(Z.shape[0]):
    for j in range(Z.shape[1]):
        vector = np.array([X[i,j],Y[i,j]])
            Z[i,j] = np.linalg.norm(vector)**2
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')

Axes object

Create an axes object

import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
help(plot)

Create another axes object for a faceted plot

It has the following methods

print([m for m in dir(plot) if not m.startswith("_")])

['ArtistList', 'acorr', 'add_artist', 'add_callback', 'add_child_axes', 
'add_collection', 'add_container', 'add_image', 'add_line', 'add_patch', 
'add_table', 'angle_spectrum', 'annotate', 'apply_aspect', 'arrow', 
'artists', 'autoscale', 'autoscale_view', 'axes', 'axhline', 'axhspan', 
'axis', 'axison', 'axline', 'axvline', 'axvspan', 'bar', 'bar_label', 
'barbs', 'barh', 'bbox', 'boxplot', 'broken_barh', 'bxp', 'callbacks', 
'can_pan', 'can_zoom', 'child_axes', 'cla', 'clabel', 'clear', 'clipbox', 
'cohere', 'collections', 'containers', 'contains', 'contains_point', 
'contour', 'contourf', 'convert_xunits', 'convert_yunits', 'csd', 
'dataLim', 'drag_pan', 'draw', 'draw_artist', 'end_pan', 'errorbar', 
'eventplot', 'figure', 'fill', 'fill_between', 'fill_betweenx', 'findobj', 
'fmt_xdata', 'fmt_ydata', 'format_coord', 'format_cursor_data', 
'format_xdata', 'format_ydata', 'get_adjustable', 'get_agg_filter', 
'get_alpha', 'get_anchor', 'get_animated', 'get_aspect', 
'get_autoscale_on', 'get_autoscalex_on', 'get_autoscaley_on', 
'get_axes_locator', 'get_axisbelow', 'get_box_aspect', 'get_children', 
'get_clip_box', 'get_clip_on', 'get_clip_path', 'get_cursor_data', 
'get_data_ratio', 'get_default_bbox_extra_artists', 'get_facecolor', 
'get_fc', 'get_figure', 'get_frame_on', 'get_gid', 'get_gridspec', 
'get_images', 'get_in_layout', 'get_label', 'get_legend', 
'get_legend_handles_labels', 'get_lines', 'get_mouseover', 'get_navigate', 
'get_navigate_mode', 'get_path_effects', 'get_picker', 'get_position', 
'get_rasterization_zorder', 'get_rasterized', 'get_renderer_cache', 
'get_shared_x_axes', 'get_shared_y_axes', 'get_sketch_params', 'get_snap', 
'get_subplotspec', 'get_tightbbox', 'get_title', 'get_transform', 
'get_transformed_clip_path_and_affine', 'get_url', 'get_visible', 
'get_window_extent', 'get_xaxis', 'get_xaxis_text1_transform', 
'get_xaxis_text2_transform', 'get_xaxis_transform', 'get_xbound', 
'get_xgridlines', 'get_xlabel', 'get_xlim', 'get_xmajorticklabels', 
'get_xminorticklabels', 'get_xscale', 'get_xticklabels', 'get_xticklines', 
'get_xticks', 'get_yaxis', 'get_yaxis_text1_transform', 
'get_yaxis_text2_transform', 'get_yaxis_transform', 'get_ybound', 
'get_ygridlines', 'get_ylabel', 'get_ylim', 'get_ymajorticklabels', 
'get_yminorticklabels', 'get_yscale', 'get_yticklabels', 'get_yticklines', 
'get_yticks', 'get_zorder', 'grid', 'has_data', 'have_units', 'hexbin', 
'hist', 'hist2d', 'hlines', 'ignore_existing_data_limits', 'images', 
'imshow', 'in_axes', 'indicate_inset', 'indicate_inset_zoom', 'inset_axes', 
'invert_xaxis', 'invert_yaxis', 'is_transform_set', 'label_outer', 
'legend', 'legend_', 'lines', 'locator_params', 'loglog', 
'magnitude_spectrum', 'margins', 'matshow', 'minorticks_off', 
'minorticks_on', 'mouseover', 'name', 'patch', 'patches', 'pchanged', 
'pcolor', 'pcolorfast', 'pcolormesh', 'phase_spectrum', 'pick', 'pickable', 
'pie', 'plot', 'plot_date', 'properties', 'psd', 'quiver', 'quiverkey', 
'redraw_in_frame', 'relim', 'remove', 'remove_callback', 'reset_position', 
'scatter', 'secondary_xaxis', 'secondary_yaxis', 'semilogx', 'semilogy', 
'set', 'set_adjustable', 'set_agg_filter', 'set_alpha', 'set_anchor', 
'set_animated', 'set_aspect', 'set_autoscale_on', 'set_autoscalex_on', 
'set_autoscaley_on', 'set_axes_locator', 'set_axis_off', 'set_axis_on', 
'set_axisbelow', 'set_box_aspect', 'set_clip_box', 'set_clip_on', 
'set_clip_path', 'set_facecolor', 'set_fc', 'set_figure', 'set_frame_on', 
'set_gid', 'set_in_layout', 'set_label', 'set_mouseover', 'set_navigate', 
'set_navigate_mode', 'set_path_effects', 'set_picker', 'set_position', 
'set_prop_cycle', 'set_rasterization_zorder', 'set_rasterized', 
'set_sketch_params', 'set_snap', 'set_subplotspec', 'set_title', 
'set_transform', 'set_url', 'set_visible', 'set_xbound', 'set_xlabel', 
'set_xlim', 'set_xmargin', 'set_xscale', 'set_xticklabels', 'set_xticks', 
'set_ybound', 'set_ylabel', 'set_ylim', 'set_ymargin', 'set_yscale', 
'set_yticklabels', 'set_yticks', 'set_zorder', 'sharex', 'sharey', 
'specgram', 'spines', 'spy', 'stackplot', 'stairs', 'stale', 
'stale_callback', 'start_pan', 'stem', 'step', 'sticky_edges', 
'streamplot', 'table', 'tables', 'text', 'texts', 'tick_params', 
'ticklabel_format', 'title', 'titleOffsetTrans', 'transAxes', 'transData', 
'transLimits', 'transScale', 'tricontour', 'tricontourf', 'tripcolor', 
'triplot', 'twinx', 'twiny', 'update', 'update_datalim', 'update_from', 
'use_sticky_edges', 'viewLim', 'violin', 'violinplot', 'vlines', 'xaxis', 
'xaxis_date', 'xaxis_inverted', 'xcorr', 'yaxis', 'yaxis_date', 
'yaxis_inverted', 'zorder']

Figure object

Create a figure object

import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
fig = plot.get_figure()
help(fig)

A figure object is the “The top level container for all the plot elements.” It has the following methods:

print([m for m in dir(fig) if not m.startswith("_")])

['add_artist', 'add_axes', 'add_axobserver', 'add_callback', 
'add_gridspec', 'add_subfigure', 'add_subplot', 'align_labels', 
'align_xlabels', 'align_ylabels', 'artists', 'autofmt_xdate', 'axes', 
'bbox', 'bbox_inches', 'callbacks', 'canvas', 'clear', 'clf', 'clipbox', 
'colorbar', 'contains', 'convert_xunits', 'convert_yunits', 'delaxes', 
'dpi', 'dpi_scale_trans', 'draw', 'draw_artist', 'draw_without_rendering', 
'execute_constrained_layout', 'figbbox', 'figimage', 'figure', 'findobj', 
'format_cursor_data', 'frameon', 'gca', 'get_agg_filter', 'get_alpha', 
'get_animated', 'get_axes', 'get_children', 'get_clip_box', 'get_clip_on', 
'get_clip_path', 'get_constrained_layout', 'get_constrained_layout_pads', 
'get_cursor_data', 'get_default_bbox_extra_artists', 'get_dpi', 
'get_edgecolor', 'get_facecolor', 'get_figheight', 'get_figure', 
'get_figwidth', 'get_frameon', 'get_gid', 'get_in_layout', 'get_label', 
'get_layout_engine', 'get_linewidth', 'get_mouseover', 'get_path_effects', 
'get_picker', 'get_rasterized', 'get_size_inches', 'get_sketch_params', 
'get_snap', 'get_tight_layout', 'get_tightbbox', 'get_transform', 
'get_transformed_clip_path_and_affine', 'get_url', 'get_visible', 
'get_window_extent', 'get_zorder', 'ginput', 'have_units', 'images', 
'is_transform_set', 'legend', 'legends', 'lines', 'mouseover', 'number', 
'patch', 'patches', 'pchanged', 'pick', 'pickable', 'properties', 'remove', 
'remove_callback', 'savefig', 'sca', 'set', 'set_agg_filter', 'set_alpha', 
'set_animated', 'set_canvas', 'set_clip_box', 'set_clip_on', 
'set_clip_path', 'set_constrained_layout', 'set_constrained_layout_pads', 
'set_dpi', 'set_edgecolor', 'set_facecolor', 'set_figheight', 'set_figure', 
'set_figwidth', 'set_frameon', 'set_gid', 'set_in_layout', 'set_label', 
'set_layout_engine', 'set_linewidth', 'set_mouseover', 'set_path_effects', 
'set_picker', 'set_rasterized', 'set_size_inches', 'set_sketch_params', 
'set_snap', 'set_tight_layout', 'set_transform', 'set_url', 'set_visible', 
'set_zorder', 'show', 'stale', 'stale_callback', 'sticky_edges', 'subfigs', 
'subfigures', 'subplot_mosaic', 'subplotpars', 'subplots', 
'subplots_adjust', 'suppressComposite', 'suptitle', 'supxlabel', 
'supylabel', 'text', 'texts', 'tight_layout', 'transFigure', 
'transSubfigure', 'update', 'update_from', 'waitforbuttonpress', 'zorder']

XY comparison scatter plot

When x and y are supposed to be the same value but are not necessarily equal. Compare the x and y values on a scatter plot to a y=x line.

def comp_plot(df, x_var, y_var, title):
    """Plot comparison for the given data frame"""
    # Scatter plot
    plt.scatter(df[x_var], df[y_var])
    # 1:1 line
    line = np.linspace(df[x_var].min(), df[x_var].max(), 100)
    plt.plot(line, line, 'r--')
    plt.xlabel(f'{x_var} additional text')
    plt.ylabel(f'{y_var} additional text')
    plt.title(title)
    return plt

Note on suggestions compared between Bard and Chat GPT-4

    # Create the 1:1 line suggested by bard
    line_x = np.linspace(x.min(), x.max(), 100)
    line_y = line_x
    plt.plot(line_x, line_y, 'r--')
    # 1:1 line suggested by GPT4 (wrong in some way)
    plt.plot([min(x), max(x)], [min(y), max(y)], 'r')

Save figures to pdf, png or svg files

This works with pandas plots and seaborn plots as well.

With the pyplot object, only works immediately after building the plot.

plt.savefig("/tmp/bli.pdf")
plt.savefig("/tmp/bli.png")
plt.savefig("file_name.svg", bbox_inches='tight')

Save a plot object to a pdf file

import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
fig = plot.get_figure()
fig.savefig('/tmp/output.pdf', format='pdf')

Save a grid plot object to a pdf file

fmri = seaborn.load_dataset("fmri")
g = seaborn.relplot(
    data=fmri, x="timepoint", y="signal", col="region",
    hue="event", style="event", kind="line",
    facet_kws={'sharey': False, 'sharex': False}
)
g.savefig("/tmp/fmri.pdf")

Pandas plots are matplotlib AxesSubplot objects

Add another line to a pandas plot

The function df.plot() returns a matplotlib axes object for a plot of the A and B variables. You can add another line for a different variable C using the plot() method of that axes object.

import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame({"A":range(0,30), "B":range(10,40)})
df["C"] = df["B"] + 2
# Using the plot method
ax = df.plot(x="A", y="B")
ax.plot(df["A"], df["C"])
plt.show()

Arguments of the df.plot() function

Example values for the df.plot() function:

  • figsize=(3,3) change the figure size. SO answer links to the documentation that explains that

    > "plt.figure(figsize=(10,5)) doesn't work because df.plot() creates its
    > own matplotlib.axes.Axes object, the size of which cannot be changed
    > after the object has been created. "
  • title='bla bla' add a plot title

  • colormap change colours

Show pandas plots in ipython

Create some data and change the xticks labels

import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame({'x':range(0,30), 'y':range(10,40)})
df.set_index('x', inplace=True)
plot = df.plot(title='Two ranges')
type(plot)
# help(plot)
plot.set_xticks(range(0,31,10), minor=False)
plt.show()

Colour palette in pandas plots

Simple palette as a dictionary

palette = {'ssp2': 'orange',
           'fair': 'green', 
           'historical_period': 'black'}
df.plot(title = "Harvest Scenarios", ylabel="Million m3", color=palette)

Note: the argument for seaborn would be palette=palette.

Using a list of colours with matplotlib ListedColormap (see also example in that documentation page): Reusing the data frame from the previous section

from matplotlib.colors import ListedColormap
df["z"] = 39 
df["a"] = 10
df.plot(colormap=ListedColormap(["red","green","orange"]), figsize=(3,3))
plt.show()
plt.savefig("/tmp/plotpalette.png")

Using a seaborn palette with the as_cmap=True argument:

palette = seaborn.color_palette("rocket_r", as_cmap=True)
df.plot(colormap=palette, figsize=(3,3))
# plt.show()
plt.savefig("/tmp/plotpalette2.png")

Histogram

Histogram

iris["petal_width"].hist(bins=20)

Options for title, labels, colours

import pandas
import matplotlib.pyplot as plt
series = pandas.Series([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])
series.hist(grid=False, bins=20, rwidth=0.9, color='#607c8e')
plt.title('Title')
plt.xlabel('Counts')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

Histogram with a log scale

Pandas plots side by side

Using the same df as above show 2 plost side by side based on this SO answer

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))
df.plot(title='Two ranges', ax=ax1)
df.plot(title='Two ranges', ax=ax2)
plt.show()

Plotly

The advantage of plotly is that it provides dynamic visualisation inside web pages, such as the possibility to zoom in a graph. It’s the open source component of a commercial project called Dash entreprise.

For example this notebook on machine learning used to enhance the localisation of weather forecasts. Seen on this blog post What does machine learning have to do with weather.

Bubble chart

  • https://plotly.com/python/bubble-charts/ example:

    import plotly.express as px df = px.data.gapminder() fig = px.scatter(df.query(“year==2007”), x=“gdpPercap”, y=“lifeExp”, size=“pop”, color=“continent”, hover_name=“country”, log_x=True, size_max=60) fig.show()

Facet chart

Facet chart where the y facet label where removed and replaced with a common annotation. (There was an issue with the annotation dissapearing when going full screen in a streamlit app).

y_var = f"{flow} {element}"
fig = plotly.express.line(
    # shorten the plot facet titles
    df.rename(columns={"product_name": "p",
                       flow: y_var}),
    x="period",
    y=y_var,
    color="partner",
    facet_row="p",
    line_group="partner",
)
# Remove y facet labels
for axis in fig.layout:
    if type(fig.layout[axis]) == plotly.graph_objects.layout.YAxis:
        fig.layout[axis].title.text = ''
# Update y label, by adding to the existing annotation
fig.layout.annotations += (
    dict(
        x=0,
        y=0.5,
        showarrow=False,
        text=f"{flow} {element}",
        textangle=-90,
        # xanchor='left',
        # yanchor="middle",
        xref="paper",
        yref="paper"
    ), # Keep this comma, this needs to be a tuple
)

Save a figure to an image

Plotly figures are normally rendered as HTML pages, but you can convert a figure to a static image file with the write_image method:

fig.write_image("/tmp/fig.png")

XY comparison scatter plot

When x and y are supposed to be the same value but are not necessarily equal. Compare the x and y values on a scatter plot to a 1:1 line.

import plotly

def comp_plotly(df, x_var, y_var, title):
    """Plot comparison for the given data frame"""
    fig = plotly.graph_objects.Figure()
    fig.add_trace(go.Scatter(x=df[x_var], y=df[y_var], mode='markers', name='Data'))
    fig.add_trace(go.Scatter(x=[min(x), max(x)], y=[min(x), max(x)], mode='lines', name='1:1 Line'))
    fig.update_layout(
        title='Scatter plot with 1:1 Line',
        xaxis_title=x_var,
        yaxis_title=y_var
    )
    # Add the reporter, partner, and year to the tooltip
    fig.update_traces(
        hoverinfo='text',
        hovertext=list(zip(df['reporter'], df['partner'], df['year']))
    )
    return fig

this_primary_product = "rape_or_colza_seed"
selector = comp_2["primary_product"] == this_primary_product
comp_plotly(comp_2.loc[selector],
            x_var = 'primary_crop_eq_re_allocated_2nd_level_imported',
            y_var = 'primary_eq_imp_alloc_1',
            title = f"Step 2 primary crop import {this_primary_product}")

Plotnine

Grammar of graphics for python https://github.com/has2k1/plotnine

Figure size

Change the plotnine figure size

import plotnine
plotnine.options.figure_size = (12, 8)

Facet grid

Create a facet grid plot

from plotnine import ggplot, aes, geom_line, facet_grid, labs

Seaborn

All Seaborn examples below require the following imports and datasets:

import seaborn
iris = seaborn.load_dataset("iris")
tips = seaborn.load_dataset("tips") 
fmri = seaborn.load_dataset("fmri")
from matplotlib import pyplot as plt

Resources

AA seaborn interfaces

Facet grid plots

Use Figure-level interface for drawing plots onto a FacetGrid:

  • catplot for drawing categorical plots

  • relplot for drawing relational plots

    • kind=“scatter” (the default) for scatter plots
    • kind=“line” for line plots

The figure level interfaces return FacetGrid objects which can be reused to add subsequent layers.

Seaborn version 0.12.1 introduced an object interfacet which can also be used to make facet plots.

Combining many plots together

This hack uses an image library to combine many plots together:

# Load the images
p_harea_eu = Image.open(composite_plot_dir / "harea_eu.png")
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
# Get the widths and heights of the images
harea_width, harea_height = p_harea_eu.size
hexprov_width, hexprov_height = p_hexprov_eu.size
sink_width, sink_height = p_sink_eu.size
# Determine the width of the combined image (the maximum width)
max_width = max(harea_width, hexprov_width, sink_width)
# Create a new image with the combined height and maximum width
combined_height = harea_height + hexprov_height + sink_height
combined_image = Image.new("RGB", (max_width, combined_height), color="white")
# Paste the individual images into the combined image
combined_image.paste(p_hexprov_eu, (0, 0))
combined_image.paste(p_harea_eu, (0, hexprov_height))
combined_image.paste(p_sink_eu, (0, harea_height + hexprov_height))
# Save the combined image
combined_image.save(composite_plot_dir / "combined_image.png")
Facet title and size

Change row and column labels to display only the content (not “label=”) and change the size to 30.

import seaborn
seaborn.set_theme(style="darkgrid")
df = seaborn.load_dataset("penguins")
g = seaborn.displot(
    df, x="flipper_length_mm", col="species", row="sex",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
g.fig.subplots_adjust(top=.9, bottom=0.1, right=0.9)
g.set_titles(row_template="{row_name}", col_template="{col_name}", size=30)

See also the figure size section.

Iterate over facets

Iterate over facet objects

for i, ax in enumerate(g.axes.flatten()):
    print(i, ax.title.get_text())

for ax in g.axes.flatten():
    this_forest_type = ax.title.get_text()

Object interface

Facet line plot example

import seaborn
import seaborn.objects as so
healthexp = seaborn.load_dataset("healthexp")
p = (
    so.Plot(healthexp, x="Year", y="Life_Expectancy")
    .facet("Country", wrap=3)
    .add(so.Line(alpha=.3), group="Country", col=None)
    .add(so.Line(linewidth=3))
)
p.show()

Example from my data

import seaborn.objects as so
(
    so.Plot(df, x="age", y="volume")
    .facet("forest_type", wrap=6)
    .add(so.Line(alpha=.3), group="forest_type", col=None)
    .add(so.Line(linewidth=3))
)
Axis and other labels

https://seaborn.pydata.org/generated/seaborn.objects.Plot.label.html

p = (
    so.Plot(penguins, x="bill_length_mm", y="bill_depth_mm")
    .add(so.Dot(), color="species")
)
p.label(x="Length", y="Depth", color="")
Facets in the object interface

https://seaborn.pydata.org/generated/seaborn.objects.Plot.facet.html

“Use Plot.share() to specify whether facets should be scaled the same way:

p.facet("clarity", wrap=3).share(x=False)
Save to a file

Save to a file

Plot.save(loc, **kwargs)

Axes and labels

Invert axis

Of a single figure

ax.invert_yaxis()

Of a grid figure

g = seaborn.relplot(x='crop', y='ranking', col='intensity',
                      hue='conservation_target', data=df)
for ax in g.axes[0]:
    ax.invert_yaxis()

Axes labels

Use p.set() to set a y label and a title

p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(xlabel = "Petal Length", ylabel = "Petal Width", title = "Flower sizes")
plt.show()

Scientific notation on axes

Use scientific notation on the axes of a g FacetGrid object:

for axes in g.axes.flat:
    axes.ticklabel_format(axis='both', style='scientific', scilimits=(0, 0))

Do not use scientific notation on the axes

plt.ticklabel_format(style='plain', axis='y')

Use the scientific notation on the y axis labels, at every tick. Without putting a 1e7 at the top that might be overwritten by a facet label.

g = seaborn.relplot(
    data=rp_global.reset_index(), x="step", y="primary_eq", col="primary_product",
    hue="year", kind="line",
    col_wrap=3, height=3,
    facet_kws={'sharey': False, 'sharex': False}
)

def y_fmt(x, pos):
    """function to format the y axis"""
    return f"{x:.0e}"


from matplotlib.ticker import FuncFormatter
#g.set(yticklabels=[])
for ax in g.axes.flat:
    ax.yaxis.set_major_formatter(FuncFormatter(y_fmt))

Rotate axes labels

Rotate index labels

plt.xticks(rotation=70)
plt.tight_layout()
plt.show()

Axes labels on grid plots

Set a common title for grid plots

g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
# Supplementary title
g.fig.suptitle('I don't smoke and I don't tip.')

Change axis label

g.set_ylabels("Y label")

Add larger axis labels for grid plots

g.fig.supxlabel("time in years")
g.fig.supylabel("weight in kg")

In case the title is overwritten on the subplots, you might need to use fig.subplot_adjust() as such:

g.fig.subplots_adjust(top=.95)

Axes limit

Set limits on the one axis in a Seaborn plot:

p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(ylim=(-2,None))

In Seaborn facet grid. How to set xlim and ylim in seaborn facet grid

g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
g.set(ylim=(0, None)) 

Year to date time objects

Years are sometimes displayed with commas, convert them to date time objects to avoid this:

pandas.to_datetime(comp["year"], format="%Y")

Plot title

Use set_title to add a title:

(seaborn
 .scatterplot(x="total_bill", y="tip", data=tips)
 .set_title('Progression of tips along the bill amount')
)

Bar plot and histogram

barplot

Bar plot

import matplotlib.pyplot as plt
import seaborn
iris = seaborn.load_dataset("iris")
iris_agg = iris.groupby("species").agg(sum)
iris_agg_long = iris_agg.melt(ignore_index=False).reset_index()
seaborn.barplot(data=iris_agg_long, x="variable", y="value", hue="species")

Rotate index labels

plt.xticks(rotation=70)
plt.tight_layout()
plt.show()

Other example

seaborn.barplot(df, x="scenario", y="value", hue="variable")

For stacked bar, use df.plot(), which uses matplotlib

p = df.plot.bar(stacked=True)

Grid Bar plot

Draw a facet bar plot from SO for each combination of size and smoker/non smoker

import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
tips=sns.load_dataset("tips")
g = sns.FacetGrid(tips, col = 'size',  row = 'smoker', hue = 'day')
g = (g.map(sns.barplot, 'time', 'total_bill', errorbar = None).add_legend())
plt.show()

Another example https://stackoverflow.com/a/35234137/2641825

times = df.interval.unique()
g = sns.FacetGrid(df, row="variable", hue="segment", palette="Set3", size=4, aspect=2)
g.map(sns.barplot, 'interval', 'value', order=times)

Grid histogram

https://seaborn.pydata.org/examples/faceted_histogram.html

import seaborn
seaborn.set_theme(style="darkgrid")
df = seaborn.load_dataset("penguins")
seaborn.displot(
    df, x="flipper_length_mm", col="species", row="sex",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)

Bar plot in the object interface

Bar plot using the object interface

import seaborn
import seaborn.objects as so
titanic = seaborn.load_dataset("titanic")
p = so.Plot(titanic, x="class", color="sex")
p = p.add(so.Bar(), so.Count(), so.Stack())
p.show()

Bar plot with facets

p = p.facet("sex")
p.show()

Other example of a bar plot with facets, using a palette

    p = so.Plot(df_long, x="year", y="value", color="sink")
    p = p.add(so.Bar(),  so.Stack())
    p = p.facet("pathway", "country_group").share(x=False)
    p = p.layout(size=(14, 9), engine="tight")
    palette = {'living_biomass_sink': 'forestgreen',
               'dom_sink': 'gold',
               'soil_sink': 'black',
               'hwp_sink_bau': 'chocolate'}
    p = p.scale(x=so.Continuous().tick(at=selected_years), color=palette)
    p = p.label(x="", y="Million t CO2 eq", color="")
    # p = p.scale(color=palette)
    years_string = "_".join([str(x) for x in selected_years])
    index_string = "".join([x[0] for x in index])
    print(composite_plot_dir)
    p.save(composite_plot_dir / f"sink_composition_{index_string}_{years_string}.png")

Palettes and styles

Colour palette

See various examples in the plots in the seaborn section. The palette can be defined from pre existing palettes

palette = seaborn.color_palette("rocket_r")

Without argument this function displays the default palette

seaborn.color_palette()

It can translate a list of colour codes into a palette

seaborn.color_palette(["r","g","b"])
seaborn.color_palette(["red","green","blue", "orange"])

This function is used internally by the palette argument of plotting functions:

p1 = sns.relplot(x="Growth", y="Value", hue="Risk", col="Mcap", data=mx, s=200, palette=['r', 'g', 'y'])

Another example using a dictionary for the palette

palette = {"fair":"green", "ssp2":"orange", "historical":"black"}
p = seaborn.lineplot(x="year", y="gdp_t", hue="scenario", data=df_gdp_eu, palette=palette)

Seaborn tutorial on choosing colour palettes https://seaborn.pydata.org/tutorial/color_palettes.html

According to https://stackoverflow.com/a/46174007/2641825 you can also use a dictionary to associate hue values to a palette element.

selected_products = ["wood_fuel", 
                     "sawlogs_and_veneer_logs",
                     "pulpwood_round_and_split_all_species_production",
                     "other_industrial_roundwood"]
palette = dict(zip(selected_products, ["red", "brown", "blue", "grey"]))

Generate darker and lighter green and orange colours

lighter_green = seaborn.dark_palette('green', n_colors=5)[0]
darker_green = seaborn.dark_palette('green', n_colors=5, reverse=True)[0]
lighter_orange = seaborn.dark_palette('orange', n_colors=5)[0]
darker_orange = seaborn.dark_palette('orange', n_colors=5, reverse=True)[0]
Colour names
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
    SO answer has a plot of this dict of colors with names.
Colour blindness

Linestyle dict (doesn’t work)

Example of specifying line styles doesn’t work

linestyle_dict = {'Industrial roundwood': 'solid', 'Fuelwood': 'dotted'}
g = sns.relplot(data=df.loc[selector], x='year', y='demand', col='country',
                hue='combo_name', style="faostat_name", kind='line',
                col_wrap=col_wrap, palette=palette_combo,
                facet_kws={'sharey': False, 'sharex': False},
                dashes=linestyle_dict)

See also https://stackoverflow.com/questions/65549047/how-to-apply-a-linestyle-to-a-specific-line-in-seaborn-lineplot

Figure size

Resize a scatter plot

p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.figure.set_figwidth(15)

Larger grid plots

set_figwidth and set_figheight work well to resize a grid object in its entirety.

g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
g.fig.set_figwidth(10)
g.fig.set_figheight(10) 

Try also

g.fig.set_size_inches(15,15)

Mentioned as a comment under this answer

To change the height and aspect ration of individual grid cells, you can use the height and aspect arguments of the FacetGrid call as such:

import seaborn 
import matplotlib.pyplot as plt
seaborn.set()
iris = seaborn.load_dataset("iris")
# Change height and aspect ratio
g = seaborn.FacetGrid(iris, col="species", height=8, aspect=0.3)
iris['species'] = iris['species'].astype('category')
g.map(seaborn.scatterplot,'petal_length','petal_width','species')
plt.show()

help(seaborn.FacetGrid)

aspect * height gives the width of each facet in inches.

Legend

Move a legend below a grid plot

g.fig.subplots_adjust(left=0.28, top=0.9) # resize the plot
g.legend.set_bbox_to_anchor((0.5, 0.15))

Another way to move the legend and make it flat

seaborn.move_legend(g, "upper center", bbox_to_anchor=(0.5, 0.1), ncol=4)

Move a legend below the plot

seaborn.move_legend(g, "upper left", bbox_to_anchor=(.05, .05), frameon=False, ncol=4, title="")

Code snippet to redraw a legend (didn’t use it in the edn)

h,l = g.axes[0].get_legend_handles_labels()
g.axes[0].legend_.remove()
g.fig.legend(h,l, ncol=4)
g.legend.set_bbox_to_anchor((.05,.05)) #, transform=g.fig.transFigure)

Line plot

Create a line plot with a title and axis labels.

import numpy as np
df = pandas.DataFrame({'value':np.random.random(100), 
                       'year':range(1901,2001)})
p = seaborn.lineplot(x="year", y="value", data=df)
p.set(ylabel = "Random variation", title = "Title here")
plt.show()

Line styles

Example generated by GPT4 with a series of prompt related to a time series plot I was refining.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# Set a random seed for reproducibility
np.random.seed(42)

# Create a synthetic dataset with a random walk
years = np.arange(2000, 2031)
categories = ['x', 'y']
data = []

for category in categories:
    random_walk = np.random.randn(len(years)).cumsum()
    data.extend(zip(years, [category] * len(years), random_walk))

df = pd.DataFrame(data, columns=['year', 'category', 'value'])

# Create a custom linestyle and color dictionary for each category (x, y)
style_dict = {'x': ('-', 'black'), 'y': ('--', 'black')}

# Plot the lineplot
ax = sns.lineplot(
    x="year",
    y="value",
    hue="category",
    style="category",
    data=df
)

# Apply custom linestyle and color for each category (x, y)
for line, category in zip(ax.lines, df["category"].unique()):
    linestyle, color = style_dict[category]
    line.set_linestyle(linestyle)
    line.set_color(color)

# Set the ylabel and title
ax.set(ylabel="Value", title="Random Walk by Category")

# Modify the legend colors to black
legend = ax.legend()
for handle in legend.legendHandles:
    handle.set_color('black')

plt.show()

Grid line plot

relplot

help(searborn.relplot)

> "This function provides access to several different axes-level functions
> that show the relationship between two variables with semantic mappings
> of subsets. The ``kind`` parameter selects the underlying axes-level
> function to use:
>     - :func:`scatterplot` with ``kind="scatter"``; the default
>     - :func:`lineplot` with ``kind="line"``
> Extra keyword arguments are passed to the underlying function, so you should
> refer to the documentation for each to see kind-specific options."

Example:

  • plot signal through time and facet along the region

  • use different axes size. This requires passing a dictionary to FacetGrid.

  • Add a y label

  • adjust the left margin so that the y label doesn’t overwrite the axis

  • Set the Y limit to zero

    g = seaborn.relplot( data=fmri, x=“timepoint”, y=“signal”, col=“region”, hue=“event”, style=“event”, kind=“line”, col_wrap=1, height=3, facet_kws={‘sharey’: False, ‘sharex’: False} ) g.fig.supylabel(“Adaptive Engagement of Cognitive Control”) g.fig.subplots_adjust(left=0.28, top=0.9) g.fig.suptitle(“Example”) g.set_ylabels(“Y label”) g.set(ylim=(0, None)) plt.show()

Older example from https://seaborn.pydata.org/examples/faceted_lineplot.html

import seaborn as sns
sns.set_theme(style="ticks")

dots = sns.load_dataset("dots")

# Define the palette as a list to specify exact values
palette = sns.color_palette("rocket_r")

# Plot the lines on two facets
g = sns.relplot(
    data=dots,
    x="time", y="firing_rate",
    hue="coherence", size="choice", col="align",
    kind="line", size_order=["T1", "T2"], palette=palette,
    height=5, aspect=.75, facet_kws=dict(sharex=False),
)
g.fig.suptitle("Dots example")
# Add a title and adjust the position so the title doesn't overwrite facets
g.set_ylabels("Y label")
plt.subplots_adjust(top=0.9)

Marker and text

Add a market and text to a plot, works both with simple plots and faceted plots.

plt.plot(2030, -420, marker='*', markersize=10, color='red')
plt.text(2030+1, -420, "Target -420", fontsize=10)

Grid Text

From [SO answer][https://stackoverflow.com/a/59775753/2641825)

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
data = [['APOLLOHOSP', 8, 6, 'High', 'small'],
        ['ANUP', 8, 7, 'High', 'small'],
        ['SIS', 4, 6, 'High', 'mid'],
        ['HAWKINCOOK', 5, 2, 'Low', 'mid'],
        ['NEULANDLAB', 6, 4, 'Low', 'large'],
        ['ORIENTELEC', 7, 9, 'Low', 'mid'],
        ['AXISBANK', 2, 3, 'Medium', 'mid'],
        ['DMART', 4, 1, 'Medium', 'large'],
        ['ARVIND', 2, 10, 'Medium', 'small'],
        ['TCI', 1, 7, 'High', 'mid'],
        ['MIDHANI', 5, 5, 'Low', 'large'],
        ['RITES', 6, 4, 'Medium', 'mid'],
        ['COROMANDEL', 9, 9, 'High', 'small'],
        ['SBIN', 10, 3, 'Medium', 'large']]
mx = pd.DataFrame(data=data, columns=["code", "Growth", "Value", "Risk", "Mcap"])
plotnum = {'small': 0, 'mid': 1, 'large': 2}
p1 = sns.relplot(x="Growth", y="Value", hue="Risk", col="Mcap", data=mx, s=200, palette=['r', 'g', 'y'])

for ax in p1.axes[0]:
    ax.set_xlim(0.0, max(mx["Growth"]) + 1.9)
for row in mx.itertuples():
    print(row)
    ax = p1.axes[0, plotnum[row.Mcap]]
    ax.text(row.Growth + 0.5, row.Value, row.code, horizontalalignment='left')
plt.show()

Pair plot

seaborn.pairplot

“Plot pairwise relationships in a dataset.”

Scatter Plot

Create a scatter plot

import seaborn
import matplotlib.pyplot as plt
tips = seaborn.load_dataset("tips")
seaborn.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()

Group by another variable and show the groups with different colors:

seaborn(x="total_bill", y="tip", hue="time", data=tips)

scatterplot

Create a scatter plot with a title and axis labels

p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(xlabel = "Petal Length", ylabel = "Petal Width", title = "Flower sizes")
plt.show()

Grid scatter plot

Draw a scatter plot for each iris species, using the recommended relplot() function:

g = seaborn.relplot(x='petal_length', y='petal_width', col='species', hue='species', data=iris)
plt.show()

help(seaborn.relplot) explains that it returns a FacetGrid object:

> " After plotting, the :class:`FacetGrid` with the plot is returned and can be
> used directly to tweak supporting plot details or add other layers."

Old way: using FacetGrid directly requires changing the species to a categorical variable in order to have a different colour for each species.

g = seaborn.FacetGrid(iris, col="species", height=6)
iris['species'] = iris['species'].astype('category')
# Use map_dataframe to name the arguments
g.map_dataframe(seaborn.scatterplot,x='petal_length',y='petal_width',hue='species')
plt.show()

# Old way without named argument
g.map(seaborn.scatterplot,'petal_length','petal_width','species')
plt.show()

Notice that if you don’t change the color to a categorical variable, it will not vary across the species. I reported this issue which led me to update the searborn documentation in this merge request.

“When using seaborn functions that infer semantic mappings from a dataset, care must be taken to synchronize those mappings across facets. In other words some mechanism needs to ensure that the same mapping is used in each facet. This can be achieved for example by passing pallete dictionnaries or by defining categorical types in your dataframe. In most cases, it will be better to use a figure-level function (e.g. :func:relplot or :func:catplot) than to use :class:FacetGrid directly.”

Grid scatter plot with x=y line

A grid scatter plot with an x=y line for comparison purposes

g = seaborn.relplot(data=df,
                x="x_var",
                y="y=var",
                col="year",
                hue="partner",
                kind="scatter",
               )
g.fig.subplots_adjust(top=0.9)
# Add x=y line
for ax in g.axes.flat:
    ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3", scalex=False, scaley=False)

Sample data

Show all Seaborn sample datasets

for dataset in seaborn.get_dataset_names():
    print(dataset)
    print(seaborn.load_dataset(dataset).head())

Squarify treemaps

Plot a tree map from the python graph gallery

import matplotlib.pyplot as plt
import squarify    # pip install squarify (algorithm for treemap)
import pandas
df = pandas.DataFrame({'nb_people':[8,3,4,2], 'group':["group A", "group B", "group C", "group D"] })
squarify.plot(sizes=df['nb_people'], label=df['group'], alpha=.8 )
plt.axis('off')
plt.show()

Vega Altair

Print

How to print coloured text at the terminal?

“Print a string that starts a color/style, then the string, and then end the color/style change with ‘1b[0m’.”

For example

print(1000 * ("\x1b[1;32;44m" + "Winter" + "\x1b[0m" + ", " +
              "\x1b[1;32;42m" + "Spring" + "\x1b[0m" + ", " +
              "\x1b[1;35;41m" + "Summer" + "\x1b[0m" + ", " +
              "\x1b[1;35;45m" + "Autumn" + "\x1b[0m" + ", "))

Profiling and measuring time

Profiling

Run a scrip with the profiler, from within ipython

%run -i -p run_zz.py

Memory profiling https://stackoverflow.com/a/15682871/2641825

Time it

How can I time a code segment for testing performance with Pythons timeit?

Time a function:

import timeit
import time
def wait():
    time.sleep(1)
timeit.timeit(wait, number=3)

“If you are profiling your code and can use IPython, it has the magic function %timeit. %%timeit operates on cells.”

%timeit wait()

Time a code block:

import timeit
start_time = timeit.default_timer()
# code you want to evaluate
elapsed = timeit.default_timer() - start_time

R and python

See also the R page for more details on R.

Reddit python vs R

“R is for analysis. Python is for production. If you want to do analysis only, use R. If you want to do production only, use Python. If you want to do analysis then production, use Python for both. If you aren’t planning to do production then it’s not worth doing, (unless you’re an academic). Conclusion: Use python.”

History

The central objects in R are vectors, matrices and data frames, that is why I mostly compare R to the python packages numpy and pandas. R was created almost 20 years before numpy and more than 40 years before pandas.

R_(programming_language)

“R is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme. S was created by John Chambers in 1976 while at Bell Labs. A commercial version of S was offered as S-PLUS starting in 1988.”

NumPy history

“In 1995 the special interest group (SIG) matrix-sig was founded with the aim of defining an array computing package; among its members was Python designer and maintainer Guido van Rossum, who extended Python’s syntax (in particular the indexing syntax) to make array computing easier. […] An implementation of a matrix package was completed by Jim Fulton, then generalized by Jim Hugunin and called Numeric. […] new package called Numarray was written as a more flexible replacement for Numeric. Like Numeric, it too is now deprecated. […] In early 2005, NumPy developer Travis Oliphant wanted to unify the community around a single array package and ported Numarray’s features to Numeric, releasing the result as NumPy 1.0 in 2006.”

Pandas_(software)

“Developer Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.”

Migrating from R to python

“Python is a full fledge programming language but it is missing statistical and plotting libraries. Vectors are an after thought in python most functionality can be reproduced using operator overloading, but some functionality looks clumsy.”

Numpy and R

R session showing a division by zero returning an infinite value.

> 1/0
[1] Inf

Python session showing a division by zero error for normal integer division and the same operation on a numpy array returning an infinite value with a warning.

In [1]: 1/0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-1-9e1622b385b6> in <module>
----> 1 1/0

ZeroDivisionError: division by zero

In [2]: import numpy as np

In [3]: np.array([1]) / 0
/home/paul/.local/bin/ipython:1: RuntimeWarning: divide by zero encountered in true_divide
  #!/usr/bin/python3
Out[3]: array([inf])

Pandas comparison with R

R data frame to be used for examples:

df = data.frame(x = 1:3, y = c('a','b','c'), stringsAsFactors = FALSE)

Pandas data frame to be used for examples:

import pandas
df = pandas.DataFrame({'x' : [1,2,3], 'y' : ['a','b','c']})
Base R python or pandas SO questions
df[df$y %in% c('a','b'),] df[df['y'].isin(['a','b'])] list of values to select a row
dput(df) df.to_dict(orient="list") Print pandas data frame for reproducible example
expand.grid(df$x,df$y) itertools.product see section below
ifelse df.where() [ifelse in pandas]
gsub df.x.replace(regex=True) gsub in pandas
or df.x.str.replace()
length(df) and dim(df) df.shape row count of a data frame
rbind pandas.concat Pandas version of rbind
rep(1,3) [1]*3
seq(1:5) np.array(range(0,5)) numpy function to generate sequences
summary describe
str df.info() pandas equivalents for R functions like str summary and head

ifelse in pandas

The mapping of tidyverse to pandas is:

tidyverse pandas Help or SO questions
arrange df.sort_values(by=“y”, ascending=False)
df %>% select(-a,-b) df.drop(columns=[‘x’, ‘y’])
select(a) df.loc[:,“x”] # Strict, var has to be present
df.filter(items=[‘x’]) # Not strict
select(contains(“a”)) df.filter(regex=‘x’)
filter df.query(“y==‘b’”)
group_by groupby
lag shift pandas lag function
mutate df.assign(e = lambda x: x[“a”] * 3) assign
pivot_longer melt or wide_to_long
pivot_wider pivot
rename df.rename(columns={‘a’:‘new’})
separate df[[‘b’,‘c’]] = df.a.str.split(‘,’,n=1,expand=True) pandas separate str section
separate df[[‘b’,‘c’]] = df.a.str.split(‘,’,expand=True)
summarize agg
unite df[“z”] = df.y + df.y pandas unite
unnest explode unnest in pandas

Methods to use inside the .groupby().agg() method:

Expand grid in pandas

This SO answer provides an implementation of expand grid using itertools:

import itertools
import pandas
countries = ["a","b","c","d"]
years = range(1990, 2020)
expand_grid = list(itertools.product(countries, years))
df = pandas.DataFrame(expand_grid, columns=('country', 'year'))

Another SO answer on the same topic

Blogs and quotes on Pandas and R

“One thing that is a blessing and a curse in R is that the machine learning algorithms are generally segmented by package. […] it can be a pain for day-to-day use where you might be switching between algorithms. […] scikit-learn provides a common set of ML algorithms all under the same API.

“one thing that R still does better than Python is plotting. Hands down, R is better in just about every facet. Even so, Python plotting has matured though it’s a fractured community.”

“Tidyverse allows a mix of quoted and unquoted references to variable names. In my (in)experience, the convenience this brings is accompanied by equal consternation. It seems to me a lot of the problems solved by tidyeval would not exist if all variables were quoted all the time, as in pandas, but there are likely deeper truths I’m missing here…”

Help of the R function unite from the tidyr package:

“col: The name of the new column, as a string or symbol. This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with ‘rlang::ensym()’ (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).”

The use of symbols which do not represent actual objects was was frustrating at first when using pandas, because we hat to use df[“x”] to assign vectors to new column names whereas we could use df.x to display them.

Xarray, pandas and R

The xarray user guide page on pandas cites Hadley Whickham’s paper on tidy data:

“Tabular data is easiest to work with when it meets the criteria for tidy data”.

Personal reflection

R is great for statistical analysis and plotting. You can also use it to elaborate a pipeline to load data, prepare it and analyse it. But when things start to get complicated, such as loading json data from APIs, or dealing with http requests issues, or understanding lazy evaluation, or the consequences of non standard evaluation, moving down the rabbit whole can get really complicated with R. The rabbit hole slide is smoother with python. I have the feeling that I keep a certain level of understanding at all steps. It’s just a matter of taste anyway.

The Python language can be more verbose on some aspects, but it allows for greater programmability, it is also more predictable because non standard evaluation doesn’t create scoping problems and it enables to dive deeper into input/output issues such as URL request headers for example. R remains very good for data exploration, statistical analysis and plotting because non standard evaluation makes it possible to call variables without quotes and to pass formulas to plotting and estimation functions.

I see R more like the bash command line. It’s great for scripts, but you wouldn’t want to write large applications in bash.

Non standard evaluation doesn’t exist in python. - An email thread discussing the idea of non standard evaluation in python. - A comparison of a python implementation and an R implementation using non standard evaluation.

Security

  • Compromised PyTorch-nightly dependency chain between December 25th and December 30th, 2022.

    “PyTorch-nightly Linux packages installed via pip during that time installed a dependency, torchtriton, which was compromised on the Python Package Index (PyPI) code repository and ran a malicious binary. This is what is known as a supply chain attack and directly affects dependencies for packages that are hosted on public package indices.”

  • Anaconda was not affected https://www.anaconda.com/blog/anaconda-unaffected-by-pytorch-security-incident 

    “Conda users installing packages from Anaconda’s “main” channel are not impacted. This is because Anaconda’s official channels (the location where all our packages are stored) only contain packages built from stable upstream releases, while the affected PyTorch releases were nightly, development builds.

    “Update: we have confirmed with the conda-forge maintainers thattheir PyTorch packages are also built from stable upstream releases andare similarly not impacted.”

String

See also string operations in pandas character vectors.

SO answer providing various ways to concatenate python strings.

F string

Number formatting in f strings

How to print number with commas as thousands separators?

Thousand mark

f"{1e6:,}"

Round to 2 decimal places

f"{0.129456789:.2f}"

See also string operations in pandas with df[“x”].str methods.

Search and replace

Simple search with in returns True or False

"a" in "bla"
"z" in "bla"

Regex patterns

  • \S matches any non white space character

  • \W matches any non-word character

Regex search Regular Expressions

Search for patterns

import re
re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']

re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]

Search for ab in baba:

re.search("ab", "baba")

Search for the numeric after “value_”

re.findall("value_(\d+)", "value_2022")

Search for the group values occurrence of the numeric after “value_”

re.search("(value)_(\d+)", "value_2022").group(0)
re.search("(value)_(\d+)", "value_2022").group(1)
re.search("(value)_(\d+)", "value_2022").group(2)

Search charcters that are not value

l = ["value123", "a", "b"]
[x for x in l if not re.search("value", x)]

Regex substitution

Documentation of the re package.

Replace one or another character by a space

import re
re.sub("l|k", " ", "mlkj")

Replace one or more consecutive non alphanumeric characters by an underscore.

re.sub(r'\W+', '_', 'bla: bla**(bla)')

Insert a suffix in a file name before the extension SO anwser

import re
re.sub(r'(?:_a)?(\.[^\.]*)$' , r'_suff\1',"long.file.name.jpg")

Join

Join strings from a list to print them nicely

l = ["cons", "imp", "exp", "prod"]
print(l)
print(", ".join(l))

Split

Split lines in a string

input = """bla
bla
bla"""
for line in input.splitlines():
    print(line, "\n")

Statistics

Linear programming solvers

Real Python What is linear programing

Several free Python libraries are specialized to interact with linear or mixed-integer linear programming solvers:

SciPy Optimization and Root Finding

PuLP

Pyomo

CVXOPT

Scaling

Feature scaling with scikit learn

  • StandardScaler
  • MinMaxScaler
  • RobustScaler
  • Normalizer

Style and linter

AA Code style

  • EAFP Easier to ask for forgiveness than permission

    “This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

  • LBYL Look before you leap

    “This coding style explicitly tests for pre-conditions before making calls or lookups. This style contrasts with the EAFP approach and is characterized by the presence of many if statements. In a multi-threaded environment, the LBYL approach can risk introducing a race condition between “the looking” and “the leaping”. For example, the code, if key in mapping: return mapping[key] can fail if another thread removes key from mapping after the test, but before the lookup. This issue can be solved with locks or by using the EAFP approach.

Black

Black is “the uncompromising Python code formater”

See the pre commit section below to install and run black as a pre commit hook with pre-commit.

In vim, you can run black on the current file with:

:!black %
  • Ignore a revision in git blame after moving to black

    “A long-standing argument against moving to automated code formatters like Black is that the migration will clutter up the output of git blame. This was a valid argument, but since Git version 2.23, Git natively supports ignoring revisions in blame with the –ignore-rev option.”

    “You can even configure git to automatically ignore revisions listed in a file on every call to git blame.”

      git config blame.ignoreRevsFile .git-blame-ignore-revs

Flake 8

Flake 8 looks at more than just formatting.

List of FLake8 warnings and error codes

https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html

  • Ignore errors in a .flake8 file at the root of the git repository

  • Ignore errors for just one line with a comment # noqa: E731

PEP Python Enhancement Proposals

PEP 8 Style Guide for Python Code

“A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is the most important.”

“However, know when to be inconsistent – sometimes style guide recommendations just aren’t applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don’t hesitate to ask!”

“In particular: do not break backwards compatibility just to comply with this PEP!”

Pre commit hooks

Blog:

Install pre commit hooks

Install pre-commit

pip install pre-commit

Set up pre-commit in a repository

cd path_to_repository
# Add the "pre-commit" python module to a requirements file
vim requirements.txt 
# Create a configuration file
vim .pre-commit-config.yaml 

Configuration options such as

repos:
-   repo: https://github.com/ambv/black
    rev: 21.6b0
    hooks:
    - id: black
      language_version: python3.7
-   repo: https://gitlab.com/pycqa/flake8
    rev: 3.7.9
    hooks:
    - id: flake8

Update hook repositories to the latest version

pre-commit autoupdate

Install git hooks in your .git/ directory.

pre-commit install

Temporarily deactivate a pre commit hook

To deactivate a pre commit hook temporarily https://stackoverflow.com/questions/7230820/skip-git-commit-hooks

git commit --no-verify -m "commit message"

Usage in Continuous integration

Usage in Continuous integration has a gitlab example:

    my_job:
      variables:
        PRE_COMMIT_HOME: ${CI_PROJECT_DIR}/.cache/pre-commit
      cache:
        paths:
          - ${PRE_COMMIT_HOME}

Un install pre-commit hooks

Uninstall

pre-commit uninstall

Pylint

Configure pylint

Edit a user’s configuration file

vim ~/.pylintrc

You can also make a project specific configuration file at the root of a git repository. The content of the configuration file is as follows:

[pylint]
# List of good names that shouldn't give a "short name" warning
good-names=df,ds,t
# Use Paul's default virtual environment
init-hook='import sys; sys.path.append("/home/paul/rp/penv/lib/python3.11/site-packages/")'

Adding the path to site packages in the virtual environment is necessary in order to avoid pylint using the base system version of python, which doesn’t have pandas installed. This avoids reporting package not installed errors.

Generate a configuration file

pylint --generate-rcfile

Blog

Class with dynamic attributes

E1101: ‘Instance of .. has no .. member’ for class with dynamic attributes

To ignore this error, I entered this in a .pylintrc file at the root of the project’s git repository

[TYPECHECK]
generated-members=other,indround,fuel,sawn,panel,pulp,paper

Ignore unused import warning

Use case:

  # Make agg_trade_eu_row available here for backward compatibility
  # so that the following import statement continues to work:
  # >>> from biotrade.faostat.aggregate import agg_trade_eu_row
  from biotrade.common.aggregate import agg_trade_eu_row # noqa # pylint: disable=unused-import

SO answer

import <module> # noqa # pylint: disable=unused-import

Dangerous default argument

I understand the dangerous of using a mutable default value and I suggest switching the warning message for something like “Dangerous mutable default value as argument”. However, this is dangerous for all sorts of scenarios? (I know that pylint isn’t supposed to check the functionality of my code, just trying to clarify this anti-pattern)

>>> def find(_filter={'_id': 0}):
...     print({**find.__defaults__[0], **_filter})
...
>>> find()
{'_id': 0}
>>> find({'a': 1})
{'_id': 0, 'a': 1}
>>> find()
{'_id': 0}
>>> find({'a': 1, 'b': 2})
{'_id': 0, 'a': 1, 'b': 2}

One might argue that the following should be used and I tend to agree:

>>> def find(_filter=None):
...     if _filter is None:
...             _filter = {'_id': 0}
...     else:
...             _filter['_id'] = 0
...     print(_filter)
...
>>> find()
{'_id': 0}
>>> find({'a': 1})
{'a': 1, '_id': 0}
>>> find()
{'_id': 0}
>>> find({'a': 1, 'b': 2})
{'a': 1, 'b': 2, '_id': 0}

Using with for resource allocation

Pylint message

Consider using ‘with’ for resource-allocating operations

Explained in a SO answer

suppose you are opening a file:

file_handle = open("some_file.txt", "r")
...
...
file_handle.close()

You need to close that file manually after required task is done. If it’s not closed, then resource (memory/buffer in this case) is wasted. If you use with in the above example:

with open("some_file.txt", "r") as file_handle:
    ...
    ...

there is no need to close that file. Resource de-allocation automatically happens when you use with.

System information

Platform type

import sys
sys.platform

or

import os
os.name

Sys and os return different results ‘linux’ or ‘posix’.

More details are given by

os.uname()

Environment variables

Get or set

Get an environment variable

import os
os.environ["XYZ"]

Set an environment variable

os.environ["XYZ"]  = "/tmp"

Python path

For example in bash, the python path can be updated as follows:

export PYTHONPATH="$HOME/repos/biotrade/":$PYTHONPATH

This tells python where the biotrade package is located.

From python, use sys.path to prepend to the python path.

import sys
sys.path.insert(0, "/home/rougipa/eu_cbm/eu_cbm_hat")

See also the section on Path/python path to change the python path and import a script from Jupyter notebook.

GIL Global Interpreter Lock

  • https://docs.python.org/3/glossary.html#term-global-interpreter-lock

    “The mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines.

    However, some extension modules, either standard or third-party, are designed so as to release the GIL when doing computationally intensive tasks such as compression or hashing. Also, the GIL is always released when doing I/O.”

  • https://numba.pydata.org/numba-doc/latest/user/jit.html#nogil

    “Whenever Numba optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL). Numba will release the GIL when entering such a compiled function if you passed nogil=True.”

Memory

Memory usage of a python object

To display the memory usage of a python object

import sys
a = 1
print(sys.getsizeof(a))

See also the section on memory usage of pandas data frames under columns / memory usage.

Out of memory error

Sometimes when a python process runs out of memory, it can get killed by the Linux Kernel. In that case the error message is short “killed” and there is no python trace back printed. You can check that it is indeed a memory error by calling

sudo dmesg

Here is a typical message:

[85962.510533] Out of memory: Kill process 16035 (ipython3) score 320 or sacrifice child
[85962.510554] Killed process 16035 (ipython3) total-vm:7081812kB, anon-rss:4536336kB, file-rss:0kB, shmem-rss:8kB
[85962.687468] oom_reaper: reaped process 16035 (ipython3), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

Various related Stack Overflow questions what does “kill” mean, How can I find the reason that python script is killed?, Why does python script randomly gets killed?.

Python version

See also the status of python versions:

Where are python module stored

Show the location of an imported module:

import module_name
print(module_name.__file__)

For example on a system you might have built-in modules stored in one directory, user installed modules in another place and development modules yet in another place:

import os
os.__file__
# '/usr/lib/python3.9/os.py'

import pandas
pandas.__file__
# '/home/paul/.local/lib/python3.9/site-packages/pandas/__init__.py'

import biotrade
biotrade.__file__
# '/home/paul/repos/forobs/biotrade/biotrade/__init__.py'

Show all modules installed in a system:

help("modules")

Test driven development

A good post about TDD Unit testing your doing it wrong

“TDD is actually about every form of tests. For example, I often write performance tests as part of my TDD routine; end-to-end tests as well. Furthermore, this is about behaviours, not implementation: you write a new test when you need to fulfil a requirement. You do not write a test when you need to code a new class or a new method. Subtle, but important nuance. For example, you should not write a new test just because you refactored the code. If you have to, it means you were not really doing TDD.” […] “Good tests must test a behavior in isolation to other tests. Calling them unit, system or integration has no relevance to this. Kent Beck says this much better than I would ever do. ’‘’From this perspective, the integration/unit test frontier is a frontier of design, not of tools or frameworks or how long tests run or how many lines of code we wrote get executed while running the test.’’’ Kent Beck”

Assertions

pytest

Numpy moved from nose to pytest as explained in their testing guidelines:

“Until the 1.15 release, NumPy used the nose testing framework, it now uses the pytest framework. The older framework is still maintained in order to support downstream projects that use the old numpy framework, but all tests for NumPy should use pytest.”

Save this function in a file names test_numpy.py

def test_numpy_closeness():
    assert [1,2] == [1,2]
    assert (np.array([1,2]) == np.array([1,2])).all()
    np.testing.assert_allclose(np.array([1,2]),np.array([1,3]))

Save the file to test_nn.py

import neural_nets as nn
import numpy as np

def test_rectified_linear_unit():
    x = np.array([[1,0],
                  [-1,-3]])
    expected = np.array([[1,0],
                         [0,0]])
    provided = nn.rectified_linear_unit(x)
    assert np.allclose(expected, provided), "test failed"

Execute the test suite from bash with py.test as follows:

cd ~/rp/course_machine_learning/projects/project_2_3_mnist/part2-nn
py.test

AA run

Doctest

  • https://docs.python.org/3/library/doctest.html

    “The doctest module searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown.”

  • Run doctest in pytest https://docs.pytest.org/en/7.1.x/how-to/doctest.html

    • Test examples in the whole module

        pytest --doctest-modules
    • Test only one file

        pytest --doctest-modules post_processor/nai.py

    “To skip a single check inside a doctest you can use the standard doctest.SKIP directive:”

      def test_random(y):
      """
      >>> random.random()  # doctest: +SKIP
      0.156231223
    
      >>> 1 + 1
      2
      """

Doctest errors

  • “ValueError: line 32 of the docstring for has inconsistent leading whitespace”

  • https://pandas.pydata.org/docs/development/contributing_docstring.html#tips-for-getting-your-examples-pass-the-doctests

    “If you have a code snippet that wraps multiple lines, you need to use ‘…’ on the continued lines”.

  • See also:

    • doctestplus https://github.com/scientific-python/pytest-doctestplus provides additional functionality to skip tests in certain classes or for an entire module.

      • doctest plus provides additional flags to skip or include tests on remote data. This works in conjonction with
        https://github.com/astropy/pytest-remotedata

        “The pytest-remotedata plugin allows developers to indicate which unit tests require access to the internet, and to control when and whether such tests should execute as part of any given run of the test suite.”

    Paul Rougieux’s SO question on testing pandas data frame with doctest

    I have a package with many methods that output pandas data frame. I would like to test the examples with pytest and doctest as explained on the pytest doctest integration page.

    Pytest requires the output data frame to contain a certain number of columns that might be different than the number of columns provided in the example.

        >>> import pandas
        >>> df = pandas.DataFrame({"variable": range(3)})
        >>> for i in range(7): 
        ...     df["variable_"+str(i)] = range(3)
        >>> df
        variable  variable_0  variable_1  variable_2  variable_3  variable_4  variable_5  variable_6
        0         0           0           0           0           0           0           0           0
        1         1           1           1           1           1           1           1           1
        2         2           2           2           2           2           2           2           2

    pytest --doctest-modules returns the following error because it displays 6 columns instead of 7

    Differences (unified diff with -expected +actual):
        @@ -1,4 +1,6 @@
        -   variable_1  variable_2  variable_3  variable_4  variable_5  variable_6  variable_7
        -0           0           0           0           0           0           0           0
        -1           1           1           1           1           1           1           1
        -2           2           2           2           2           2           2           2
        +   variable_1  variable_2  variable_3  ...  variable_5  variable_6  variable_7
        +0           0           0           0  ...           0           0           0
        +1           1           1           1  ...           1           1           1
        +2           2           2           2  ...           2           2           2
        +<BLANKLINE>
        +[3 rows x 7 columns]

    Is there a way to fix the number of column? Does doctest always have a fixed terminal output?

    Number of columns issues

        >>> import pandas
        >>> df = pandas.DataFrame({"variable_1": range(3)})
        >>> for i in range(2, 8): df["variable_"+str(i)] = range(3)
        >>> df
           variable_1  variable_2  variable_3  variable_4  variable_5  variable_6  variable_7
        0           0           0           0           0           0           0           0
        1           1           1           1           1           1           1           1
        2           2           2           2           2           2           2           2

    Differences (unified diff with -expected +actual): @@ -1,4 +1,6 @@ - variable_1 variable_2 variable_3 variable_4 variable_5 variable_6 variable_7 -0 0 0 0 0 0 0 0 -1 1 1 1 1 1 1 1 -2 2 2 2 2 2 2 2 + variable_1 variable_2 variable_3 … variable_5 variable_6 variable_7 +0 0 0 0 … 0 0 0 +1 1 1 1 … 1 1 1 +2 2 2 2 … 2 2 2 + +[3 rows x 7 columns]

    Test pandas data frame

    Methods to test data frame and series equality

    from pandas.testing import assert_frame_equal
    from pandas.testing import assert_series_equal
    import seaborn 
    iris = seaborn.load_dataset("iris")
    assert_frame_equal(iris, iris)
    iris["species2"] = iris["species"]
    assert_series_equal(iris["species"], iris["species2"])
    # Ignore names
    assert_series_equal(iris["species"], iris["species2"], check_names=False)

    Sometimes you want tolerance

    df = pandas.DataFrame({"a":[1.0,2,3],
                           "b":[1.0001,2,3]})
    assert_series_equal(df["a"], df["b"], check_names=False)
    assert_series_equal(df["a"], df["b"], rtol=1e-2, check_names=False)

    Expected exceptions

    pytest assert

    “In order to write assertions about raised exceptions, you can use pytest.raises() as a context manager like this:”

    import pytest
    def test_zero_division():
        with pytest.raises(ZeroDivisionError):
            1 / 0

    “and if you need to have access to the actual exception info you may use:”

    def test_recursion_depth():
        with pytest.raises(RuntimeError) as excinfo:
    
            def f():
                f()
    
            f()
        assert "maximum recursion" in str(excinfo.value)

    “excinfo is an ExceptionInfo instance, which is a wrapper around the actual exception raised. The main attributes of interest are .type, .value and .traceback.”

    Fixtures

    import pytest
    import xarray
    from cobwood.gfpmx_equations import (
        consumption,
        consumption_pulp,
        consumption_indround,
    )
    
    @pytest.fixture
    def secondary_product_dataset():
        """Create a sample dataset for testing"""
        ds = xarray.Dataset({
            "cons_constant": xarray.DataArray([2, 3, 4], dims=["c"]),
            "price": xarray.DataArray([[1, 2], [3, 4], [5, 6]], dims=["c", "t"]),
            "gdp": xarray.DataArray([[100, 200], [300, 400], [500, 600]], dims=["c", "t"]),
            "prod": xarray.DataArray([[100, 200], [300, 400], [500, 600]], dims=["c", "t"]),
            "cons_price_elasticity": xarray.DataArray([0.5, 0.6, 0.7], dims=["c"]),
            "cons_gdp_elasticity": xarray.DataArray([0.8, 0.9, 1.0], dims=["c"]),
        })
        return ds
    
    def test_consumption(secondary_product_dataset):
        """Test the consumption function"""
        ds = secondary_product_dataset
        t = 1
        expected_result = xarray.DataArray([138.62896863, 1274.23051055, 7404.40635264], dims=["c"])
        result = consumption(ds, t)
        xarray.testing.assert_allclose(result, expected_result)

    Parametrize

    https://docs.pytest.org/en/6.2.x/parametrize.html

    “The builtin pytest.mark.parametrize decorator enables parametrization of arguments for a test function. Here is a typical example of a test function that implements checking that a certain input leads to an expected output:

    # content of test_expectation.py
    import pytest
    @pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
    def test_eval(test_input, expected):
        assert eval(test_input) == expected

    Pylint and pytest

    Add this at the beginning of pytest files

    # pylint: disable=redefined-outer-name

    Test use in projects

    • tabulate

      “uses pytest testing framework and tox to automate testing in different environments.”

    Web

    Back-end API

    Frameworks

    Flask vs. Django

    Note: Flask Evolution into Quart to support asyncio This last link contains a nice, simple example of how asyncio works with a simulated delay to fetch a web page.

    Workflows and pipelines

    Apache Airflow

    “DAGs In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code. For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10pm, but shouldn’t start until a certain date. In this way, a DAG describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually want to do! A, B, and C could be anything. Maybe A prepares data for B to analyze while C sends an email. Or perhaps A monitors your location so B can open your garage door while C turns on your house lights. The important thing is that the DAG isn’t concerned with what its constituent tasks do; its job is to make sure that whatever they do happens at the right time, or in the right order, or with the right handling of any unexpected issues. DAGs are defined in standard Python files that are placed in Airflow’sDAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.”

    ” Workflows You’re now familiar with the core building blocks of Airflow. Some of the concepts may sound very similar, but the vocabulary can be conceptualized like this:

    • DAG: The work (tasks), and the order in which work should take place (dependencies), written in Python.

    • DAG Run: An instance of a DAG for a particular logical date and time.

    • Operator: A class that acts as a template for carrying out some work.

    • Task: Defines work by implementing an operator, written in Python.

    • Task Instance: An instance of a task - that has been assigned to a DAG and has a state associated with a specific DAG run (i.e for a specific execution_date).

    • execution_date: The logical date and time for a DAG Run and its Task Instances.

    By combining DAGs and Operators to create TaskInstances, you can build complex workflows.”

    Xarray

    Create a data array and plot it, example from the xarray quick overview:

    import numpy as np
    import xarray as xr
    import matplotlib.pyplot as plt
    da2 = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
    da2.attrs["long_name"] = "random velocity"
    da2.attrs["units"] = "metres/sec"
    da2.attrs["description"] = "A random variable created as an example."
    da2.attrs["random_attribute"] = 123
    da2.attrs
    da2.plot()
    plt.show()

    Create another data array with one dimension only and multiply it with the two dimensional array

    da1 = xr.DataArray([1,2], coords={"x":[10,20]})
    da2 * da1

    Textual definition of data array and dataset

    DataArray

    > "xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional
    > array. It has several key properties:
    > 
    > - values: a numpy.ndarray holding the array’s values
    > 
    > - dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
    > 
    > - coords: a dict-like container of arrays (coordinates) that label each point
    >   (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
    > 
    > - attrs: dict to hold arbitrary metadata (attributes)
    > 
    > Xarray uses dims and coords to enable its core metadata aware operations.
    > Dimensions provide names that xarray uses instead of the axis argument found in
    > many numpy functions. Coordinates enable fast label based indexing and
    > alignment, building on the functionality of the index found on a pandas
    > DataFrame or Series."

    Dataset

    > "xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a
    > dict-like container of labeled arrays (DataArray objects) with aligned
    > dimensions. It is designed as an in-memory representation of the data model
    > from the netCDF file format.
    > 
    > In addition to the dict-like interface of the dataset itself, which can be used
    > to access any variable in a dataset, datasets have four key properties:
    > 
    > dims: a dictionary mapping from dimension names to the fixed length of each
    > dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
    > 
    > data_vars: a dict-like container of DataArrays corresponding to variables
    > 
    > coords: another dict-like container of DataArrays intended to label points
    > used in data_vars (e.g., arrays of numbers, datetime objects or strings)
    > attrs: dict to hold arbitrary metadata
    > 
    > The distinction between whether a variable falls in data or coordinates
    > (borrowed from CF conventions) is mostly semantic, and you can probably get
    > away with ignoring it if you like: dictionary like access on a dataset will
    > supply variables found in either category. However, xarray does make use of the
    > distinction for indexing and computations. Coordinates indicate
    > constant/fixed/independent quantities, unlike the varying/measured/dependent
    > quantities that belong in data."

    converting between datasets and arrays

    > "This method broadcasts all data variables in the dataset against each other,
    > then concatenates them along a new dimension into a new array while
    > preserving coordinates."

    Convert to-from other formats

    To a list

    Convert a 1 dimensional data array to a list

    ds.country.values.tolist()

    Copy shallow and deep

    The documentation of the DataArray.copy and Dataset.copy methods show they both have a deep argument. If this argument is set to False (the default) it will only return a new view on the dataset. Illustration below, a dataset is passed to a function that removes values above a threshold. When deep=False the input data is changed as well even though we used the copy() method. We really have to use copy(deep=True) to make sure that the input data remains un modified.

    import xarray
    import numpy as np
    ds = xarray.Dataset(
        {"a": (("x", "y"), np.random.randn(2, 3))},
        coords={"x": [10, 20], "y": ["a", "b", "c"]},
    )
    ds
    def remove_x_larger_than(ds_in, threshold, deep):
        """Remove values of x larger than the threshold"""
        ds_out = ds_in.copy(deep=deep)
        ds_out.loc[dict(x=ds_out.coords["x"]>threshold)] = np.nan
        return ds_out
    remove_x_larger_than(ds, threshold=10, deep=True)
    print(ds)
    remove_x_larger_than(ds, threshold=10, deep=False)
    print(ds)

    Create a dataset

    Round trip from pandas to xarray and back from the xarray user guide page on pandas.

    import xarray
    import numpy as np
    ds = xarray.Dataset(
        {"foo": (("x", "y"), np.random.randn(2, 3))},
        coords={
            "x": [10, 20],
            "y": ["a", "b", "c"],
            "along_x": ("x", np.random.randn(2)),
            "scalar": 123,
        },
    )
    ds

    Dimensions

    x and y are dimensions.

    Attributes

    We can add attributes to qualify metadata.

    ds.attrs["product"] = "sponge"

    Convert xarray to pandas and back

    Convert the xarray dataset to a pandas data frame

    df = ds.to_dataframe()
    df

    Convert the data frame back to a dataset

    xarray.Dataset.from_dataframe(df)

    “Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex. Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.”

    You can also use

    xarray.DataArray(df)

    Create an empty dataset similar to existing one

    Fill an array with zero values, similar to an existing data array. Or fill it with NA values.

    xarray.zeros_like(da)
    xarray.full_like(da, fill_value=xarray.nan)

    Data variables

    The equivalent to df.columns in pandas would be list(sawn.data_vars) for an xarray dataset. ds.data_vars displays the data variables with the beginning of their content. If you loop on it it will just display a string

    for x in ds.data_vars:
        print(x, type(x))

    A list of variables

    list(sawn.data_vars)

    Groupby operations

    Group the given variable by region, using a dataArray called “region” which is stored inside the dataset

    region_data = gfpmx_data.country_groups.set_index('country')['region']
    region_dataarray = xarray.DataArray.from_series(region_data)
    aggregated_data = ds[var].loc[COUNTRIES, t].groupby(ds["region"]).sum()
    ds[var].loc["WORLD", t] = ds[var].loc[COUNTRIES, t].sum()
    ds[var].loc[regions,t] = ds[var].loc[COUNTRIES,t].groupby(ds["region"].loc[COUNTRIES]).sum()

    Indexing

    .loc

    Example use with the GFPMx dataset

    ds["exp"].loc["Czechia", ds.coords["year"]>2015]

    Assigning values with indexing

    https://docs.xarray.dev/en/stable/user-guide/indexing.html#assigning-values-with-indexing

    To select and assign values to a portion of a DataArray() you can use indexing with .loc or .where.

    import xarray
    import matplotlib.pyplot as plt
    ds = xarray.tutorial.open_dataset("air_temperature")
    ds["empty"] = xarray.full_like(ds.air.mean("time"), fill_value=0)
    ds["empty"].loc[dict(lon=260, lat=30)] = 100
    lc = ds.coords["lon"]
    la = ds.coords["lat"]
    ds["empty"].loc[ 
    dict(lon=lc[(lc > 220) & (lc < 260)], lat=la[(la > 20) & (la < 60)]) 
    ] = 100
    # Plot
    ds.empty.plot()
    plt.show()
    # Write to a csv file
    ds.empty.to_dataframe().to_csv("/tmp/empty.csv")

    “Warning Do not try to assign values when using any of the indexing methods .isel or .sel:”

    da = xarray.DataArray([0, 1, 2, 3], dims=["x"])
    # This will return an error 
    da.isel(x=[0, 1, 2]) = -1
    # SyntaxError: cannot assign to function call
    # Do not do this
    da.isel(x=[0, 1, 2])[1] = -1
    # Use a dictionnary instead
    da[dict(x=[1])] =  -1
    # Also works with broadcasting
    da[dict(x=[0, 1, 2])] =  -1

    Querying index variables

    Keep only data that is below the base year into the dataset.

    base_year = 2018
    ds.sel(year = ds.year <= base_year)
    ds.query(year = "year <= 2018")
    ds.query(year = "year <= @base_year")
    # Returns an error
    # SyntaxError: The '@' prefix is not allowed in top-level eval calls.
    # please refer to your variables by name without the '@' prefix.

    Reindex

    Reindex an array to get the same coordinates as another one, with empty values where values are missing.

    Missing data

    There is no isna() method in xarray. Check for missing data with the isnull() method

    Panel data

    transitioning from pandas panel to xarray

    > "As discussed elsewhere in the docs, there are two primary data structures
    > in xarray: DataArray and Dataset. You can imagine a DataArray as a
    > n-dimensional pandas Series (i.e. a single typed array), and a Dataset as the
    > DataFrame equivalent (i.e. a dict of aligned DataArray objects).
    > So you can represent a Panel, in two ways:
    > 
    >    As a 3-dimensional DataArray,
    > 
    >    Or as a Dataset containing a number of 2-dimensional DataArray objects.
    
    
    > "Variables in Dataset objects can use a subset of its dimensions. For
    > example, you can have one dataset with Person x Score x Time, and another
    > with Person x Score."

    Plots

    • Xarray plotting introduction

    • Scatter plots

      “For more advanced scatter plots, we recommend converting the relevant data variables to a pandas DataFrame and using the extensive plotting capabilities of seaborn.”

    • Facetting

      “The easiest way to create faceted plots is to pass in row or col arguments to the xarray plotting methods/functions. This returns a xarray.plot.FacetGrid object.”

      spda.loc[dict(country=[‘Ukraine’, ‘Uzbekistan’])].plot(col=“country”) # Plot by continents gfpmxb2020.indround[“prod”].loc[~gfpmxb2020.indround.c].plot(col=“country”)

    Pandas plots from xarray datasets

    Example using the GFPMx data structure to plot industrial roundwood consumption, production and trade in Czechia:

    variables = ["imp", "cons", "exp", "prod"]
    # Select inside the dataset
    gfpmx["indround"].loc[{"country":"Czechia"}][variables].to_dataframe()[variables].plot()
    # Convert to data frame first then plot
    gfpmx["indround"].to_dataframe().loc["Czechia"][variables].plot()

    Xarray IO

    • See the general section on IO and file formats. The subsection on netcdf files refers to xarray.

    • See also the conversion section for conversion to other in memory formats such as lists or pandas data frames.

    ZZ Media and organizations

    Blogs

    • Julio Biason Things I Learnt The Hard Way (in 30 Years of Software Development)

    • Daniel Lemire I do not use a debugger

      “Debuggers don’t remove bugs. They only show them in slow motion.”

      Linus Toarvald doesn’t use a debugger

    • Wes McKinney

      • 2017 Apache Arrow and the 10 things I hate about pandas

        “pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset”

      • 2018 Announcing Ursalabs

        “It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. This is part of why working on Arrow has been so important for me; it provides a path to sharing of systems code outside of Python by enabling free interoperability at the data level.”

        “Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development. Nearly 9 years have passed since J.J. started building the RStudio IDE, but in many ways he and Hadley and others feel like they are just getting started.”

    • Dotan Nahum Functional Programming with Python for People Without Time

      “Cracks in the Ice - We ended the previous part with stating that with a good measure of abstraction, functional programming doesn’t offer a considerable advantage over the “traditional” way of design, object oriented. It’s a lie. […] In our pipeline example above with our Executors — how do you feed in the output of one executor as the input for the next one? well, you have to build that infrastructure. With functional programming, those abstractions are not abstractions that you have to custom build. They’re part of the language, mindset, and ecosystem. Generically speaking — it’s all about impedence mismatch and leaky abstractions and when it comes to data and functions; there’s no mismatch because it’s built up from the core. The thesis is — that to build a functional programming approach over an object-oriented playground — is going to crash and burn at one point or another: be it bad modeling of abstractions, performance problems, bad developer ergonomics, and the worst — wrong mindset. Being able to model problems and solutions in a functional way, transcends above traditional abstraction; the object-oriented approach, in comparison, is crude, inefficient and prone to maintenance problems.”

    • Christopher Rackauckas Why numba and cython are no substitute for Julia discusses the advantages of the Julia language over Python for large code bases.

    • Ethan Rosenthal Everything Gets a Package: My Python Data Science Setup

    • Data Formats for Panel Data Analysis

    There are two primary methods to express data:

    • MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.

    • 3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.

    Explains multi index with stacking and unstacking.

    Foundations

    • COIN-OR project “open source for the operations research community”

      “Without open source implementations of existing algorithms, testing new ideas built on existing ones typically requires the time-consuming and error-prone process of re-implementing (and re-debugging and re-testing) the original algorithm. If the original algorithm were publicly available in a community repository, imagine the productivity gains from software reuse! Science evolves when previous results can be easily replicated”

    • Python software foundation

      “We support and maintain python.org, The Python Package Index, Python Documentation, and many other services the Python Community relies on.”

    Interviews

    Interview with Alex Martelli

    “Larry Page in his dormitory at Stanford had written or tried to write a web spider to get a copy of some subset of the web on his computers so he could try his famous Page algorithm. He was trying to use the brand-new language Java in 1.0 beta version, and it kept crashing. So he asked for help from his roommate and his roommate took a look at said ‘oh you’re using that Java disaster’. Of course, it crashed and did it in 100 lines of Python. It runs perfectly, and that’s how Google became possible through 100 lines of Python. But I had no idea until about five years ago that it had played so crucial role so early on.”

    ” Similarly, if I hadn’t heard it from the mouth of Guido himself, I would never have known that Python was at the heart of the web. The very first Web server and web browser were written by the inventor of the World Wide Web, HTTP, and HTML in Python. He wasn’t really a programmer; he was a physicist and Python was far easier to use than anything else.”

    Podcasts

    talk python to me