pypi.org pip:
“pip is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes.”
Run at the command line or from an ipython prompt:
pip install packagename
What is the difference between a python module and a python package
“A package is a collection of modules in directories that give a package hierarchy.”
In Python check the version of a package with the
__version__
attribute, however it’s not always
available.
>>> import pandas
>>> print(pandas.__version__)
You can also use importlib
>>> from importlib.metadata import version
>>> version('pandas')
SO answer suggestion a command line way to display the same information
pip freeze | grep pandas
Check the location of a package
import eu_cbm_hat
eu_cbm_hat.__file__
For example to install pandas 0.24.2
python3 -m pip install --user pandas==0.24.2
or
pip3 install --user pandas==0.24.2
Sometimes you need to overwrite the existing version with I.
pip install -I package==version
To install the local version of a package with pip
pip install -e /develop/MyPackage
According to man pip
, the -e
option
“installs a project in editable mode (i.e. setuptools”develop mode”)
from a local project path or a VCS url”.
When uninstalling a package installed locally, you might get this error message:
uninstall localpackage
# Found existing installation: localpackage 0.0.1
# Can't uninstall 'localpackage'. No files were found to uninstall.
You can show the location of the package with
pip show localpackage
Then remove it manually with
rm -rf ~/.local/lib/python3.9/site-packages/localpackage*
And maybe this as well
rm -rf ~/.local/lib/python3.9/site-packages/build/lib/localpackage*
Install from the dev branch of a private repo on gitlab using ssh
pip install git+ssh://git@gitlab.com/bioeconomy/forobs/biotrade.git@dev
Install from the dev branch of a private repo on gitlab using an authentication token
pip install git+https://gitlab+deploy-token-833444:ByW1T2bJZRtYhWuGrauY@gitlab.com/bioeconomy/forobs/biotrade.git@dev
Install from the compressed tar.gz
version of a
repository that doesn’t require git to be installed on your laptop:
pip install --force-reinstall https://github.com/ytdl-org/youtube-dl/archive/refs/heads/master.tar.gz
pip install --force-reinstall https://github.com/mwaskom/seaborn/archive/refs/heads/master.tar.gz
Use conda update command to check to see if a new update is available. If conda tells you an update is available, you can then choose whether or not to install it.
conda vs pip vs virtualenv commands
“If you have used pip and virtualenv in the past, you can use conda to perform all of the same operations. Pip is a package manager and virtualenv is an environment manager. conda is both.”
In case a python package is not available in the default conda channel use, you can change the channel to conda-forge as follows:
conda install -c conda-forge <package_name>
> - "The conda team, from Anaconda, Inc., packages a multitude of packages and
> provides them to all users free of charge in their default channel."
> "conda-forge is a community effort that tackles these issues:
> - All packages are shared in a single channel named conda-forge.
> - Care is taken that all packages are up-to-date.
> - Common standards ensure that all packages have compatible versions.
> - By default, we build packages for macOS, Linux AMD64 and Windows
> AMD64."
Documentation conda.io installing packages
To install a specific package such as SciPy into an existing environment “myenv”:
conda install --name myenv scipy
If you do not specify the environment name, which in this example is done by –name myenv, the package installs into the current environment:
conda install scipy
To install a specific version of a package such as SciPy:
conda install scipy=0.15.0
To install multiple packages at once, such as SciPy and cURL:
conda install scipy curl
Note: It is best to install all packages at once, so that all of the dependencies are installed at the same time.
Using pip in a conda environment
“Use pip only after conda. Install as many requirements as possible with conda, then use pip”
Documentation conda.io updating packages
Use the terminal or an Anaconda Prompt for the following steps.
To update a specific package:
conda update biopython
To update Python:
conda update python
To update conda itself:
conda update conda
Remove the package ‘scipy’ from the currently-active environment:
conda remove scipy
Remove a list of packages from an environemnt ‘myenv’:
conda remove -n myenv scipy curl wheel
Creating an environment file manually
You can create an environment file (environment.yml) manually to share with others.
EXAMPLE: A simple environment file:
name: stats dependencies: - numpy - pandas
EXAMPLE: A more complex environment file:
name: stats2 channels: - javascript dependencies: - python=3.6 # or 2.7 - bokeh=0.9.2 - numpy=1.9. - nodejs=0.10. - flask - pip: - Flask-Testing
Note
Note the use of the wildcard * when defining the patch version number. Defining the version number by fixing the major and minor version numbers while allowing the patch version number to vary allows us to use our environment file to update our environment to get any bug fixes whilst still maintaining consistency of software environment.
Documentation using an environment
https://docs.anaconda.com/free/navigator/tutorials/manage-environments/#using-an-environment
The mamba solver can speed up the dependency resolution process. It doesn’t require a special mamba installation You can switch the default solver in a normal conda installation:
conda install -n base -c defaults conda-libmamba-solver
conda config --set solver libmamba
Some packages can be installed with the OS’s package manager. Such as for example on Debian:
sudo apt install python3-pip
Venv or Anaconda? https://www.reddit.com/r/Python/comments/xhbhbh/venv_or_anaconda/
What are the downsides of using Anaconda versus https://www.reddit.com/r/Python/comments/6vq2m4/what_are_the_downsides_of_using_anaconda_vs/?rdt=43570&onetap_auto=true
Conda in production? https://www.reddit.com/r/Python/comments/58n9ox/conda_in_production/
User 1
“At my company we use conda to manage our entire python stack across multiple platforms (OSX, Windows and Linux) and we haven’t had any issues. Typically we have a metapackage that defines the specific requirements of a project.”
Another user
“We distribute a Python 3 GUI application on Mac, Windows, and Linux that uses PyQt4, gdal, matplotlib, pyopengl, etc. We use conda for all of our developers and beta testers and use pyinstaller to create an installer for each OS that we support.”
To upload a package to pypi, you need a pypi account. The instructions on uploading distribution archives explain how to upload the package to test.pypi:
python3 -m twine upload --repository testpypi dist/*
I updated the following package before running this
pip install --upgrade build
pip install --upgrade twine
I built the package with
cd forobs/biotrade
python3 -m build
twine
uses kde wallet to store the password, press
cancel if you can’t use kde wallet, it will then ask for the password at
the command line. There is a twine issue related to
the use of keyring.
Register an account on pypi (it’s a different server than test.pypi). Create a token under account settings. Then upload to pypi itself
cd repository
python3 -m build
twine upload dist/*
To use the API token:
Set your username to __token__
Set your password to the token value, including the pypi- prefix
In bash, create a virtual environment to test installation and remove the python path otherwise my local version is seen
mkdir /tmp/biotrade_env/
cd /tmp/biotrade_env/
python3 -m venv /tmp/biotrade_env/
source /tmp/biotrade_env/bin/activate
PYTHONPATH=""
python3
In python check that the package is not there
>>> import biotrade
>>> import pandas
Back to the shell test the installation from test.pypi
pip install -i https://test.pypi.org/simple/ biotrade
# ERROR: Could not find a version that satisfies the requirement pandas (from biotrade)
# ERROR: No matching distribution found for pandas
Installing biotrade’s dependencies directly generates an error
because pandas is not available in the test repository. You can install
them from the pypi directly with pip install pandas
.
Install from a wheel
cd ~/repos/forobs/biotrade/dist
pip install biotrade-0.2.2-py3-none-any.whl
# Or
pip install biotrade-0.2.2.tar.gz
On https://conda-forge.org/docs/maintainer/adding_pkgs.html conda recommends https://github.com/conda-incubator/grayskull to create the recipe
“Presently Grayskull can generate recipes for Python packages available on PyPI and also those not published on PyPI but available as GitHub repositories.”
The Python packaging documentation on adding non code files
“The mechanism that provides this is the MANIFEST.in file. This is relatively quite simple: MANIFEST.in is really just a list of relative file paths specifying files or globs to include.:
include README.rst
include docs/*.txt
include funniest/data.json
“In order for these files to be copied at install time to the package’s folder inside site-packages, you’ll need to supply include_package_data=True to the setup() function.”
“Files which are to be used by your installed library (e.g. data files to support a particular computation method) should usually be placed inside of the Python module directory itself. E.g. in our case, a data file might be at
funniest/funniest/data.json
. That way, code which loads those files can easily specify a relative path from the consuming module’s__file__
variable.”
The Python packaging documentation on the Manifest commands The syntax of recursive-include graft commands.
Add all files under directories matching dir-pattern that match any of the listed patterns
recursive-include dir-pattern pat1 pat2
Add all files under directories matching dir-pattern
graft dir-pattern
The Python packaging documentation on source dist gives an example of the patterns
include *.txt
recursive-include examples *.txt *.py
prune examples/sample?/build
” The meanings should be fairly clear: include all files in the distribution root matching
*.txt
, all files anywhere under the examples directory matching*.txt
or*.py
, and exclude all directories matching examples/sample?/build.
SO What is the correct way to share package version with setup.py and the package?
The version of a package has to be set both in setup.py
and __init__py
it’s crazy the number of options that people
have thought about. This answers
summarizes the state of the art in 7 options, including a link to the python
packaging user guide
Install bumpversion
pip install bumpversion
Increment the version number both in setup.py and
init.py with the command line tool bumpversion. First
create a configuration file .bumpversion.cfg
where the
current_version
matches the versions in
setup.py
and packagename/__init__.py
[bumpversion]
current_version = 0.0.5
commit = True
tag = True
[bumpversion:file:setup.py]
[bumpversion:file:biotrade/__init__.py]
Increment the version number in all files and the git tag with:
bumpversion patch
# Or to increment minor or major versions
bumpversion minor
bumpversion major
Push the corresponding tags to the remote repository
git push origin --tags
Check the updated version in setup.py
python setup.py --version
Start an ipython prompt to test the package version
ipython
import packagename
packagename.__version__
Generate the documentation of a package with pdoc:
pdoc -o public ./biotrade
This can be added to a .gitlab-ci.yml
file in order to
generate the documentation on a Continuous Integration system:
pages:
stage: document
script:
# GitLab Pages will only publish files in the public directory
- pdoc -o public ./biotrade
artifacts:
paths:
- public
only:
- main
interruptible: true
https://flit.pypa.io/en/latest/rationale.html
” The existence of Flit spurred the development of new standards, like PEP 518 and PEP 517, which are now used by other packaging tools such as Poetry and Enscons.”
https://pip.pypa.io/en/stable/reference/build-system/setup-py/
“Prior to the introduction of pyproject.toml-based builds (in PEP 517 and PEP 518), pip had only supported installing packages using setup.py files that were built using setuptools.”
“The interface documented here is retained currently solely for legacy purposes, until the migration to pyproject.toml-based builds can be completed.”
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/
https://peps.python.org/pep-0518/ (from 2016 already) is worth looking at. It discusses why they didn’t choose other configuration file formats JSON, YAML or python literals such as dict. They chose TOML in the end (its used for RUST package metadata). Details of each format in https://gist.github.com/njsmith/78f68204c5d969f8c8bc645ef77d4a8f
The location of a package can be obtained from
package_name.__file__
.
Get the location of the python executable with
>>> import sys
>>> print(sys.executable)
In virtual env, it can return the symlink to another folder. In that case, the path can be deduced from
>>> import os
>>> os.__file__
venv is available by default in Python 3.3+
Installation
sudo apt install python3-venv
Usage
mkdir /tmp/testenv
python3 -m venv /tmp/testenv
source /tmp/testenv/bin/activate
Pipenv makes pip and virtual environments work together.
“There is a subtle but very important distinction to be made between applications and libraries. This is a very common source of confusion in the Python community.”
“Libraries provide reusable functionality to other libraries and applications (let’s use the umbrella term projects here). They are required to work alongside other libraries, all with their own set of sub-dependencies. They define abstract dependencies. To avoid version conflicts in sub-dependencies of different libraries within a project, libraries should never ever pin dependency versions. Although they may specify lower or (less frequently) upper bounds, if they rely on some specific feature/fix/bug. Library dependencies are specified via install_requires in setup.py.”
“Libraries are ultimately meant to be used in some application. Applications are different in that they usually are not depended on by other projects. They are meant to be deployed into some specific environment and only then should the exact versions of all their dependencies and sub-dependencies be made concrete. To make this process easier is currently the main goal of Pipenv.”
Install on Debian
sudo apt install pipenv
pyenv makes it possible to manage different python versions. But within the containing environment you can also install different packages with pip.
Illustration of the complementarity between pyenv and pipenv.
To test a fresh install of a package or test it in conditions where some environmental variables are not defined.
For example remove environment variables with unset
.
unset BIOTRADE_DATABASE_URL
Example application with select boxes and a slider. Use the
index
argument to select a default value.
import streamlit
reporter = streamlit.sidebar.selectbox(
"Select a reporter Country", options=df["reporter"].unique()
)
products = streamlit.sidebar.multiselect(
"Select some products", options=df["product_name"].unique()
)
element = streamlit.sidebar.selectbox(
"Select a variable for the Y Axis", options=["net_weight", "price", "trade_value"]
)
flow = streamlit.sidebar.selectbox(
"Select a flow direction", options=["import", "export"]
)
n_partners = streamlit.sidebar.slider(
"Select N First Partners", min_value=1, max_value=10, value=5
)
“When a call is made to a Numba-decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!”
https://github.com/serge-sans-paille/pythran
“Pythran is an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface descriptions and turns it into a native Python module with the same interface, but (hopefully) faster.”
dotnetperls: not python
Sample code with a function and if conditions:
def function(condition):
if condition:
print("Hi")
if not condition:
print("Bye")
function(True)
function(False)
function('')
function('lalala')
Example from https://stackoverflow.com/a/16287793/2641825
Using two if conditions
# According to the UN Convention of the Rights of the Child
ADULT_AGE = 18
def analyze_age(age):
if age < ADULT_AGE and age > 0:
print("You are a child")
if age >= ADULT_AGE:
print("You are an adult")
else:
print("The age must be a positive integer!")
analyze_age(16)
>You are a child
>The age must be a positive integer!
“The elif fixes this and makes the two if statements ‘stick together’ as one:”
def analyze_age(age):
if age < ADULT_AGE and age > 0:
print("You are a child")
elif age >= ADULT_AGE:
print("You are an adult")
else:
print("The age must be a positive integer!")
analyze_age(16)
>You are a child
A for loop with a continue
statement
print("I print all numbers in the range except 2.")
for i in range(5):
if i==2:
continue
print(i)
Argparse
Tutorial explain how to create a python program that processes
command line arguments. Save the following in prog.py
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int,
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print(f"the square of {args.square} equals {answer}")
elif args.verbosity == 1:
print(f"{args.square}^2 == {answer}")
else:
print(answer)
Usage
$ python3 prog.py 4
16
$ python3 prog.py 4 -v
usage: prog.py [-h] [-v VERBOSITY] square
prog.py: error: argument -v/--verbosity: expected one argument
$ python3 prog.py 4 -v 1
4^2 == 16
$ python3 prog.py 4 -v 2
the square of 4 equals 16
$ python3 prog.py 4 -v 3
16
SO answer
explains that when using ipython, you need to separate
ipython arguments from your script arguments using --
.
Click looks like a rising star https://star-history.com/#docopt/docopt&pallets/click&pyinvoke/invoke&Date
SQL Alchemy is a database abstraction layer. Interaction with the database is built upon metadata objects:
The core of SQLAlchemy’s query and object mapping operations are supported by database metadata, which is comprised of Python objects that describe tables and other schema-level objects. These objects are at the core of three major types of operations - issuing CREATE and DROP statements (known as DDL), constructing SQL queries, and expressing information about structures that already exist within the database. Database metadata can be expressed by explicitly naming the various components and their properties, using constructs such as Table, Column, ForeignKey and Sequence, all of which are imported from the sqlalchemy.schema package. It can also be generated by SQLAlchemy using a process called reflection, which means you start with a single object such as Table, assign it a name, and then instruct SQLAlchemy to load all the additional information related to that name from a particular engine source.
from sqlalchemy import MetaData
from sqlalchemy import Table
meta = MetaData(schema = "raw_comtrade")
meta.bind = comtrade.database.engine
yearly_hs2 = Table('yearly_hs2', meta, autoload_with=comtrade.database.engine)
SQL Alchemy has an automap feature which generates mapped classes and relationships from a database schema.
I used sqlacodegen to automatically generate python code from an existing PostGreSQl database table as follows
sqlacodegen --schema raw_comtrade --tables yearly_hs2 postgresql://rdb@localhost/biotrade
Paul’s SO
Answer. SQL Alchemy’s recommended way to check for the presence of a
table is to create an inspector object and use its
has_table()
method. The following example was copied from
sqlalchemy.engine.reflection.Inspector.has_table,
with the addition of an SQLite engine to make it reproducible:
from sqlalchemy import create_engine, inspect
from sqlalchemy import MetaData, Table, Column, Text
engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
user_table = Table('user', meta,
Column("name", Text),
Column("full_name", Text))
user_table.create()
inspector = inspect(engine)
inspector.has_table('user')
You can also use the user_table
metadata element
name
to check if it exists as such:
inspector.has_table(user_table.name)
Create a connection and execute a select statement, it’s a read only operation
Create a connection and execute a create statement followed by a commit:
with engine.connect() as conn:
if not engine.dialect.has_schema(conn, schema):
conn.execute(CreateSchema(schema))
conn.commit()
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html
“As a means of both proving the 2.0 architecture as well as allowing a fully iterative transition environment, the entire scope of 2.0’s new APIs and features are present and available within the 1.4 series;”
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-implicit-execution
“For schema level patterns, explicit use of an Engine or Connection is required.”
with engine.connect() as connection: # create tables, requires explicit begin and/or commit: with connection.begin(): metadata_obj.create_all(connection)
# reflect all tables
metadata_obj.reflect(connection)
# reflect individual table
t = Table("t", metadata_obj, autoload_with=connection)
# execute SQL statements
result = conn.execute(t.select())
SQL Alchemy Object Relational Model Querying Guide
from sqlalchemy import select
stmt = select(user_table).where(user_table.c.name == 'spongebob')
print(stmt)
Since version 1.4 .where()
is a synonym of
.filter()
as explained in sqlalchemy.orm.Query.where.
To select only one column you can use Select.with_only_columns:
from sqlalchemy import MetaData, Table, Column, Text
meta = MetaData()
table = Table('user', meta,
Column("name", Text),
Column("full_name", Text))
stmt = (table.select()
.with_only_columns([table.c.name])
)
print(stmt)
Entering columns in the select
method returns an error.
Although it should be valid according to the documentation.
print(table.select([table.c.name]))
# ArgumentError: SQL expression for WHERE/HAVING role expected,
# got [Column('name', Text(), table=<user>)].
Insert
some data into the user
table
from sqlalchemy import insert
from sqlalchemy.orm import Session
stmt = (
insert(user_table).
values(name='Bob', full_name='Sponge Bob')
)
with Session(engine) as session:
result = session.execute(stmt)
session.commit()
The pandas.to_sql method uses sqlalchemy to write pandas data frame to a PostgreSQL database.
“The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default.”
Repeat the example table defined above, read the result of a select statement into a pandas data frame:
import pandas
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table, Column, Text
from sqlalchemy.orm import Session
# Define metadata and create the table
engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
user_table = Table('user', meta,
Column("name", Text),
Column("full_name", Text))
user_table.create()
# Insert data into the user table
stmt = user_table.insert().values(name='Bob', full_name='Sponge Bob')
with Session(engine) as session:
result = session.execute(stmt)
session.commit()
# Select data into a pandas data frame
stmt = user_table.select().where(user_table.c.name == 'Bob')
df = pandas.read_sql_query(stmt, engine)
Another way importing the select statement:
from sqlalchemy import select
stmt = select(user_table).where(user_table.c.name == 'Bob')
df = pandas.read_sql_query(stmt, engine)
Another way using a session
with Session(engine) as session:
df2 = pandas.read_sql(session.query(user_table).filter(user_table.name=="Bob").statement, session.bind)
Read the whole table into pandas
df3 = pandas.read_sql_table("user", engine)
Define an ORM structure for the iris dataset, then use pandas to
insert the data into an SQLite database. Pandas inserts with
if_exists="append"
argument so that it keeps the structure
defined in SQL Alchemy.
import seaborn
import pandas
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table, Column, Text, Float
from sqlalchemy.orm import Session
Define metadata and create the table
engine = create_engine('sqlite://')
meta = MetaData()
meta.bind = engine
iris_table = Table('iris',
meta,
Column("sepal_length", Float),
Column("sepal_width", Float),
Column("petal_length", Float),
Column("petal_width", Float),
Column("species", Text))
iris_table.create()
Load data into the table
iris = seaborn.load_dataset("iris")
iris.to_sql(name="iris",
con=engine,
if_exists="append",
index=False,
chunksize=10 ** 6,
)
The SQL ALchemy iris_table
from above can be used to
build a select statement that extracts unique values:
from sqlalchemy import distinct, select
stmt = select(distinct(iris_table.c.species))
df = pandas.read_sql_query(stmt, engine)
Create a database engin with SQLalchemy
from sqlalchemy import create_engine
engine = create_engine('postgresql://myusername:mypassword@myhost:5432/mydatabase')
Blogs and Stackoverflow
Load data into postgreSQL using python (without pandas)
5 ways to backup your postgreSQl database using python Mentions the sh package, a subprocess replacement.
Create an SQLITE in memory database and add a table to it.
In [17]: from sqlalchemy import create_engine, inspect
...: from sqlalchemy import MetaData, Table, Column, Text
...: engine = create_engine('sqlite://')
...: meta = MetaData()
...: meta.bind = engine
...: user_table = Table('user', meta, Column("name", Text))
...: user_table.create()
...: inspector = inspect(engine)
...: inspector.has_table('user')
Out[17]: True
Create a file based database at a specific path:
# absolute path
e = create_engine('sqlite:////path/to/database.db')
I have set the following shortcuts to be similar to RStudio:
Ctrl+H find and replace dialog
Ctrl+R run selection or current line
Ctrl+Shift+C comment/uncomment code block
F1 inspect current object (i.e. display function and classes documentation)
F2 go to function definition
Spyder has a data frame explorer https://docs.spyder-ide.org/current/panes/variableexplorer.html#dataframes
“DataFrames, like Numpy arrays, display in a viewer where you can show or hide”heatmap” colors, change the format and resize the rows and columns either manually or automatically”
I use Vim to edit python code and vim-slime to send the code to an ipython interpreter that runs inside a tmux pane. For more information, see my page on vim.html.
Auto reload a module in ipython
%load_ext autoreload
%autoreload 2
The following uses importlib.reload
to illustrate the
functionality and compares it with auto reload. Create a sample function
and load it
import sys
import pathlib
from importlib import reload
tmp_dir = pathlib.Path("/tmp/this_dir")
tmp_dir.mkdir(exist_ok=True)
sys.path.append(str(tmp_dir))
f = open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n return i+j", file=f)
f.close()
from script import compute_sum
compute_sum(1,2)
Change the function and reload it using
importlib.reload
f = open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n print('blabla')\n return i+j", file=f)
f.close()
reload(sys.modules['script'])
from script import compute_sum
compute_sum(1,2)
Change the function and reload it using auto reload in ipython
%load_ext autoreload
%autoreload 2
f = open(tmp_dir / "script.py",'w')
print("def compute_sum(i,j):\n print('blibli')\n return i+j", file=f)
f.close()
compute_sum(1,2)
CSV minimal common denominator works every where. Great for small datasets to be shared across many languages and platforms.
NetCDF Supports rich metadata, complex data types, and is especially good at handling large datasets efficiently. Also supports various types of compression.
Parquet Columnar storage, efficient compression, and encoding schemes. Optimized for query performance.
Read and write csv
Write a compressed csv file as a gzip archive
import pandas
df = pandas.DataFrame({'x':range(0,3), 'y':['a','b','c']})
df.to_csv("/tmp/df.csv.gz", index=False, compression="gzip")
Write a compressed csv file as a zip archive, using a dict with the option “archive_name” (works only for the zip format)
compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('/tmp/df.csv.zip', index=False, compression=compression_opts)
Read compressed csv files
df1 = pandas.read_csv("/tmp/df.csv.gz")
df.equals(df1)
df2 = pandas.read_csv("/tmp/df.csv.zip")
df.equals(df2)
pandas.read_csv
can only read zip archive that contain
one file only. If there are more than one file in the archive, you can
use a ZipFile
object to provide access to the correct file
inside the archive, see SO answer.
import zipfile
import pandas
zf = zipfile.ZipFile("archive_name.zip")
print("Files in the archive:", zf.namelist())
df = pandas.read_csv(zf.open("file_name.csv"))
https://en.wikipedia.org/wiki/CPU_time
“If a program uses parallel processing, total CPU time for that program would be more than its elapsed real time.”
Pandas data frames can be used to read CSV files from the Comtrade data API. For example, using the default API URL for all countries:
import pandas
df1 = pandas.read_csv('http://comtrade.un.org/api/get?max=500&type=C&freq=A&px=HS&ps=2020&r=all&p=0&rg=all&cc=TOTAL&fmt=csv')
df2 = pandas.read_csv('http://comtrade.un.org/api/get?max=500&type=C&freq=A&px=HS&ps=2020&r=all&p=0&rg=all&cc=01&fmt=csv',
# Force the id column to remain a character column,
# otherwise str "01" becomes an int 1.
dtype={'Commodity Code': str, 'bli': str})
Then use df.to_csv to write the data frame to a csv file
df1.to_csv("/tmp/comtrade.csv")
Load Eurostat population projection data Eurostat tab separated values have a peculiar way to be a mix of tab separated and command separated values. This is annoying when loading data into pandas.
Here is how to load the population projection dataset available at https://ec.europa.eu/eurostat/databrowser/view/PROJ_23NP/ into pandas
This reads only one sheet:
pandas.read_excel("file_name.xlsx", "sheet_name")
Open all sheets in an excel file, and concatenate them to a single data frame with an additional column that contains the sheet name.
import pandas as pd
sheets_dict = pd.read_excel("file_name.xlsx", sheet_name=None)
all_data = pd.concat(
[df.assign(sheet_name=s) for s, df in sheets_dict.items()],
ignore_index=True
)
print(all_data)
Load a sample data frame and save it to a feather file
import pandas
import seaborn
iris = seaborn.load_dataset("iris")
iris.to_feather("/tmp/iris.feather")
Load the data from the feather file
iris2 = pandas.read_feather("/tmp/iris.feather")
iris2.equals(iris)
GDX files store data for the GAMMS modelling platform. They can be loaded into pandas data frames with the gdxpds package as explained in the gdpxpds documentation:
import gdxpds
gdx_file = 'C:\path_to_my_gdx\data.gdx'
dataframes = gdxpds.to_dataframes(gdx_file)
for symbol_name, df in dataframes.items():
print("Doing work with {}.".format(symbol_name))
Print a data frame to markdown, without the scientific notation https://stackoverflow.com/questions/66713432/suppress-scientific-notation-in-to-markdown-in-pandas
import pandas
import numpy as np
df = pandas.DataFrame({"x" : [1,1e7, 2], "y":[1e-5,100, np.nan]})
print(df.to_markdown())
print(df.to_markdown(floatfmt='.0f', index=False))
Print missing values as a minus sign https://stackoverflow.com/a/71165631/2641825
import pandas
import numpy as np
from tabulate import tabulate
df = pandas.DataFrame({"x": [1, 2], "y": [0, np.nan]})
print(tabulate(df,floatfmt=".0f", missingval="-",tablefmt="grid"))
print(tabulate(df.replace(np.nan, None),floatfmt=".0f", missingval="-",tablefmt="grid"))
In practice I use the data frame .to_markdown()
methods
which call tabulate in the background, as explained on the pandas
documentation
print(df.to_markdown(floatfmt=".0f", index=False, missingval="-"))
print(df.replace(np.nan, None).to_markdown(floatfmt=".0f", index=False, missingval="-"))
See also
You can use the command line tool ncdump to view the content of netcdf files
sudo apt install netcdf-bin
ncdump fuel.nc
Open a text file and print lines containing “error”
with open('filename.txt', 'r') as file:
for line in file:
if "error" in line.lower():
print(line)
Write to one file defaults to [snappy compression](https://en.wikipedia.org/wiki/Snappy_(compression)
import pandas
import seaborn
iris = seaborn.load_dataset("iris")
iris.to_parquet("/tmp/iris.parquet")
Read back the file
iris3 = pandas.read_parquet("/tmp/iris.parquet")
iris3.equals(iris)
You can also use gzip compression for a smaller file size (but slower read and write times)
iris.to_parquet("/tmp/iris.parquet.gzip", compression='gzip')
Write to multiple files along a column used as partition variable
iris.to_parquet("/tmp/iris",partition_cols="species")
The partitioned dataset is saved under a sub directory for each
unique value of the partition variable. For example there is a sub
directory for each species in the /tmp/iris
directory
iris
├── species=setosa
│ └── 1609afe5535d4e2b94e65f1892210269.parquet
├── species=versicolor
│ └── 18dd7ae6d0794fd48dad37bf8950d813.parquet
└── species=virginica
└── e0a9786251f54eed9f16380c8f5c3db3.parquet
On can read a single file in memory
virginica = pandas.read_parquet("/tmp/iris/species=virginica")
Note it has lost the species column
Read all files in memory
iris4 = pandas.read_parquet("/tmp/iris")
Note the data frame is slightly different. Values are the same but the species columns has become a categorical variable.
iris4.equals(iris)
# False
iris4.species
# ...
# Name: species, Length: 150, dtype: category
# Categories (3, object): ['setosa', 'versicolor', 'virginica']
Changing it back to a strings makes the 2 data frames equals again.
iris4["species"] = iris4["species"].astype("str")
iris4.equals(iris)
# True
Read only part of the content from parquet files with a filter. See
help(pyarrow.parquet.read_pandas)
for arguments concerning
the pyarrow engine. Reusing example files from the previous section:
selection = [("species", "in", ["versicolor","virginica"])]
iris5 = pandas.read_parquet("/tmp/iris", filters=selection)
In fact, the filter variable doesn’t have to be a partition variable.
selection = [("species", "in", ["versicolor","virginica"]),
("petal_width", ">", 2.4)]
iris6 = pandas.read_parquet("/tmp/iris", filters=selection)
This works as well on the single file version
iris7 = pandas.read_parquet("/tmp/iris.parquet", filters=selection)
# Change column type for the comparison
iris6["species"] = iris6["species"].astype("str")
iris7.equals(iris6)
Depending on whether or not the query is on the partition variable, read time can be increased by a lot. See experiment in the next section.
Note the detaset to perform these comparisons is not made available here. I keep these for information purposes.
Compare a read of 2 countries with the read of the whole dataset
# start_time = timeit.default_timer()
# selection = [("reporter", "in", ["France","Germany"])]
# ft_frde = pandas.read_parquet(la_fo_data_dir / "comtrade_forest_footprint.parquet",
# filters=selection)
# print("Reading 2 countries took:",timeit.default_timer() - start_time)
#
# start_time = timeit.default_timer()
# ft2 = pandas.read_parquet(la_fo_data_dir / "comtrade_forest_footprint.parquet")
# print("Reading the whole dataset took:",timeit.default_timer() - start_time)
#
Time comparison when the reporter is used as a partition column It’s about 10 times faster!
# ft.to_parquet("/tmp/ft", partition_cols="reporter")
# start_time = timeit.default_timer()
# selection = [("reporter", "in", ["France","Germany"])]
# ft_frde2 = pandas.read_parquet("/tmp/ft", filters=selection)
# print("Reading 2 countries took:",timeit.default_timer() - start_time)
#
# # Save to a compressed csv file in biotrade_data
# # file_path = la_fo_data_dir / "comtrade_forest_footprint.csv.gz"
# # ft.to_csv(file_path, index=False, compression="gzip")
Also try the feather format.
# # Save to a feather file
# ft.to_feather(la_fo_data_dir / "comtrade_forest_footprint.feather")
#
# # Read time of a feather file
# start_time = timeit.default_timer()
# ft_frde2 = pandas.read_feather(la_fo_data_dir / "comtrade_forest_footprint.feather")
# print("Reading a feather file took:",timeit.default_timer() - start_time)
Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.
Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.
Therefore, Arrow and Parquet complement each other and are commonly used together in applications. Storing your data on disk using Parquet and reading it into memory in the Arrow format will allow you to make the most of your computing hardware.”
What about “Arrow files” then?
Apache Arrow defines an inter-process communication (IPC) mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”). It can be used synchronously between processes using the Arrow “stream format”, or asynchronously by first persisting data on storage using the Arrow “file format”.
The Arrow IPC mechanism is based on the Arrow in-memory format, such that there is no translation necessary between the on-disk representation and the in-memory representation. Therefore, performing analytics on an Arrow IPC file can use memory-mapping, avoiding any deserialization cost and extra copies.
Some things to keep in mind when comparing the Arrow IPC file format and the Parquet format:
Parquet is designed for long-term storage and archival purposes, meaning if you write a file today, you can expect that any system that says they can “read Parquet” will be able to read the file in 5 years or 10 years. While the Arrow on-disk format is stable and will be readable by future versions of the libraries, it does not prioritize the requirements of long-term archival storage.
Reading Parquet files generally requires efficient yet relatively complex decoding, while reading Arrow IPC files does not involve any decoding because the on-disk representation is the same as the in-memory representation.
Parquet files are often much smaller than Arrow IPC files because of the columnar data compression strategies that Parquet uses. If your disk storage or network is slow, Parquet may be a better choice even for short-term storage or caching.
Is it better to have one large parquet file or lots of smaller parquet files?
“Notice that Parquet files are internally split into row groups https://parquet.apache.org/documentation/latest/ So by making parquet files larger, row groups can still be the same if your baseline parquet files were not small/tiny. There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can’t span multiple parquet files.”
“Also larger parquet files don’t limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups).”
“The only downside of larger parquet files is it takes more memory to create them. So you can watch out if you need to bump up Spark executors’ memory.”
Store a dictionary to a pickle file
import pickle
d = {"lkj":1}
with open('/tmp/d.pickle', 'wb') as file:
pickle.dump(d, file)
Read from a pickle file
with open("/tmp/d.pickle", "rb") as file:
e = pickle.load(file)
d == e
Print the size of the output layer
import torch
import torch.nn as nn
x = torch.randn(28,28).view(-1,1,28,28)
model = nn.Sequential(
nn.Conv2d(1, 32, (3, 3)),
nn.ReLU(),
nn.MaxPool2d((2, 2)),
nn.Conv2d(32, 64, (3, 3)),
)
print(model(x).shape)
type()
displays the type of an object.
i = 1
print(type(i))
# <type 'int'>
x = 1.2
print(type(x))
# <type 'float'>
t = (1,2)
print(type(t))
# <type 'tuple'>
l = [1,2]
print(type(l))
# <type 'list'>
Check if a variable is a string, int or float
isinstance("a", str)
isinstance(1, int)
isinstance(1.2, float)
Character to numeric
int("3")
float("3.33")
int("3.33")
Numeric to character
str(2)
Convert a list to a comma separated string
",".join(["a","b","c"])
Another example with the list of the last 5 years
import datetime
year = datetime.datetime.today().year
# Create a numeric list of years
YEARS = [year - i for i in range(1,6)]
# Convert each element of the list to a string
YEARS = [str(x) for x in YEARS]
",".join(YEARS)
Create a dictionary with curly braces
ceci = {'x':1, 'y':2, 'z':3}
Converts 2 lists into a dictionary with the dict
built
in function
dict(zip(['x', 'y', 'z'], [1, 2, 3]))
Dictionary comprehension
d = {n: True for n in range(5)}
Loop over the key and values of a dictionary
for key, value in ceci.items():
print(key, "has the value", value)
Invert keys and values
{value:key for key,value in ceci.items()}
The map
function makes an iterator object of type
map
iter = map(lambda x:x+1,range(3))
type(iter)
[i for i in iter]
Create a list
l1 = [1, 2, 3]
l2 = ["a", "b", "c"]
Create a list of strings using split (seen in this answer)
"slope, intercept, r_value, p_value, std_err".split(", ")
Remove an element from a list of strings
li = ['a', 'b', 'c', 'd']
li.remove('c')
li
['a', 'b', 'd']
Reverse a list with the reverse iterator
list(reversed(range(0,15)))
Reverse a list in place
bli = list(range(5))
print(bli)
bli.reverse()
print(bli)
How to flatten a list of tuples
nested_list = [(1, 2, 4), (0, 9)]
Using reduce
:
reduce(lambda x,y:x+y, map(list, nested_list))
[1, 2, 4, 0, 9]
Using itertools.chain:
import itertools
list(itertools.chain.from_iterable(nested_list))
Using extend
:
flat_list = []
for a_tuple in nested_list:
flat_list.extend(list(a_tuple))
flat_list
[1, 2, 4, 0, 9]
Difference between two sets:
set1 = {1,2,3}
set2 = {2,3,4}
set1 - set2
# {1}
set2 - set1
# {4}
Return a new set with elements common to the set and all others.
intersection(*others)
set & other & ...
bli = {1,2,3}
bli.intersection({1,2})
# {1, 2}
bli.intersection({1,2}, {1})
difference(*others)
set - other - ...
Return a new set with elements in the set that are not in the others.
symmetric_difference(other)
set ^ other
Return a new set with elements in either the set or other but not both.
For example check whether country names are all the same in 2 data frames
country_differences = set(df1["country"].unique()) ^ set(df2["country"].unique())
assert country_differences == set()
Instances of set provide the following operations:
issubset(other)
set <= other
Test whether every element in the set is in other. For example SO answer using issubset
l = [1,2,3]
m = [1,2]
set(m).issubset(l)
# True
seta = {1,2,3}
setb = {1,2}
setb.issubset(seta)
set < other
Test whether the set is a proper subset of other, that is, set <= other and set != other.
issuperset(other)
set >= other
Test whether every element in other is in the set.
set > other
Test whether the set is a proper superset of other, that is, set >= other and set != other.
Return a new set with elements from the set and all others.
union(*others)
set | other | ...
Example
{1,2}.union({3,4}, {10})
Note the following perform a union:
set(range(3,10)).union(set(range(5)))
set(range(3,10)) | set(range(5))
But this is not a union:
set(range(3,10)) or set(range(5))
For example
import panda
from pathlib import Path
def csv_to_df(path: [str, Path]) -> pandas.DataFrame:
return pandas.read_csv(path)
Issue with https://medium.com/virtuslab/pandas-stubs-how-we-enhanced-pandas-with-type-annotations-1f69ecf1519e
pandas.DataFrame and spark.sql.DataFrame
This post is related to that project https://github.com/VirtusLab/pandas-stubs?tab=readme-ov-file
Which later turned into the pandas-stubs project https://github.com/pandas-dev/pandas-stubs maintained by the core pandas team. See also pandera.
https://stackoverflow.com/questions/43890844/pythonic-type-hints-with-pandas Provides an answer that uses pandera https://github.com/unionai-oss/pandera to specify the type of each column in the input data frame
See also function decorators in another section below.
https://stackoverflow.com/questions/2005878/what-are-python-metaclasses-useful-for
according to this Metaclass programming in Python you might not need them ( yet )
Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t (the people who actually need them know with certainty that they need them, and don’t need an explanation about why).
– Python Guru Tim Peters
https://developer.ibm.com/tutorials/ba-metaprogramming-python/ provides an example with a class that has camel case and another class that has snake case attributes.
Below is an example of object inheritance where a Car and a Boat classes inherit from a Vehicle class.
class Vehicle(object):
def __init__(self, color, speed_max, garage=None):
self.color = color
self.speed_max = speed_max
self.garage = garage
def paint(self, new_color):
self.color = new_color
def go_back_home(self, new_color):
self.position = self.go_to(self.parent.location)
class Car(Vehicle):
def open_door(self):
pass
class Boat(Vehicle):
def open_balast(self):
pass
honda = Car('bleu', 60)
gorgeoote = Boat('rouge', 30)
honda.paint('purple')
Note that the object should be able to access it’s parent properties through the super() method.
Below an example of object composition where the Garage class is parent to many Vehicle objects.
class Garage(object):
def __init__(self, all_vehicles):
self.all_vehicles = all_vehicles
def mass_paint(self, new_color):
for v in self.all_vehicles: v.paint(new_color)
def build_car(self, color):
new_car = Car(color, 90, self)
self.all_vehicles.append(new_car)
return new_car
@property
def location(self):
return '10, 18'
mike = Garage([honda, gorgeoote])
mike.mass_paint()
sport_car = mike.build_car('rouge')
Why do Python classes inherit object?
So, what should you do?
In Python 2: always inherit from object explicitly. Get the perks.
In Python 3: inherit from object if you are writing code that tries to be Python agnostic, that is, it needs to work both in Python 2 and in Python 3. Otherwise don’t, it really makes no difference since Python inserts it for you behind the scenes.
Get the active branch name in a git repository with GitPython:
import git
hat = git.Repo(path="~/repos/eu_cbm/eu_cbm_hat")
hat.active_branch.name
Find the location of git repositories for libcbm_py and eu_cbm_hat, then create git repository objects with them:
import sys
import git
def find_sys_path(path_contains):
"""Find path that contains the given characters.
Raise an error if there's not exactly one matching path"""
matching_paths = [path for path in sys.path if path_contains in path]
if len(matching_paths) != 1:
msg = f"Expected one path containing {path_contains}, "
msg += f"found {len(matching_paths)}\n"
msg += f"{matching_paths}"
raise ValueError(msg)
return matching_paths[0]
repo_eu_cbm_hat = git.Repo(find_sys_path("eu_cbm_hat"))
Checkout a branch if the repository is clean (no changes)
def checkout_branch(git_repo:git.repo.base.Repo, branch_name:str):
"""Check if a repository has any changes and checkout the given branch
"""
if git_repo.is_dirty(untracked_files=True):
msg = f"There are changes in {git_repo}.\n"
msg += f"Not checking out the '{branch_name}' branch."
raise RuntimeError(msg)
git_repo.git.checkout(branch_name)
print(f"Checked out branch: {branch_name} of {git_repo}.")
#Usage
checkout_branch(repo_libcbm_py, "2.x")
The following example uses urllib.request.urlopen to download a zip
file containing Oceania’s crop production data from the FAO statistical
database. In that example, it is necessary to define a minimal header,
otherwise FAOSTAT throws an Error 403: Forbidden
. It was
posted as a StackOverflow
Answer.
import shutil
import urllib.request
import tempfile
# Create a request object with URL and headers
url = "http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip"
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)
# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f"File located at:{dest_file}")
# Create an http response object
with urllib.request.urlopen(req) as response:
# Create a file object
with open(dest_file, "wb") as f:
# Copy the binary content of the response to the file
shutil.copyfileobj(response, f)
Based on https://stackoverflow.com/a/48691447/2641825 and https://stackoverflow.com/a/66591873/2641825, see also the documentation at https://docs.python.org/3/howto/urllib2.html
The following loads a JSON file into a pandas data frame from the Comtrade API.
import urllib.request
import json
import pandas
url_reporter = "https://comtrade.un.org/Data/cache/reporterAreas.json"
url_partner = "https://comtrade.un.org/Data/cache/partnerAreas.json"
# attempt with pandas.io, with an issue related to nested json
pandas.io.json.read_json(url_reporter, encoding='utf-8-sig')
pandas.io.json.read_json(url_partner)
# `results` is a character column containing {'id': '4', 'text': 'Afghanistan'}.
# Is there a way to tell read_json to load the id and text columns directly instead?
“Since the whole processing is done in the pd.io.json.read_json method, we cannot select the keys to direct to the actual data that we are after. So you need to run this additional code to get your desired results:”
df = pandas.io.json.read_json(url_reporter, encoding='utf-8-sig')
df2 = pandas.json_normalize(df.results.to_list())
Other attempt using lower level packages
req = urllib.request.Request(url=url_reporter)
with urllib.request.urlopen(req) as response:
json_content = json.load(response)
df = pandas.json_normalize(json_content['results'])
In [17]: df
Out[17]:
id text
0 all All
1 4 Afghanistan
2 8 Albania
3 12 Algeria
4 20 Andorra
.. ... ...
252 876 Wallis and Futuna Isds
253 887 Yemen
254 894 Zambia
255 716 Zimbabwe
256 975 ASEAN
Related question I asked on SO.: How to load a nested data frame with pandas.io.json.read_json?
Enconding issue What is the difference between utf-8 and utf-8-sig?
Add these options at the ipyhton command line to reload objects automatically while you are coding
%load_ext autoreload
%autoreload 2
When pasting from another place, turn off auto indentation in ipython
%autoindent off
Once an error occurs at the ipython command line. Press
debug
then you can move up the stack trace with:
`u`
Move down the stack trace with:
`d`
Show code context of the error:
`l`
Show available variable in the current context:
`a`
To enter interactive mode and paste more than one line of code at a time:
interact
To break at every step in a loop, use the breakpoint()
function in any part of the code as explained in setp
by step debuging with ipython.
continue
See also the main section profiling and measuring time.
%prun statement
# Store the profiler output to a file
%prun -T /tmp/profiler.txt
Run a script with profiling enabled from the ipython console
%run -i -p script.py
Run a file from the ipython console
%run -i test.py
Prefix the bash call with an exclamation mark, for example:
!df -h
In fact the question mark also works from an ipython shell. See also Difference between ! and % in Jupyter Notebooks
Google Colab
To work from the ipython command line it’s useful to load execute the whole notebook inside the ipython shell with
ipython -c "%run notebook.ipynb"
It’s also possible to convert the long notebooks to a python script with:
jupyter nbconvert --to script notebook.ipynb
Then run the whole notebook and start an interactive shell with:
ipython -i notebook.py
Otherwise I also sometimes open the synchronized markdown version of the notebook and execute a few cells using Vim slime to sent them to a tmux pane where ipython is running.
Convert the long notebooks to a python script with:
jupyter nbconvert --to script notebook.ipynb
Notebooks can be converted from the File / Save and Export Notebook As / HTML menu. Or at the command line with nbconvert
jupyter nbconvert --to html notebook.ipynb
Run an ipython notebook from python using nbconver’s execute API:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
import jupytext
####################
# Run one notebook #
####################
filename = 'notebook.ipynb'
with open(filename) as ff:
nb_in = nbformat.read(ff, nbformat.NO_CONVERT)
# Read a notebook from the markdown file synchronized by jupytext
nb_md = jupytext.read('notebook.md')
# Run the notebook
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
nb_out = ep.preprocess(nb_in)
# Save the output notebook
with open(filename, 'w', encoding='utf-8') as f:
nbformat.write(nb_out, f)
Saving fails in my case.
“interactive HTML widgets for Jupyter notebooks and the IPython kernel.”
Documentation of interactive widgets
Create a text box
def print_name(name):
return("Name: " + name)
interact(print_name, name="Paul")
Create a drop down list for an interactive plot
import matplotlib.pyplot as plt
import seaborn
from ipywidgets import interact
iris = seaborn.load_dataset("iris").set_index("species")
def plot_iris(species):
"""Plot the given species"""
df = iris.loc[species]
ax = df.plot.scatter(x='petal_length', y='petal_width', title=species)
ax.set_xlim(0,8)
ax.set_ylim(0,4)
interact(plot_iris, species=list(iris.index.unique()))
Use the @interact
decorator
@interact(species=list(iris.index.unique()))
def plot_iris(species):
"""Plot the given species"""
df = iris.loc[species]
ax = df.plot.scatter(x='petal_length', y='petal_width', title=species)
ax.set_xlim(0,8)
ax.set_ylim(0,4)
https://pandas.pydata.org/docs/user_guide/options.html#frequently-used-options
Round all numbers
pandas.set_option('display.precision', 0)
Precision with 2 digit
pandas.set_option('display.precision', 2)
Scientific notation with 2 significant digits after the dot
pandas.set_option('display.float_format', '{:.2e}'.format)
Display all columns
pandas.options.display.max_columns = None
Display max rows
pandas.set_option('display.max_rows', 500)
With a context manager as in this answer
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
some pandas stuff
with pandas.option_context('display.max_rows', 100, 'display.max_columns', 10):
display(large_prod)
I wrote this csv download function in an SO answer
def csv_download_link(df, csv_file_name, delete_prompt=True):
"""Display a download link to load a data frame as csv from within a Jupyter notebook"""
df.to_csv(csv_file_name, index=False)
from IPython.display import FileLink
display(FileLink(csv_file_name))
if delete_prompt:
a = input('Press enter to delete the file after you have downloaded it.')
import os
os.remove(csv_file_name)
To get a link to a csv file, enter the above function and the code below in a jupyter notebook cell :
csv_download_link(df, 'df.csv')
To get help on a function, enter function_name?
in a
cell. Quick hep can also be obtained by pressing SHIFT + TAB.
To install Jupyter notebooks on python3:
pip3 install jupyter notebook
Then start the notebook server as such:
jupyter notebook
It is sometimes necessary to add the following at the beginning of a jupyter notebook so that plots are displayed inline
%matplotlib inline
Change the size of a plot displayed in a notebook
import seaborn
p = seaborn.lineplot(x="year", y="value", hue="source", data=df1)
p.figure.set_figwidth(15)
Install jupyter_contrib_nbextensions
python3 -m pip install --user jupyter_contrib_nbextensions
python3 -m jupyter contrib nbextension install --user
Activate the table of content extension:
python3 -m jupyter nbextension enable toc2/main
There are many other extensions available in this package. Optionally you can install the jupyter notebook extension configurator (not needed)
python3 -m pip install --user jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user
This will make a configuration interface available at:
http://localhost:8888/nbextensions
Using the old Table of Content extension jupyter table of content extension
jupyter nbconvert --to markdown mynotebook.ipynb
jupyter nbconvert --to html mynotebook.ipynb
For a colleague using Anaconda Installing jupyter_contrib_nbextensions specifies that
“There are conda packages for the notebook extensions and the jupyter_nbextensions_configurator available from conda-forge. You can install both using”
conda install -c conda-forge jupyter_contrib_nbextensions
Star history comparison between nbstripout and jupytext https://star-history.com/#mwouts/jupytext&kynan/nbstripout&Date
nbdime https://nbdime.readthedocs.io/en/latest/ nbdime provides tools for diffing and merging Jupyter notebooks.
Convert notebooks to markdown so they are easier to track in git.
Install https://github.com/mwouts/jupytext
python3 -m pip install --user jupytext
More commands:
python3 -m jupyter notebook --generate-config
vim ~/.jupyter/jupyter_notebook_config.py
Add this line:
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
And also this line if you always want to pair notebooks with their markdown counterparts:
c.ContentsManager.default_jupytext_formats = "ipynb,md"
More commands:
python3 -m jupyter nbextension install jupytext --py --user
python3 -m jupyter nbextension enable jupytext --py --user
Add syncing to a given notebook:
# Markdown sync
jupytext --set-formats ipynb,md --sync ~/repos/example_repos/notebooks/test.ipynb
# Python sync
jupytext --set-formats ipynb,py --sync ~/repos/example_repos/notebooks/test.ipynb
As an alternative to Jupytext, you can also clear the output of all cells before committing the notebook. That way the notebooks only contain code and not the output of tables and plots (which can sometimes take several megabytes of data).
how to remove Jupyter notebook output from terminal and when using git.
Clearing Jupyter output https://zhauniarovich.com/post/2020/2020-10-clearing-jupyter-output-p3/ previous approaches were using a pre-commit hook, current approach uses git attributes.
nbstripout https://github.com/kynan/nbstripout
“This does mostly the same thing as the Clear All Output command in the notebook UI.”
In pre-commit mode
“nbstripout is used as a git hook to strip any .ipynb files before committing. This also modifies your working copy!”
In regular mode
“In its regular mode, nbstripout acts as a filter and only modifies what git gets to see for committing or diffing. The working copy stays intact.”
It’s probably better to use the regular filter mode.
Install nbstripout
pip install –upgrade nbstripout
Configure a git repository to use nbstripout
cd git_repos nbstripout –install
Uninstall nbstripout from the current repository “(remove the git filter and attributes)”
cd git_repos nbstripout –uninstall
nbstripout-fast https://pypi.org/project/nbstripout-fast/ 200x faster implementation in rust avoids python startup times. They advertise 40s for git status with large repos, while their tool would speed it up to 1s.
Python documentation on Handling Exceptions.
while True:
try:
x = int(input("Please enter a number: "))
break
except ValueError:
print("Oops! That was no valid number. Try again...")
Re-raise the exception using from
to track the original
exception
for i in [1,0]:
try:
print(1/i)
except Exception as e:
msg = f"failed to compute: 1/{i} {str(e)}"
raise ValueError(msg) from e
The error message will contain:
"The above exception was the direct cause of the following exception"
Same example Without “from”
for i in [1,0]:
try:
print(1/i)
except Exception as e:
msg = f"failed to compute: {str(e)}"
raise ValueError(msg)
The error message will contain:
"During handling of the above exception, another exception occurred"
Simple Exception message capturing with a print statement
for i in [1,0]:
try:
print(1/i)
except Exception as e:
print("Failed to compute:", str(e))
Handle empty data in pandas
import pandas
try:
df = gfpmx_data[s]
columns = df.columns
except pandas.errors.EmptyDataError:
print(f"no data in file {s}")
columns = []
Capture the first few lines of an exception and re-raising it using “from”:
try:
assert_allclose(
ds[var].loc[COUNTRIES, t],
ds_ref[var].loc[COUNTRIES, t],
rtol=rtol,
)
except AssertionError as e:
first_line_of_error = ", ".join(str(e).split('\n')[:3])
msg = f"{ds.product}, {var}: {first_line_of_error}"
raise AssertionError(msg) from e
Python documentation on Raising Exceptions
raise Exception('spam')
raise ValueError('Not an acceptable value')
raise NameError("Wrong name: %s" % "quack quack quack")
Display variables in the error message:
raise ValueError("This is wrong: %s" % "wrong_value")
msg = "Value %s and value %s have problems."
raise ValueError(msg % (1, 2))
KeyError
is returned when a value is missing. For
example if an environment variable is undefined.
import os os.environ[“avarthatdoesntexist”]
EnvironmentError
is the parent error to IO errors
related to the operating system. It is only kept for compatibility
purposes and should not be used for errors related to environment
variables according to an answer in https://stackoverflow.com/questions/50869968/is-it-appropriate-to-raise-an-environmenterror-for-os-environValueError
can be used to raise exceptions about
data issues.
Send a warning to the user
import warnings
warnings.warn("there is no data")
Do not display the following warnings:
import warnings
warnings.filterwarnings("ignore", message="option is deprecated")
warnings.filterwarnings("ignore", ".*layout has changed to tight.*", category=UserWarning)
# related to https://github.com/mwaskom/seaborn/issues/3462
warnings.filterwarnings("ignore", "is_categorical_dtype")
warnings.filterwarnings("ignore", "use_inf_as_na")
docs.python.org logging cookbook
Pylint error: Use lazy % formatting in logging functions
Answer to Lazy
evaluation of strings in python logging: comparing %
with
.format
The documentation https://docs.python.org/2/library/logging.html suggest the following for lazy evaluation of string interpolation:
logging.getLogger().debug('test: %i', 42)
Functions in python can be defined with
def add_one(x):
return x + 1
add_one(1)
# 2
Annotations for parameters take the form of optional expressions that follow the parameter name:
def foo(a: expression, b: expression = 5):
...
to annotate the type of a function’s return value. This is done like so:
def sum() -> expression:
...
def kinetic_energy(m:'in KG', v:'in M/S')->'Joules':
return 1/2_m_v**2
kinetic_energy.__annotations__
{'m': 'in KG', 'v': 'in M/S', 'return': 'Joules'}
The pandas code base doesn’t use it everywhere, there are functions that use the standard sphinx type of documentation timedeltas.py#L1094. I have the impression that the annotation are used for the package internal functions, while the sphinx documentation is used for the functions that are exposed to the outside users. And in the same script, they use both sphinx documentation and type annotations timedeltas.py#L952.
There are several things to know about up front when it comes to type hinting in Python. Let’s look at the pros of type hinting first:
- Type hints are nice way to document your code in addition to docstrings
- Type hints can make IDEs and linters give better feedback and better autocomplete
- Adding type hints forces you to think about types, which may help you make good decisions during the design of your applications.
Adding type hinting isn’t all rainbows and roses though. There are some downsides:
- The code is more verbose and arguably harder to write
- Type hinting adds development time
- Type hints only work in Python 3.5+. Before that, you had to use type comments
- Type hinting can have a minor start up time penalty in code that uses it, especially if you import the typing module.
When using numpy arrays, python displays a behaviour of call by reference
a = np.array([1,2])
def changeinput(x, scalar):
x[0] = scalar
changeinput(a,3)
a
# array([3, 2])
This is really weird coming from R, which has a copy-on-modify principle.
The R Language Definition says this (in section 4.3.3 Argument Evaluation)
“The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame. [Emphasis added]”
Decorators are a way to wrap a function around another function. It is useful to repeat a pattern of behaviour around a function.
I have used decorators to cache the function output along a data processing pipeline.
Since python 3.8 there is also a @cached_property
decorator functools.cached_property
“Transform a method of a class into a property whose value is computed once and then cached as a normal attribute for the life of the instance. Similar to property(), with the addition of caching. Useful for expensive computed properties of instances that are otherwise effectively immutable.”
Example (by https://www.perplexity.ai/search/5cc7a6e1-ef72-418d-b7ae-d9049815b6f8?s=c#5cc7a6e1-ef72-418d-b7ae-d9049815b6f8):
from functools import cached_property
class MyClass:
def __init__(self):
self._data = [1, 2, 3, 4, 5]
@cached_property
def sum(self):
print("Computing sum...")
return sum(self._data)
Usage:
obj = MyClass()
print(obj.sum) # prints "Computing sum... 15"
print(obj.sum) # prints "15"
There is also a @cache
decorator functools.cache
that creates:
“a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.
Deprecate the old name of a function argument
def agg_trade_eu_row(df, grouping_side="partner", index_side=None):
if index_side is not None:
warnings.warn("index_side is deprecated; use grouping_side", DeprecationWarning, 2)
grouping_side = index_side
This SO Questions asks how to create an argument alias, without changing the number of arguments to the function.
Document python functions with the sphinx convention SO Answer
def send_message(sender, recipient, message_body, priority=1) -> int:
"""
Send a message to a recipient
:param str sender: The person sending the message
:param str recipient: The recipient of the message
:param str message_body: The body of the message
:param priority: The priority of the message, can be a number 1-5
:type priority: integer or None
:return: the message id
:rtype: int
:raises ValueError: if the message_body exceeds 160 characters
:raises TypeError: if the message_body is not a basestring
"""
https://pypi.org/project/Cartopy/
“A cartographic python library with Matplotlib support for visualisation”
See the sections on
How to do maths in python 3 with operators
2 to the power of 3
2**3
# 8
Floor division
5//3
# 1
# Use it to extract the year of a Comtrade period
202105 // 100
Modulo
5%3
# 2
12
# Use it to extract the last 2 digits of an integer
202105 % 100
Both at the same time
divmod(5,3)
# (1, 2)
https://www.sympy.org/en/index.html
“SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.”
Wikipedia pyomo
“Pyomo is well suited to modeling simple and complex systems that can be described by linear or nonlinear algebraic, differential, and partial differential equations and constraints.”
https://pypi.org/project/openai/
“The library needs to be configured with your account’s secret key which is available on the website. […] Set it as the OPENAI_API_KEY environment variable”
Ask Chat GPT to complete a message
import openai
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "What are the trade-offs around deadwood in forests?"}]
)
print(response)
Print available models
models = openai.Model.list()
print([m["id"] for m in models.data])
Note ChatGPT plus and the API have separate pricing: https://help.openai.com/en/articles/7039783-how-can-i-access-the-chatgpt-api
“Please note that the ChatGPT API is not included in the ChatGPT Plus subscription and are billed separately. The API has its own pricing, which can be found at https://openai.com/pricing. The ChatGPT Plus subscription covers usage on chat.openai.com only and costs $20/month.”
https://www.statsmodels.org/stable/index.html
“statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.”
2020 paper https://conference.scipy.org/proceedings/scipy2010/pdfs/seabold.pdf
“The current main developers of statsmodels are trained as economists with a background in econometrics. As such, much of the development over the last year has focused on econometric applications.”
https://pypi.org/project/wooldridge/
“A Python package which contains 111 data sets from one of the most famous econometrics textbooks for undergraduates. […] It is extensively used in Learning Introductory Econometrics with Python (Japanese, translated). It is also used in Using Python for Introductory Econometrics, which is a sister book of ‘Using R for Introductory Econometrics’.”
The package linearmodels provides fixed effects estimator for panel data.
All examples below are based on the numpy package being imported as np :
import numpy as np
I mostly use binary operators on boolean arrays for index selections in pandas data frames.
Bitwise and
np.array([True, True]) & np.array([False, True])
Bitwise not
~np.array([True, False])
They are equivalent to logical operators numpy.logical_and, numpy.logical_not for logical arrays.
A SO answer quotes the NumPy v1.15 Manual
> If you know you have boolean arguments, you can get away with using
> NumPy’s bitwise operators, but be careful with parentheses, like this:
> `z = (x > 1) & (x < 2)`. The absence of NumPy operator forms of
> `logical_and` and `logical_or` is an unfortunate consequence of Python’s
> design.
So one can also use
~
forlogical_not
and|
forlogical_or
.
“The number 13 is represented by 00001101. Likewise, 17 is represented by 00010001. The bit-wise AND of 13 and 17 is therefore 000000001, or 1”
np.bitwise_and(13, 17)
# 1
The &
operator can be used as a shorthand for
np.bitwise_and on ndarrays.
x1 = np.array([2, 5, 255])
x2 = np.array([3, 14, 16])
x1 & x2
Numpy array indexing
“Basic slicing extends Python’s basic concept of slicing to N dimensions. Basic slicing occurs when obj is a slice object (constructed by start:stop:step notation inside of brackets), an integer, or a tuple of slice objects and integers.” […] The basic slice syntax is i:j:k where i is the starting index, j is the stopping index, and k is the step (\(k\neq0\)). “[…]”Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).” “Integer array indexing allows selection of arbitrary items in the array based on their N-dimensional index. Each integer array represents a number of indexes into that dimension.”
x[0:3,0:2]
# array([[0.64174957, 0.18540429],
# [0.97558697, 0.69314058],
# [0.51646795, 0.71055115]])
In this case because every row is selected, it is the same as:
x[:,0:2]
Examples modified from https://docs.scipy.org/doc/numpy/user/basics.indexing.html
y = np.arange(35).reshape(5,7)
print(y[np.array([0,2,4]), np.array([0,1,2])])
print('With slice 1:3')
print(y[np.array([0,2,4]),1:3])
print('is equivalent to')
print(y[np.array([[0],[2],[4]]),np.array([[1,2]])])
# This one is the same but transposed, which is weird
print(y[np.array([[0,2,4]]),np.array([[1],[2]])])
# Notice the difference with the following
print(y[np.array([0,2,4]),np.array([1,2,3])])
Masks masked array We wish to mark the fourth entry as invalid. The easiest is to create a masked array:
x = np.array([1, 2, 3, -1, 5])
mx = np.ma.masked_array(x, mask=[0, 0, 0, 1, 0])
print(x.sum(), mx.sum())
# 10 11
Create a vector
a = np.array([1,2,3])
Create a matrix
b = np.array([[1,2,3],[5,6,6]])
Shape
a.shape
# (3,)
b.shape
# (2, 3)
Matrix of zeroes
np.zeros([2,2])
#array([[0., 0.],
# [0., 0.]])
Create a matrix with an additional dimension
np.zeros(b.shape + (2,))
array([[[0., 0.],
[0., 0.],
[0., 0.]],
[[0., 0.],
[0., 0.],
[0., 0.]]])
Transpose
b.transpose()
# array([[1, 5],
# [2, 6],
# [3, 6]])
c = b.transpose()
Math functions in numpy:
np.cos()
np.sin()
np.tan()
np.exp()
min and max
x = np.array([1,2,3,4,5,-7,10,-8])
x.max()
# 10
x.min()
# -8
Matrix multiplication matmul
np.matmul(a,c)
# array([14, 35])
# Can also be written as
a @ c
# array([14, 35])
Otherwise the multiplication symbol implements an element wise multiplication, also called the
Hadamard product. It only works on 2 matrices of same dimensions. Element-wise multiplication is used for example in convolution kernels.
b * b
# array([[ 1, 4, 9],
# [25, 36, 36]])
So here is again an example showing the difference between
m = np.array([[0,1],[2,3]])
Element wise multiplication :
m * m
# array([[0, 1],
# [4, 9]])
Matrix multiplication :
m @ m
# array([[ 2, 3],
# [ 6, 11]])
Linear algebra functionalities are provided by numpy.linalg For example the norm of a matrix or vector:
np.linalg.norm(x)
# 16.3707055437449
np.linalg.norm(np.array([3,4]))
# 5.0
np.linalg.norm(a)
# 3.7416573867739413
Norm of the matrix for the regularization parameter in a machine learning model
bli = np.array([[1, 1, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1, 0.0, 0.0, 0.0]])
sum(np.linalg.norm(bli, axis=0)**2) 3.0000000000000004
sum(np.linalg.norm(bli, axis=1)**2) 3.0000000000000004
np.linalg.norm(bli)**2 2.9999999999999996
Append vs concatenate
x = np.array([1,2])
print(np.append(x,x))
# [1 2 1 2]
print(np.concatenate((x,x),axis=None))
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(np.concatenate((a, b), axis=0))
print(np.concatenate((a, b.T), axis=1))
print(np.concatenate((a, b), axis=None))
Power of an array
import numpy as np
a = np.arange(4).reshape(2, 2)
print(a)
print(a**2)
print(a*a)
np.power(a, 2)
Broadcast the power operator
np.power(a, a)
x = np.random.random([3,4])
x
# array([[0.64174957, 0.18540429, 0.7045183 , 0.44623567],
# [0.97558697, 0.69314058, 0.32469324, 0.82612627],
# [0.51646795, 0.71055115, 0.74864751, 0.2142459 ]])
Random choice, with a given probability Choose zero with probability 0.1 and one with probability 0.9.
for i in range(10):
print(np.random.choice(2, p=[0.1, 0.9]))
print(np.random.choice(2, 10, p=[0.1, 0.9]))
print(np.random.choice(2, (10,10), p=[0.1, 0.9]))
[[1 1 1 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 0 1 1]
[1 1 1 0 1 1 1 1 1 1]
[1 1 0 1 1 1 1 1 0 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 0]
[1 1 1 1 1 1 1 1 1 1]]
Error if probabilities do not sum up to one
print(np.random.choice(2, p=[0.1, 0.8]))
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-31-8a8665287968> in <module>
# ----> 1 print(np.random.choice(2, p=[0.1, 0.8]))
# mtrand.pyx in numpy.random.mtrand.RandomState.choice()
# ValueError: probabilities do not sum to 1
All code below assumes you have imported pandas
import pandas
You can create a data frame by passing a dictionary of lists with column names as keys
df = pandas.DataFrame({'x':range(0,3),
'y':['a','b','c']})
# a b
# 0 0 3
# 1 1 4
# 2 2 5
By passing data as a list of lists and specifying the name of the columns in the columns argument.
data = [['APOLLOHOSP', 8, 6, 'High', 'small'],
['COROMANDEL', 9, 9, 'High', 'small'],
['SBIN', 10, 3, 'Medium', 'large']]
pandas.DataFrame(data=data, columns=["code", "Growth", "Value", "Risk",
"Mcap"])
Or by passing a list of tuples and defining the columns
argument
pandas.DataFrame(
list(zip(range(0,3), ['a','b','c'])),
columns=["x", "y"]
)
Random numbers
import numpy as np
df = pandas.DataFrame({'x':np.random.random(100)})
Create a new column based on another one
df = pandas.DataFrame({'a':range(0,3),
'b':['p','q','r'],
'c':['m','n','o']})
df["d"] = df["a"] * 2
Use the assign
method
df.assign(e = lambda x: x["a"] * 3)
df.assign(e = lambda x: x["a"] / 1e3)
A recursive function is difficult to vectorize because each input at
time t depends on the previous input at time t-1. When possible use a
year index for shorter selection with .loc()
.
import pandas
df = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7)})
df1 = df.copy()
# Set the initial value
t0 = min(df1.year)
df1.loc[df1.year==t0, "x"] = 0
# Doesn't work when the right side of the equation is a pandas.core.series.Series
for t in range (min(df1.year)+1, max(df1.year)+1):
df1.loc[df1.year==t, "x"] = df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]
print(df1)
# year a x
# 0 2020 3 0.0
# 1 2021 4 NaN
# 2 2022 5 NaN
# 3 2023 6 NaN
print(type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]))
# <class 'pandas.core.series.Series'>
# Works when the right side of the equation is a numpy array
for t in range (min(df1.year)+1, max(df1.year)+1):
df1.loc[df1.year==t, "x"] = (df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()
#break
print(df1)
# year a x
# 0 2020 3 0.0
# 1 2021 4 3.0
# 2 2022 5 7.0
# 3 2023 6 12.0
print(type((df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()))
# <class 'numpy.ndarray'>
# Assignement works directly when the .loc() selection is using a year index
df2 = df.set_index("year").copy()
# Set the initial value
df2.loc[df2.index.min(), "x"] = 0
for t in range (df2.index.min()+1, df2.index.max()+1):
df2.loc[t, "x"] = df2.loc[t-1, "x"] + df2.loc[t-1,"a"]
#break
print(df2)
# a x
# year
# 2020 3 0.0
# 2021 4 3.0
# 2022 5 7.0
# 2023 6 12.0
print(type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"]))
# <class 'numpy.float64'>
#SO answer using cumsum
Our real problem is more complicated since there is a multiplicative and an additive component
import pandas
df3 = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7), 'b':range(8,12)})
df3 = df3.set_index("year").copy()
# Set the initial value
initial_value = 1
df3.loc[df3.index.min(), "x"] = initial_value
# Use a loop
for t in range (df3.index.min()+1, df3.index.max()+1):
df3.loc[t, "x"] = df3.loc[t-1, "x"] * df3.loc[t-1, "a"] + df3.loc[t-1, "b"]
# Use cumsum and cumprod
df3["cumprod_a"] = df3.a.cumprod().shift(1).fillna(1)
df3["cumsum_cumprod_a_b"] = df3.cumprod_a.cumsum().shift(1).fillna(0) * df3.b
df3["x2"] = df3.cumprod_a * initial_value + df3.cumsum_cumprod_a_b
print(df3)
type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"])
is a pandas series while
type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"])
is a numpy
float. Why are types different?
Is there a better way to write a recursive .loc()
assignment than to use .unique()
?
See also:
“It is a general rule in programming that one should not mutate a container while it is being iterated over. Mutation will invalidate the iterator, causing unexpected behavior.” […] “To resolve this issue, one can make a copy so that the mutation does not apply to the container being iterated over.”
Create an example data frame
import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
Set value for all items matching the list of labels
df.loc[['viper', 'sidewinder'], ['shield']] = 50
# max_speed shield
# cobra 1 2
# viper 4 50
# sidewinder 7 50
df = pandas.DataFrame({'lettre':['p','q','r','r','s','v','p']})
mapping = {'p':'pour','q':'quoi','r':'roi'}
df["mot"] = df["lettre"].map(mapping)
Set categorical data type
import pandas
df = pandas.DataFrame({'id':["a", "b", "c"], 'x':range(3)})
list_of_ids = ["b", "c", "a"]
df['id'] = pandas.Categorical(df['id'], categories=list_of_ids, ordered=True)
df.sort_values('id', inplace=True)
df["id"]
# Out:
# 0 a
# 1 b
# 2 c
# Name: id, dtype: category
# Categories (3, object): ['b' < 'c' < 'a']
Remove a category
df["element"].cat.remove_categories(["nai_merch"])
See also the IO section to convert data frames to other files.
Convert 2 columns to a dictionary
df = pandas.DataFrame({'a':range(0,3),
'b':['p','q','r'],
'c':['m','n','o']})
df.set_index('b').to_dict()['c']
Convert a string to a numeric type using the argument
errors="coerce"
:
s = pandas.Series(["1", "2", "a"])
pandas.to_numeric(s, errors="coerce")
Check if a column is of numeric or string type
pandas.api.types.is_numeric_dtype(s)
pandas.api.types.is_string_dtype(s)
The following would return an error
s.astype(float)
s.astype(int)
And using the df[col].astype()
method with
errors="ignore"
would not convert at all:
s.astype(float, errors="ignore")
pandas.to_numeric(s, errors="ignore")
Convert an integer to a string type
s = pandas.Series(range(3))
s.astype(str)
SO question Convert to scalar
iat()
“Access a single value for a row/column pair by integer position. Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a single value in a DataFrame or Series.”
squeeze()
“Squeeze 1 dimensional axis objects into scalars. Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged. This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.”
Pure equality example from pandas.DataFrame.equals
import pandas
df = pandas.DataFrame({1: [10], 2: [20]})
exactly_equal = pandas.DataFrame({1: [10], 2: [20]})
df.equals(exactly_equal)
different_column_type = pandas.DataFrame({1.0: [10], 2.0: [20]})
df.equals(different_column_type)
different_data_type = pandas.DataFrame({1: [10.0], 2: [20.0]})
df.equals(different_data_type)
Testing closeness (for example with floating point results computed in another software)
import numpy as np
np.testing.assert_allclose([1,2,3], [1.001,2,3],rtol=1e-2)
np.testing.assert_allclose([1,2,3], [1.001,2,3],rtol=1e-6)
Other examples
df.equals(df+1e-6)
np.testing.assert_allclose(df,df+1e-7)
np.testing.assert_allclose(df,df+1e-3)
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
“pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.”
Example based on https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects modified
df1 = pandas.DataFrame(
{
"A": range(3),
"B": range(3,6)
}
)
df2 = pandas.DataFrame(
{
"A": range(4),
"B": range(4,8),
"C": ["w", "x", "y", "z"]
}
)
df3 = pandas.DataFrame(
{
"A": range(10,13),
"B": range(13,16)
}
)
result = pandas.concat([df1, df2, df3]).reset_index(drop=True)
Concatenate two series SO
Notice the difference between the default axis=0
concatenate on the index, and axis=1
concatenate on the
columns.
import pandas
s1 = pandas.Series([1, 2, 3], index=['A', 'B', 'c'], name='s1')
s2 = pandas.Series([4, 5, 6], index=['A', 'B', 'D'], name='s2')
pandas.concat([s1, s2], axis=0)
pandas.concat([s1, s2], axis=1)
Stackoverflow Pandas merging
pandas merge right_on do not keep variable name Stack Overflow
Propose 3 solutions:
rename the original data frame to merge on varables that have the same name
merge and drop the redundant column with a different name
set the merge column as an index in the right data frame and use right_index=True
Three example data types
df = pandas.DataFrame({"a":range(0,3),
"b": ["a", "b", "c"],
"c": [0.1, 0.2, 0.3]})
df.info()
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 a 3 non-null int64
# 1 b 3 non-null object
# 2 c 3 non-null float64
Each data type has specific methods attached to it. For example the string accessors methods
df.b.str.contains("c")
Reusing the example from above
df["b"] = pandas.Categorical(
df["b"], categories=["b", "c", "a"], ordered=True
)
df.info()
df.sort_values("b")
List columns as an index object
df = pandas.DataFrame({'a':range(0,3),'b':range(3,6)})
df.columns
List columns as a list
df.columns.tolist()
Select only certain columns in a list
df['bla'] = 0
cols = df.columns.tolist()
[name for name in cols if 'a' in name]
https://datatofish.com/rows-with-nan-pandas-dataframe/
import pandas import numpy as np df = pandas.DataFrame({‘i’ : [‘a’, ‘b’, ‘c’, ‘d’, ‘e’], ‘y’ : [np.nan, ‘2’,‘2’, ‘4’, ‘1’], ‘z’ : [‘2’,‘2’, ‘4’, ‘1’, np.nan], }) df[df.isna().any(axis=1)]
A function that prints the number and proportion of NA values:
def nrows_available(df, var):
"""Number of rows where this variables is not NA"""
avail = sum(df[var] == df[var])
not_avail = sum(df[var] != df[var])
assert(not_avail + avail == len(df))
print(f"{var} is available in {avail} rows",
f"and NA in the other {not_avail} rows",
f"{round(avail/len(df)*100)}% are available.")
nrows_available(placette, "tpespar1")
nrows_available(placette, "tpespar2")
Remove empty columns where values are all NA
import pandas
import numpy as np
df = pandas.DataFrame({'A' : ['bli', 'bla', 'bla', 'bla', 'bla'],
'B' : [np.nan, '2','2', '4', '1'],
'C' : np.nan})
columns_to_keep = [x for x in df.columns if not all(df[x].isna())]
df = df[columns_to_keep].copy()
Rename the ‘a’ column to ‘new’
df.rename(columns={'a':'new'})
Rename columns to snake case using a regular expression
import re
df.rename(columns=lambda x: re.sub(r" ", "_", str(x)).lower(), inplace=True)
# Another regexp that replaces all non alphanumeric characters by an
# underscore
df.rename(columns=lambda x: re.sub(r"\W+", "_", str(x)).lower(), inplace=True)
Remove parenthesis and dots in column names
df.rename(columns=lambda x: re.sub(r"[()\.]", "", x), inplace=True)
Replace the content of the columns, see below:
iris["species"].replace("setosa","x")
You can use a selector data frame to select and rename at the same time.
https://stackoverflow.com/questions/57417520/selecting-and-renaming-columns-at-the-same-time
df.rename(columns=selector_d)[selector_d.values()]
Load a csv file which has headers on 2 lines, merge the headers, convert to lower case, remove the “unnamed_1_” part of the column name:
csv_file_name = self.data_dir / "names.csv"
df = pandas.read_csv(csv_file_name, header=[0, 1])
df.columns = [str('_'.join(col)).lower() for col in df.columns]
df.rename(columns=lambda x: re.sub(r"unnamed_\d+_", "", str(x)).lower(), inplace=True)
You can also rename a series with
iris["species"].rename("bla")
Place the last column first
cols = df.columns.to_list()
cols = [cols[-1] + cols[:-1]
df = df[cols]
This SO Answer provide 6 different ways to reorder columns.
Place the last 3 columns first
cols = list(df.columns)
cols = cols[-3:] + cols[:-3]
df = df[cols]
See also string operations in pandas.
Replace Comtrade product code by the FAOSTAT product codes
import seaborn
iris = seaborn.load_dataset("iris")
iris["species"].replace("setosa","x")
# Create a dictionary from 2 columns of a data frame
product_dict = product_mapping.set_index('comtrade_code').to_dict()['faostat_code']
df_comtrade["product_code"] = df_comtrade["product_code"].replace(product_dict)
To change the type of a column use astype:
s = pandas.Series(range(3))
s.to_list()
s.astype(str).to_list()
s.astype(float).to_list()
Note using NA values is not possible with the base integer type, it requires a special type Int64 as explained in this SO answer
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
When loading data sometimes more than one type are detected per column with a warning such as this one:
arbre = pandas.read_csv(zf.open("ARBRE.csv"), sep=";")
DtypeWarning: Columns (4,5,9,14,21,36) have mixed types.
Specify dtype option on import or set low_memory=False.
Create sample data with a column that has 2 data types
import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
iris.loc[1,"sepal_length"] = iris.loc[1,"sepal_length"].astype(str)
Find columns that use more than one type
for col in iris.columns:
unique_types = iris[col].apply(type).unique()
if len(unique_types) > 1:
print(col, unique_types)
To display the memory usage of each column in a pandas data frame
import pandas
df = pandas.DataFrame({'x':range(0,3), 'y':['a','b','c']})
print(df.memory_usage(deep=True))
print(df.memory_usage(deep=True).sum())
df.info()
Using sys.getsizeof
:
import sys
print(sys.getsizeof(df))
Changing a repeated data series to a categorical can help reduce memory usage, although this is probably not true any more in pandas. Categorical variables come with additional annoyances (such as the memory blow-up bug with observed=False in groupby operations).
import seaborn
iris = seaborn.load_dataset("iris")
print(iris["species"].memory_usage(deep=True))
print(iris["species"].astype('category').memory_usage(deep=True))
iris2 = iris.copy()
iris2["species"] = iris["species"].astype('category')
print(sys.getsizeof(iris2))
print(sys.getsizeof(iris))
Use copy() to make a copy of a data frame’s indices and data.
import seaborn
iris1 = seaborn.load_dataset("iris")
iris2 = iris1.copy()
iris2["x"] = 0
print(iris2.head(1))
print(iris1.head(1))
iris2.equals(iris1)
If you don’t make a copy, modifying a new assigned data farme also modifies the original data frame
iris3 = iris1
iris3["x"] = 0
print(iris3.head(1))
print(iris1.head(1))
iris3.equals(iris1)
Create date time columns from a character column
import pandas
pandas.to_datetime('2020-01-01', format='%Y-%m-%d')
pandas.to_datetime('2020-01-02')
pandas.to_datetime('20200103')
Extract the year
s = pandas.Series(pandas.date_range("2000-01-01", periods=3, freq="Y"))
print(s)
print(s.dt.year)
Convert integer years to a time series
s = pandas.Series([2020, 2021, 2022])
pandas.to_datetime(s, format="%Y")
Convert UN Comtrade dates in the format 202201 to a datetime type
df = pandas.DataFrame({'period':[202201, 202202]})
df["period2"] = pandas.to_datetime(df['period'], format='%Y%m')
df.info()
Rolling mean over a 5 year window for the whole data frame (provided that year is the index variable)
df.rolling(window=5).mean()
Plot the difference to a 5 years rolling mean
(df - df.rolling(window=5).mean()).plot.bar()
Example “rolling sum with a window length of 2 observations.”
df = pandas.DataFrame({'B': [0, 1, 2, np.nan, 4, 5, 6, 7]})
df.rolling(2).sum()
Yearly rolling of a monthly time series:
.transform(lambda x: x.rolling(13, min_periods=1).mean()))
Note you might actually not need the transform in this case.
df["x"].rolling(13, min_periods=1).mean()
also works.
Compute the sum of sepal length grouped by species
import seaborn
iris = seaborn.load_dataset("iris")
# Aggregate one value
iris.groupby('species')["sepal_length"].agg(sum).reset_index()
# Aggregate multiple values
iris.groupby('species')[["sepal_length", "petal_length"]].agg(sum).reset_index()
# Aggregate multiple values and give new names
iris.groupby('species').agg(sepal_length_sum = ('sepal_length', sum),
petal_length_sum = ('petal_length', sum))
Compute the sum but repeated for every original row
iris['sepal_sum'] = iris.groupby('species')['sepal_length'].transform('sum')
iris
This is useful to compute the share of total in each group for example.
Compute the cumulative sum of the sepal length
iris['cumsum'] = iris.groupby('species').sepal_length.cumsum()
ris['cumsum'].plot()
from matplotlib import pyplot
pyplot.show()
Compute a lag
iris['cumsum_lag'] = iris.groupby('species')['cumsum'].transform('shift', fill_value=0)
iris[['cumsum', 'cumsum_lag']].plot()
pyplot.show()
Aggregate a trade data frame by decades
bins = range(2000, 2031, 10)
tf_agg["decade"] = pandas.cut(
tf_agg["year"], bins=bins, include_lowest=True, labels=range(2000, 2021, 10)
)
index = ["reporter", "partner", "flow", "product_code_4d", "decade"]
tf_decade = (
tf_agg.groupby(index)[["net_weight", "trade_value"]]
.agg(sum)
.reset_index()
)
Beyond standard function such as sum
and
mean
, it’s possible to use a self defined lambda function
as follows
import numpy as np
(iris
.groupby(["species"])
.agg(pw_sum = ("petal_width", sum),
pw_sum_div_by_10 = ("petal_width", lambda x: x.sum()/0),
n = ("petal_width", len),
mean1 = ("petal_width", np.mean))
.assign(mean2 = lambda x: x.pw_sum / x.n)
)
Display mean, std, min, 25%, 50%, 75%, max across group by variables
df.groupby(["status"])["diff"].describe()
Example aggregating some variables with a sum and taking the unique
value (first) for other variables (input coefficients). The code below
passes a dictionary of variables and aggregation functions to the
df.groupby().agg()
method.
# Aggregate product codes from the 6 to the 4 digit level
index = [
"year",
"period",
"reporter_code",
"reporter",
"reporter_iso",
"partner_code",
"partner",
"partner_iso",
"product_code_4d",
"unit_code",
"unit",
]
agg_dict = {'quantity': 'sum',
'net_weight': 'sum',
'trade_value': 'sum',
'vol_eqrwd_ub': 'sum',
'vol_eqrwd_ob': 'sum',
'la_fo': 'sum',
'conversion_factor_m3_mt':'first',
'bark_factor': 'first',
'nai': 'first'}
ft4d = (
ft
.groupby(index)
.agg(agg_dict)
.reset_index()
)
The fist element of the aggregation dictionary shows how to simply compute all the unique values
agg = {
# All unique values in a list
"country_iso2":lambda x: x.unique(),
# Concatenate a list of strings into a string
lambda x: "".join(x.unique()),
# The first value if the value is repeated and only present once
'primary_eq':lambda x: x.unique()[0] if x.nunique() == 1 else np.nan,
'import_quantity':lambda x: x.unique()[0] if x.nunique() == 1 else np.nan,
# Sum the values
'primary_eq_imp_1':"sum"
}
df_agg = df.groupby(index)[selected_columns].agg(agg).reset_index()
Compute proportion within groups:
df = pandas.DataFrame({
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
'value': [10, 20, 30, 40, 50, 60, 70]
})
df['proportion'] = df.groupby('category')['value'].transform(lambda x: x / x.sum())
Load the flights dataset and for each month, display the passenger
value in the same month of the previous year. Compare the
passengers
and pass_year_minus_one
columns by
displaying the tables for January and December.
import seaborn
flights = seaborn.load_dataset("flights")
flights['pass_year_minus_one'] = flights.groupby(['month']).passengers.shift()
flights.query("month=='January'")
flights.query("month=='December'")
Compute a lag
iris['cumsum_lag'] = iris.groupby('species')['cumsum'].transform('shift', fill_value=0)
iris[['cumsum', 'cumsum_lag']].plot()
pyplot.show()
Extract the min in each group
df.loc[df.groupby('A')['B val'].idxmin()]
Sort by max in each group
df.groupby('reporter')["value"].max().sort_values(ascending=False)
Number of unique combinations of one or 2 columns
df = pandas.DataFrame({'A' : ['bla', 'bla', 'bli', 'bli', 'bli'],
'B' : ['1', '2', '2', '4', '2']})
df.groupby(["A"]).nunique()
df.groupby(["B"]).nunique()
df.groupby(["A", "B"]).nunique()
How do I select the first row in each group in groupby
import pandas import numpy as np df = pandas.DataFrame({‘A’ : [‘bla’, ‘bla’, ‘bli’, ‘bli’, ‘bli’], ‘B’ : [‘1’, ‘2’,‘2’, ‘4’, ‘1’], ‘C’ : [np.nan, ‘X’, ‘Y’, ‘Y’, ‘Y’]}) df.sort_values(‘B’).groupby(‘A’).nth(0) df.sort_values(‘B’).groupby(‘A’).nth(list(range(2))) df.sort_values(‘B’).groupby(‘A’).head(2)
Sum by groups
import pandas
df = pandas.DataFrame({'i' : ['a', 'a', 'b', 'b', 'b'],
'x' : range(1,6)})
df["y"] = df.groupby("i")["x"].transform("sum")
Yearly rolling of a monthly time series:
import pandas
from matplotlib import pyplot as plt
li = list(range(15))
df = pandas.DataFrame({'x' : li + list(reversed(li)) + li})
df["y"] = df["x"].transform(lambda x: x.rolling(13, min_periods=1).mean())
df.plot()
plt.show()
Interpolate
import pandas
import numpy as np
df = pandas.DataFrame({'i' : ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
'x' : [1,np.nan, np.nan, 4, 1, 2, np.nan]})
df["y"] = df.groupby("i")["x"].transform(pandas.Series.interpolate)
See also
Compute the area diff
df["area_diff"] = df.groupby(groupby_area_diff)["area"].transform(
lambda x: x.diff()
)
Based on SO answer
import pandas
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
df.groupby("X", as_index=False)["A"].agg(' '.join)
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
Index can be converted back to a data frame See also index selection in the “.loc” section.
Example from pandas DataFrame drop
import pandas
midx = pandas.MultiIndex(levels=[['lama', 'cow', 'falcon'],
['speed', 'weight', 'length']],
codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
[0, 1, 2, 0, 1, 2, 0, 1, 2]])
df = pandas.DataFrame(index=midx, columns=['big', 'small'],
data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
[250, 150], [1.5, 0.8], [320, 250],
[1, 0.8], [0.3, 0.2]])
df
df.drop(index='cow', columns='small')
df.drop(index='length', level=1)
For example the result of a pivot operation on multiple value columns
returns a multi index. To flatten that multi index, use
.to_flat_index()
as follows:
df.columns = ["_".join(a) for a in df.columns.to_flat_index()]
Do this column renaming before the reset_index() that you would have use after the pivot operation.
Inspired by https://stackoverflow.com/questions/14507794/how-to-flatten-a-hierarchical-index-in-columns
Compute in a loop based on the value of the previous year t-1. If there is a single value by year, scalar computation
df = pandas.DataFrame({'x':range(0,10)})
df.loc[0, "y"] = 2
for t in range(1, len(df)):
df.loc[t, "y"] = pow(df.loc[t-1, "y"], df.loc[t, "x"]/2)
df
If there are multiple values for each single year vector computation.
import itertools
import pandas
countries = ["a","b","c","d"]
years = range(1990, 2020)
expand_grid = list(itertools.product(countries, years))
df = pandas.DataFrame(expand_grid, columns=('country', 'year'))
df["x"] = 1
df["x"] = df["x"].cumsum()
df.set_index(["year"], inplace=True)
df.loc[min(years), "y"] = 2
for t in range(min(years)+1, max(years)+1):
df.loc[t, "y"] = pow(df.loc[t-1, "y"], df.loc[t, "x"]/2)
df
I would like to compute the consumption equation of a partial equilibrium model.
df = pandas.DataFrame({'x':range(0,3),
'y':['a','b','c']})
for t in range(gfpmx_data.base_year + 1, years.max()+1):
# TODO: replace this loop by vectorized operations using only the index on years
for c in countries:
# Consumption
swd.loc[(t,c), "cons2"] = (swd.loc[(t, c), "constant"]
* pow(swd.loc[(t-1, c), "price"],
swd.loc[(t, c), "price_elasticity"])
* pow(swd.loc[(t, c), "gdp"],
swd.loc[(t, c), "gdp_elasticity"])
)
swd['comp_prop'] = swd.cons2 / swd.cons -1
print(swd["comp_prop"].abs().max())
swd.query("year >= 2019")
Display the unique values of the two columns with a count of occurrences
import seaborn
penguins = seaborn.load_dataset("penguins")
penguins.value_counts(["species", "island"])
penguins[["species", "island"]].value_counts()
Lower level method using unique()
on a multi index and
returning a data frame
penguins.set_index(["species", "island"]).index.unique().to_frame(False)
See also the query section for other ways to query data frames.
df = pandas.DataFrame({'i':range(0,3),
'j':['a','b','c'],
'x':range(22,25)})
df = df.set_index(["i","j"])
df.loc[(df.index.get_level_values('i') > 1)]
Using query instead
df.query("i>1")
import pandas
import numpy as np
s = pandas.Series([0, 2, np.nan, 8, np.nan, np.nan])
s.interpolate(method='polynomial', order=2)
s.interpolate(method='linear')
# Also fill NA values at the begining and end of the series
s.interpolate(method='linear', limit_direction="both")
Limit interpolation to the inner NAN values
s.interpolate(limit_area="inside")
# Interpolate the whole data frame
df.groupby("a").transform(pandas.DataFrame.interpolate)
# Only one column
df.groupby("a")["b"].transform(pandas.Series.interpolate)
“The Apache Arrow format allows computational routines and execution engines to maximize their efficiency when scanning and iterating large chunks of data. In particular, the contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.” “[…] a standardized memory format facilitates reuse of libraries of algorithms, even across languages.” “Arrow libraries for C (Glib), MATLAB, Python, R, and Ruby are built on top of the C++ library.”
See the general section on IO input output, many of the subsections there refer to pandas IO.
The
Pandas user guide on reshaping gives several example using
melt
(easier to rename the “variable” and “value” columns)
or stack
(designed to work together with MultiIndex
objects).
Reshape using melt
import seaborn
iris = seaborn.load_dataset("iris")
iris.melt(id_vars="species", var_name="measurement")
Another example with two index columns
cheese = pandas.DataFrame(
{
"first": ["John", "Mary"],
"last": ["Doe", "Bo"],
"height": [5.5, 6.0],
"weight": [130, 150],
}
)
cheese
cheese.melt(id_vars=["first", "last"], var_name="quantity")
Another example
grading_matrix = pandas.DataFrame({"dbh":["d1", "d2", "d3"],
"abies":["p","q","r"],
"picea":["m","n","o"],
"larix":["m","n","o"]})
grading_long = grading_matrix.melt(id_vars="dbh",
var_name="species",
value_name="grading")
Reshape using the wide_to_long
convenience function
import numpy as np
dft = pandas.DataFrame(
{
"A1970": {0: "a", 1: "b", 2: "c"},
"A1980": {0: "d", 1: "e", 2: "f"},
"B1970": {0: 2.5, 1: 1.2, 2: 0.7},
"B1980": {0: 3.2, 1: 1.3, 2: 0.1},
"X": dict(zip(range(3), np.random.randn(3))),
"id": {0: 0, 1: 1, 2: 2},
}
)
dft
pandas.wide_to_long(dft, stubnames=["A", "B"], i="id", j="year")
Pivot from long to wide format using pivot
:
df = pandas.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
df_wide = df.pivot(columns=["lev2", "lev3"], index="lev1", values="values")
df_wide
# lev2 1 2
# lev3 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
Rename the (sometimes confusing) axis names
df_wide.rename_axis(columns=[None, None])
# 1 2
# 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
Add a prefix to a year columns before pivoting
(df
.assign(year = lambda x: "net_trade_" + x["year"].astype(str))
.pivot(columns="year", index=["product", "scenario"], values="net_trade")
.reset_index()
)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
df1 = pandas.DataFrame({‘col1’: [1, 2], ‘col2’: [3, 4]}) df1.transpose()
Python pandas equivalent for replace
import pandas
s = pandas.Series(["ape", "monkey", "seagull"])
s.replace(["ape", "monkey"], ["lion", "panda"])
s.replace("a", "x", regex=True)
`s.replace({"ape": "lion", "monkey": "panda"})`
pandas.Series(["bla", "bla"]).replace("a","i",regex=True)
Replace by the upper case value
s.str.upper()
Replace values where the condition is false see
help(df.where)
“Where
cond
is True, keep the original value. Where False, replace with corresponding value fromother
.”
df = pandas.DataFrame({'a':range(0,3),
'b':['p','q','r'],
'c':['m','n','o']})
df["b"].where(df["c"].isin(["n","o"]),"no")
df.where(df["c"].isin(["n","o"]),"no")
See also the interpolate section.
Replace NA values by another value
import pandas
import numpy as np
df = pandas.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
# Replace all NaN elements with 0s.
df.fillna(0)
# Replace by 0 and column 2 and by 1 in column B
df.fillna({"A":0, "B":1}, inplace=True)
df
There are many ways to select data in pandas (squarebraquets, loc, iloc, query, isin). In a first stage, during data preparation it’s better to keep data out of the index. But in a second stage, when you are doing modelling, multi indexes become useful. And especially slicers to computes on part of the dataset - only some years, only some products, only some countries . For this, tools such as df.xs or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IndexSlice.html are needed.
.loc
is primarily label based, but may also be used with
a boolean array.
I copied the examples below from the pandas loc documentation at: pandas.DataFrame.loc
Create an example data frame
import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
List of index labels
In : df.loc[['viper', 'sidewinder']]
Out:
max_speed shield
viper 4 5
sidewinder 7 8
Selecting a cell with 2 lists returns a data frame
df.loc[["viper"], ["shield"]]
Selecting cell with tuples (for multi indexes) or strings returns its value
df.loc[("viper"), ("shield")]
df.loc["viper", "shield"]
Note: in the case of a multi index, use tuples for index selection, see section below on multi index selection with loc.
Conditional that returns a boolean Series
In : df.loc[df['shield'] > 6]
Out:
max_speed shield
sidewinder 7 8
Slice with labels for row and labels for columns.
In : df.loc['cobra':'viper', 'max_speed':'shield']
Out:
max_speed shield
cobra 1 2
viper 4 5
Set value for all items matching the list of labels
In : df.loc[['viper', 'sidewinder'], ['shield']] = 50
In : df
Out:
max_speed shield
cobra 1 2
viper 4 50
sidewinder 7 50
Another example using integers for the index
df2 = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
index=[7, 8, 9],
columns=['max_speed', 'shield'])
Slice with integer labels for rows. Note that both the start and stop of the slice are included. Python slices behave differently.
In : df2.loc[8:9]
Out:
max_speed shield
8 4 5
9 7 8
Using the same example as above, select rows that are not in [‘cobra’,‘viper’]. Based on a SO answer use isin on the index:
In : df.index.isin(['cobra','viper'])
Out: array([ True, True, False])
In : df.loc[~df.index.isin(['cobra','viper'])]
Out:
max_speed shield
sidewinder 7 8
Or assign the selector to reuse it:
selector = df.index.isin(['cobra','viper'])
df.loc[selector]
df.loc[~selector]
With an index corresponding to years, select all years below or equal to 2050
df.loc[df.index <= 2050]
import pandas
df = pandas.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df.loc[(df["max_speed"] > 1) & (df["shield"] < 7)]
df.query("max_speed > 1 & shield < 7")
Create a panel data set with a multi index in years and countries
import pandas
import numpy as np
df = pandas.DataFrame(
{"country": ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso'] * 2,
"year": np.repeat(np.array([2020,2021]), 5),
"value": np.random.randint(0,1e3,10)
})
df = df.set_index(["year", "country"])
Use the multi index to select data for 2020 only
idx = pandas.IndexSlice
df.loc[idx[2020, :]]
Use the multi index to select data for Algeria only, in all years
df.loc[idx[:, "Algeria"], :]
Note: it’s better to write df.loc[idx[2020, :], :]
than
df.loc[(2020,)]
. The later is in fact just equivalent to
df.loc[(2020,)]
. Note that df.loc[(,“Algeria”)] would
return a syntax error
See also the course material Pandas for panel data.
Sample data copied from help(df.loc)
:
tuples = [
('cobra', 'mark i'), ('cobra', 'mark ii'),
('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
('viper', 'mark ii'), ('viper', 'mark iii')
]
index = pandas.MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20],
[1, 4], [7, 1], [16, 36]]
df = pandas.DataFrame(values, columns=['max_speed', 'shield'], index=index)
Single label. Note this returns a DataFrame with a single index.
df.loc['cobra']
Single index tuple. Note this returns a Series.
df.loc[('cobra', 'mark ii')]
df.loc[(:,'mark ii')]
Single tuple. Note using [[]]
returns a DataFrame.
df.loc[[('cobra', 'mark ii')]]
Single label for row and column. Similar to passing in a tuple, this returns a Series.
df.loc['cobra', 'mark i']
Slice from index tuple to single label
df.loc[('cobra', 'mark i'):'viper']
Slice from index tuple to index tuple
df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
Invert a selection on the second index
df.loc[~df.index.isin(["mark i"], level=1)]
Get index level values to use conditional checks on those values. For example select years smaller than 2021:
selector = df.index.get_level_values("year") < 2021
df.loc[selector]
“You can use pandas.IndexSlice to facilitate a more natural syntax using
:
, rather than usingslice(None)
.”
Other example from a SO question
import pandas
df = pandas.DataFrame(index = pandas.MultiIndex.from_product([range(2010,2020),
['mike', 'matt', 'dave', 'frank', 'larry'], ]))
df['x']=0
df.index.names=['year', 'people']
df.loc[2010]
df.loc[(2010,"mike")]
These two df.loc[2010]
,
df.loc[(2010,"mike")]
work, but
df.loc["mike"]
Returns a KeyError: 'mike'
. To select on the second
index level only, you need a multi index slicer.
idx = pandas.IndexSlice
df.loc[idx[:, "mike"],:]
You can also use df.xs
df.xs("mike", level=1)
df.xs("mike", level="people")
Using loc on just the second index in multi index Other example using the same data frame as the previous section.
idx = pandas.IndexSlice
df.loc[idx[:, "mark i"],:]
df.xs("mark i", level=1)
.iloc is primarily integer position based (from 0 to length -1 of the axis), but may also be used with a boolean array.
Create a sample data frame:
In : example = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pandas.DataFrame(example)
In : df
Out:
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
Index with a slice object. Note that it doesn’t include the upper bound.
In : df.iloc[0:2]
Out:
a b c d
0 1 2 3 4
1 100 200 300 400
With lists of integers.
In : df.iloc[[0, 2], [1, 3]]
Out:
b d
0 2 4
2 2000 4000
With slice objects.
In : df.iloc[1:3, 0:3]
Out:
a b c
1 100 200 300
2 1000 2000 3000
With a boolean array whose length matches the columns.
In : df.iloc[:, [True, False, True, False]]
Out:
a c
0 1 3
1 100 300
2 1000 3000
Query the columns of a Data Frame with a boolean expression.
df = pandas.DataFrame({'A': range(1, 6),
'B': range(10, 0, -2),
'C': range(10, 5, -1)})
df.query("A > B")
A B C
5 2 6
Two queries
df.query("A < B and B < C")
df.query("A < B or B < C")
Query using a variable
limit = 3
df.query("A > @limit")
A B C
4 4 7
5 2 6
Query for a variable in a list
df.query("A in [3,6]")
Query for a variable not in a list
df.query("A not in [3,6]")
str.contains and str.startswith do not work with the default numexpre
engine, you need to set engine="python"
as explained in this answer.
Example use on a table of product codes, query products description that contain “oak” but not “cloak” and query sawnwood products starting with “4407”:
comtrade.products.hs.query("product_description.str.contains('oak') and not product_description.str.contains('cloak')", engine="python")
comtrade.products.hs.query("product_code.str.startswith('4407')", engine="python")
Use alist of values to select rows
df = pandas.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
df[df['A'].isin([3, 6])]
df.loc[df['A'].isin([3, 6])]
df.query("A in [3,6]")
Select the second column with square brackets
df[df.columns[1]]
The key
and level
arguments specify which
part of the multilevel index should be used. Create a sample data frame,
copied from help(df.xs)
:
d = {'num_legs': [4, 4, 2, 2],
'num_wings': [0, 0, 2, 2],
'class': ['mammal', 'mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog', 'bat', 'penguin'],
'locomotion': ['walks', 'walks', 'flies', 'walks']}
df = pandas.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
print(df)
Select with a key following the order in which levels appear in the index:
df.xs('mammal')
df.xs(('mammal', 'dog'))
Select with a key and specify the levels:
df.xs(key='cat', level=1)
df.xs(key=('bird', 'walks'),
level=[0, 'locomotion'])
Pandas DataFrame.xs “cannot be used to set values.”
How to determine whether a pandas column contains a particular varlue
In of a Series checks whether the value is in the index:
In : s = pd.Series(list('abc'))
In : 1 in s
Out: True
In : 'a' in s
Out: False
One option is to see if it’s in unique values:
In : 'a' in s.unique()
Out: True
SO
question recomends using the boolean value df.empty
to
test whether a data frame is empty.
import seaborn
iris = seaborn.load_dataset("iris")
selector = iris["species"] == "non_existant"
df = iris[selector]
df.empty
Sort iris by descending order of species and ascending order of petal width
iris.sort_values(by=["species", "petal_width"], ascending=[False,True])
See also string operations in python in another section.
String operations in pandas use vectorized string methods of the class StringMethods(pandas.core.base.NoNewAttributesMixin).
df = pandas.DataFrame({'a':['a','b','c']})
help(df.a.str)
Concatenate all values in a character vector:
df['a'].str.cat()
Extract the first 2 or last 2 characters
df = pandas.DataFrame({'a':['bla','bli','quoi?']})
df["a"].str[:2]
df["a"].str[-2:]
Search one element or another in a character vector:
df = pandas.DataFrame({'a':['bla','ble','bli2']})
df[df['a'].str.contains('a|i')]
Replace elements in a character vector:
df['a'].replace('a|i','b',regex=True)
Keep only numbers
df["a"].replace('[a-zA-Z]', '', regex=True)
Strip spaces in strings
df = pandas.DataFrame({'a':['bla','bli',' bla ']})
print(df.a.unique())
print(df.a.str.strip().unique())
The “too many values to unpack” error can also be returned by the
str.split
method of pandas data frames.
For example splitting a character vector on the “,
”
pattern. Split by using both n=1
and
expand=True
. Then assign to new columns using multiple
vector assignment. It is equivalent to tidyr::separate
in R.
import pandas
df = pandas.DataFrame({"x": ["a", "a,b", "a,b,c"]})
df[["y", "z"]] = df.x.str.split(",", n=1, expand=True)
df
# x y z
# 0 a a None
# 1 a, b a b
# 2 a,b,c a b,c
# The split data frame returned by the split method
df.x.str.split(",", n=1, expand=True)
# 0 1
# 0 a None
# 1 a b
# 2 a b,c
df.x.str.split(",")
# 0 [a]
# 1 [a, b]
# 2 [a, b, c]
df.x.str.split(",", expand=True)
# 0 1 2
# 0 a None None
# 1 a b None
# 2 a b c
The following form of assignment works only if each row has exactly 2 splits. In this example, it fails with the error “too many values to unpack (expected 2)”, because of the first row which has only one value instead of two:
df["y"], df["z"] = df.x.str.split(",", n=1)
According to the documentation of pandas.Series.str.split If n > 0 and
“If for a certain row the number of found splits < n, append None for padding up to n if expand=True.”
Extract the first group of character before the first white space into a new column named product
df = pandas.DataFrame({"raw_content": ["A xyz", "BB xyz lala", "CDE o li"]})
df[["product"]] = df["raw_content"].str.extract(r"^(\S+)")
df
Place product patterns in a capture group for extraction
df = pandas.DataFrame({"x": ["am", "an", "o", "bm", "bn", "cm"]})
product_pattern = "a|b|c"
df[["product", "element"]] = df.x.str.extract(f"({product_pattern})?(.*)")
df
df.style.format?
“Format the text display value of cells.”
import pandas
import numpy as np
df = pandas.DataFrame({"x":[np.nan, 1.0, "A"], "y":[2.0, np.nan, 3.0]})
df["z"] = df["y"]
df.style.format("{:.2f}", na_rep="")
df.style.format({0: '{:.2f}', 1: '£ {:.1f}'}, na_rep='MISS', precision=1)
Two methods Using merge
:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Using drop_duplicates
newdf=pd.concat[df1,df2].drop_duplicates(keep=False)
Warn in case the variable x is duplicated
import pandas
df = pandas.DataFrame({"x": ["a", "b", "c", "a"], "y": range(4)})
dup_x = df["x"].duplicated(keep=False)
if any(dup_x):
msg = "x values are not unique. "
msg += "The following duplicates are present:\n"
msg += f"{df.loc[dup_x]}"
raise ValueError(msg)
Drop duplicates
df["x"].drop_duplicates()
df["x"].drop_duplicates(keep=False)
df["x"].drop_duplicates(keep="last")
where
replaces values that do not fit the condition and
mask
replaces values that fit the condition.
s = pandas.Series(range(5))
s.where(s > 1, 10)
s.mask(s > 1, 10)
On a data frame
import pandas
import numpy as np
df1 = pandas.DataFrame({'x':[0,np.nan, np.nan],
'y':['a',np.nan,'c']})
df2 = pandas.DataFrame({'x':[10, 11, 12],
'y':['x','y', np.nan]})
df1.mask(df1.isna(), df2)
df1.where(df1.isna(), df2)
Write to a text file using a context manager, then copy the file somewhere else.
with open("/tmp/bli.md", "w") as f:
f.write('Hola!')
Copy a file
import shutil
shutil.copy("/tmp/bli.md", "/tmp/bla.md")
Move a file
shutil.move("/tmp/bli.md", "/tmp/bla.md")
Create a file and a path object for example purposes
import pathlib
with open("/tmp/bli.md", "w") as f:
f.write('Hola!')
bli_path = pathlib.Path("/tmp/bli.md")
Delete a file if it exists
if bli_path.exists():
bli_path.unlink()
Create a directory and path object for example purposes
import pathblib
Delete a directory if it exists
See also:
Pathlib is an object oriented path API for python as explained in PEP 428
Instead of
import os
os.path.join('~','downloads')
You can use:
from pathlib import Path
Path('~') / 'downloads'
Data located in the home folder
data_dir = Path.home() / "repos/data/"
Check if a directory is empty using pathlib
import pathlib
p1 = pathlib.Path("/tmp/")
p2 = pathlib.Path("/tmp/thisisempty/")
p2.mkdir()
any(p1.iterdir()) # returns True
any(p2.iterdir()) # returns False
SO question that illustrate different levels of parents
import os
import pathlib
p = pathlib.Path('/path/to/my/file')
p.parents[0]
p.parents[1]
p.parent
“Note that os.path.dirname and pathlib treat paths with a trailing slash differently. The pathlib parent of some/path/ is some: While os.path.dirname on some/path/ returns some/path”:
pathlib.Path('some/path/').parent
os.path.dirname('some/path/')
Cross platform way to refer to the home directory
from pathlib import Path
Path.home()
If p
is a pathlib object you can list file names
corresponding to a file pattern as such:
[x.name for x in p.glob('**/*.csv')]
You can also use the simpler iterdir()
method to list
all files in the directory
from pathlib import Path
dir_path = Path('/tmp')
for file_path in dir_path.iterdir():
print(file_path)
Temporarily add to the python path (SO question) in order to import scripts
import sys
sys.path.append('/path/to/dir')
# You might want to prepend if you want to overwrite a system package
sys.path.insert(0, "/home/rougipa/eu_cbm/eu_cbm_hat")
# If it's a pathlib object, you want to convert it to string first
sys.path.insert(0, str(path_lib_object))
To permanently add a package under development to the python path,
add the following to your .bashrc
or
.bash_profile
:
export PYTHONPATH="$HOME/repos/project_name/":$PYTHONPATH
Docs.python.org tempfile examples using a context manager
import tempfile
# create a temporary directory using the context manager
with tempfile.TemporaryDirectory() as tmpdirname:
print('created temporary directory', tmpdirname)
# directory and contents have been removed
Using pathlib
to facilitate path manipulation on top of
tempfile
makes it possible to create new paths using the
/
path operator of pathlib:
import tempfile
from pathlib import Path
with tempfile.TemporaryDirectory() as tmpdirname:
temp_dir = Path(tmpdirname)
print(temp_dir, temp_dir.exists())
file_name = temp_dir / "test.txt"
file_name.write_text("bla bla bla")
print(file_name, "contains", file_name.open().read())
Outside the context manager, files have been destroyed
print(temp_dir, temp_dir.exists())
# /tmp/tmp81iox6s2 False
print(file_name, file_name.exists())
# /tmp/tmp81iox6s2/test.txt False
Python plotting for exploratory analysis is a great gallery of plot examples, each example is written in 5 different plotting libraries: pandas, plotnine, plotly, altair and R ggplot2. There is also one seaborn example.
For some complex plots, I directly pasted images of plots together as follows:
from PIL import Image
# Load the images
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
p_harea_eu = Image.open(composite_plot_dir / "harea_eu.png")
p_harv_nai_eu = Image.open(composite_plot_dir / "harv_nai_eu.png")
# Get the widths and heights of the images
harea_width, harea_height = p_harea_eu.size
hexprov_width, hexprov_height = p_hexprov_eu.size
sink_width, sink_height = p_sink_eu.size
harv_nai_width, harv_nai_height = p_harv_nai_eu.size
# Create a figure with 2 plot images pasted together
# Change the letter in the sink plot
g_sink.savefig(composite_plot_dir / "sink_eu.png")
# Load images again
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
# Determine the width of the combined image (the maximum width)
max_width = max(hexprov_width, sink_width)
# Create a new image with the combined height and maximum width
combined_height = hexprov_height + sink_height
# Offset the x axis of the top figure to align both axes
x_offset = 25
combined_image = Image.new("RGB", (max_width+x_offset, combined_height), color="white")
# Paste the individual images into the combined image
combined_image.paste(p_hexprov_eu, (x_offset, 0))
combined_image.paste(p_sink_eu, (0, hexprov_height))
# Save the combined image
combined_image.save(composite_plot_dir / "hexprov_sink.png")
All matplotlib examples require the following imports:
from matplotlib import pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
Simple line plot changing the figure size and the axes limit with pyplot
plt.rcParams['figure.figsize'] = [10, 10]
fig = plt.figure()
ax = plt.axes()
x = np.linspace(-1.5, 1.5, 1000)
ax.plot(x, 1-3*x)
ax.set_xlim(-6, 6)
ax.set_ylim(-6, 6)
Scatter plot, using a colour variable and the ‘jet’ colour map.
Y = np.array([1,-1,-1, 1])
X = np.array([
[-1, -1],
[ 1, -1],
[-1, 1],
[ 1, 1]])
fig = plt.figure()
ax = plt.axes()
ax.scatter(X[:,0], X[:,1],c=Y, cmap='jet')
Use another colour map
ax.scatter(X[:,0], X[:,1],c=Y, cmap='Spectral')
Plot the probability density function of the normal distribution.
\[f(x)=\frac{1}{\sigma{\sqrt {2\pi }}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}\]
With various sigma and mu values displayed in the legend.
fig = plt.figure()
ax = plt.axes()
x = np.linspace(-5, 5, 1000)
def pdensitynormal(x,sigma_squared,mu):
sigma = np.sqrt(sigma_squared)
return 1/(sigma*np.sqrt(2*np.math.pi))*np.exp(-1/2*((x-mu)/sigma)**2)
ax.plot(x, pdensitynormal(x,0.2,0), label="$\sigma^2=0.2, \mu=0$")
ax.plot(x, pdensitynormal(x,1,0), label="$\sigma^2=1, \mu=0$")
ax.plot(x, pdensitynormal(x,5,0), label="$\sigma^2=5, \mu=0$")
ax.plot(x, pdensitynormal(x,0.5,-2), label="$\sigma^2=0.5, \mu=-2$")
ax.legend(loc="upper right")
plt.show()
Plot a 3D surface
from mpl_toolkits import mplot3d # Required for 3d plots
fig = plt.figure()
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
xline = np.linspace(-10, 10, 1000)
yline = np.linspace(-10, 10, 1000)
# Just a line
zline = xline**2 + yline**2
ax.plot3D(xline, yline, zline, 'gray')
# A mesh grid
X, Y = np.meshgrid(xline, yline)
Z = X**2 + Y**2
ax.contour3D(X, Y, Z, 50, cmap='binary')
# Scatter points
ax.scatter(1,2,3)
plt.show()
See how the np.meshgridi
objects interact with each
other. Note this nested loop is not the optimal way to compute. Better
to use X2 + Y2 directly as above.
for i in range(Z.shape[0]):
for j in range(Z.shape[1]):
vector = np.array([X[i,j],Y[i,j]])
Z[i,j] = np.linalg.norm(vector)**2
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
Create an axes object
import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
help(plot)
Create another axes object for a faceted plot
It has the following methods
print([m for m in dir(plot) if not m.startswith("_")])
['ArtistList', 'acorr', 'add_artist', 'add_callback', 'add_child_axes',
'add_collection', 'add_container', 'add_image', 'add_line', 'add_patch',
'add_table', 'angle_spectrum', 'annotate', 'apply_aspect', 'arrow',
'artists', 'autoscale', 'autoscale_view', 'axes', 'axhline', 'axhspan',
'axis', 'axison', 'axline', 'axvline', 'axvspan', 'bar', 'bar_label',
'barbs', 'barh', 'bbox', 'boxplot', 'broken_barh', 'bxp', 'callbacks',
'can_pan', 'can_zoom', 'child_axes', 'cla', 'clabel', 'clear', 'clipbox',
'cohere', 'collections', 'containers', 'contains', 'contains_point',
'contour', 'contourf', 'convert_xunits', 'convert_yunits', 'csd',
'dataLim', 'drag_pan', 'draw', 'draw_artist', 'end_pan', 'errorbar',
'eventplot', 'figure', 'fill', 'fill_between', 'fill_betweenx', 'findobj',
'fmt_xdata', 'fmt_ydata', 'format_coord', 'format_cursor_data',
'format_xdata', 'format_ydata', 'get_adjustable', 'get_agg_filter',
'get_alpha', 'get_anchor', 'get_animated', 'get_aspect',
'get_autoscale_on', 'get_autoscalex_on', 'get_autoscaley_on',
'get_axes_locator', 'get_axisbelow', 'get_box_aspect', 'get_children',
'get_clip_box', 'get_clip_on', 'get_clip_path', 'get_cursor_data',
'get_data_ratio', 'get_default_bbox_extra_artists', 'get_facecolor',
'get_fc', 'get_figure', 'get_frame_on', 'get_gid', 'get_gridspec',
'get_images', 'get_in_layout', 'get_label', 'get_legend',
'get_legend_handles_labels', 'get_lines', 'get_mouseover', 'get_navigate',
'get_navigate_mode', 'get_path_effects', 'get_picker', 'get_position',
'get_rasterization_zorder', 'get_rasterized', 'get_renderer_cache',
'get_shared_x_axes', 'get_shared_y_axes', 'get_sketch_params', 'get_snap',
'get_subplotspec', 'get_tightbbox', 'get_title', 'get_transform',
'get_transformed_clip_path_and_affine', 'get_url', 'get_visible',
'get_window_extent', 'get_xaxis', 'get_xaxis_text1_transform',
'get_xaxis_text2_transform', 'get_xaxis_transform', 'get_xbound',
'get_xgridlines', 'get_xlabel', 'get_xlim', 'get_xmajorticklabels',
'get_xminorticklabels', 'get_xscale', 'get_xticklabels', 'get_xticklines',
'get_xticks', 'get_yaxis', 'get_yaxis_text1_transform',
'get_yaxis_text2_transform', 'get_yaxis_transform', 'get_ybound',
'get_ygridlines', 'get_ylabel', 'get_ylim', 'get_ymajorticklabels',
'get_yminorticklabels', 'get_yscale', 'get_yticklabels', 'get_yticklines',
'get_yticks', 'get_zorder', 'grid', 'has_data', 'have_units', 'hexbin',
'hist', 'hist2d', 'hlines', 'ignore_existing_data_limits', 'images',
'imshow', 'in_axes', 'indicate_inset', 'indicate_inset_zoom', 'inset_axes',
'invert_xaxis', 'invert_yaxis', 'is_transform_set', 'label_outer',
'legend', 'legend_', 'lines', 'locator_params', 'loglog',
'magnitude_spectrum', 'margins', 'matshow', 'minorticks_off',
'minorticks_on', 'mouseover', 'name', 'patch', 'patches', 'pchanged',
'pcolor', 'pcolorfast', 'pcolormesh', 'phase_spectrum', 'pick', 'pickable',
'pie', 'plot', 'plot_date', 'properties', 'psd', 'quiver', 'quiverkey',
'redraw_in_frame', 'relim', 'remove', 'remove_callback', 'reset_position',
'scatter', 'secondary_xaxis', 'secondary_yaxis', 'semilogx', 'semilogy',
'set', 'set_adjustable', 'set_agg_filter', 'set_alpha', 'set_anchor',
'set_animated', 'set_aspect', 'set_autoscale_on', 'set_autoscalex_on',
'set_autoscaley_on', 'set_axes_locator', 'set_axis_off', 'set_axis_on',
'set_axisbelow', 'set_box_aspect', 'set_clip_box', 'set_clip_on',
'set_clip_path', 'set_facecolor', 'set_fc', 'set_figure', 'set_frame_on',
'set_gid', 'set_in_layout', 'set_label', 'set_mouseover', 'set_navigate',
'set_navigate_mode', 'set_path_effects', 'set_picker', 'set_position',
'set_prop_cycle', 'set_rasterization_zorder', 'set_rasterized',
'set_sketch_params', 'set_snap', 'set_subplotspec', 'set_title',
'set_transform', 'set_url', 'set_visible', 'set_xbound', 'set_xlabel',
'set_xlim', 'set_xmargin', 'set_xscale', 'set_xticklabels', 'set_xticks',
'set_ybound', 'set_ylabel', 'set_ylim', 'set_ymargin', 'set_yscale',
'set_yticklabels', 'set_yticks', 'set_zorder', 'sharex', 'sharey',
'specgram', 'spines', 'spy', 'stackplot', 'stairs', 'stale',
'stale_callback', 'start_pan', 'stem', 'step', 'sticky_edges',
'streamplot', 'table', 'tables', 'text', 'texts', 'tick_params',
'ticklabel_format', 'title', 'titleOffsetTrans', 'transAxes', 'transData',
'transLimits', 'transScale', 'tricontour', 'tricontourf', 'tripcolor',
'triplot', 'twinx', 'twiny', 'update', 'update_datalim', 'update_from',
'use_sticky_edges', 'viewLim', 'violin', 'violinplot', 'vlines', 'xaxis',
'xaxis_date', 'xaxis_inverted', 'xcorr', 'yaxis', 'yaxis_date',
'yaxis_inverted', 'zorder']
Create a figure object
import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
fig = plot.get_figure()
help(fig)
A figure object is the “The top level container for all the plot elements.” It has the following methods:
print([m for m in dir(fig) if not m.startswith("_")])
['add_artist', 'add_axes', 'add_axobserver', 'add_callback',
'add_gridspec', 'add_subfigure', 'add_subplot', 'align_labels',
'align_xlabels', 'align_ylabels', 'artists', 'autofmt_xdate', 'axes',
'bbox', 'bbox_inches', 'callbacks', 'canvas', 'clear', 'clf', 'clipbox',
'colorbar', 'contains', 'convert_xunits', 'convert_yunits', 'delaxes',
'dpi', 'dpi_scale_trans', 'draw', 'draw_artist', 'draw_without_rendering',
'execute_constrained_layout', 'figbbox', 'figimage', 'figure', 'findobj',
'format_cursor_data', 'frameon', 'gca', 'get_agg_filter', 'get_alpha',
'get_animated', 'get_axes', 'get_children', 'get_clip_box', 'get_clip_on',
'get_clip_path', 'get_constrained_layout', 'get_constrained_layout_pads',
'get_cursor_data', 'get_default_bbox_extra_artists', 'get_dpi',
'get_edgecolor', 'get_facecolor', 'get_figheight', 'get_figure',
'get_figwidth', 'get_frameon', 'get_gid', 'get_in_layout', 'get_label',
'get_layout_engine', 'get_linewidth', 'get_mouseover', 'get_path_effects',
'get_picker', 'get_rasterized', 'get_size_inches', 'get_sketch_params',
'get_snap', 'get_tight_layout', 'get_tightbbox', 'get_transform',
'get_transformed_clip_path_and_affine', 'get_url', 'get_visible',
'get_window_extent', 'get_zorder', 'ginput', 'have_units', 'images',
'is_transform_set', 'legend', 'legends', 'lines', 'mouseover', 'number',
'patch', 'patches', 'pchanged', 'pick', 'pickable', 'properties', 'remove',
'remove_callback', 'savefig', 'sca', 'set', 'set_agg_filter', 'set_alpha',
'set_animated', 'set_canvas', 'set_clip_box', 'set_clip_on',
'set_clip_path', 'set_constrained_layout', 'set_constrained_layout_pads',
'set_dpi', 'set_edgecolor', 'set_facecolor', 'set_figheight', 'set_figure',
'set_figwidth', 'set_frameon', 'set_gid', 'set_in_layout', 'set_label',
'set_layout_engine', 'set_linewidth', 'set_mouseover', 'set_path_effects',
'set_picker', 'set_rasterized', 'set_size_inches', 'set_sketch_params',
'set_snap', 'set_tight_layout', 'set_transform', 'set_url', 'set_visible',
'set_zorder', 'show', 'stale', 'stale_callback', 'sticky_edges', 'subfigs',
'subfigures', 'subplot_mosaic', 'subplotpars', 'subplots',
'subplots_adjust', 'suppressComposite', 'suptitle', 'supxlabel',
'supylabel', 'text', 'texts', 'tight_layout', 'transFigure',
'transSubfigure', 'update', 'update_from', 'waitforbuttonpress', 'zorder']
When x and y are supposed to be the same value but are not necessarily equal. Compare the x and y values on a scatter plot to a y=x line.
def comp_plot(df, x_var, y_var, title):
"""Plot comparison for the given data frame"""
# Scatter plot
plt.scatter(df[x_var], df[y_var])
# 1:1 line
line = np.linspace(df[x_var].min(), df[x_var].max(), 100)
plt.plot(line, line, 'r--')
plt.xlabel(f'{x_var} additional text')
plt.ylabel(f'{y_var} additional text')
plt.title(title)
return plt
Note on suggestions compared between Bard and Chat GPT-4
# Create the 1:1 line suggested by bard
line_x = np.linspace(x.min(), x.max(), 100)
line_y = line_x
plt.plot(line_x, line_y, 'r--')
# 1:1 line suggested by GPT4 (wrong in some way)
plt.plot([min(x), max(x)], [min(y), max(y)], 'r')
This works with pandas plots and seaborn plots as well.
With the pyplot object, only works immediately after building the plot.
plt.savefig("/tmp/bli.pdf")
plt.savefig("/tmp/bli.png")
plt.savefig("file_name.svg", bbox_inches='tight')
Save a plot object to a pdf file
import pandas
df = pandas.DataFrame({'x':range(0,30), 'y':range(110,140)})
plot = df.plot(x="x", y="y", kind="scatter")
fig = plot.get_figure()
fig.savefig('/tmp/output.pdf', format='pdf')
Save a grid plot object to a pdf file
fmri = seaborn.load_dataset("fmri")
g = seaborn.relplot(
data=fmri, x="timepoint", y="signal", col="region",
hue="event", style="event", kind="line",
facet_kws={'sharey': False, 'sharex': False}
)
g.savefig("/tmp/fmri.pdf")
The function df.plot()
returns a matplotlib axes object
for a plot of the A and B variables. You can add another line for a
different variable C using the plot() method of that axes object.
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame({"A":range(0,30), "B":range(10,40)})
df["C"] = df["B"] + 2
# Using the plot method
ax = df.plot(x="A", y="B")
ax.plot(df["A"], df["C"])
plt.show()
Example values for the df.plot() function:
figsize=(3,3)
change the figure size. SO answer links
to the documentation that explains that
> "plt.figure(figsize=(10,5)) doesn't work because df.plot() creates its
> own matplotlib.axes.Axes object, the size of which cannot be changed
> after the object has been created. "
title='bla bla'
add a plot title
colormap
change colours
Create some data and change the xticks labels
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame({'x':range(0,30), 'y':range(10,40)})
df.set_index('x', inplace=True)
plot = df.plot(title='Two ranges')
type(plot)
# help(plot)
plot.set_xticks(range(0,31,10), minor=False)
plt.show()
Simple palette as a dictionary
palette = {'ssp2': 'orange',
'fair': 'green',
'historical_period': 'black'}
df.plot(title = "Harvest Scenarios", ylabel="Million m3", color=palette)
Note: the argument for seaborn would be palette=palette.
Using a list of colours with matplotlib ListedColormap (see also example in that documentation page): Reusing the data frame from the previous section
from matplotlib.colors import ListedColormap
df["z"] = 39
df["a"] = 10
df.plot(colormap=ListedColormap(["red","green","orange"]), figsize=(3,3))
plt.show()
plt.savefig("/tmp/plotpalette.png")
Using a seaborn palette with the as_cmap=True
argument:
palette = seaborn.color_palette("rocket_r", as_cmap=True)
df.plot(colormap=palette, figsize=(3,3))
# plt.show()
plt.savefig("/tmp/plotpalette2.png")
Histogram
iris["petal_width"].hist(bins=20)
Options for title, labels, colours
import pandas
import matplotlib.pyplot as plt
series = pandas.Series([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])
series.hist(grid=False, bins=20, rwidth=0.9, color='#607c8e')
plt.title('Title')
plt.xlabel('Counts')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()
Histogram with a log scale
Using the same df as above show 2 plost side by side based on this SO answer
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))
df.plot(title='Two ranges', ax=ax1)
df.plot(title='Two ranges', ax=ax2)
plt.show()
The advantage of plotly is that it provides dynamic visualisation inside web pages, such as the possibility to zoom in a graph. It’s the open source component of a commercial project called Dash entreprise.
For example this notebook on machine learning used to enhance the localisation of weather forecasts. Seen on this blog post What does machine learning have to do with weather.
https://plotly.com/python/bubble-charts/ example:
import plotly.express as px df = px.data.gapminder() fig = px.scatter(df.query(“year==2007”), x=“gdpPercap”, y=“lifeExp”, size=“pop”, color=“continent”, hover_name=“country”, log_x=True, size_max=60) fig.show()
Facet chart where the y facet label where removed and replaced with a common annotation. (There was an issue with the annotation dissapearing when going full screen in a streamlit app).
y_var = f"{flow} {element}"
fig = plotly.express.line(
# shorten the plot facet titles
df.rename(columns={"product_name": "p",
flow: y_var}),
x="period",
y=y_var,
color="partner",
facet_row="p",
line_group="partner",
)
# Remove y facet labels
for axis in fig.layout:
if type(fig.layout[axis]) == plotly.graph_objects.layout.YAxis:
fig.layout[axis].title.text = ''
# Update y label, by adding to the existing annotation
fig.layout.annotations += (
dict(
x=0,
y=0.5,
showarrow=False,
text=f"{flow} {element}",
textangle=-90,
# xanchor='left',
# yanchor="middle",
xref="paper",
yref="paper"
), # Keep this comma, this needs to be a tuple
)
Plotly figures are normally rendered as HTML pages, but you can
convert a figure to a static image file with the
write_image
method:
fig.write_image("/tmp/fig.png")
When x and y are supposed to be the same value but are not necessarily equal. Compare the x and y values on a scatter plot to a 1:1 line.
import plotly
def comp_plotly(df, x_var, y_var, title):
"""Plot comparison for the given data frame"""
fig = plotly.graph_objects.Figure()
fig.add_trace(go.Scatter(x=df[x_var], y=df[y_var], mode='markers', name='Data'))
fig.add_trace(go.Scatter(x=[min(x), max(x)], y=[min(x), max(x)], mode='lines', name='1:1 Line'))
fig.update_layout(
title='Scatter plot with 1:1 Line',
xaxis_title=x_var,
yaxis_title=y_var
)
# Add the reporter, partner, and year to the tooltip
fig.update_traces(
hoverinfo='text',
hovertext=list(zip(df['reporter'], df['partner'], df['year']))
)
return fig
this_primary_product = "rape_or_colza_seed"
selector = comp_2["primary_product"] == this_primary_product
comp_plotly(comp_2.loc[selector],
x_var = 'primary_crop_eq_re_allocated_2nd_level_imported',
y_var = 'primary_eq_imp_alloc_1',
title = f"Step 2 primary crop import {this_primary_product}")
Grammar of graphics for python https://github.com/has2k1/plotnine
Create a facet grid plot
from plotnine import ggplot, aes, geom_line, facet_grid, labs
All Seaborn examples below require the following imports and datasets:
import seaborn
iris = seaborn.load_dataset("iris")
tips = seaborn.load_dataset("tips")
fmri = seaborn.load_dataset("fmri")
from matplotlib import pyplot as plt
Resources
Use Figure-level interface for drawing plots onto a FacetGrid:
catplot for drawing categorical plots
relplot for drawing relational plots
The figure level interfaces return FacetGrid objects which can be reused to add subsequent layers.
Seaborn version 0.12.1 introduced an object interfacet which can also be used to make facet plots.
This hack uses an image library to combine many plots together:
# Load the images
p_harea_eu = Image.open(composite_plot_dir / "harea_eu.png")
p_hexprov_eu = Image.open(composite_plot_dir / "hexprov_eu.png")
p_sink_eu = Image.open(composite_plot_dir / "sink_eu.png")
# Get the widths and heights of the images
harea_width, harea_height = p_harea_eu.size
hexprov_width, hexprov_height = p_hexprov_eu.size
sink_width, sink_height = p_sink_eu.size
# Determine the width of the combined image (the maximum width)
max_width = max(harea_width, hexprov_width, sink_width)
# Create a new image with the combined height and maximum width
combined_height = harea_height + hexprov_height + sink_height
combined_image = Image.new("RGB", (max_width, combined_height), color="white")
# Paste the individual images into the combined image
combined_image.paste(p_hexprov_eu, (0, 0))
combined_image.paste(p_harea_eu, (0, hexprov_height))
combined_image.paste(p_sink_eu, (0, harea_height + hexprov_height))
# Save the combined image
combined_image.save(composite_plot_dir / "combined_image.png")
Change row and column labels to display only the content (not
“label=
”) and change the size to 30.
import seaborn
seaborn.set_theme(style="darkgrid")
df = seaborn.load_dataset("penguins")
g = seaborn.displot(
df, x="flipper_length_mm", col="species", row="sex",
binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
g.fig.subplots_adjust(top=.9, bottom=0.1, right=0.9)
g.set_titles(row_template="{row_name}", col_template="{col_name}", size=30)
See also the figure size section.
Iterate over facet objects
for i, ax in enumerate(g.axes.flatten()):
print(i, ax.title.get_text())
for ax in g.axes.flatten():
this_forest_type = ax.title.get_text()
Facet line plot example
import seaborn
import seaborn.objects as so
healthexp = seaborn.load_dataset("healthexp")
p = (
so.Plot(healthexp, x="Year", y="Life_Expectancy")
.facet("Country", wrap=3)
.add(so.Line(alpha=.3), group="Country", col=None)
.add(so.Line(linewidth=3))
)
p.show()
Example from my data
import seaborn.objects as so
(
so.Plot(df, x="age", y="volume")
.facet("forest_type", wrap=6)
.add(so.Line(alpha=.3), group="forest_type", col=None)
.add(so.Line(linewidth=3))
)
https://seaborn.pydata.org/generated/seaborn.objects.Plot.label.html
p = (
so.Plot(penguins, x="bill_length_mm", y="bill_depth_mm")
.add(so.Dot(), color="species")
)
p.label(x="Length", y="Depth", color="")
https://seaborn.pydata.org/generated/seaborn.objects.Plot.facet.html
“Use Plot.share() to specify whether facets should be scaled the same way:
p.facet("clarity", wrap=3).share(x=False)
Save to a file
Plot.save(loc, **kwargs)
Of a single figure
ax.invert_yaxis()
Of a grid figure
g = seaborn.relplot(x='crop', y='ranking', col='intensity',
hue='conservation_target', data=df)
for ax in g.axes[0]:
ax.invert_yaxis()
Use p.set() to set a y label and a title
p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(xlabel = "Petal Length", ylabel = "Petal Width", title = "Flower sizes")
plt.show()
Use scientific notation on the axes of a g
FacetGrid
object:
for axes in g.axes.flat:
axes.ticklabel_format(axis='both', style='scientific', scilimits=(0, 0))
Do not use scientific notation on the axes
plt.ticklabel_format(style='plain', axis='y')
Use the scientific notation on the y axis labels, at every tick.
Without putting a 1e7
at the top that might be overwritten
by a facet label.
g = seaborn.relplot(
data=rp_global.reset_index(), x="step", y="primary_eq", col="primary_product",
hue="year", kind="line",
col_wrap=3, height=3,
facet_kws={'sharey': False, 'sharex': False}
)
def y_fmt(x, pos):
"""function to format the y axis"""
return f"{x:.0e}"
from matplotlib.ticker import FuncFormatter
#g.set(yticklabels=[])
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(FuncFormatter(y_fmt))
Rotate index labels
plt.xticks(rotation=70)
plt.tight_layout()
plt.show()
Set a common title for grid plots
g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
# Supplementary title
g.fig.suptitle('I don't smoke and I don't tip.')
Change axis label
g.set_ylabels("Y label")
Add larger axis labels for grid plots
g.fig.supxlabel("time in years")
g.fig.supylabel("weight in kg")
In case the title is overwritten on the subplots, you might need to use fig.subplot_adjust() as such:
g.fig.subplots_adjust(top=.95)
Set limits on the one axis in a Seaborn plot:
p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(ylim=(-2,None))
In Seaborn facet grid. How to set xlim and ylim in seaborn facet grid
g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
g.set(ylim=(0, None))
Years are sometimes displayed with commas, convert them to date time objects to avoid this:
pandas.to_datetime(comp["year"], format="%Y")
Use set_title
to add a title:
(seaborn
.scatterplot(x="total_bill", y="tip", data=tips)
.set_title('Progression of tips along the bill amount')
)
Bar plot
import matplotlib.pyplot as plt
import seaborn
iris = seaborn.load_dataset("iris")
iris_agg = iris.groupby("species").agg(sum)
iris_agg_long = iris_agg.melt(ignore_index=False).reset_index()
seaborn.barplot(data=iris_agg_long, x="variable", y="value", hue="species")
Rotate index labels
plt.xticks(rotation=70)
plt.tight_layout()
plt.show()
Other example
seaborn.barplot(df, x="scenario", y="value", hue="variable")
For stacked bar, use df.plot(), which uses matplotlib
p = df.plot.bar(stacked=True)
Draw a facet bar plot from SO for each combination of size and smoker/non smoker
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
tips=sns.load_dataset("tips")
g = sns.FacetGrid(tips, col = 'size', row = 'smoker', hue = 'day')
g = (g.map(sns.barplot, 'time', 'total_bill', errorbar = None).add_legend())
plt.show()
Another example https://stackoverflow.com/a/35234137/2641825
times = df.interval.unique()
g = sns.FacetGrid(df, row="variable", hue="segment", palette="Set3", size=4, aspect=2)
g.map(sns.barplot, 'interval', 'value', order=times)
https://seaborn.pydata.org/examples/faceted_histogram.html
import seaborn
seaborn.set_theme(style="darkgrid")
df = seaborn.load_dataset("penguins")
seaborn.displot(
df, x="flipper_length_mm", col="species", row="sex",
binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
Bar plot using the object interface
import seaborn
import seaborn.objects as so
titanic = seaborn.load_dataset("titanic")
p = so.Plot(titanic, x="class", color="sex")
p = p.add(so.Bar(), so.Count(), so.Stack())
p.show()
Bar plot with facets
p = p.facet("sex")
p.show()
Other example of a bar plot with facets, using a palette
p = so.Plot(df_long, x="year", y="value", color="sink")
p = p.add(so.Bar(), so.Stack())
p = p.facet("pathway", "country_group").share(x=False)
p = p.layout(size=(14, 9), engine="tight")
palette = {'living_biomass_sink': 'forestgreen',
'dom_sink': 'gold',
'soil_sink': 'black',
'hwp_sink_bau': 'chocolate'}
p = p.scale(x=so.Continuous().tick(at=selected_years), color=palette)
p = p.label(x="", y="Million t CO2 eq", color="")
# p = p.scale(color=palette)
years_string = "_".join([str(x) for x in selected_years])
index_string = "".join([x[0] for x in index])
print(composite_plot_dir)
p.save(composite_plot_dir / f"sink_composition_{index_string}_{years_string}.png")
See various examples in the plots in the seaborn section. The palette can be defined from pre existing palettes
palette = seaborn.color_palette("rocket_r")
Without argument this function displays the default palette
seaborn.color_palette()
It can translate a list of colour codes into a palette
seaborn.color_palette(["r","g","b"])
seaborn.color_palette(["red","green","blue", "orange"])
This function is used internally by the palette argument of plotting functions:
p1 = sns.relplot(x="Growth", y="Value", hue="Risk", col="Mcap", data=mx, s=200, palette=['r', 'g', 'y'])
Another example using a dictionary for the palette
palette = {"fair":"green", "ssp2":"orange", "historical":"black"}
p = seaborn.lineplot(x="year", y="gdp_t", hue="scenario", data=df_gdp_eu, palette=palette)
Seaborn tutorial on choosing colour palettes https://seaborn.pydata.org/tutorial/color_palettes.html
According to https://stackoverflow.com/a/46174007/2641825 you can also use a dictionary to associate hue values to a palette element.
selected_products = ["wood_fuel",
"sawlogs_and_veneer_logs",
"pulpwood_round_and_split_all_species_production",
"other_industrial_roundwood"]
palette = dict(zip(selected_products, ["red", "brown", "blue", "grey"]))
Generate darker and lighter green and orange colours
lighter_green = seaborn.dark_palette('green', n_colors=5)[0]
darker_green = seaborn.dark_palette('green', n_colors=5, reverse=True)[0]
lighter_orange = seaborn.dark_palette('orange', n_colors=5)[0]
darker_orange = seaborn.dark_palette('orange', n_colors=5, reverse=True)[0]
https://seaborn.pydata.org/tutorial/color_palettes.html
“. It’s also possible to pass a list of colors specified any way that matplotlib accepts (an RGB tuple, a hex code, or a name in the X11 table).
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
SO answer has a plot of this dict of colors with names.
Example of specifying line styles doesn’t work
linestyle_dict = {'Industrial roundwood': 'solid', 'Fuelwood': 'dotted'}
g = sns.relplot(data=df.loc[selector], x='year', y='demand', col='country',
hue='combo_name', style="faostat_name", kind='line',
col_wrap=col_wrap, palette=palette_combo,
facet_kws={'sharey': False, 'sharex': False},
dashes=linestyle_dict)
Resize a scatter plot
p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.figure.set_figwidth(15)
set_figwidth
and set_figheight
work well to
resize a grid object in its entirety.
g = seaborn.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
g.fig.set_figwidth(10)
g.fig.set_figheight(10)
Try also
g.fig.set_size_inches(15,15)
Mentioned as a comment under this answer
To change the height and aspect ration of individual grid cells, you
can use the height
and aspect
arguments of the
FacetGrid call as such:
import seaborn
import matplotlib.pyplot as plt
seaborn.set()
iris = seaborn.load_dataset("iris")
# Change height and aspect ratio
g = seaborn.FacetGrid(iris, col="species", height=8, aspect=0.3)
iris['species'] = iris['species'].astype('category')
g.map(seaborn.scatterplot,'petal_length','petal_width','species')
plt.show()
help(seaborn.FacetGrid)
aspect * height
gives the width of each facet in inches.
Move a legend below a grid plot
g.fig.subplots_adjust(left=0.28, top=0.9) # resize the plot
g.legend.set_bbox_to_anchor((0.5, 0.15))
Another way to move the legend and make it flat
seaborn.move_legend(g, "upper center", bbox_to_anchor=(0.5, 0.1), ncol=4)
https://stackoverflow.com/questions/56575756/how-to-split-seaborn-legend-into-multiple-columns
Placing legend out in a floating box with
facet_kws={“legend_out”:False},
Move a legend below the plot
seaborn.move_legend(g, "upper left", bbox_to_anchor=(.05, .05), frameon=False, ncol=4, title="")
Code snippet to redraw a legend (didn’t use it in the edn)
h,l = g.axes[0].get_legend_handles_labels()
g.axes[0].legend_.remove()
g.fig.legend(h,l, ncol=4)
g.legend.set_bbox_to_anchor((.05,.05)) #, transform=g.fig.transFigure)
Create a line plot with a title and axis labels.
import numpy as np
df = pandas.DataFrame({'value':np.random.random(100),
'year':range(1901,2001)})
p = seaborn.lineplot(x="year", y="value", data=df)
p.set(ylabel = "Random variation", title = "Title here")
plt.show()
Example generated by GPT4 with a series of prompt related to a time series plot I was refining.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
# Set a random seed for reproducibility
np.random.seed(42)
# Create a synthetic dataset with a random walk
years = np.arange(2000, 2031)
categories = ['x', 'y']
data = []
for category in categories:
random_walk = np.random.randn(len(years)).cumsum()
data.extend(zip(years, [category] * len(years), random_walk))
df = pd.DataFrame(data, columns=['year', 'category', 'value'])
# Create a custom linestyle and color dictionary for each category (x, y)
style_dict = {'x': ('-', 'black'), 'y': ('--', 'black')}
# Plot the lineplot
ax = sns.lineplot(
x="year",
y="value",
hue="category",
style="category",
data=df
)
# Apply custom linestyle and color for each category (x, y)
for line, category in zip(ax.lines, df["category"].unique()):
linestyle, color = style_dict[category]
line.set_linestyle(linestyle)
line.set_color(color)
# Set the ylabel and title
ax.set(ylabel="Value", title="Random Walk by Category")
# Modify the legend colors to black
legend = ax.legend()
for handle in legend.legendHandles:
handle.set_color('black')
plt.show()
help(searborn.relplot)
> "This function provides access to several different axes-level functions
> that show the relationship between two variables with semantic mappings
> of subsets. The ``kind`` parameter selects the underlying axes-level
> function to use:
> - :func:`scatterplot` with ``kind="scatter"``; the default
> - :func:`lineplot` with ``kind="line"``
> Extra keyword arguments are passed to the underlying function, so you should
> refer to the documentation for each to see kind-specific options."
Example:
plot signal through time and facet along the region
use different axes size. This requires passing a dictionary to FacetGrid.
Add a y label
adjust the left margin so that the y label doesn’t overwrite the axis
Set the Y limit to zero
g = seaborn.relplot( data=fmri, x=“timepoint”, y=“signal”, col=“region”, hue=“event”, style=“event”, kind=“line”, col_wrap=1, height=3, facet_kws={‘sharey’: False, ‘sharex’: False} ) g.fig.supylabel(“Adaptive Engagement of Cognitive Control”) g.fig.subplots_adjust(left=0.28, top=0.9) g.fig.suptitle(“Example”) g.set_ylabels(“Y label”) g.set(ylim=(0, None)) plt.show()
Older example from https://seaborn.pydata.org/examples/faceted_lineplot.html
import seaborn as sns
sns.set_theme(style="ticks")
dots = sns.load_dataset("dots")
# Define the palette as a list to specify exact values
palette = sns.color_palette("rocket_r")
# Plot the lines on two facets
g = sns.relplot(
data=dots,
x="time", y="firing_rate",
hue="coherence", size="choice", col="align",
kind="line", size_order=["T1", "T2"], palette=palette,
height=5, aspect=.75, facet_kws=dict(sharex=False),
)
g.fig.suptitle("Dots example")
# Add a title and adjust the position so the title doesn't overwrite facets
g.set_ylabels("Y label")
plt.subplots_adjust(top=0.9)
Add a market and text to a plot, works both with simple plots and faceted plots.
plt.plot(2030, -420, marker='*', markersize=10, color='red')
plt.text(2030+1, -420, "Target -420", fontsize=10)
From [SO answer][https://stackoverflow.com/a/59775753/2641825)
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
data = [['APOLLOHOSP', 8, 6, 'High', 'small'],
['ANUP', 8, 7, 'High', 'small'],
['SIS', 4, 6, 'High', 'mid'],
['HAWKINCOOK', 5, 2, 'Low', 'mid'],
['NEULANDLAB', 6, 4, 'Low', 'large'],
['ORIENTELEC', 7, 9, 'Low', 'mid'],
['AXISBANK', 2, 3, 'Medium', 'mid'],
['DMART', 4, 1, 'Medium', 'large'],
['ARVIND', 2, 10, 'Medium', 'small'],
['TCI', 1, 7, 'High', 'mid'],
['MIDHANI', 5, 5, 'Low', 'large'],
['RITES', 6, 4, 'Medium', 'mid'],
['COROMANDEL', 9, 9, 'High', 'small'],
['SBIN', 10, 3, 'Medium', 'large']]
mx = pd.DataFrame(data=data, columns=["code", "Growth", "Value", "Risk", "Mcap"])
plotnum = {'small': 0, 'mid': 1, 'large': 2}
p1 = sns.relplot(x="Growth", y="Value", hue="Risk", col="Mcap", data=mx, s=200, palette=['r', 'g', 'y'])
for ax in p1.axes[0]:
ax.set_xlim(0.0, max(mx["Growth"]) + 1.9)
for row in mx.itertuples():
print(row)
ax = p1.axes[0, plotnum[row.Mcap]]
ax.text(row.Growth + 0.5, row.Value, row.code, horizontalalignment='left')
plt.show()
Create a scatter plot
import seaborn
import matplotlib.pyplot as plt
tips = seaborn.load_dataset("tips")
seaborn.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()
Group by another variable and show the groups with different colors:
seaborn(x="total_bill", y="tip", hue="time", data=tips)
Create a scatter plot with a title and axis labels
p = seaborn.scatterplot('petal_length','petal_width','species',data=iris)
p.set(xlabel = "Petal Length", ylabel = "Petal Width", title = "Flower sizes")
plt.show()
Draw a scatter plot for each iris species, using the recommended
relplot()
function:
g = seaborn.relplot(x='petal_length', y='petal_width', col='species', hue='species', data=iris)
plt.show()
help(seaborn.relplot) explains that it returns a FacetGrid object:
> " After plotting, the :class:`FacetGrid` with the plot is returned and can be
> used directly to tweak supporting plot details or add other layers."
Old way: using FacetGrid directly requires changing the species to a categorical variable in order to have a different colour for each species.
g = seaborn.FacetGrid(iris, col="species", height=6)
iris['species'] = iris['species'].astype('category')
# Use map_dataframe to name the arguments
g.map_dataframe(seaborn.scatterplot,x='petal_length',y='petal_width',hue='species')
plt.show()
# Old way without named argument
g.map(seaborn.scatterplot,'petal_length','petal_width','species')
plt.show()
Notice that if you don’t change the color to a categorical variable, it will not vary across the species. I reported this issue which led me to update the searborn documentation in this merge request.
“When using seaborn functions that infer semantic mappings from a dataset, care must be taken to synchronize those mappings across facets. In other words some mechanism needs to ensure that the same mapping is used in each facet. This can be achieved for example by passing pallete dictionnaries or by defining categorical types in your dataframe. In most cases, it will be better to use a figure-level function (e.g. :func:
relplot
or :func:catplot
) than to use :class:FacetGrid
directly.”
A grid scatter plot with an x=y line for comparison purposes
g = seaborn.relplot(data=df,
x="x_var",
y="y=var",
col="year",
hue="partner",
kind="scatter",
)
g.fig.subplots_adjust(top=0.9)
# Add x=y line
for ax in g.axes.flat:
ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3", scalex=False, scaley=False)
Show all Seaborn sample datasets
for dataset in seaborn.get_dataset_names():
print(dataset)
print(seaborn.load_dataset(dataset).head())
Plot a tree map from the python graph gallery
import matplotlib.pyplot as plt
import squarify # pip install squarify (algorithm for treemap)
import pandas
df = pandas.DataFrame({'nb_people':[8,3,4,2], 'group':["group A", "group B", "group C", "group D"] })
squarify.plot(sizes=df['nb_people'], label=df['group'], alpha=.8 )
plt.axis('off')
plt.show()
Altair https://altair-viz.github.io/
“Vega-Altair is a declarative visualization library for Python. Its simple, friendly and consistent API, built on top of the powerful Vega-Lite grammar, empowers you to spend less time writing code and more time exploring your data.”
Vega Lite
Vega lite gallery https://vega.github.io/vega-lite-v1/examples/
Vega lite documentation on tooltips
from vega import VegaLite import pandas df = pandas.read_json(‘cars.json’) VegaLite({ “data”: {“url”: “data/cars.json”}, “mark”: {“type”: “point”, “tooltip”: true}, “encoding”: { “x”: {“field”: “Horsepower”,“type”: “quantitative”}, “y”: {“field”: “Miles_per_Gallon”,“type”: “quantitative”} } }, df)
The tool tip feature is nice in an interactive notebook.
Ipython vega for Jupyter notebooks https://github.com/vega/ipyvega
Vega gallery https://vega.github.io/vega/examples/
How to print coloured text at the terminal?
“Print a string that starts a color/style, then the string, and then end the color/style change with ‘1b[0m’.”
For example
print(1000 * ("\x1b[1;32;44m" + "Winter" + "\x1b[0m" + ", " +
"\x1b[1;32;42m" + "Spring" + "\x1b[0m" + ", " +
"\x1b[1;35;41m" + "Summer" + "\x1b[0m" + ", " +
"\x1b[1;35;45m" + "Autumn" + "\x1b[0m" + ", "))
Run a scrip with the profiler, from within ipython
%run -i -p run_zz.py
Memory profiling https://stackoverflow.com/a/15682871/2641825
How can I time a code segment for testing performance with Pythons timeit?
Time a function:
import timeit
import time
def wait():
time.sleep(1)
timeit.timeit(wait, number=3)
“If you are profiling your code and can use IPython, it has the magic function
%timeit
.%%timeit
operates on cells.”
%timeit wait()
import timeit
start_time = timeit.default_timer()
# code you want to evaluate
elapsed = timeit.default_timer() - start_time
See also the R page for more details on R.
Reddit python vs R
“R is for analysis. Python is for production. If you want to do analysis only, use R. If you want to do production only, use Python. If you want to do analysis then production, use Python for both. If you aren’t planning to do production then it’s not worth doing, (unless you’re an academic). Conclusion: Use python.”
The central objects in R are vectors, matrices and data frames, that is why I mostly compare R to the python packages numpy and pandas. R was created almost 20 years before numpy and more than 40 years before pandas.
“R is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme. S was created by John Chambers in 1976 while at Bell Labs. A commercial version of S was offered as S-PLUS starting in 1988.”
“In 1995 the special interest group (SIG) matrix-sig was founded with the aim of defining an array computing package; among its members was Python designer and maintainer Guido van Rossum, who extended Python’s syntax (in particular the indexing syntax) to make array computing easier. […] An implementation of a matrix package was completed by Jim Fulton, then generalized by Jim Hugunin and called Numeric. […] new package called Numarray was written as a more flexible replacement for Numeric. Like Numeric, it too is now deprecated. […] In early 2005, NumPy developer Travis Oliphant wanted to unify the community around a single array package and ported Numarray’s features to Numeric, releasing the result as NumPy 1.0 in 2006.”
“Developer Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.”
“Python is a full fledge programming language but it is missing statistical and plotting libraries. Vectors are an after thought in python most functionality can be reproduced using operator overloading, but some functionality looks clumsy.”
R session showing a division by zero returning an infinite value.
> 1/0
[1] Inf
Python session showing a division by zero error for normal integer division and the same operation on a numpy array returning an infinite value with a warning.
In [1]: 1/0
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-1-9e1622b385b6> in <module>
----> 1 1/0
ZeroDivisionError: division by zero
In [2]: import numpy as np
In [3]: np.array([1]) / 0
/home/paul/.local/bin/ipython:1: RuntimeWarning: divide by zero encountered in true_divide
#!/usr/bin/python3
Out[3]: array([inf])
R data frame to be used for examples:
df = data.frame(x = 1:3, y = c('a','b','c'), stringsAsFactors = FALSE)
Pandas data frame to be used for examples:
import pandas
df = pandas.DataFrame({'x' : [1,2,3], 'y' : ['a','b','c']})
Base R | python or pandas | SO questions |
---|---|---|
df[df$y %in% c('a','b'),] |
df[df['y'].isin(['a','b'])] |
list of values to select a row |
dput(df) |
df.to_dict(orient="list") |
Print pandas data frame for reproducible example |
expand.grid(df$x,df$y) |
itertools.product |
see section below |
ifelse |
df.where() |
[ifelse in pandas] |
gsub |
df.x.replace(regex=True) |
gsub in pandas |
or df.x.str.replace() | ||
length(df) and dim(df) |
df.shape |
row count of a data frame |
rbind |
pandas.concat |
Pandas version of rbind |
rep(1,3) |
[1]*3 |
|
seq(1:5) |
np.array(range(0,5)) |
numpy function to generate sequences |
summary |
describe |
|
str |
df.info() |
pandas equivalents for R functions like str summary and head |
The mapping of tidyverse to pandas is:
tidyverse | pandas | Help or SO questions |
---|---|---|
arrange | df.sort_values(by=“y”, ascending=False) | |
df %>% select(-a,-b) | df.drop(columns=[‘x’, ‘y’]) | |
select(a) | df.loc[:,“x”] # Strict, var has to be present | |
df.filter(items=[‘x’]) # Not strict | ||
select(contains(“a”)) | df.filter(regex=‘x’) | |
filter | df.query(“y==‘b’”) | |
group_by | groupby | |
lag | shift | pandas lag function |
mutate | df.assign(e = lambda x: x[“a”] * 3) | assign |
pivot_longer | melt or wide_to_long | |
pivot_wider | pivot | |
rename | df.rename(columns={‘a’:‘new’}) | |
separate | df[[‘b’,‘c’]] = df.a.str.split(‘,’,n=1,expand=True) | pandas separate str section |
separate | df[[‘b’,‘c’]] = df.a.str.split(‘,’,expand=True) | |
summarize | agg | |
unite | df[“z”] = df.y + df.y | pandas unite |
unnest | explode | unnest in pandas |
Methods to use inside the .groupby().agg() method:
sum
count
mean
', '.join
to
get a union of stringsThis SO answer provides an implementation of expand grid using itertools:
import itertools
import pandas
countries = ["a","b","c","d"]
years = range(1990, 2020)
expand_grid = list(itertools.product(countries, years))
df = pandas.DataFrame(expand_grid, columns=('country', 'year'))
Another SO answer on the same topic
“One thing that is a blessing and a curse in R is that the machine learning algorithms are generally segmented by package. […] it can be a pain for day-to-day use where you might be switching between algorithms. […] scikit-learn provides a common set of ML algorithms all under the same API.
“one thing that R still does better than Python is plotting. Hands down, R is better in just about every facet. Even so, Python plotting has matured though it’s a fractured community.”
pandas.pydata.org Comparison with R
“Tidyverse allows a mix of quoted and unquoted references to variable names. In my (in)experience, the convenience this brings is accompanied by equal consternation. It seems to me a lot of the problems solved by tidyeval would not exist if all variables were quoted all the time, as in pandas, but there are likely deeper truths I’m missing here…”
Help of the R function unite from the tidyr package:
“col: The name of the new column, as a string or symbol. This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with ‘rlang::ensym()’ (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).”
The use of symbols which do not represent actual objects was was frustrating at first when using pandas, because we hat to use df[“x”] to assign vectors to new column names whereas we could use df.x to display them.
The xarray user guide page on pandas cites Hadley Whickham’s paper on tidy data:
“Tabular data is easiest to work with when it meets the criteria for tidy data”.
R is great for statistical analysis and plotting. You can also use it to elaborate a pipeline to load data, prepare it and analyse it. But when things start to get complicated, such as loading json data from APIs, or dealing with http requests issues, or understanding lazy evaluation, or the consequences of non standard evaluation, moving down the rabbit whole can get really complicated with R. The rabbit hole slide is smoother with python. I have the feeling that I keep a certain level of understanding at all steps. It’s just a matter of taste anyway.
The Python language can be more verbose on some aspects, but it allows for greater programmability, it is also more predictable because non standard evaluation doesn’t create scoping problems and it enables to dive deeper into input/output issues such as URL request headers for example. R remains very good for data exploration, statistical analysis and plotting because non standard evaluation makes it possible to call variables without quotes and to pass formulas to plotting and estimation functions.
I see R more like the bash command line. It’s great for scripts, but you wouldn’t want to write large applications in bash.
Non standard evaluation doesn’t exist in python. - An email thread discussing the idea of non standard evaluation in python. - A comparison of a python implementation and an R implementation using non standard evaluation.
Compromised PyTorch-nightly dependency chain between December 25th and December 30th, 2022.
“PyTorch-nightly Linux packages installed via pip during that time installed a dependency, torchtriton, which was compromised on the Python Package Index (PyPI) code repository and ran a malicious binary. This is what is known as a supply chain attack and directly affects dependencies for packages that are hosted on public package indices.”
Anaconda was not affected https://www.anaconda.com/blog/anaconda-unaffected-by-pytorch-security-incident
“Conda users installing packages from Anaconda’s “main” channel are not impacted. This is because Anaconda’s official channels (the location where all our packages are stored) only contain packages built from stable upstream releases, while the affected PyTorch releases were nightly, development builds.
“Update: we have confirmed with the conda-forge maintainers thattheir PyTorch packages are also built from stable upstream releases andare similarly not impacted.”
See also string operations in pandas character vectors.
SO answer providing various ways to concatenate python strings.
How to print number with commas as thousands separators?
Thousand mark
f"{1e6:,}"
Round to 2 decimal places
f"{0.129456789:.2f}"
See also string operations in pandas with df[“x”].str methods.
Simple search with in
returns True
or
False
"a" in "bla"
"z" in "bla"
\S
matches any non white space character
\W
matches any non-word character
Search for patterns
import re
re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]
Search for ab in baba:
re.search("ab", "baba")
Search for the numeric after “value_”
re.findall("value_(\d+)", "value_2022")
Search for the group values occurrence of the numeric after “value_”
re.search("(value)_(\d+)", "value_2022").group(0)
re.search("(value)_(\d+)", "value_2022").group(1)
re.search("(value)_(\d+)", "value_2022").group(2)
Search charcters that are not value
l = ["value123", "a", "b"]
[x for x in l if not re.search("value", x)]
Documentation of the re package.
Replace one or another character by a space
import re
re.sub("l|k", " ", "mlkj")
Replace one or more consecutive non alphanumeric characters by an underscore.
re.sub(r'\W+', '_', 'bla: bla**(bla)')
Insert a suffix in a file name before the extension SO anwser
import re
re.sub(r'(?:_a)?(\.[^\.]*)$' , r'_suff\1',"long.file.name.jpg")
Join strings from a list to print them nicely
l = ["cons", "imp", "exp", "prod"]
print(l)
print(", ".join(l))
input = """bla
bla
bla"""
for line in input.splitlines():
print(line, "\n")
Real Python What is linear programing
Several free Python libraries are specialized to interact with linear or mixed-integer linear programming solvers:
EAFP Easier to ask for forgiveness than permission
“This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.
“This coding style explicitly tests for pre-conditions before making calls or lookups. This style contrasts with the EAFP approach and is characterized by the presence of many if statements. In a multi-threaded environment, the LBYL approach can risk introducing a race condition between “the looking” and “the leaping”. For example, the code, if key in mapping: return mapping[key] can fail if another thread removes key from mapping after the test, but before the lookup. This issue can be solved with locks or by using the EAFP approach.
Black is “the uncompromising Python code formater”
See the pre commit section below to install and run
black
as a pre commit hook with
pre-commit
.
In vim, you can run black on the current file with:
:!black %
Ignore a revision in git blame after moving to black
“A long-standing argument against moving to automated code formatters like Black is that the migration will clutter up the output of git blame. This was a valid argument, but since Git version 2.23, Git natively supports ignoring revisions in blame with the –ignore-rev option.”
“You can even configure git to automatically ignore revisions listed in a file on every call to git blame.”
git config blame.ignoreRevsFile .git-blame-ignore-revs
Flake 8 looks at more than just formatting.
List of FLake8 warnings and error codes
https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html
Ignore errors in a .flake8 file at the root of the git repository
Ignore errors for just one line with a comment # noqa: E731
PEP 8 Style Guide for Python Code
“A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is the most important.”
“However, know when to be inconsistent – sometimes style guide recommendations just aren’t applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don’t hesitate to ask!”
“In particular: do not break backwards compatibility just to comply with this PEP!”
Blog:
Install pre-commit
pip install pre-commit
Set up pre-commit
in a repository
cd path_to_repository
# Add the "pre-commit" python module to a requirements file
vim requirements.txt
# Create a configuration file
vim .pre-commit-config.yaml
Configuration options such as
repos:
- repo: https://github.com/ambv/black
rev: 21.6b0
hooks:
- id: black
language_version: python3.7
- repo: https://gitlab.com/pycqa/flake8
rev: 3.7.9
hooks:
- id: flake8
Update hook repositories to the latest version
pre-commit autoupdate
Install git hooks in your .git/ directory.
pre-commit install
To deactivate a pre commit hook temporarily https://stackoverflow.com/questions/7230820/skip-git-commit-hooks
git commit --no-verify -m "commit message"
Usage in Continuous integration has a gitlab example:
my_job:
variables:
PRE_COMMIT_HOME: ${CI_PROJECT_DIR}/.cache/pre-commit
cache:
paths:
- ${PRE_COMMIT_HOME}
Uninstall
pre-commit uninstall
Edit a user’s configuration file
vim ~/.pylintrc
You can also make a project specific configuration file at the root of a git repository. The content of the configuration file is as follows:
[pylint]
# List of good names that shouldn't give a "short name" warning
good-names=df,ds,t
# Use Paul's default virtual environment
init-hook='import sys; sys.path.append("/home/paul/rp/penv/lib/python3.11/site-packages/")'
Adding the path to site packages in the virtual environment is necessary in order to avoid pylint using the base system version of python, which doesn’t have pandas installed. This avoids reporting package not installed errors.
Generate a configuration file
pylint --generate-rcfile
Blog
E1101: ‘Instance of .. has no .. member’ for class with dynamic attributes
To ignore this error, I entered this in a .pylintrc file at the root of the project’s git repository
[TYPECHECK]
generated-members=other,indround,fuel,sawn,panel,pulp,paper
Use case:
# Make agg_trade_eu_row available here for backward compatibility
# so that the following import statement continues to work:
# >>> from biotrade.faostat.aggregate import agg_trade_eu_row
from biotrade.common.aggregate import agg_trade_eu_row # noqa # pylint: disable=unused-import
import <module> # noqa # pylint: disable=unused-import
I understand the dangerous of using a mutable default value and I suggest switching the warning message for something like “Dangerous mutable default value as argument”. However, this is dangerous for all sorts of scenarios? (I know that pylint isn’t supposed to check the functionality of my code, just trying to clarify this anti-pattern)
>>> def find(_filter={'_id': 0}):
... print({**find.__defaults__[0], **_filter})
...
>>> find()
{'_id': 0}
>>> find({'a': 1})
{'_id': 0, 'a': 1}
>>> find()
{'_id': 0}
>>> find({'a': 1, 'b': 2})
{'_id': 0, 'a': 1, 'b': 2}
One might argue that the following should be used and I tend to agree:
>>> def find(_filter=None):
... if _filter is None:
... _filter = {'_id': 0}
... else:
... _filter['_id'] = 0
... print(_filter)
...
>>> find()
{'_id': 0}
>>> find({'a': 1})
{'a': 1, '_id': 0}
>>> find()
{'_id': 0}
>>> find({'a': 1, 'b': 2})
{'a': 1, 'b': 2, '_id': 0}
Pylint message
Consider using ‘with’ for resource-allocating operations
Explained in a SO answer
suppose you are opening a file:
file_handle = open("some_file.txt", "r")
...
...
file_handle.close()
You need to close that file manually after required task is done. If it’s not closed, then resource (memory/buffer in this case) is wasted. If you use
with
in the above example:
with open("some_file.txt", "r") as file_handle:
...
...
there is no need to close that file. Resource de-allocation automatically happens when you use with.
Platform type
import sys
sys.platform
or
import os
os.name
Sys and os return different results ‘linux’ or ‘posix’.
More details are given by
os.uname()
Get an environment variable
import os
os.environ["XYZ"]
Set an environment variable
os.environ["XYZ"] = "/tmp"
For example in bash, the python path can be updated as follows:
export PYTHONPATH="$HOME/repos/biotrade/":$PYTHONPATH
This tells python where the biotrade package is located.
From python, use sys.path
to prepend to the python
path.
import sys
sys.path.insert(0, "/home/rougipa/eu_cbm/eu_cbm_hat")
See also the section on Path/python path to change the python path and import a script from Jupyter notebook.
https://docs.python.org/3/glossary.html#term-global-interpreter-lock
“The mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines.
However, some extension modules, either standard or third-party, are designed so as to release the GIL when doing computationally intensive tasks such as compression or hashing. Also, the GIL is always released when doing I/O.”
https://numba.pydata.org/numba-doc/latest/user/jit.html#nogil
“Whenever Numba optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL). Numba will release the GIL when entering such a compiled function if you passed nogil=True.”
To display the memory usage of a python object
import sys
a = 1
print(sys.getsizeof(a))
See also the section on memory usage of pandas data frames under columns / memory usage.
Sometimes when a python process runs out of memory, it can get killed by the Linux Kernel. In that case the error message is short “killed” and there is no python trace back printed. You can check that it is indeed a memory error by calling
sudo dmesg
Here is a typical message:
[85962.510533] Out of memory: Kill process 16035 (ipython3) score 320 or sacrifice child
[85962.510554] Killed process 16035 (ipython3) total-vm:7081812kB, anon-rss:4536336kB, file-rss:0kB, shmem-rss:8kB
[85962.687468] oom_reaper: reaped process 16035 (ipython3), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB
Various related Stack Overflow questions what does “kill” mean, How can I find the reason that python script is killed?, Why does python script randomly gets killed?.
Show the location of an imported module:
import module_name
print(module_name.__file__)
For example on a system you might have built-in modules stored in one directory, user installed modules in another place and development modules yet in another place:
import os
os.__file__
# '/usr/lib/python3.9/os.py'
import pandas
pandas.__file__
# '/home/paul/.local/lib/python3.9/site-packages/pandas/__init__.py'
import biotrade
biotrade.__file__
# '/home/paul/repos/forobs/biotrade/biotrade/__init__.py'
Show all modules installed in a system:
help("modules")
A good post about TDD Unit testing your doing it wrong
“TDD is actually about every form of tests. For example, I often write performance tests as part of my TDD routine; end-to-end tests as well. Furthermore, this is about behaviours, not implementation: you write a new test when you need to fulfil a requirement. You do not write a test when you need to code a new class or a new method. Subtle, but important nuance. For example, you should not write a new test just because you refactored the code. If you have to, it means you were not really doing TDD.” […] “Good tests must test a behavior in isolation to other tests. Calling them unit, system or integration has no relevance to this. Kent Beck says this much better than I would ever do. ’‘’From this perspective, the integration/unit test frontier is a frontier of design, not of tools or frameworks or how long tests run or how many lines of code we wrote get executed while running the test.’’’ Kent Beck”
Numpy moved from nose to pytest as explained in their testing guidelines:
“Until the 1.15 release, NumPy used the nose testing framework, it now uses the pytest framework. The older framework is still maintained in order to support downstream projects that use the old numpy framework, but all tests for NumPy should use pytest.”
Save this function in a file names test_numpy.py
def test_numpy_closeness():
assert [1,2] == [1,2]
assert (np.array([1,2]) == np.array([1,2])).all()
np.testing.assert_allclose(np.array([1,2]),np.array([1,3]))
Save the file to test_nn.py
import neural_nets as nn
import numpy as np
def test_rectified_linear_unit():
x = np.array([[1,0],
[-1,-3]])
expected = np.array([[1,0],
[0,0]])
provided = nn.rectified_linear_unit(x)
assert np.allclose(expected, provided), "test failed"
Execute the test suite from bash with py.test as follows:
cd ~/rp/course_machine_learning/projects/project_2_3_mnist/part2-nn
py.test
Run and enter debug mode https://stackoverflow.com/a/48739098/2641825 suggests
starting an ipython shell within the package directory and then running
!pytest --pdb
to enter inside the function that is causing
a test to fail. Objects in that function can then be inspected at the python debugger
prompt.
!pytest –pdb
Run unittest with pytest How to use unittest-based tests with pytest
pytest file_test.py
https://docs.python.org/3/library/doctest.html
“The doctest module searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown.”
Run doctest in pytest https://docs.pytest.org/en/7.1.x/how-to/doctest.html
Test examples in the whole module
pytest --doctest-modules
Test only one file
pytest --doctest-modules post_processor/nai.py
“To skip a single check inside a doctest you can use the standard doctest.SKIP directive:”
def test_random(y):
"""
>>> random.random() # doctest: +SKIP
0.156231223
>>> 1 + 1
2
"""
“ValueError: line 32 of the docstring for
“If you have a code snippet that wraps multiple lines, you need to use ‘…’ on the continued lines”.
See also:
doctestplus https://github.com/scientific-python/pytest-doctestplus provides additional functionality to skip tests in certain classes or for an entire module.
doctest plus provides additional flags to skip or include tests
on remote data. This works in conjonction with
https://github.com/astropy/pytest-remotedata
“The pytest-remotedata plugin allows developers to indicate which unit tests require access to the internet, and to control when and whether such tests should execute as part of any given run of the test suite.”
Paul Rougieux’s SO question on testing pandas data frame with doctest
I have a package with many methods that output pandas data frame. I would like to test the examples with pytest and doctest as explained on the pytest doctest integration page.
Pytest requires the output data frame to contain a certain number of columns that might be different than the number of columns provided in the example.
>>> import pandas
>>> df = pandas.DataFrame({"variable": range(3)})
>>> for i in range(7):
... df["variable_"+str(i)] = range(3)
>>> df
variable variable_0 variable_1 variable_2 variable_3 variable_4 variable_5 variable_6
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
pytest --doctest-modules
returns the following error
because it displays 6 columns instead of 7
Differences (unified diff with -expected +actual):
@@ -1,4 +1,6 @@
- variable_1 variable_2 variable_3 variable_4 variable_5 variable_6 variable_7
-0 0 0 0 0 0 0 0
-1 1 1 1 1 1 1 1
-2 2 2 2 2 2 2 2
+ variable_1 variable_2 variable_3 ... variable_5 variable_6 variable_7
+0 0 0 0 ... 0 0 0
+1 1 1 1 ... 1 1 1
+2 2 2 2 ... 2 2 2
+<BLANKLINE>
+[3 rows x 7 columns]
Is there a way to fix the number of column? Does doctest always have a fixed terminal output?
Number of columns issues
>>> import pandas
>>> df = pandas.DataFrame({"variable_1": range(3)})
>>> for i in range(2, 8): df["variable_"+str(i)] = range(3)
>>> df
variable_1 variable_2 variable_3 variable_4 variable_5 variable_6 variable_7
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
Differences (unified diff with -expected +actual): @@ -1,4 +1,6 @@ -
variable_1 variable_2 variable_3 variable_4 variable_5 variable_6
variable_7 -0 0 0 0 0 0 0 0 -1 1 1 1 1 1 1 1 -2 2 2 2 2 2 2 2 +
variable_1 variable_2 variable_3 … variable_5 variable_6 variable_7 +0 0
0 0 … 0 0 0 +1 1 1 1 … 1 1 1 +2 2 2 2 … 2 2 2 +
Methods to test data frame and series equality
from pandas.testing import assert_frame_equal
from pandas.testing import assert_series_equal
import seaborn
iris = seaborn.load_dataset("iris")
assert_frame_equal(iris, iris)
iris["species2"] = iris["species"]
assert_series_equal(iris["species"], iris["species2"])
# Ignore names
assert_series_equal(iris["species"], iris["species2"], check_names=False)
Sometimes you want tolerance
df = pandas.DataFrame({"a":[1.0,2,3],
"b":[1.0001,2,3]})
assert_series_equal(df["a"], df["b"], check_names=False)
assert_series_equal(df["a"], df["b"], rtol=1e-2, check_names=False)
pytest assert
“In order to write assertions about raised exceptions, you can use pytest.raises() as a context manager like this:”
import pytest
def test_zero_division():
with pytest.raises(ZeroDivisionError):
1 / 0
“and if you need to have access to the actual exception info you may use:”
def test_recursion_depth():
with pytest.raises(RuntimeError) as excinfo:
def f():
f()
f()
assert "maximum recursion" in str(excinfo.value)
“excinfo is an ExceptionInfo instance, which is a wrapper around the actual exception raised. The main attributes of interest are .type, .value and .traceback.”
import pytest
import xarray
from cobwood.gfpmx_equations import (
consumption,
consumption_pulp,
consumption_indround,
)
@pytest.fixture
def secondary_product_dataset():
"""Create a sample dataset for testing"""
ds = xarray.Dataset({
"cons_constant": xarray.DataArray([2, 3, 4], dims=["c"]),
"price": xarray.DataArray([[1, 2], [3, 4], [5, 6]], dims=["c", "t"]),
"gdp": xarray.DataArray([[100, 200], [300, 400], [500, 600]], dims=["c", "t"]),
"prod": xarray.DataArray([[100, 200], [300, 400], [500, 600]], dims=["c", "t"]),
"cons_price_elasticity": xarray.DataArray([0.5, 0.6, 0.7], dims=["c"]),
"cons_gdp_elasticity": xarray.DataArray([0.8, 0.9, 1.0], dims=["c"]),
})
return ds
def test_consumption(secondary_product_dataset):
"""Test the consumption function"""
ds = secondary_product_dataset
t = 1
expected_result = xarray.DataArray([138.62896863, 1274.23051055, 7404.40635264], dims=["c"])
result = consumption(ds, t)
xarray.testing.assert_allclose(result, expected_result)
https://docs.pytest.org/en/6.2.x/parametrize.html
“The builtin pytest.mark.parametrize decorator enables parametrization of arguments for a test function. Here is a typical example of a test function that implements checking that a certain input leads to an expected output:
# content of test_expectation.py
import pytest
@pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
def test_eval(test_input, expected):
assert eval(test_input) == expected
Add this at the beginning of pytest files
# pylint: disable=redefined-outer-name
https://fastapi.tiangolo.com/ seems to be a recommended way to create python APIs.
Django API
Flask API
Note: Flask Evolution into Quart to support asyncio This last link contains a nice, simple example of how asyncio works with a simulated delay to fetch a web page.
Star history comparison of ploomber and snakemake https://star-history.com/#snakemake/snakemake&ploomber/ploomber&Date
Ploomber https://github.com/ploomber/ploomber
Snake make https://github.com/snakemake/snakemake
“DAGs In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code. For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10pm, but shouldn’t start until a certain date. In this way, a DAG describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually want to do! A, B, and C could be anything. Maybe A prepares data for B to analyze while C sends an email. Or perhaps A monitors your location so B can open your garage door while C turns on your house lights. The important thing is that the DAG isn’t concerned with what its constituent tasks do; its job is to make sure that whatever they do happens at the right time, or in the right order, or with the right handling of any unexpected issues. DAGs are defined in standard Python files that are placed in Airflow’sDAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.”
” Workflows You’re now familiar with the core building blocks of Airflow. Some of the concepts may sound very similar, but the vocabulary can be conceptualized like this:
DAG: The work (tasks), and the order in which work should take place (dependencies), written in Python.
DAG Run: An instance of a DAG for a particular logical date and time.
Operator: A class that acts as a template for carrying out some work.
Task: Defines work by implementing an operator, written in Python.
Task Instance: An instance of a task - that has been assigned to a DAG and has a state associated with a specific DAG run (i.e for a specific execution_date).
execution_date: The logical date and time for a DAG Run and its Task Instances.
By combining DAGs and Operators to create TaskInstances, you can build complex workflows.”
Create a data array and plot it, example from the xarray quick overview:
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
da2 = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
da2.attrs["long_name"] = "random velocity"
da2.attrs["units"] = "metres/sec"
da2.attrs["description"] = "A random variable created as an example."
da2.attrs["random_attribute"] = 123
da2.attrs
da2.plot()
plt.show()
Create another data array with one dimension only and multiply it with the two dimensional array
da1 = xr.DataArray([1,2], coords={"x":[10,20]})
da2 * da1
> "xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional
> array. It has several key properties:
>
> - values: a numpy.ndarray holding the array’s values
>
> - dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
>
> - coords: a dict-like container of arrays (coordinates) that label each point
> (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
>
> - attrs: dict to hold arbitrary metadata (attributes)
>
> Xarray uses dims and coords to enable its core metadata aware operations.
> Dimensions provide names that xarray uses instead of the axis argument found in
> many numpy functions. Coordinates enable fast label based indexing and
> alignment, building on the functionality of the index found on a pandas
> DataFrame or Series."
> "xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a
> dict-like container of labeled arrays (DataArray objects) with aligned
> dimensions. It is designed as an in-memory representation of the data model
> from the netCDF file format.
>
> In addition to the dict-like interface of the dataset itself, which can be used
> to access any variable in a dataset, datasets have four key properties:
>
> dims: a dictionary mapping from dimension names to the fixed length of each
> dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
>
> data_vars: a dict-like container of DataArrays corresponding to variables
>
> coords: another dict-like container of DataArrays intended to label points
> used in data_vars (e.g., arrays of numbers, datetime objects or strings)
> attrs: dict to hold arbitrary metadata
>
> The distinction between whether a variable falls in data or coordinates
> (borrowed from CF conventions) is mostly semantic, and you can probably get
> away with ignoring it if you like: dictionary like access on a dataset will
> supply variables found in either category. However, xarray does make use of the
> distinction for indexing and computations. Coordinates indicate
> constant/fixed/independent quantities, unlike the varying/measured/dependent
> quantities that belong in data."
converting between datasets and arrays
> "This method broadcasts all data variables in the dataset against each other,
> then concatenates them along a new dimension into a new array while
> preserving coordinates."
Convert a 1 dimensional data array to a list
ds.country.values.tolist()
The documentation of the DataArray.copy
and Dataset.copy
methods show they both have a deep
argument. If this
argument is set to False
(the default) it will only return
a new view on the dataset. Illustration below, a dataset is passed to a
function that removes values above a threshold. When
deep=False
the input data is changed as well even though we
used the copy()
method. We really have to use
copy(deep=True)
to make sure that the input data remains un
modified.
import xarray
import numpy as np
ds = xarray.Dataset(
{"a": (("x", "y"), np.random.randn(2, 3))},
coords={"x": [10, 20], "y": ["a", "b", "c"]},
)
ds
def remove_x_larger_than(ds_in, threshold, deep):
"""Remove values of x larger than the threshold"""
ds_out = ds_in.copy(deep=deep)
ds_out.loc[dict(x=ds_out.coords["x"]>threshold)] = np.nan
return ds_out
remove_x_larger_than(ds, threshold=10, deep=True)
print(ds)
remove_x_larger_than(ds, threshold=10, deep=False)
print(ds)
Round trip from pandas to xarray and back from the xarray user guide page on pandas.
import xarray
import numpy as np
ds = xarray.Dataset(
{"foo": (("x", "y"), np.random.randn(2, 3))},
coords={
"x": [10, 20],
"y": ["a", "b", "c"],
"along_x": ("x", np.random.randn(2)),
"scalar": 123,
},
)
ds
x and y are dimensions.
We can add attributes to qualify metadata.
ds.attrs["product"] = "sponge"
Convert the xarray dataset to a pandas data frame
df = ds.to_dataframe()
df
Convert the data frame back to a dataset
xarray.Dataset.from_dataframe(df)
“Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex. Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.”
You can also use
xarray.DataArray(df)
Fill an array with zero values, similar to an existing data array. Or fill it with NA values.
xarray.zeros_like(da)
xarray.full_like(da, fill_value=xarray.nan)
The equivalent to df.columns
in pandas would be
list(sawn.data_vars)
for an xarray dataset.
ds.data_vars
displays the data variables with the beginning
of their content. If you loop on it it will just display a string
for x in ds.data_vars:
print(x, type(x))
A list of variables
list(sawn.data_vars)
Group the given variable by region, using a dataArray called “region” which is stored inside the dataset
region_data = gfpmx_data.country_groups.set_index('country')['region']
region_dataarray = xarray.DataArray.from_series(region_data)
aggregated_data = ds[var].loc[COUNTRIES, t].groupby(ds["region"]).sum()
ds[var].loc["WORLD", t] = ds[var].loc[COUNTRIES, t].sum()
ds[var].loc[regions,t] = ds[var].loc[COUNTRIES,t].groupby(ds["region"].loc[COUNTRIES]).sum()
Example use with the GFPMx dataset
ds["exp"].loc["Czechia", ds.coords["year"]>2015]
https://docs.xarray.dev/en/stable/user-guide/indexing.html#assigning-values-with-indexing
To select and assign values to a portion of a DataArray() you can use indexing with
.loc
or.where
.
import xarray
import matplotlib.pyplot as plt
ds = xarray.tutorial.open_dataset("air_temperature")
ds["empty"] = xarray.full_like(ds.air.mean("time"), fill_value=0)
ds["empty"].loc[dict(lon=260, lat=30)] = 100
lc = ds.coords["lon"]
la = ds.coords["lat"]
ds["empty"].loc[
dict(lon=lc[(lc > 220) & (lc < 260)], lat=la[(la > 20) & (la < 60)])
] = 100
# Plot
ds.empty.plot()
plt.show()
# Write to a csv file
ds.empty.to_dataframe().to_csv("/tmp/empty.csv")
“Warning Do not try to assign values when using any of the indexing methods
.isel
or.sel
:”
da = xarray.DataArray([0, 1, 2, 3], dims=["x"])
# This will return an error
da.isel(x=[0, 1, 2]) = -1
# SyntaxError: cannot assign to function call
# Do not do this
da.isel(x=[0, 1, 2])[1] = -1
# Use a dictionnary instead
da[dict(x=[1])] = -1
# Also works with broadcasting
da[dict(x=[0, 1, 2])] = -1
Keep only data that is below the base year into the dataset.
base_year = 2018
ds.sel(year = ds.year <= base_year)
ds.query(year = "year <= 2018")
ds.query(year = "year <= @base_year")
# Returns an error
# SyntaxError: The '@' prefix is not allowed in top-level eval calls.
# please refer to your variables by name without the '@' prefix.
Reindex an array to get the same coordinates as another one, with empty values where values are missing.
There is no isna() method in xarray. Check for missing data with the isnull() method
transitioning from pandas panel to xarray
> "As discussed elsewhere in the docs, there are two primary data structures
> in xarray: DataArray and Dataset. You can imagine a DataArray as a
> n-dimensional pandas Series (i.e. a single typed array), and a Dataset as the
> DataFrame equivalent (i.e. a dict of aligned DataArray objects).
> So you can represent a Panel, in two ways:
>
> As a 3-dimensional DataArray,
>
> Or as a Dataset containing a number of 2-dimensional DataArray objects.
> "Variables in Dataset objects can use a subset of its dimensions. For
> example, you can have one dataset with Person x Score x Time, and another
> with Person x Score."
“For more advanced scatter plots, we recommend converting the relevant data variables to a pandas DataFrame and using the extensive plotting capabilities of seaborn.”
“The easiest way to create faceted plots is to pass in row or col arguments to the xarray plotting methods/functions. This returns a xarray.plot.FacetGrid object.”
spda.loc[dict(country=[‘Ukraine’, ‘Uzbekistan’])].plot(col=“country”) # Plot by continents gfpmxb2020.indround[“prod”].loc[~gfpmxb2020.indround.c].plot(col=“country”)
Example using the GFPMx data structure to plot industrial roundwood consumption, production and trade in Czechia:
variables = ["imp", "cons", "exp", "prod"]
# Select inside the dataset
gfpmx["indround"].loc[{"country":"Czechia"}][variables].to_dataframe()[variables].plot()
# Convert to data frame first then plot
gfpmx["indround"].to_dataframe().loc["Czechia"][variables].plot()
See the general section on IO and file formats. The subsection on netcdf files refers to xarray.
See also the conversion section for conversion to other in memory formats such as lists or pandas data frames.
Julio Biason Things I Learnt The Hard Way (in 30 Years of Software Development)
Daniel Lemire I do not use a debugger
“Debuggers don’t remove bugs. They only show them in slow motion.”
Wes McKinney
2017 Apache Arrow and the 10 things I hate about pandas
“pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset”
2018 Announcing Ursalabs
“It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. This is part of why working on Arrow has been so important for me; it provides a path to sharing of systems code outside of Python by enabling free interoperability at the data level.”
“Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development. Nearly 9 years have passed since J.J. started building the RStudio IDE, but in many ways he and Hadley and others feel like they are just getting started.”
Dotan Nahum Functional Programming with Python for People Without Time
“Cracks in the Ice - We ended the previous part with stating that with a good measure of abstraction, functional programming doesn’t offer a considerable advantage over the “traditional” way of design, object oriented. It’s a lie. […] In our pipeline example above with our Executors — how do you feed in the output of one executor as the input for the next one? well, you have to build that infrastructure. With functional programming, those abstractions are not abstractions that you have to custom build. They’re part of the language, mindset, and ecosystem. Generically speaking — it’s all about impedence mismatch and leaky abstractions and when it comes to data and functions; there’s no mismatch because it’s built up from the core. The thesis is — that to build a functional programming approach over an object-oriented playground — is going to crash and burn at one point or another: be it bad modeling of abstractions, performance problems, bad developer ergonomics, and the worst — wrong mindset. Being able to model problems and solutions in a functional way, transcends above traditional abstraction; the object-oriented approach, in comparison, is crude, inefficient and prone to maintenance problems.”
Christopher Rackauckas Why numba and cython are no substitute for Julia discusses the advantages of the Julia language over Python for large code bases.
Ethan Rosenthal Everything Gets a Package: My Python Data Science Setup
There are two primary methods to express data:
MultiIndex DataFrames where the outer index is the entity and the inner is the time index. This requires using pandas.
3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index. It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.
Explains multi index with stacking and unstacking.
COIN-OR project “open source for the operations research community”
“Without open source implementations of existing algorithms, testing new ideas built on existing ones typically requires the time-consuming and error-prone process of re-implementing (and re-debugging and re-testing) the original algorithm. If the original algorithm were publicly available in a community repository, imagine the productivity gains from software reuse! Science evolves when previous results can be easily replicated”
“We support and maintain python.org, The Python Package Index, Python Documentation, and many other services the Python Community relies on.”
“Larry Page in his dormitory at Stanford had written or tried to write a web spider to get a copy of some subset of the web on his computers so he could try his famous Page algorithm. He was trying to use the brand-new language Java in 1.0 beta version, and it kept crashing. So he asked for help from his roommate and his roommate took a look at said ‘oh you’re using that Java disaster’. Of course, it crashed and did it in 100 lines of Python. It runs perfectly, and that’s how Google became possible through 100 lines of Python. But I had no idea until about five years ago that it had played so crucial role so early on.”
” Similarly, if I hadn’t heard it from the mouth of Guido himself, I would never have known that Python was at the heart of the web. The very first Web server and web browser were written by the inventor of the World Wide Web, HTTP, and HTML in Python. He wasn’t really a programmer; he was a physicist and Python was far easier to use than anything else.”
Interesting how Samuel Colvin talks about serialization automate your data exchange with pydantic