Python Standards and Best Practices

The Zen of Python, by Tim Peters

View Slides

This page contains a list of proposed guidelines to (mostly) improve the formatting of our code, and create more consistency across developers. It's a proposal for discussion and should be considered as such.

Guideline of PEP8, PEP257, PEP20, and some additions. To be optimized continuously.

A video presentation and a related slide can be found here for this doc.

Python Enhancement Proposals (PEPs)
Naming Convention
Break long lines
Comments & DocString
Classes and functions
Auto-optimize

Details

Code, likened to a sincere love letter, communicates a developer's logic and approach. Just as poets craft resonance with devices, good code strategically applies design patterns like SOLID and adheres to principles (DRY, KISS, YAGNI). Following best practices such as naming conventions and modularity, good code becomes a lasting legacy for future developers, balancing comprehension and efficiency.

>>> import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
...

Python developers should prioritize code readability and best practices because it leads to easier maintenance, efficient collaboration, fewer bugs, smoother scalability, consistent codebase, streamlined documentation, faster learning curve for new developers, expedited code reviews, code longevity, and professional coding standards.

Imports

Grouping imports

Imports should be grouped in the following order:

1, Standard library imports.
2, Related third-party imports (pip install).
3, Local application/library specific imports.

You should put a blank line between each group of imports.

Not recommended
                                        import pendulum
from airflow import DAG
from cdh.airflow import global_default_args
from datetime import datetime

Recommended
                                        from datetime import datetime

from airflow import DAG
import pendulum

from cdh.airflow import global_default_args

Order alphabetically

Not recommended
                                        from cdh.variables import CDH_DAG_PREFIX, ...
from cdh.variables import HAAL_ADL_ROOT
from cdh.airflow.sensors.adf_activity_run_sensor import ADFActivityRunSensor
from cdh.py.dl_raw import partitioned as load_partitioned
from cdh.airflow.operators.cdh_hive_operator import CDHHiveOperator
from cdh.airflow.sensors.cdh_table_source_file_is_ready_sensor import CDHTableSourceFileIsReadySensor
from cdh.py.dl_raw import unpartitioned as load_unpartitioned
from cdh.airflow.operators.cdh_access_rights_operator import CDHAccessRightsOperator
from cdh.airflow.operators.cdh_load_status_operator import CDHLoadStatusOperator
from cdh.airflow.operators.cdh_compute_stats_operator import CDHComputeStatsOperator
from cdh.airflow import global_default_args

Recommended
                                        from cdh.airflow import global_default_args
from cdh.airflow.operators.cdh_access_rights_operator import CDHAccessRightsOperator
from cdh.airflow.operators.cdh_compute_stats_operator import CDHComputeStatsOperator
from cdh.airflow.operators.cdh_hive_operator import CDHHiveOperator
from cdh.airflow.operators.cdh_load_status_operator import CDHLoadStatusOperator
from cdh.airflow.sensors.adf_activity_run_sensor import ADFActivityRunSensor
from cdh.airflow.sensors.cdh_table_source_file_is_ready_sensor import CDHTableSourceFileIsReadySensor
from cdh.py.dl_raw import partitioned as load_partitioned
from cdh.py.dl_raw import unpartitioned as load_unpartitioned
from cdh.variables import CDH_DAG_PREFIX, ...

Break long lines

Introduction

Too long lines make it difficult:

1, to have several files next to each other
2, when doing code review, to see which part of the line has changed
3, to count how many arguments there are etc.

For these reasons and to keep it consistent within the team and to help readability, the recommended maximum length for a single line is 79 characters.

PEP8: Trailing commas

Trailing Commas are often helpful when a list of values, arguments, or imported items are expected to be extended over time. The pattern is to put each value (etc.) on a line by itself, always adding a trailing comma, and add the close parenthesis/bracket/brace on the next line. Therefore:

1, always do multiline when it's >= 3 parameters/elements
2, for 2, it's up to your common-sense
3, keep in mind readability > rules

Strings

Not recommended

PartitionedTable = collections.namedtuple( "PartitionedTable", ( "cdh_target_schema_name cdh_target_table_name adf_activity_name adf_name " "adf_pipeline_name do_not_deduplicate primary_key_columns_list " "primary_key_sort_columns_list on_insert_sort_by min_adf_completion_count" ) )

Recommended
                                        from datetime import datetime

from airflow import DAG
import pendulum

from cdh.airflow import global_default_args

Arrays

Imports should be grouped in the following order:

1, Standard library imports.
2, Related third-party imports (pip install).
3, Local application/library specific imports.

You should put a blank line between each group of imports.

Not recommended

my_array = [elements, element2, element3, ]

Not recommended

my_array = [ elements, element2, # ... element4 ]

Recommended

my_array = [ elements, element2, # ... element4, # note: here the final , so that adding a new element create a nice diff ]

Parameters

In addition to that, it is highly recommended to explicit the type of the object/param. This helps understanding the code as well as can be helpful to detect inconsistencies and bugs.

Not recommended
                                        def my_methods(param1,
    param2,
    param3,
):

Not recommended
                                        def my_methods(
    param1,
    param2,
    param3
):

Not recommended
                                        def my_methods(param1,
               param2,
):

Recommended
                                        def my_methods(
    param1,
    param2,
    param3, # don't forget the final ,
):

List comprehensions

While sometimes it's okay to put an if/for/while with a small body on the same line, never do this for multi-clause statements. Also, avoid folding such long lines!

Not recommended
                                        articles_ids = [ articles["id"] for articles in articles_list if articles["type"] == "A" ]

Recommended
                                        articles_ids = [
    articles["headed"]
    for articles in articles_list
    if articles["type"] == "A"
]

Imports

Not recommended
                                        from cdh.variables import CDH_DAG_PREFIX, CDH_SCHEMA_PREFIX, CDH_WAREHOUSE_ROOT, DATA_FACTORY_V2, DATA_FACTORY_V1, CDH_TASK_PREFIX, CDH_INGRESS_ROOT, CDH_ENV_NAME, LOCAL_TZ

Recommended
                                        from cdh.variables import (
    DATA_FACTORY_V1,
    DATA_FACTORY_V2,
    CDH_DAG_PREFIX,
    CDH_INGRESS_ROOT,
    CDH_SCHEMA_PREFIX,
    CDH_TASK_PREFIX,
    CDH_WAREHOUSE_ROOT,
    LOCAL_TZ,
)

Classes and functions

Introduction

These guidelines provided the recommended way to name functions and classes. In general, the most important takeaways are:

1, PascalCase for classes (e.g., MyClass)
2, snake_case for variables and functions (e.g., function_to_calculate_average)

Constants

Constants are usually defined on a module level and written in all capital letters with underscores separating words. These constants can be defined in the same script as where they are used. However, if possible, it is recommended to have a separate script (something like constants.py) in a coherent scope that can be imported and where the values for the constants are extracted. This would help modifying the constants if required.

Not recommended

tuple_fields = "cdh_target_schema_name ..."

Recommended

TUPLE_FIELDS = "cdh_target_schema_name ..."

Avoid using abbreviations

Abbreviations may look useful when programming/developing. However, it is recommended to avoid using them since projects usually involve many people and the meaning of abbreviations may be forgotten over time.

Not recommended
                                        # inc => increment / include ?

def inc_ctr() -> None:
    pass

Recommended
                                        def include_category() -> None:
    pass

Not recommended
                                        # creds_info => credit / credential ?

self.creds_info = azureapi.get_credential(cred_name=self.cluster_name, cred_type="hdi")

Recommended

self.credential_info = azureapi.get_credential( credential_name=self.cluster_name, credential_type="hdi", )

Classes

Use one leading underscore only for non-public methods and instance variables (ref).

Recommended
                                        class MyClass:
  def _private_fun(self):
    return 0
  def public_fun(self):
    return self._private_fun()

Comments & DocString

Introduction

Docstrings are key in the development of code. Since any project is bound to be shared by many people and over a long period of time, clear documentation of what each function does, what parameters (and their types) it expects and what it returns is mandatory.
Docstrings should prevail over comments, although comments are recommended when necessary.

DocString

1, FE team uses the google docstring.
2, Use the automation tool pyment to auto-format docstring.

Write complete sentences

Comments should be complete sentences. The first word should be capitalized.

Not recommended
                                        # defining skip function

def skip_fn(*args, **kwargs) -> str:
    ...

Recommended
                                        # Define a skip function.

def skip_fn(*args, **kwargs) -> str:
    ...

One-line Docstrings (PEP257)

One-liners are for really obvious cases. They should really fit on one line.

Not recommended
                                        def skip_fn(*args, **kwargs) -> Boolean:
"""
Define a placeholder function for the skip operator.
"""
return True

Recommended
                                        def skip_fn(*args, **kwargs) -> Boolean:
"""Define a placeholder function for the skip operator."""
return True

Multi-line Docstrings

PEP8: Multi-line comments

Block comments generally consist of one or more paragraphs built out of complete sentences, with each sentence ending in a period.

PEP257: Multiline docstrings

Multi-line docstrings consist of:

1, a summary line just like a one-line docstring
2, followed by a blank line
3, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line.

Not recommended
                                        def compute_params(context: object) -> List:
"""
Passing date and environment parameters to ADF Pipeline
:param context: ...
:return: [pDayPath, pMonthPath, pYearPath, pDatePath]

Recommended
                                        def compute_params(context: object) -> List:
"""Pass the date and environment parameters to ADF Pipeline.
    
:param context: ...
:return: a list of strings, [pDayPath, pMonthPath, pYearPath, pDatePath]

None return or parameters

Not recommended
                                        def skip_fn(*args, **kwargs) -> Boolean:
"""Placeholder function for the skip operator.
    
    :param args:
    :param kwargs:
"""
return True

Recommended
                                        def skip_fn(*args, **kwargs) -> Boolean:
"""Placeholder function for the skip operator."""
return True

Object Typing

Introduction

Even if python is a weakly-typed language, it is highly recommended to specify the type of the different objects that are used in functions, as well as the expected type of the returned object.

There are many good reasons for this, including that it can give a very good idea of what the function needs with one glimpse and also that it is very useful when debugging.
In general, in-built data types are good enough, although typing objects are found to be very useful for collections.

Typing hint

This Typing module provides runtime support for type hints. It helps developers to pre-check your code in case of any issues of types for variables, function returns, etc.

Not recommended
                                        def check_df_empty(target_df):
    return True

Recommended
                                        from typing import Boolean
import pandas as pd

def get_transposed_df(target_df: pd.DataFrame) -> Boolean:
    return True

XX

In case you need to define a function that can return different objects depending on the context, you can use the Union type.

Union[X, Y] is equivalent to X | Y and means either X or Y.

Recommended
                                        from typing import Union

def load_table(
    cfg: dict,
    spark: pyspark.sql.SparkSession = None,
) -> Union[pd.DataFrame, pyspark.sql.DataFrame]:
    """Returns either a spark or pandas dataframe."""
    pass

Use of logging over print

Introduction

In general, logging must be used instead of print. The reason is that, if set correctly, the log level can be useful to debug. The use of print is not recommended since it lays in the final output and may not be recovered.

What

When

Report events that occur during normal operation of a program (e.g. for status monitoring or fault investigation)

Recommend

logging.info() (or logging.debug() for very detailed output for diagnostic purposes)

When

Issue a warning regarding a particular runtime event

Recommend

logging.warning() if there is nothing the client application can do about the situation, but the event should still be noted

When

Report an error regarding a particular runtime event

Recommend

Raise an exception

When

Report suppression of an error without raising an exception (e.g. error handler in a long-running server process)

Recommend

logging.error(), logging.exception() or logging.critical() as appropriate for the specific error and application domain

Why

The logging package has a lot of useful features:

1, Easy to see where and when (even what line no.) a logging call is being made from.
2, You can log to files, sockets, pretty much anything, all at the same time.
3, You can differentiate your logging based on severity.

The project is meant to be imported by other python tools, it's bad practice for the package to print things to stdout since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propagate logging messages from your tool or not.

Print doesn't have any of these.

How

Not recommended

print("Hello World!")

Recommended
                                        import logging

logging.info('So should this')

Not recommended

try: # risky code except Exception: print('this is an exception')

Recommended
                                        import logging

try:
    # risky code
except Exception:
    logger.exception('this is an exception')

Other recommendations

A mix of single and double quote

In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it to avoid backslashes. It improves readability.

Not recommended
                                        query = 'metrics_resourcemanager_clustermetrics_CL'
    '| where ClusterType_s == "spark" and TimeGenerated > ago(5m) and ClusterName_s '
    'contains '
    """ + CLUSTER_NAME + """
    '| sort by AggregatedValue desc'

Recommended
                                        query = """metrics_resourcemanager_clustermetrics_CL
    | where ClusterType_s == "spark" and ClusterName_s contains "{CLUSTER_NAME}"
    | sort by AggregatedValue desc""".format(
        CLUSTER_NAME=CLUSTER_NAME
    )

Magic number

Magic numbers make the code difficult:

1, to read and understand
2, to alter the value of the number, as it is not duplicated

Not recommended

LOAD = CDHPartitionedDecentralizedDeltaInsertOperator( num_partitions_per_batch=100, max_partitions_in_total=600, megabytes_per_reducer_on_finalize=128, )

Recommended

PARTITIONED_TABLE = PartitionedTable( num_partitions_per_batch=100, max_partitions_in_total=600, megabytes_per_reducer_on_finalize=128, ) ... LOAD = CDHPartitionedDecentralizedDeltaInsertOperator( num_partitions_per_batch=PARTITIONED_TABLE.num_partitions_per_batch, max_partitions_in_total=PARTITIONED_TABLE.max_partitions_in_total, megabytes_per_reducer_on_finalize=PARTITIONED_TABLE.megabytes_per_reducer_on_finalize, )

End with a blank line

Not recommended
                                        from airflow import DAG
 
with DAG(...) as dag:
    loo = ...
    insert = ...
    loo >> insert
dag.doc_md = doc # The end of the file is here.

Recommended
                                        from airflow import DAG
 
 with DAG(...) as dag:
     loo = ...
     insert = ...
     oo >> insert
 dag.doc_md = doc
 #Leave a new blank line with no content here as the end of the file.

Exceptions

When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.

Not recommended

try: # risky code except Exception: logger.exception("this is an exception")

Recommended

try: # risky code except KeyError: logger.exception("this is a key error")

Lambda expression

Always use a def statement instead of an assignment statement that binds a lambda expression directly to an identifier. The use of the assignment statement eliminates the sole benefit a lambda expression can offer over an explicit def statement (i.e. that it can be embedded inside a larger expression).

Not recommended

f = lambda x: 2 * x

Recommended
                                        def f(x): return 2 * x

Boolean expressions

Use is not operator rather than not ... is. While both expressions are functionally identical, the former is more readable and preferred.

Not recommended

if not foo is None: pass

Recommended

if foo is not None: pass

Don't compare boolean values to True or False using ==.

Not recommended

if greeting == True: pass

Recommended

if greeting: pass

Specify param when calling a function

When you are going to use a function, the best idea is to specify the params you are passing instead of relying on the order. Doing so, the chance of making an error is minimised.

Not recommended
                                        def subtract_these_numbers(a: int, b: int):
  return a - b

subtracted_value = f(8, 4)

Recommended
                                        def subtract_these_numbers(a: int, b: int):
  return a - b

subtracted_value = f(a=8, b=4)