Python Standards and Best Practices
The Zen of Python, by Tim Peters
This page contains a list of proposed guidelines to (mostly) improve the formatting of our code, and create more consistency across developers. It's a proposal for discussion and should be considered as such.
Guideline of PEP8, PEP257, PEP20, and some additions. To be optimized continuously.
A video presentation and a related slide can be found here for this doc.
Details
Code, likened to a sincere love letter, communicates a developer's logic and approach. Just as poets craft resonance with devices, good code strategically applies design patterns like SOLID and adheres to principles (DRY, KISS, YAGNI). Following best practices such as naming conventions and modularity, good code becomes a lasting legacy for future developers, balancing comprehension and efficiency.
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
...
Python developers should prioritize code readability and best practices because it leads to easier maintenance, efficient collaboration, fewer bugs, smoother scalability, consistent codebase, streamlined documentation, faster learning curve for new developers, expedited code reviews, code longevity, and professional coding standards.
Imports
Grouping imports
Imports should be grouped in the following order:
1, Standard library imports.
2, Related third-party imports (pip install).
3, Local application/library specific imports.
You should put a blank line between each group of imports.
Not recommended
import pendulum from airflow import DAG from cdh.airflow import global_default_args from datetime import datetime
Recommended
from datetime import datetime from airflow import DAG import pendulum from cdh.airflow import global_default_args
Order alphabetically
Not recommended
from cdh.variables import CDH_DAG_PREFIX, ... from cdh.variables import HAAL_ADL_ROOT from cdh.airflow.sensors.adf_activity_run_sensor import ADFActivityRunSensor from cdh.py.dl_raw import partitioned as load_partitioned from cdh.airflow.operators.cdh_hive_operator import CDHHiveOperator from cdh.airflow.sensors.cdh_table_source_file_is_ready_sensor import CDHTableSourceFileIsReadySensor from cdh.py.dl_raw import unpartitioned as load_unpartitioned from cdh.airflow.operators.cdh_access_rights_operator import CDHAccessRightsOperator from cdh.airflow.operators.cdh_load_status_operator import CDHLoadStatusOperator from cdh.airflow.operators.cdh_compute_stats_operator import CDHComputeStatsOperator from cdh.airflow import global_default_args
Recommended
from cdh.airflow import global_default_args from cdh.airflow.operators.cdh_access_rights_operator import CDHAccessRightsOperator from cdh.airflow.operators.cdh_compute_stats_operator import CDHComputeStatsOperator from cdh.airflow.operators.cdh_hive_operator import CDHHiveOperator from cdh.airflow.operators.cdh_load_status_operator import CDHLoadStatusOperator from cdh.airflow.sensors.adf_activity_run_sensor import ADFActivityRunSensor from cdh.airflow.sensors.cdh_table_source_file_is_ready_sensor import CDHTableSourceFileIsReadySensor from cdh.py.dl_raw import partitioned as load_partitioned from cdh.py.dl_raw import unpartitioned as load_unpartitioned from cdh.variables import CDH_DAG_PREFIX, ...
Break long lines
Introduction
Too long lines make it difficult:
1, to have several files next to each other
2, when doing code review, to see which part of the line has changed
3, to count how many arguments there are etc.
For these reasons and to keep it consistent within the team and to help readability, the recommended maximum length for a single line is 79 characters.
PEP8: Trailing commas
Trailing Commas are often helpful when a list of values, arguments, or imported items are expected to be extended over time. The pattern is to put each value (etc.) on a line by itself, always adding a trailing comma, and add the close parenthesis/bracket/brace on the next line. Therefore:
1, always do multiline when it's >= 3 parameters/elements
2, for 2, it's up to your common-sense
3, keep in mind readability > rules
Strings
Not recommended
PartitionedTable = collections.namedtuple( "PartitionedTable", ( "cdh_target_schema_name cdh_target_table_name adf_activity_name adf_name " "adf_pipeline_name do_not_deduplicate primary_key_columns_list " "primary_key_sort_columns_list on_insert_sort_by min_adf_completion_count" ) )
Recommended
from datetime import datetime from airflow import DAG import pendulum from cdh.airflow import global_default_args
Arrays
Imports should be grouped in the following order:
1, Standard library imports.
2, Related third-party imports (pip install).
3, Local application/library specific imports.
You should put a blank line between each group of imports.
Not recommended
my_array = [elements, element2, element3, ]
Not recommended
my_array = [ elements, element2, # ... element4 ]
Recommended
my_array = [ elements, element2, # ... element4, # note: here the final , so that adding a new element create a nice diff ]
Parameters
In addition to that, it is highly recommended to explicit the type of the object/param. This helps understanding the code as well as can be helpful to detect inconsistencies and bugs.
Not recommended
def my_methods(param1, param2, param3, ):
Not recommended
def my_methods( param1, param2, param3 ):
Not recommended
def my_methods(param1, param2, ):
Recommended
def my_methods( param1, param2, param3, # don't forget the final , ):
List comprehensions
While sometimes it's okay to put an if/for/while with a small body on the same line, never do this for multi-clause statements. Also, avoid folding such long lines!
Not recommended
articles_ids = [ articles["id"] for articles in articles_list if articles["type"] == "A" ]
Recommended
articles_ids = [ articles["headed"] for articles in articles_list if articles["type"] == "A" ]
Imports
Not recommended
from cdh.variables import CDH_DAG_PREFIX, CDH_SCHEMA_PREFIX, CDH_WAREHOUSE_ROOT, DATA_FACTORY_V2, DATA_FACTORY_V1, CDH_TASK_PREFIX, CDH_INGRESS_ROOT, CDH_ENV_NAME, LOCAL_TZ
Recommended
from cdh.variables import ( DATA_FACTORY_V1, DATA_FACTORY_V2, CDH_DAG_PREFIX, CDH_INGRESS_ROOT, CDH_SCHEMA_PREFIX, CDH_TASK_PREFIX, CDH_WAREHOUSE_ROOT, LOCAL_TZ, )
Classes and functions
Introduction
These guidelines provided the recommended way to name functions and classes. In general, the most important takeaways are:
1, PascalCase for classes (e.g., MyClass)
2, snake_case for variables and functions (e.g., function_to_calculate_average)
Constants
Constants are usually defined on a module level and written in all capital letters with underscores separating words. These constants can be defined in the same script as where they are used. However, if possible, it is recommended to have a separate script (something like constants.py) in a coherent scope that can be imported and where the values for the constants are extracted. This would help modifying the constants if required.
Not recommended
tuple_fields = "cdh_target_schema_name ..."
Recommended
TUPLE_FIELDS = "cdh_target_schema_name ..."
Avoid using abbreviations
Abbreviations may look useful when programming/developing. However, it is recommended to avoid using them since projects usually involve many people and the meaning of abbreviations may be forgotten over time.
Not recommended
# inc => increment / include ? def inc_ctr() -> None: pass
Recommended
def include_category() -> None: pass
Not recommended
# creds_info => credit / credential ? self.creds_info = azureapi.get_credential(cred_name=self.cluster_name, cred_type="hdi")
Recommended
self.credential_info = azureapi.get_credential( credential_name=self.cluster_name, credential_type="hdi", )
Classes
Use one leading underscore only for non-public methods and instance variables (ref).
Recommended
class MyClass: def _private_fun(self): return 0 def public_fun(self): return self._private_fun()
Comments & DocString
Introduction
Docstrings are key in the development of code. Since any project is bound to be shared by many people and over a long period of time, clear documentation of what each function does, what parameters (and their types) it expects and what it returns is mandatory.
Docstrings should prevail over comments, although comments are recommended when necessary.
DocString
1, FE team uses the google docstring.
2, Use the automation tool pyment to auto-format docstring.
Write complete sentences
Comments should be complete sentences. The first word should be capitalized.
Not recommended
# defining skip function def skip_fn(*args, **kwargs) -> str: ...
Recommended
# Define a skip function. def skip_fn(*args, **kwargs) -> str: ...
One-line Docstrings (PEP257)
One-liners are for really obvious cases. They should really fit on one line.
Not recommended
def skip_fn(*args, **kwargs) -> Boolean: """ Define a placeholder function for the skip operator. """ return True
Recommended
def skip_fn(*args, **kwargs) -> Boolean: """Define a placeholder function for the skip operator.""" return True
Multi-line Docstrings
Block comments generally consist of one or more paragraphs built out of complete sentences, with each sentence ending in a period.
Multi-line docstrings consist of:
1, a summary line just like a one-line docstring
2, followed by a blank line
3, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line.
Not recommended
def compute_params(context: object) -> List: """ Passing date and environment parameters to ADF Pipeline :param context: ... :return: [pDayPath, pMonthPath, pYearPath, pDatePath]
Recommended
def compute_params(context: object) -> List: """Pass the date and environment parameters to ADF Pipeline. :param context: ... :return: a list of strings, [pDayPath, pMonthPath, pYearPath, pDatePath]
None return or parameters
Not recommended
def skip_fn(*args, **kwargs) -> Boolean: """Placeholder function for the skip operator. :param args: :param kwargs: """ return True
Recommended
def skip_fn(*args, **kwargs) -> Boolean: """Placeholder function for the skip operator.""" return True
Object Typing
Introduction
Even if python is a weakly-typed language, it is highly recommended to specify the type of the different objects that are used in functions, as well as the expected type of the returned object.
There are many good reasons for this, including that it can give a very good idea of what the function needs with one glimpse and also that it is very useful when debugging.
In general, in-built data types are good enough, although typing objects are found to be very useful for collections.
Typing hint
This Typing module provides runtime support for type hints. It helps developers to pre-check your code in case of any issues of types for variables, function returns, etc.
Not recommended
def check_df_empty(target_df): return True
Recommended
from typing import Boolean import pandas as pd def get_transposed_df(target_df: pd.DataFrame) -> Boolean: return True
XX
In case you need to define a function that can return different objects depending on the context, you can use the Union type.
Union[X, Y] is equivalent to X | Y and means either X or Y.
Recommended
from typing import Union def load_table( cfg: dict, spark: pyspark.sql.SparkSession = None, ) -> Union[pd.DataFrame, pyspark.sql.DataFrame]: """Returns either a spark or pandas dataframe.""" pass
Use of logging over print
Introduction
In general, logging must be used instead of print. The reason is that, if set correctly, the log level can be useful to debug. The use of print is not recommended since it lays in the final output and may not be recovered.
What
When
Report events that occur during normal operation of a program (e.g. for status monitoring or fault investigation)
Recommend
logging.info() (or logging.debug() for very detailed output for diagnostic purposes)
When
Issue a warning regarding a particular runtime event
Recommend
logging.warning() if there is nothing the client application can do about the situation, but the event should still be noted
When
Report an error regarding a particular runtime event
Recommend
Raise an exception
When
Report suppression of an error without raising an exception (e.g. error handler in a long-running server process)
Recommend
logging.error(), logging.exception() or logging.critical() as appropriate for the specific error and application domain
Why
The logging package has a lot of useful features:
1, Easy to see where and when (even what line no.) a logging call is being made from.
2, You can log to files, sockets, pretty much anything, all at the same time.
3, You can differentiate your logging based on severity.
The project is meant to be imported by other python tools, it's bad practice for the package to print things to stdout since the user likely won't know where the print messages are coming from. With logging, users of your package can choose whether or not they want to propagate logging messages from your tool or not.
Print doesn't have any of these.
How
Not recommended
print("Hello World!")
Recommended
import logging logging.info('So should this')
Not recommended
try: # risky code except Exception: print('this is an exception')
Recommended
import logging try: # risky code except Exception: logger.exception('this is an exception')
Other recommendations
A mix of single and double quote
In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it to avoid backslashes. It improves readability.
Not recommended
query = 'metrics_resourcemanager_clustermetrics_CL' '| where ClusterType_s == "spark" and TimeGenerated > ago(5m) and ClusterName_s ' 'contains ' """ + CLUSTER_NAME + """ '| sort by AggregatedValue desc'
Recommended
query = """metrics_resourcemanager_clustermetrics_CL | where ClusterType_s == "spark" and ClusterName_s contains "{CLUSTER_NAME}" | sort by AggregatedValue desc""".format( CLUSTER_NAME=CLUSTER_NAME )
Magic number
Magic numbers make the code difficult:
1, to read and understand
2, to alter the value of the number, as it is not duplicated
Not recommended
LOAD = CDHPartitionedDecentralizedDeltaInsertOperator( num_partitions_per_batch=100, max_partitions_in_total=600, megabytes_per_reducer_on_finalize=128, )
Recommended
PARTITIONED_TABLE = PartitionedTable( num_partitions_per_batch=100, max_partitions_in_total=600, megabytes_per_reducer_on_finalize=128, ) ... LOAD = CDHPartitionedDecentralizedDeltaInsertOperator( num_partitions_per_batch=PARTITIONED_TABLE.num_partitions_per_batch, max_partitions_in_total=PARTITIONED_TABLE.max_partitions_in_total, megabytes_per_reducer_on_finalize=PARTITIONED_TABLE.megabytes_per_reducer_on_finalize, )
End with a blank line
Not recommended
from airflow import DAG with DAG(...) as dag: loo = ... insert = ... loo >> insert dag.doc_md = doc # The end of the file is here.
Recommended
from airflow import DAG with DAG(...) as dag: loo = ... insert = ... oo >> insert dag.doc_md = doc #Leave a new blank line with no content here as the end of the file.
Exceptions
When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause.
Not recommended
try: # risky code except Exception: logger.exception("this is an exception")
Recommended
try: # risky code except KeyError: logger.exception("this is a key error")
Lambda expression
Always use a def statement instead of an assignment statement that binds a lambda expression directly to an identifier. The use of the assignment statement eliminates the sole benefit a lambda expression can offer over an explicit def statement (i.e. that it can be embedded inside a larger expression).
Not recommended
f = lambda x: 2 * x
Recommended
def f(x): return 2 * x
Boolean expressions
Use is not operator rather than not ... is. While both expressions are functionally identical, the former is more readable and preferred.
Not recommended
if not foo is None: pass
Recommended
if foo is not None: pass
Don't compare boolean values to True or False using ==.
Not recommended
if greeting == True: pass
Recommended
if greeting: pass
Specify param when calling a function
When you are going to use a function, the best idea is to specify the params you are passing instead of relying on the order. Doing so, the chance of making an error is minimised.
Not recommended
def subtract_these_numbers(a: int, b: int): return a - b subtracted_value = f(8, 4)
Recommended
def subtract_these_numbers(a: int, b: int): return a - b subtracted_value = f(a=8, b=4)