SQL Standards and Best Practices

SQL Standards and Best Practices

The success of data engineering projects

This page contains SQL guidelines on how to format your SQL. Proper SQL formatting is vital for precise, efficient, and sustainable data pipelines and databases. It minimizes errors, boosts collaboration, improves performance, and fuels the triumph of data engineering projects.

A video presentation and a related slide can be found here for this doc.

  • SQL Standards and Best Practices
  • Upper or Lower Case
  • Alias
  • Column Order
  • Code Comments
  • Auto-optimize tool

Format SQL

Purposes

1, To make the code more consistent regardless of developers.
2, To make the code easier and faster to read and understand.

Checklist

What to think about when writing SQL code:

1, Use format tool (PoorSQL)
2, Upper Case keyword, DB and Tables
3, Use good alias
4, Think about the column order
5, Comment your code
6, Anything else (e.g. ending semicolon ;, SORT BY, etc.)

Tool (Poor SQL)

To help us make the code more consistent, we are using a web-based tool called PoorSQL. PoorSQL is a small and easy-to-use website that will help format your code so you don't have to do everything manually.

It will take care of formatting issues, such as Indent, Line Break, Upper Case T-SQL keyword, and so on. PoorSQL is compatible with T-SQL code and as such will not understand keywords that are not used in T-SQL. Those cases will instead have to be handled manually, e.g. SORT BY.

In PoorSQL, change the setting by:
1, set Max Width (chars) is 79
2, check Trailing CommasX
3, check Break Join ON Sections
4, All other settings keep as is

Upper or Lower Case

There is no standard to when to use upper or lower case in SQL. Therefore, PoorSQL will not take care of it and instead requires manual handling.

Things that should be uppercase:

1, SQL keywords (e.g. a list of HIVE SQL keywords), PoorSQL will handle the T-SQL ones.
2, Schema Name
3, Table Name
4, Column Datatype

Tip: If you have a lot of code that needs converting first make all smaller cases with here then use PoorSQL.

Alias

Aliases are used to make the SQL easier to read by giving columns, tables, and sub-selects a "shorter name".

Keep in mind:

1, Always use an alias when more than one table is involved.
2, Keep it short, but make it relevant.

Example of aliases:

1, If its just one-word use first 3-4 letter e.g. DIM.DIM_ARTICLE AS art.
2, If its a snake case use the first letter from every word e.g. DIM.PRODUCT_ARTICLE_DATAMART AS pad.

Column Order

To make the table and query more readable, we want to structure it so things come in a logical order.

Column order left to right:

1, Primary key
2, Foreign key
3, Attributes
4, Metadata (non biz related columns, e.g in GDO tables, there is always a column called “run_id“)

Also, think about the order internally, keep its business logic, and go from the lowest granularity and up. e.g. DO NOT mix of product information and location every second column; Group the columns with business logic - product group columns and location group columns.

Not recommended

SELECT store, article, country, corporate brand, department, planning market, product ...

Recommended

SELECT store, country, planning market, corporate brand, article, product, department ...

Code Comments

For comments, it is preferable to add more rather than less. Keep in mind that someone without background information still needs to be able to understand the code. It’s always more important to comment WHY, instead of WHAT.

Example of comments:

1, (Optional) This code is doing Y
2, (Mandatory) Filtering on X due to Y
3, (Mandatory) Casting X due to Y
4, (Mandatory) JOIN -> why

Example

This is an example of the code that we have today that have been changed with the steps above.

Before Changes

INSERT INTO TABLE {SCHEMA_PREFIX}STAGE.ECC_APPOINTMENT_SCHEDULING_FINAL PARTITION (DELIVERY_MONTH) SELECT INBOUND_DELIVERY , lpad(variant, 18, '0') as VARIANT , ALLOCATION_WAREHOUSE , DELIVERY_QUANTITY , DELIVERY_DATE , HM_ORDER_NO , APPOINTMENT_SCHEDULE_DONE, a.ARTICLE_ID , a.PAS_CODE, a.PRODUCT_SEASON_ID , L.PLANNING_MARKET_ID, L.LOCATION_ID, ETL_LOAD_ID, DELIVERY_MONTH FROM {SCHEMA_PREFIX}STAGE.ECC_APPOINTMENT_SCHEDULING_DISTINCT S left join ( select pas_code, product_season_id, article_id from {SCHEMA_PREFIX}dma.product_article_datamart) a on a.pas_code = substring(lpad(variant, 18, '0'), 3, 13) left join ( select distinct loc.location_id, loc.planning_market_id, wtl.site_code from {SCHEMA_PREFIX}dim.dim_location loc inner join {SCHEMA_PREFIX}struct.warehouse_to_location wtl on wtl.location_code = loc.location_code ) l on l.site_code = s.ALLOCATION_WAREHOUSE sort by PRODUCT_SEASON_ID, PLANNING_MARKET_ID ;

Changes are done:

1, Used PoorSQL
2, Removed code that is not needed (sub-selects)
3, Upper and Lower case
4, Order of select statement
5, Added alias

After Changes

INSERT INTO TABLE {SCHEMA_PREFIX}STAGE.ECC_APPOINTMENT_SCHEDULING_FINAL PARTITION (delivery_month) SELECT seasd.inbound_delivery, seasd.delivery_date, seasd.allocation_warehouse, loc.location_id, loc.planning_market_id, lpad(seasd.variant, 18, '0') AS variant, padm.pas_code, padm.article_id, padm.product_season_id, seasd.hm_order_no, seasd.appointment_schedule_done, seasd.delivery_quantity, seasd.etl_load_id, seasd.delivery_month FROM {SCHEMA_PREFIX}STAGE.ECC_APPOINTMENT_SCHEDULING_DISTINCT seasd LEFT JOIN {SCHEMA_PREFIX}DMA.PRODUCT_ARTICLE_DATAMART padm ON padm.pas_code = substring(lpad(seasd.variant, 18, '0'), 3, 13) LEFT JOIN {SCHEMA_PREFIX}STRUCT.WAREHOUSE_TO_LOCATION wtl ON wtl.site_code = seasd.allocation_warehouse INNER JOIN {SCHEMA_PREFIX}DIM.DIM_LOCATION loc ON wtl.location_id = loc.location_id SORT BY seasd.product_season_id, seasd.planning_market_id ;