SQL Style Guide

It is our collective responsibility to enforce this Style Guide.

SQLFLuff Linting

We use SQLFluff as our linter which enforces a majority of our Style Guide, although there are still some limitations.

Over time, we expect an increasing number of these guidelines will be handled by the SQLFluff lint checks. Areas that are known to be checked by SQLFluff are marked as such.

The Dev Standards guide now has guidance for SQL code, including SQLFLuff settings file defaults.**

SQLFLuff has a Rules Reference with descriptions of all rules and many helpful examples. We recommend everyone read or scan through the rules documentation (and the other SQLFluff docs) at least once.

You can also become familiar with these rules by installing the SQLFluff VS Code extension which will give real time lint feedback and which has autoformat capabilities.

Casing

These are enforced by SQLFLuff:

Field names should all be lowercase.
Keywords should be UPPERCASE.
Function names should be UPPERCASE.

General Formatting

These are enforced by SQLFLuff:

No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces.
Lines of SQL should be no longer than 80 characters
Commas should be at the end-of-line (EOL) as a right comma.

Field Naming and Reference Conventions

An id, name, or generally ambiguous value such as type should always be prefixed by what it is identifying or naming

-- Good
SELECT
    id AS account_id,
    name AS account_name,
    type AS account_type,
    ...

-- Bad
SELECT
    id,
    name,
    type,
    ...

When joining to any data from a different source, a field should be prefixed with the data source, e.g. sfdc_account_id, to avoid ambiguity

-- Good
SELECT
    sfdc_account.account_id AS sfdc_account_id,
    zuora_account.account_id AS zuora_account_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...

-- Bad
SELECT
    sfdc_account.account_id,
    zuora_account.account_id AS zuora_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...

When joining tables and referencing columns from both, strongly prefer to reference the full table name instead of an alias. When the table name is long (~20), try to rename the CTE if possible, and lastly consider aliasing to something descriptive.

-- Good
SELECT
    budget_forecast_cogs_opex.account_id,
    -- 15 more columns
    date_details.fiscal_year,
    date_details.fiscal_quarter,
    date_details.fiscal_quarter_name,
    cost_category.cost_category_level_1,
    cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex
LEFT JOIN date_details
    ON date_details.first_day_of_month = budget_forecast_cogs_opex.accounting_period
LEFT JOIN cost_category
    ON budget_forecast_cogs_opex.unique_account_name = cost_category.unique_account_name

-- Ok, but not preferred. Consider renaming the CTE in lieu of aliasing
SELECT
    bfcopex.account_id,
    -- 15 more columns
    date_details.fiscal_year,
    date_details.fiscal_quarter,
    date_details.fiscal_quarter_name,
    cost_category.cost_category_level_1,
    cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex bfcopex
LEFT JOIN date_details
    ON date_details.first_day_of_month = bfcopex.accounting_period
LEFT JOIN cost_category
    ON bfcopex.unique_account_name = cost_category.unique_account_name

-- Bad
SELECT
    a.*,
    -- 15 more columns
    b.fiscal_year,
    b.fiscal_quarter,
    b.fiscal_quarter_name,
    c.cost_category_level_1,
    c.cost_category_level_2
FROM budget_forecast_cogs_opex a
LEFT JOIN date_details b
    ON b.first_day_of_month = a.accounting_period
LEFT JOIN cost_category c
    ON b.unique_account_name = c.unique_account_name

All field names should be snake-cased

-- Good
SELECT
    dvcecreatedtstamp AS device_created_timestamp,
    account_id
FROM table

-- Bad
SELECT
    dvcecreatedtstamp AS DeviceCreatedTimestamp,
    account_id
FROM table

Boolean field names should start with has_, is_, or does_

-- Good
SELECT
    deleted AS is_deleted,
    sla AS has_sla
FROM table

-- Bad
SELECT
    deleted,
    sla
FROM table

When transforming source data, use double quotes to identify case sensitive columns or columns that contain special characters different than “$” or “_”. Double quotes aren’t needed for capitalized field names, as this is how Snowflake identifiers are handled internally.
```
-- Good
SELECT "First_Name_&_" AS first_name,

-- Bad
SELECT "FIRST_NAME" AS first_name,
```

Dates

Timestamps should end with _at, e.g. deal_closed_at, and should always be in UTC
Dates should end with _date, e.g. deal_closed_date
Months should be indicated as such and should always be truncated to a date format, e.g. deal_closed_month
Always avoid key words like date or month as a column name
Prefer the explicit date function over date_part, but prefer date_part over extract, e.g. DAYOFWEEK(created_at) > DATE_PART(dayofweek, 'created_at') > EXTRACT(dow FROM created_at)
- Note that selecting a date’s part is different from truncating the date. date_trunc('month', created_at) will produce the calendar month (‘2019-01-01’ for ‘2019-01-25’) while SELECT date_part('month', '2019-01-25'::date) will produce the number 1
Be careful using DATEDIFF, as the results are often non-intuitive.
- For example, SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') returns 1 even though the timestamps are different by one millisecond.
- Similarly, SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999') return 0 even though the timestamps are nearly an entire day apart.
- Using the appropriate interval with the DATEDIFF function will ensure you are getting the right results. For example, DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') will provide a 1 day interval and DATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') will provide a 1 millisecond interval.

Use CTEs (Common Table Expressions), not subqueries

CTEs make SQL more readable and are more performant
Use CTEs to reference other tables. Think of these as import statements
CTEs should be placed at the top of the query
Where performance permits, CTEs should perform a single, logical unit of work
CTE names should be as concise as possible while still being clear
- Avoid long names like replace_sfdc_account_id_with_master_record_id and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins
CTEs with confusing or noteable logic should be commented in file and documented in dbt docs
CTEs that are duplicated across models should be pulled out into their own models
Leave an empty row above and below the query statement

CTEs should be formatted as follows:

WITH events AS ( -- think of these select statements as your import statements.

  ...

), filtered_events AS ( -- CTE comments go here

  ...

)

SELECT * -- you should always aim to "select * from final" for your last model
FROM filtered_events

CTEs and Subqueries

Within a CTE, the entire SQL statement should be indented 4 spaces

-- Good
WITH my_data AS (

    SELECT *
    FROM prod.my_data
    WHERE filter = 'my_filter'

)

-- Bad
WITH my_data AS (

  SELECT *
  FROM prod.my_data
  WHERE filter = 'my_filter'

)

Indentation within a query (e.g. columns, JOIN clauses, multi-line GROUP BY, etc.) should also be 4 spaces

-- Good
SELECT
    table_1.column_name1,
    table_1.column_name2,
    table_1.column_name3
FROM table_1
JOIN table_2
    ON table_1.id = table_2.id
WHERE table_2.clouds = TRUE
    AND table_2.gem = TRUE
GROUP BY 1, 2, 3
HAVING table_1.column_name1 > 0
    AND table_1.column_name2 > 0

-- Bad
SELECT
    column_name1,
    column_name2,
    column_name3
FROM table_1
JOIN table_2
ON table_1.id = table_2.id
WHERE clouds = true
AND gem = true
GROUP BY 1,2,3
HAVING column_name1 > 0
AND column_name2 > 0

General (Other)

When SELECTing, always give each column its own row, with the exception of SELECT * which can be on a single row
DISTINCT should be included on the same row as SELECT
The AS keyword should be used when projecting a field or table name
Fields should be stated before aggregates / window functions
Ordering and grouping by a number (eg. GROUP BY 1, 2) is preferred
- When grouping by 3 or more columns in a dbt model, use the dbt-utils group_by macro
Prefer WHERE to HAVING when either would suffice
Prefer accessing JSON using the bracket syntax, e.g. data_by_row['id']::bigint as id_value
Never use USING in joins because it produces inaccurate results in Snowflake. Create an account to view the forum discussion on this topic.
Prefer UNION ALL to UNION. This is because a UNION could indicate upstream data integrity issue that are better solved elsewhere.
Prefer != to <>. This is because != is more common in other programming languages and reads like “not equal” which is how we’re more likely to speak
Consider performance. Understand the difference between LIKE vs ILIKE, IS vs =, and NOT vs ! vs <>. Use appropriately
Prefer LOWER(column) LIKE '%match%' to column ILIKE '%Match%'. This lowers the chance of stray capital letters leading to an unexpected result
Familiarize yourself with the DRY Principal. Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places
DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP. BRAIN TIME IS EXPENSIVE.

Data Types

Use default data types and not aliases. Review the Snowflake summary of data types for more details. The defaults are:
- NUMBER instead of DECIMAL, NUMERIC, INTEGER, BIGINT, etc.
- FLOAT instead of DOUBLE, REAL, etc.
- VARCHAR instead of STRING, TEXT, etc.
- TIMESTAMP instead of DATETIME

The exception to this is for timestamps. Prefer TIMESTAMP to TIME. Note that the default for TIMESTAMP is TIMESTAMP_NTZ which does not include a time zone.

Functions

Prefer IFNULL TO NVL
Prefer IFF to a single line CASE statement
Prefer IFF to selecting a boolean statement (amount < 10) AS is_less_than_ten

Consider simplifying a repetitive CASE statement where possible:

-- OK
CASE
    WHEN field_id = 1 THEN 'date'
    WHEN field_id = 2 THEN 'integer'
    WHEN field_id = 3 THEN 'currency'
    WHEN field_id = 4 THEN 'boolean'
    WHEN field_id = 5 THEN 'variant'
    WHEN field_id = 6 THEN 'text'
END AS field_type

-- Better
CASE field_id
    WHEN 1 THEN 'date'
    WHEN 2 THEN 'integer'
    WHEN 3 THEN 'currency'
    WHEN 4 THEN 'boolean'
    WHEN 5 THEN 'variant'
    WHEN 6 THEN 'text'
END AS field_type

JOINs

Be explicit when joining, e.g. use LEFT JOIN instead of JOIN. (Default joins are INNER)
Prefix the table name to a column when joining, otherwise omit

Specify the order of a join with the FROM table first and JOIN table second:

-- Good
FROM source
LEFT JOIN other_source
    ON source.id = other_source.id
WHERE ...

-- Bad
FROM source
LEFT JOIN other_source
    ON other_source.id = source.id
WHERE ...

Example Code

Putting it all together:

WITH my_data AS (

    SELECT *
    FROM prod.my_data
    WHERE filter = 'my_filter'

),

some_cte AS (

    SELECT DISTINCT
        id,
        other_field_1,
        other_field_2
    FROM prod.my_other_data

),

final AS (

    SELECT
        my_data.field_1 AS detailed_field_1,
        my_data.field_2 AS detailed_field_2,
        my_data.detailed_field_3 AS detailed_field_3,
        my_data.data_by_row AS id_field,
        CASE
            WHEN my_data.cancellation_date IS NULL
                AND my_data.expiration_date IS NOT NULL
                THEN my_data.expiration_date
            WHEN my_data.cancellation_date IS NULL
                THEN my_data.start_date + 7
            ELSE my_data.cancellation_date
        END AS cancellation_date,
        LAG(my_data.detailed_field_3) OVER (
            PARTITION BY
                my_data.id_field,
                my_data.detailed_field_1
            ORDER BY cancellation_date
        ) AS previous_detailed_field_3,
        SUM(my_data.field_4) AS field_4_sum,
        MAX(my_data.field_5) AS field_5_max
    FROM my_data
    LEFT JOIN some_cte
        ON my_data.id = some_cte.id
    WHERE my_data.field_1 = 'abc'
        AND (my_data.field_2 = 'def' OR my_data.field_2 = 'ghi')
    GROUP BY 1, 2, 3, 4, 5
    HAVING COUNT(*) > 1
    ORDER BY 4 DESC

)

SELECT *
FROM final

Commenting

When making single line comments in a model use the -- syntax
When making multi-line comments in a model use the /* */ syntax
Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long
dbt model comments should live in the model documentation
Calculations made in SQL should have a brief description of what’s going on and a link to the handbook defining the metric (and how it’s calculated)
Instead of leaving TODO comments, create new issues for improvement

Other SQL Style Guides

Edit this page!