Summarize
The summarize
function in DUQL is used to perform aggregations on your data, condensing multiple rows into summary statistics. It can be used both as a top-level function and within the group
function, offering flexibility in how you aggregate your data.
Syntax
As a top-level function:
summarize:
<new_column_name>: <aggregation_function>
Within a group
function:
group:
by: <grouping_columns>
steps:
summarize:
<new_column_name>: <aggregation_function>
Behavior
Top-level usage: When used at the top level,
summarize
produces a single row summarizing the entire dataset.Within
group
: When used inside agroup
function, it produces one summary row for each unique combination of the grouping columns.
This dual behavior allows for both global summaries and group-specific summaries within the same query language.
Common Aggregation Functions
Currently, all declared aggregation functions are min
, max
, count
, average
, stddev
, avg
, sum
and count_distinct
Examples
Top-level Summarize
dataset: sales
steps:
- summarize:
total_revenue: sum amount
average_order_value: avg amount
order_count: count order_id
unique_customers: count_distinct customer_id
This query will produce a single row with overall summary statistics for the entire sales dataset.
Summarize Within Group
dataset: sales
steps:
- group:
by: [category, order_year]
steps:
- summarize:
total_revenue: sum amount
average_order_value: avg amount
order_count: count order_id
This query will produce summary statistics for each unique combination of category and year.
Combined Usage
dataset: sales
steps:
- group:
by: category
steps:
- summarize:
category_revenue: sum amount
category_orders: count order_id
- summarize:
total_revenue: sum category_revenue
total_orders: sum category_orders
num_categories: count category
This query first summarizes data by category, then provides an overall summary of those category summaries.
Best Practices
🎯 Choose appropriate aggregation functions for your data types and analysis goals.
🏷️ Use clear and descriptive names for your summarized columns.
🧮 Be aware of how
null
values are handled in your aggregations.🚀 Consider performance implications when summarizing large datasets.
📊 Use top-level
summarize
for overall statistics andgroup
withsummarize
for segmented analysis.🔍 Validate your summary results, especially when using complex aggregation expressions.
Real-World Use Case
Here's an example of a DUQL query that uses summarize
both within group
and at the top level to analyze sales performance:
declare:
recent_date: @2023-01-01
calculate_margin: revenue cost -> (revenue - cost) / revenue * 100
dataset: sales
steps:
- filter: sale_date >= recent_date
- join:
dataset: products
where: sales.product_id == products.id
- generate:
revenue: price * quantity
margin: calculate_margin revenue cost
- group:
by: [category, sale_month]
steps:
- summarize:
monthly_revenue: sum revenue
monthly_quantity: sum quantity
avg_margin: avg margin
product_count: count_distinct product_id
- sort: [category, sale_month]
- summarize:
total_revenue: sum monthly_revenue
total_quantity: sum monthly_quantity
overall_avg_margin: avg avg_margin
peak_monthly_revenue: max monthly_revenue
total_products: sum product_count
into: sales_performance_summary
This query demonstrates:
Filtering recent sales data
Joining with product information
Calculating revenue and margin
Grouping by category and month, with monthly summaries
Sorting the grouped results
Providing an overall summary of the grouped data
The resulting sales_performance_summary
dataset offers both detailed monthly performance by category and overall performance metrics, showcasing the flexibility of the summarize
function in DUQL.
💡 Tip: The
summarize
function is a versatile tool in DUQL. Use it at the top level for quick overall insights, or withingroup
for detailed breakdowns. Combine both approaches to create comprehensive analytical queries!
Last updated
Was this helpful?