Steps
The steps component in DUQL represents a series of data transformation operations applied sequentially to your dataset. It's the heart of DUQL's query structure, providing a flexible and intuitive way to express complex data manipulations as a pipeline of operations.
Syntax
steps:
- <step_1>
- <step_2>
- <step_3>
# ... more steps as neededBehavior
Each step in the pipeline takes the output of the previous step as its input.
Steps are executed in the order they are listed.
The pipeline is flexible, allowing you to add, remove, or reorder steps easily.
You can use any DUQL function as a step in the pipeline.
Supported Step Types
The steps pipeline can include any combination of the following operations:
filter: Select rows based on conditionsgenerate: Create new columns or modify existing onesgroup: Aggregate datajoin: Combine data from multiple sourcesselect: Choose or compute columnssort: Order resultstake: Limit the number of rowswindow: Perform window functionsappend: Combine datasets by adding rowsremove: Exclude specific rows or subsetsintersect: Find common rows between datasetsdistinct: Remove duplicate rowsunion: Combine datasets, removing duplicatesexcept: Find rows in one dataset but not in anotherloop: Perform iterative processing
Flexibility and Power
The steps pipeline in DUQL offers several advantages:
Modularity: Each step performs a specific operation, making queries easy to understand and maintain.
Reusability: You can easily copy, paste, and modify steps between queries.
Readability: The sequential nature of steps makes the data transformation process clear and logical.
Incremental Development: You can build complex queries step by step, testing and refining as you go.
Easy Debugging: You can comment out or remove steps to isolate issues in your query.
Flexibility: Steps can be easily reordered or modified to change the query logic.
Examples
Basic Pipeline
steps:
- filter: order_date >= @2023-01-01
- generate:
total_amount: price * quantity
- group:
by: product_id
summarize:
total_sales: sum(total_amount)
- sort: -total_sales
- take: 10Complex Data Transformation
steps:
- filter: status == 'active'
- join:
dataset: customer_details
where: orders.customer_id == customer_details.id
- generate:
age: datediff(years, birth_date, current_date()) # assuming this is a custom function
customer_segment:
case:
- age < 25: "Young Adult"
- age < 40: "Adult"
- age < 60: "Middle Age"
- true: "Senior"
- group:
by: [customer_segment, product_category]
summarize:
total_revenue: sum price * quantity
average_order_value: avg price * quantity
- sort: [customer_segment, -total_revenue]
- generate:
revenue_rank:
sql'ROW_NUMBER() OVER (PARTITION BY customer_segment ORDER BY total_revenue DESC)'
- filter: revenue_rank <= 5Best Practices
🔍 Start with a
filterstep to reduce data volume early in the pipeline.🧮 Use
generatesteps to create new columns needed for analysis.🔗 Place
joinsteps early in the pipeline, but after initial filtering.📊 Use
groupsteps for aggregations, followed bygeneratefor post-aggregation calculations.🏷️ Use
selecttowards the end of the pipeline to choose final output columns.🔢 End with
sortandtakesteps to order and limit final results.📝 Comment your steps for complex queries to explain the transformation logic.
Real-World Use Case
Here's an example of a DUQL query that uses a complex steps pipeline to perform a cohort analysis:
dataset: user_actions
steps:
- filter: action_date >= @2023-01-01
- group:
by: [cohort_date, months_since_first_action]
summarize:
user_count: count_distinct user_id
total_actions: count action_id
- join:
dataset:
steps:
- dataset: user_actions
- group:
by: cohort_date
summarize:
initial_users: count_distinct user_id
where: ==cohort_date
- sort: [cohort_date, months_since_first_action]
- select:
Cohort: cohort_date
"Months Since First Action": months_since_first_action
"User Count": user_count
"Retention Rate": retention_rate
- filter: months_since_first_action <= 12
into: user_cohort_analysisThis query demonstrates:
Filtering recent data
Generating cohort and time-based columns
Aggregating user actions
Joining with a subquery to get initial user counts
Calculating retention rates
Sorting results
Selecting and renaming final columns
Filtering to show only the first year of data for each cohort
The steps pipeline allows for this complex analysis to be expressed clearly and logically, making it easy to understand and modify the cohort analysis process.
💡 Tip: Think of the
stepspipeline as a data assembly line. Each step performs a specific task, gradually shaping your data into the final form you need. Don't hesitate to experiment with the order and composition of steps to achieve your desired result!
Last updated
Was this helpful?