Steps

The steps component in DUQL represents a series of data transformation operations applied sequentially to your dataset. It's the heart of DUQL's query structure, providing a flexible and intuitive way to express complex data manipulations as a pipeline of operations.

Syntax

steps:
- <step_1>
- <step_2>
- <step_3>
# ... more steps as needed

Behavior

Each step in the pipeline takes the output of the previous step as its input.
Steps are executed in the order they are listed.
The pipeline is flexible, allowing you to add, remove, or reorder steps easily.
You can use any DUQL function as a step in the pipeline.

Supported Step Types

The steps pipeline can include any combination of the following operations:

filter: Select rows based on conditions
generate: Create new columns or modify existing ones
group: Aggregate data
join: Combine data from multiple sources
select: Choose or compute columns
sort: Order results
take: Limit the number of rows
window: Perform window functions
append: Combine datasets by adding rows
remove: Exclude specific rows or subsets
intersect: Find common rows between datasets
distinct: Remove duplicate rows
union: Combine datasets, removing duplicates
except: Find rows in one dataset but not in another
loop: Perform iterative processing

Flexibility and Power

The steps pipeline in DUQL offers several advantages:

Modularity: Each step performs a specific operation, making queries easy to understand and maintain.
Reusability: You can easily copy, paste, and modify steps between queries.
Readability: The sequential nature of steps makes the data transformation process clear and logical.
Incremental Development: You can build complex queries step by step, testing and refining as you go.
Easy Debugging: You can comment out or remove steps to isolate issues in your query.
Flexibility: Steps can be easily reordered or modified to change the query logic.

Examples

Basic Pipeline

steps:
- filter: order_date >= @2023-01-01
- generate:
    total_amount: price * quantity
- group:
    by: product_id
    summarize:
      total_sales: sum(total_amount)
- sort: -total_sales
- take: 10

Complex Data Transformation

steps:
- filter: status == 'active'
- join:
    dataset: customer_details
    where: orders.customer_id == customer_details.id
- generate:
    age: datediff(years, birth_date, current_date()) # assuming this is a custom function
    customer_segment:
      case:
      - age < 25: "Young Adult"
      - age < 40: "Adult"
      - age < 60: "Middle Age"
      - true: "Senior"
- group:
    by: [customer_segment, product_category]
    summarize:
      total_revenue: sum price * quantity
      average_order_value: avg price * quantity
- sort: [customer_segment, -total_revenue]
- generate:
    revenue_rank:
      sql'ROW_NUMBER() OVER (PARTITION BY customer_segment ORDER BY total_revenue DESC)'
- filter: revenue_rank <= 5

Best Practices

🔍 Start with a filter step to reduce data volume early in the pipeline.
🧮 Use generate steps to create new columns needed for analysis.
🔗 Place join steps early in the pipeline, but after initial filtering.
📊 Use group steps for aggregations, followed by generate for post-aggregation calculations.
🏷️ Use select towards the end of the pipeline to choose final output columns.
🔢 End with sort and take steps to order and limit final results.
📝 Comment your steps for complex queries to explain the transformation logic.

Real-World Use Case

Here's an example of a DUQL query that uses a complex steps pipeline to perform a cohort analysis:

dataset: user_actions

steps:
- filter: action_date >= @2023-01-01
- group:
    by: [cohort_date, months_since_first_action]
    summarize:
      user_count: count_distinct user_id
      total_actions: count action_id
- join:
    dataset: 
      steps:
      - dataset: user_actions
      - group:
          by: cohort_date
          summarize:
            initial_users: count_distinct user_id
    where: ==cohort_date
- sort: [cohort_date, months_since_first_action]
- select:
    Cohort: cohort_date
    "Months Since First Action": months_since_first_action
    "User Count": user_count
    "Retention Rate": retention_rate
- filter: months_since_first_action <= 12

into: user_cohort_analysis

This query demonstrates:

Filtering recent data
Generating cohort and time-based columns
Aggregating user actions
Joining with a subquery to get initial user counts
Calculating retention rates
Sorting results
Selecting and renaming final columns
Filtering to show only the first year of data for each cohort

The steps pipeline allows for this complex analysis to be expressed clearly and logically, making it easy to understand and modify the cohort analysis process.

💡 Tip: Think of the steps pipeline as a data assembly line. Each step performs a specific task, gradually shaping your data into the final form you need. Don't hesitate to experiment with the order and composition of steps to achieve your desired result!

PreviousDeclare NextInto

Last updated 11 months ago

Was this helpful?