Join

The join function in DUQL is used to combine rows from two or more tables based on a related column between them. It supports different types of joins and allows for complex join conditions.

Syntax

join:
  dataset: <data_source>
  where: <join_condition>
  retain: <join_type>

Parameters

Behavior

  • Combines rows from the current dataset with rows from the specified dataset based on the join condition.

  • The dataset parameter can include its own steps for preprocessing before the join.

  • The where condition supports a == shorthand for equality joins on matching column names.

  • The retain parameter specifies the join type, defaulting to an inner join.

Join Types

Examples

Basic Join

join:
  dataset: customers
  steps: 
  where: orders.customer_id == customers.id

Left Join

join:
  dataset: products
  where: orders.product_id == products.id
  retain: left

Join with CSV File

join:
  dataset: myorg/employee_data.csv
  type: csv
  where: departments.department_id == employee_data.department_id
  retain: inner

Join with Subquery and Full Outer Join

join:
  dataset: recent_inventory
  steps:
  - filter: last_updated > @2023-01-01
  where: products.product_id == recent_inventory.product_id
  retain: full

Complex Join with Aggregation

join:
  dataset: customer_views
  where: customers.id == customer_views.customer_id

Join with SQL Query

join:
  dataset: sql'SELECT * FROM myexample'
  where: ==id
  retain: full

Join with External Data Source

join:
  dataset: hdfs://cluster/user_profiles/thing.parquet
  type: parquet
  where: users.id == user_profiles.user_id
  retain: left

Best Practices

  1. 🎯 Always specify the join condition explicitly for clarity.

  2. 🤔 Choose the appropriate join type using the retain parameter based on your data requirements.

  3. 🚀 Utilize the steps within the dataset parameter for preprocessing before joining.

  4. 🔍 Use the == shorthand for simple equality joins when column names match.

  5. 📊 Be mindful of data volume when performing joins, especially with large datasets or complex subqueries.

  6. 🏷️ Consider using aliases or clear naming conventions when joining multiple tables to avoid ambiguity.

The join operations allow us to bring together data from multiple sources, creating a rich dataset for analysis that includes sales performance, customer behavior, inventory status, and category-level benchmarks.


💡 Tip: The join function in DUQL is highly flexible, allowing you to combine data from various sources and formats. Leverage its ability to include steps within the dataset parameter to preprocess your data before joining, enabling more sophisticated data integration in your analyses!

Last updated