Join
The join function in DUQL is used to combine rows from two or more tables based on a related column between them. It supports different types of joins and allows for complex join conditions.
Syntax
join:
dataset: <data_source>
where: <join_condition>
retain: <join_type>Parameters
dataset
object
Yes
-
The data source to join with (ref: dataset.s.duql.yml)
where
string
Yes
-
The join condition
retain
string
No
"inner"
The type of join to perform
Behavior
Combines rows from the current dataset with rows from the specified dataset based on the join condition.
The
datasetparameter can include its ownstepsfor preprocessing before the join.The
wherecondition supports a==shorthand for equality joins on matching column names.The
retainparameter specifies the join type, defaulting to an inner join.
Join Types
"inner"
Returns only the matched rows (default)
"left"
Returns all rows from the left table and matched rows from the right table
"right"
Returns all rows from the right table and matched rows from the left table
"full"
Returns all rows when there is a match in either left or right table
Examples
Basic Join
join:
dataset: customers
steps:
where: orders.customer_id == customers.idLeft Join
join:
dataset: products
where: orders.product_id == products.id
retain: leftJoin with CSV File
join:
dataset: myorg/employee_data.csv
type: csv
where: departments.department_id == employee_data.department_id
retain: innerJoin with Subquery and Full Outer Join
join:
dataset: recent_inventory
steps:
- filter: last_updated > @2023-01-01
where: products.product_id == recent_inventory.product_id
retain: fullComplex Join with Aggregation
join:
dataset: customer_views
where: customers.id == customer_views.customer_idJoin with SQL Query
join:
dataset: sql'SELECT * FROM myexample'
where: ==id
retain: fullJoin with External Data Source
join:
dataset: hdfs://cluster/user_profiles/thing.parquet
type: parquet
where: users.id == user_profiles.user_id
retain: leftBest Practices
🎯 Always specify the join condition explicitly for clarity.
🤔 Choose the appropriate join type using the
retainparameter based on your data requirements.🚀 Utilize the
stepswithin thedatasetparameter for preprocessing before joining.🔍 Use the
==shorthand for simple equality joins when column names match.📊 Be mindful of data volume when performing joins, especially with large datasets or complex subqueries.
🏷️ Consider using aliases or clear naming conventions when joining multiple tables to avoid ambiguity.
The join operations allow us to bring together data from multiple sources, creating a rich dataset for analysis that includes sales performance, customer behavior, inventory status, and category-level benchmarks.
💡 Tip: The
joinfunction in DUQL is highly flexible, allowing you to combine data from various sources and formats. Leverage its ability to includestepswithin thedatasetparameter to preprocess your data before joining, enabling more sophisticated data integration in your analyses!
Last updated
Was this helpful?