Datasets
The dataset
component in DUQL is a fundamental element that specifies the source of data for your query. It can represent a wide range of data sources, from simple table names to complex file systems or even subqueries.
Syntax
The dataset
can be defined in two main ways:
Simple Form
dataset: <table_name_or_simple_source>
Advanced Form
dataset:
name: <data_source_name_or_path>
format: <data_format>
Parameters
name
string
Yes
The name or path of the data source
format
string
No
The format of the data (default: table
)
Supported Formats
table
: A database table (default)csv
: Comma-separated values filejson
: JSON file or dataparquet
: Apache Parquet file
Examples
Basic Table Reference
dataset: customers
This simple form refers to a table named customers
in the default database.
CSV File
dataset:
name: /path/to/sales_data.csv
format: csv
This defines a dataset from a CSV file located at the specified path.
Remote Parquet Files
dataset:
name: s3://my-bucket/transactions/*.parquet
format: parquet
This example shows how to reference Parquet files stored in an Amazon S3 bucket.
SQL Query
dataset: sql"""
SELECT * FROM orders
WHERE order_date >= '2023-01-01'
"""
You can use raw SQL queries as datasets, which is useful for complex data sources or when transitioning from SQL to DUQL.
Subquery
dataset:
name: recent_orders
steps:
- dataset: orders
- filter: order_date >= @2023-01-01
This advanced example shows how to define a dataset based on a subquery, which can be reused in multiple parts of your DUQL script.
Best Practices
📁 Use meaningful names for your datasets to improve query readability.
🔒 Ensure you have the necessary permissions to access the specified data sources.
🚀 For large datasets, consider using efficient file formats like Parquet for better performance.
🧩 Leverage subqueries to create reusable dataset components.
📊 When working with files, use wildcards to process multiple files in a single query.
🔍 Always validate the data format and structure of external sources before using them in complex queries.
🔧 When using SQL strings (
sql''
), be aware that they bypass DUQL's syntax checking and optimization. Use them judiciously and consider requesting native DUQL features for frequently used SQL patterns.
Real-World Use Case
Here's an example of a DUQL query that combines data from multiple sources:
declare:
recent_customers:
dataset: customers
steps:
- filter: last_purchase_date >= @2023-01-01
dataset:
name: s3://sales-data/transactions/*.parquet
format: parquet
steps:
- filter: transaction_date >= @2023-01-01
- join:
dataset: recent_customers
where: transactions.customer_id == recent_customers.id
- join:
dataset: sql"""
SELECT product_id, name, category
FROM products
WHERE is_active = true
"""
where: transactions.product_id == products.product_id
- group:
by: [category, product_id, name]
summarize:
total_sales: sum amount
num_transactions: count transaction_id
- sort: -total_sales
- take: 100
into: top_products_analysis
This query demonstrates:
Declaring a reusable dataset (
recent_customers
)Using a Parquet file from S3 as the main dataset
Joining with the declared dataset and an SQL-defined product dataset
Performing aggregations and sorting on the combined data
The resulting top_products_analysis
dataset provides insights into the best-selling products among recent customers, combining data from multiple sources and formats.
💡 Tip: The flexibility of DUQL's
dataset
component allows you to work with a wide variety of data sources. This makes it easy to integrate data from different systems and formats into a single, cohesive analysis! (Note: Only DuckDB currently supports non-table source.)
Last updated
Was this helpful?