Datasets
The dataset
component in DUQL is a fundamental element that specifies the source of data for your query. It can represent a wide range of data sources, from simple table names to complex file systems or even subqueries.
Syntax
The dataset
can be defined in two main ways:
Simple Form
Advanced Form
Parameters
name
string
Yes
The name or path of the data source
format
string
No
The format of the data (default: table
)
Supported Formats
table
: A database table (default)csv
: Comma-separated values filejson
: JSON file or dataparquet
: Apache Parquet file
Examples
Basic Table Reference
This simple form refers to a table named customers
in the default database.
CSV File
This defines a dataset from a CSV file located at the specified path.
Remote Parquet Files
This example shows how to reference Parquet files stored in an Amazon S3 bucket.
SQL Query
You can use raw SQL queries as datasets, which is useful for complex data sources or when transitioning from SQL to DUQL.
Subquery
This advanced example shows how to define a dataset based on a subquery, which can be reused in multiple parts of your DUQL script.
Best Practices
📁 Use meaningful names for your datasets to improve query readability.
🔒 Ensure you have the necessary permissions to access the specified data sources.
🚀 For large datasets, consider using efficient file formats like Parquet for better performance.
🧩 Leverage subqueries to create reusable dataset components.
📊 When working with files, use wildcards to process multiple files in a single query.
🔍 Always validate the data format and structure of external sources before using them in complex queries.
🔧 When using SQL strings (
sql''
), be aware that they bypass DUQL's syntax checking and optimization. Use them judiciously and consider requesting native DUQL features for frequently used SQL patterns.
Real-World Use Case
Here's an example of a DUQL query that combines data from multiple sources:
This query demonstrates:
Declaring a reusable dataset (
recent_customers
)Using a Parquet file from S3 as the main dataset
Joining with the declared dataset and an SQL-defined product dataset
Performing aggregations and sorting on the combined data
The resulting top_products_analysis
dataset provides insights into the best-selling products among recent customers, combining data from multiple sources and formats.
💡 Tip: The flexibility of DUQL's
dataset
component allows you to work with a wide variety of data sources. This makes it easy to integrate data from different systems and formats into a single, cohesive analysis! (Note: Only DuckDB currently supports non-table source.)
Last updated
Was this helpful?