File Formats Guide
File Formats Guide
Section titled “File Formats Guide”DBCrust supports querying Parquet, CSV, and JSON files directly using Apache DataFusion 50.0, a powerful SQL query engine. This allows you to analyze file-based data using familiar SQL syntax without loading it into a database.
🚀 Getting Started
Section titled “🚀 Getting Started”Supported File Formats
Section titled “Supported File Formats”DBCrust supports the following file formats via Apache DataFusion:
- Parquet - Columnar storage format optimized for analytics
- CSV - Comma-separated values with configurable delimiters
- JSON - JSON records and NDJSON (newline-delimited JSON)
Quick Start
Section titled “Quick Start”# Query a Parquet filedbcrust parquet:///data/sales_2024.parquet> SELECT COUNT(*), AVG(amount) FROM sales_2024;
# Query multiple CSV filesdbcrust 'csv:///logs/*.csv?header=true'> SELECT date, COUNT(*) FROM logs WHERE level = 'ERROR' GROUP BY date;
# Query JSON with nested structuresdbcrust json:///api_responses.json> SELECT data.customer.name, data.order.total FROM api_responses LIMIT 10;📦 Connection URLs
Section titled “📦 Connection URLs”Parquet Files
Section titled “Parquet Files”Scheme: parquet://
# Single Parquet filedbcrust parquet:///data/sales_2024.parquet
# Multiple files with glob patterndbcrust 'parquet:///data/sales_*.parquet'
# All Parquet files in directorydbcrust 'parquet:///warehouse/*.parquet'
# Recursive glob patterndbcrust 'parquet:///data/**/*.parquet'Table Naming:
- Single file: Uses filename without extension (
sales_2024.parquet→sales_2024) - Directory: Uses directory name (
/warehouse/→warehouse) - Glob pattern: Uses directory name or sanitized pattern
CSV Files
Section titled “CSV Files”Scheme: csv://
# CSV with header row (default)dbcrust csv:///data/users.csv
# Custom delimiter (tab-separated)dbcrust 'csv:///data/data.tsv?delimiter=\t'
# No header rowdbcrust 'csv:///data/logs.csv?header=false'
# Multiple CSV filesdbcrust 'csv:///logs/*.csv?header=true'
# Custom delimiter with globdbcrust 'csv:///exports/*.tsv?delimiter=\t&header=true'Query Parameters:
?header=true|false- CSV has header row (default: true)?delimiter=,- Field delimiter character (default: ’,’)
Common Delimiters:
,- Comma (CSV)\t- Tab (TSV)|- Pipe;- Semicolon
JSON Files
Section titled “JSON Files”Scheme: json://
# Standard JSON filedbcrust json:///api_responses.json
# NDJSON (newline-delimited JSON)dbcrust json:///events.ndjson
# JSON with deeply nested structuresdbcrust json:///vault_policies.jsonNDJSON Detection: DBCrust automatically detects NDJSON format and converts it to Arrow format for efficient querying:
{"id": 1, "name": "Alice", "age": 30}{"id": 2, "name": "Bob", "age": 25}{"id": 3, "name": "Charlie", "age": 35}📊 Querying File Formats
Section titled “📊 Querying File Formats”Basic SELECT Queries
Section titled “Basic SELECT Queries”-- All columns from Parquet fileSELECT * FROM sales_2024 LIMIT 10;
-- Specific columns with filteringSELECT customer_name, total, dateFROM sales_2024WHERE total > 1000ORDER BY total DESC;
-- AggregationsSELECT product_category, COUNT(*) as count, SUM(amount) as total_sales, AVG(amount) as avg_saleFROM sales_2024GROUP BY product_category;Advanced SQL Features
Section titled “Advanced SQL Features”DataFusion provides a rich SQL dialect with extensive function support:
Aggregate Functions
Section titled “Aggregate Functions”SELECT department, COUNT(*) as employees, AVG(salary) as avg_salary, MIN(salary) as min_salary, MAX(salary) as max_salary, STDDEV(salary) as salary_stddev, MEDIAN(salary) as median_salaryFROM employeesGROUP BY department;Window Functions
Section titled “Window Functions”SELECT employee_name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept, LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) as previous_salary, AVG(salary) OVER (PARTITION BY department) as dept_avg_salaryFROM employees;String Functions
Section titled “String Functions”SELECT UPPER(name) as name_upper, LOWER(email) as email_lower, CONCAT(first_name, ' ', last_name) as full_name, SUBSTRING(product_code, 1, 3) as category, LENGTH(description) as desc_length, TRIM(title) as trimmed_titleFROM products;Date/Time Functions
Section titled “Date/Time Functions”SELECT date, EXTRACT(YEAR FROM date) as year, EXTRACT(MONTH FROM date) as month, DATE_TRUNC('month', date) as month_start, NOW() as current_timestamp, date + INTERVAL '7 days' as next_weekFROM events;Array Functions
Section titled “Array Functions”SELECT ARRAY_AGG(product_name) as products, ARRAY_LENGTH(tags) as tag_count, ARRAY_CONTAINS(categories, 'electronics') as has_electronicsFROM orders;Multi-File Queries with Glob Patterns
Section titled “Multi-File Queries with Glob Patterns”Query multiple files as if they were a single table:
-- Query all CSV files in directorySELECT date, level, COUNT(*) as log_countFROM logsWHERE level IN ('ERROR', 'CRITICAL')GROUP BY date, levelORDER BY date DESC;
-- Combine multiple Parquet files by patternSELECT EXTRACT(MONTH FROM date) as month, SUM(revenue) as monthly_revenueFROM 'sales_*.parquet'GROUP BY monthORDER BY month;🔍 Nested Field Navigation
Section titled “🔍 Nested Field Navigation”Understanding Nested Structures
Section titled “Understanding Nested Structures”Parquet and JSON files often contain nested structures (Struct, Array, Map types). DBCrust provides intelligent multi-level autocomplete for navigating these structures.
Schema Display for Complex Types
Section titled “Schema Display for Complex Types”When you describe a table with nested fields, DBCrust shows two sections:
- Main table with column summaries
- Nested field details section with expandable information
\d policies
Table: policiesColumn | Type-----------------+--------------------id | Int64data | Struct<4 fields>timestamp | Utf8
Nested field details: data (Struct): - chroot_namespace: Utf8 - exact_paths: Struct<25 fields> - glob_paths: Struct<10 fields> - root: Utf8
data.exact_paths (Struct): - auth/token/create: Struct<1 fields> - auth/token/lookup-self: Struct<1 fields> - aws_okta/creds/management-ecr: Struct<1 fields> ... and 22 more fieldsMulti-Level Autocomplete
Section titled “Multi-Level Autocomplete”DBCrust supports deep nested field navigation with intelligent autocomplete:
-- First level - shows all top-level pathsSELECT data.[TAB] FROM policies-- Suggests: data, data.chroot_namespace, data.exact_paths, data.glob_paths, data.root
-- Second level - shows direct children onlySELECT data.exact_paths.[TAB] FROM policies-- Suggests: auth/token/create, auth/token/lookup-self, aws_okta/creds/management-ecr, ...
-- Third level - navigate even deeperSELECT data.exact_paths.auth/token/create.[TAB] FROM policies-- Suggests: capabilities
-- Fourth level - works at any depthSELECT data.exact_paths.auth/token/create.capabilities[TAB] FROM policies-- Shows capabilities array fieldsAutocomplete Features:
- Direct Children Only: Shows immediate children, not all descendants
- Any Depth: Navigate nested structures to unlimited depth
- Special Characters: Handles field names with
/,-,@, and other characters - Context-Aware: Works in SELECT, WHERE, ORDER BY, GROUP BY, and all SQL clauses
- Performance: Fast even with deeply nested structures (100+ fields)
Querying Nested Fields
Section titled “Querying Nested Fields”-- Select specific nested fieldsSELECT id, data.chroot_namespace, data.exact_paths.auth/token/create.capabilitiesFROM policies;
-- Filter by nested fieldsSELECT *FROM policiesWHERE data.root = 'true' AND data.exact_paths.auth/token/create.capabilities IS NOT NULL;
-- Aggregate by nested fieldsSELECT data.chroot_namespace, COUNT(*) as policy_countFROM policiesGROUP BY data.chroot_namespace;Working with Arrays
Section titled “Working with Arrays”-- Access array elementsSELECT tags[1] as first_tag, ARRAY_LENGTH(tags) as tag_countFROM products;
-- Unnest arraysSELECT product_id, tagFROM ( SELECT product_id, UNNEST(tags) as tag FROM products);🎯 Use Cases
Section titled “🎯 Use Cases”Data Analysis
Section titled “Data Analysis”Analyze large datasets without database setup:
-- Analyze sales trends from Parquet filesSELECT DATE_TRUNC('month', order_date) as month, product_category, COUNT(DISTINCT customer_id) as unique_customers, SUM(amount) as total_revenue, AVG(amount) as avg_order_valueFROM 'warehouse/sales_*.parquet'WHERE order_date >= '2024-01-01'GROUP BY month, product_categoryORDER BY month, total_revenue DESC;Log Analysis
Section titled “Log Analysis”Query log files with SQL:
-- Analyze error patterns in CSV logsSELECT date, service, error_type, COUNT(*) as error_countFROM 'logs/*.csv'WHERE level = 'ERROR' AND date >= CURRENT_DATE - INTERVAL '7 days'GROUP BY date, service, error_typeORDER BY error_count DESCLIMIT 20;Data Lake Queries
Section titled “Data Lake Queries”Query data lake structures:
-- Query partitioned Parquet filesSELECT year, month, country, SUM(revenue) as total_revenueFROM 'datalake/events/**/*.parquet'WHERE year = 2024GROUP BY year, month, country;API Response Analysis
Section titled “API Response Analysis”Analyze JSON API responses:
-- Analyze nested JSON from API responsesSELECT data.user.id, data.user.email, data.order.total, data.order.status, ARRAY_LENGTH(data.order.items) as item_countFROM api_responsesWHERE data.order.total > 100 AND data.order.status = 'completed';⚡ Performance Tips
Section titled “⚡ Performance Tips”Parquet Files
Section titled “Parquet Files”Best for: Large datasets (100MB+), analytical queries
Advantages:
- Columnar Format: Only reads columns needed for query
- Predicate Pushdown: Filters applied at file level before loading data
- Compression: Efficient storage with compression
- Type Preservation: Full type information maintained
Example:
-- Only reads 'amount' and 'date' columns from fileSELECT SUM(amount)FROM 'sales.parquet'WHERE date >= '2024-01-01';-- ✓ Fast: Only scans needed columns-- ✓ Predicate pushdown: Filters at file levelCSV Files
Section titled “CSV Files”Best for: Small to medium datasets (<100MB), simple structure
Advantages:
- Universal Format: Works everywhere
- Human Readable: Easy to inspect
- Simple Schema: Good for flat data
Limitations:
- Sequential reading (no column pruning)
- Type inference required
- Less efficient compression
Example:
-- Reads entire file to parse CSVSELECT *FROM 'logs.csv'WHERE date >= '2024-01-01';-- ✗ Slower: Must read full file-- ✓ Simple: Easy to work withJSON Files
Section titled “JSON Files”Best for: Semi-structured data, varying schemas, nested structures
Advantages:
- Flexible Schema: Handles varying field sets
- Nested Structures: Natural representation of hierarchical data
- Self-Describing: Field names included in data
NDJSON Benefits:
- Line-by-line processing
- Streaming-friendly
- Partial read support
Example:
-- Handles varying schemas gracefullySELECT data.customer.name, data.order.totalFROM 'api_responses.json'WHERE data.order.total > 100;-- ✓ Flexible: Works with varying schemas-- ✓ Nested: Direct access to nested fields🔧 Advanced Features
Section titled “🔧 Advanced Features”Cross-Format JOINs
Section titled “Cross-Format JOINs”Join data across different file formats:
-- Join Parquet and CSV dataSELECT u.name, u.email, o.order_id, o.totalFROM 'users.parquet' uJOIN 'orders.csv' o ON u.id = o.user_idWHERE o.total > 1000;
-- Combine JSON and ParquetSELECT p.data.customer.name as customer, s.product_name, s.amountFROM 'policies.json' pJOIN 'sales.parquet' s ON p.id = s.policy_idWHERE s.amount > 500;
-- Three-way join across formatsSELECT u.name, o.order_id, p.product_nameFROM 'users.csv' uJOIN 'orders.parquet' o ON u.user_id = o.user_idJOIN 'products.json' p ON o.product_id = p.id;File Path References
Section titled “File Path References”Reference files directly in queries:
-- Explicit file paths in FROM clauseSELECT a.*, b.*FROM '/data/warehouse/sales.parquet' aJOIN '/exports/customers.csv' b ON a.customer_id = b.id;
-- Mix table names and file pathsSELECT *FROM sales sJOIN '/path/to/products.parquet' p ON s.product_id = p.id;Subqueries and CTEs
Section titled “Subqueries and CTEs”Use Common Table Expressions for complex queries:
-- CTE with file formatsWITH monthly_sales AS ( SELECT DATE_TRUNC('month', date) as month, SUM(amount) as total FROM 'sales_*.parquet' GROUP BY month)SELECT month, total, LAG(total, 1) OVER (ORDER BY month) as prev_month, (total - LAG(total, 1) OVER (ORDER BY month)) / LAG(total, 1) OVER (ORDER BY month) * 100 as growth_pctFROM monthly_salesORDER BY month;📖 Complete SQL Reference
Section titled “📖 Complete SQL Reference”DataFusion provides PostgreSQL-compatible SQL with extensive function support. Here’s a quick reference:
Supported SQL Features
Section titled “Supported SQL Features”✅ SELECT Statements
- Column selection, aliases, wildcards
- DISTINCT
- WHERE conditions with all comparison operators
- GROUP BY, HAVING
- ORDER BY with ASC/DESC
- LIMIT, OFFSET
✅ JOINs
- INNER JOIN
- LEFT/RIGHT/FULL OUTER JOIN
- CROSS JOIN
- Self joins
✅ Subqueries
- Scalar subqueries
- IN/NOT IN subqueries
- EXISTS/NOT EXISTS
- Correlated subqueries
✅ Set Operations
- UNION, UNION ALL
- INTERSECT
- EXCEPT
✅ Window Functions
- ROW_NUMBER, RANK, DENSE_RANK
- LAG, LEAD
- FIRST_VALUE, LAST_VALUE
- Aggregate window functions
✅ Common Table Expressions (CTEs)
- WITH clause
- Recursive CTEs (limited support)
Function Categories
Section titled “Function Categories”Aggregate Functions:
COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, MEDIAN, APPROX_DISTINCT, APPROX_PERCENTILE
String Functions:
CONCAT, UPPER, LOWER, TRIM, LTRIM, RTRIM, SUBSTRING, REPLACE, SPLIT_PART, LENGTH, CHAR_LENGTH, POSITION, REGEXP_MATCH, REGEXP_REPLACE
Date/Time Functions:
NOW, CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, DATE_TRUNC, EXTRACT, TO_TIMESTAMP, TO_DATE, DATE_ADD, DATE_SUB
Math Functions:
ABS, CEIL, FLOOR, ROUND, TRUNC, SQRT, POWER, EXP, LN, LOG, SIN, COS, TAN
Type Conversion:
CAST, TRY_CAST, COALESCE, NULLIF
Array Functions:
ARRAY_AGG, ARRAY_LENGTH, ARRAY_CONTAINS, ARRAY_POSITION, ARRAY_CONCAT, UNNEST
Conditional:
CASE WHEN, IF, COALESCE, NULLIF
For complete DataFusion SQL documentation, see the Apache Arrow DataFusion SQL Reference.
🔍 Troubleshooting
Section titled “🔍 Troubleshooting”Common Issues
Section titled “Common Issues”File Not Found:
Error: File not found: /data/sales.parquet- Verify file path is absolute (
///prefix forfile://) - Check file permissions
- Ensure file exists:
ls -la /data/sales.parquet
Schema Inference Failed:
Error: Failed to infer schema- For CSV: Check if
?header=trueis set correctly - For JSON: Verify file is valid JSON or NDJSON
- Try reading first few rows to diagnose
Nested Field Not Found:
Error: No field named 'data.customer.address'- Check schema with
\d table_name - Verify field names match exactly (case-sensitive)
- Use autocomplete to discover available fields
Glob Pattern No Results:
Table 'logs' not found- Verify glob pattern matches files:
ls /path/*.csv - Check directory exists and contains matching files
- Ensure proper quoting:
dbcrust 'csv:///logs/*.csv'
Performance Issues
Section titled “Performance Issues”Slow Query on Large CSV:
- Consider converting to Parquet for better performance
- Use column selection to reduce data read
- Apply filters early in the query
Memory Usage:
- Parquet files are memory-efficient (columnar)
- CSV/JSON may load more data into memory
- Use LIMIT for exploratory queries
Nested Field Queries Slow:
- DataFusion efficiently handles nested structures
- Consider flattening extremely deep nesting
- Use specific field paths instead of SELECT *
Ready to query your data files? Start with a simple Parquet or CSV file and explore the powerful SQL capabilities DBCrust provides!