Open-Source Accelerators
Our Vault Developer Support team has built a sample accelerators which you can use as-is with your Vault or as a starting point to build your custom integration. If you are building your own accelerator, refer to best practices for building scalable accelerators.
Each open-source accelerator performs the following fundamental processes:
- Downloads zipped Direct Data files and uploads to object storage
- Extracts data from Direct Data files
- Optionally converts CSV files to Parquet
- Loads data into the target system
- Optionally extracts document source content and text from Vault and uploads to object storage
The open-source accelerators listed below facilitate the loading of data from Vault to an object storage system, to then be loaded into target data systems. Additionally, if your organization relies on data visualization, each target system can connect to Power BI.
- Vault → AWS S3 → Snowflake
- Vault → AWS S3 → Databricks
- Vault → AWS S3 → Redshift
- Vault → Azure Blob Storage → Azure SQL Database
- Vault → Azure Blob Storage → Microsoft Fabric Warehouse
You can access the source code for these accelerators from our GitHub repository
Best Practices for Building a Scalable Accelerator
Section link for Best Practices for Building a Scalable AcceleratorBuilding a scalable accelerator leveraging Direct Data API requires careful consideration of several key factors. Key strategies for maximizing throughput and minimizing processing time include:
- Parallel Loading: Optimize data ingestion by loading multiple Direct Data extracts simultaneously using separate sessions. This approach leverages parallel processing capabilities to significantly reduce overall load times.
- Memory Allocation: Ensure sufficient memory allocation for
COPYoperations and other data processing tasks. Adequate memory prevents bottlenecks and ensures smooth data transfer. - Incremental Loads with Staging Tables: For incremental updates, utilize staging tables to manage changes efficiently. Separate "Delete" and "Update" tables within the staging area allow for optimized processing, and apply deletes before updates. Consider using temporary tables for complex data transformations before final insertion into the target database.
- Performance Monitoring: Continuously monitor load times and other key performance metrics. Tracking these metrics can identify potential bottlenecks and ensure that performance remains close to the standards of your external system.