ETL Pipelines Implementation for a US Government Agency
Summary
The Customer is a US municipal government agency.
Challenge
As a part of its long-term digitalization strategy, the Customer planned to build a data analytics system to facilitate analytics and reporting of transportation accidents. To implement it, the Customer first needed to develop a data warehouse based on Microsoft SQL Server and was looking for a vendor with solid experience in building data warehouses for ETL pipelines implementation. Among the major requirements for the vendor was the ability to meet tight deadlines, as the Customer needed to get the pipelines built within one month.
Solution
With 17 years in data warehousing services, a portfolio of successful DWH implementation projects, and an ISO 9001-certified quality management system, ScienceSoft won over the other vendors the Customer was considering.
The Customer commissioned ScienceSoft to develop ETL pipelines that would move the historical data related to transportation accidents from the government sources to the Customer’s data storage. The goal was to centralize the existing data from the past several years and to facilitate data import in the future.
To meet the Customer’s initial request, ScienceSoft’s lead SQL developer designed and developed ETL pipelines that enabled the extraction of historical data on transportation accidents from 4 sources to the agency’s data storage.
A special requirement applied to the pipeline running from FTP storage that held continuously updated JSON files. The Customer wanted to be able to cleanse the FTP storage upon data loading and, at the same time, keep all previous file versions at its disposal. For this purpose, our developers built intermediate file storage that held all the previous file versions using FileTable, a Microsoft SQL Server feature. This way, the Customer could restore the older file versions when needed.
Throughout the project, new requirements appeared. In particular, the Customer wanted its new DWH to incorporate the data from the agency’s old Microsoft Access-based database. It stored large amounts of historical data, some of which overlapped with the data gathered from the FTP storage source. ScienceSoft’s team consolidated these two sources and conducted data deduplication.
Additionally, ScienceSoft built one more ETL pipeline to integrate another large data source into the system. It was a database holding 250+ tables, which was expected to grow further. For this source, we also enabled customized scheduling for automated data upload (e.g., hourly, daily, or weekly).
Results
Within one month, the Customer received an easy-to-configure automated mechanism for extracting, transforming, and loading data from 5 sources. With ScienceSoft’s help, the Customer was able to establish a robust data warehouse that was ready to facilitate the future analytics solution.
Technologies and Tools
Microsoft SQL Server, Microsoft SQL Server Integration Service.