Modernizing Data Processing Platform for a Finance Customer

Executive Summary

A finance customer, providing credit card services and installment loan products in the US, embarked on modernizing their data processing platform leveraging the power of cloud to optimize their data pipelines and bring down the cost and effort required in managing the data platform.

Customer Challenge

Previously, the customer faced challenges with building individual data pipelines for onboarding new datasets from existing or new source systems in their on-premises infrastructure. This process was time-intensive and required significant manual effort.

As part of their modernization plan, the customer aimed to automate this process to reduce development time and enhance operational efficiency. Additionally, they sought comprehensive visibility into data movement across various data lake zones, data quality issues, error logging, and alerting.

Why AWS

Customer chose AWS as they wanted to leverage managed services offered by AWS so that they can focus on creating business value rather than spend resources in managing the infrastructure.

The Solution

To address these challenges, we designed and built a sophisticated data integration and observability platform using AWS services and Snowflake. The solution leveraged AWS S3, AWS Glue, PySpark, Managed Workflows for Apache Airflow (MWAA), AWS Secrets Manager, AWS Cloud watch, AWS Simple Email Service, and Snowflake.

  • Configuration tables were created to capture details of files, source and target systems, transformations, business rules, and data standardization requirements. Which served as a central repository for managing metadata and orchestrating data processes.
  • Data was ingested into the data lake’s raw zone, stored in an S3 bucket.
  • The framework was ensured to be robust and scalable, handling thousands of files efficiently.
  • Developed a sophisticated scheduling system for data files, using control file frequency codes to manage processing windows for intraday, daily, weekly, monthly, quarterly, and yearly schedules. This system included mechanisms for ad-hoc processing, previous processing checks, and distinct processing windows to ensure timely and efficient data handling.
  • MWAA triggered PySpark jobs using AWS Glue to automatically provision, manage and run Spark processes.
  • These processes standardized CSV files in the raw zone, applied business rules and transformations, converted them to Parquet format, and moved them to the cleansed S3 bucket.
  • MWAA orchestrated the entire ETL process, ensuring seamless workflow management.
  • Validation processes ensured the integrity of data by checking the number of records processed between source and target.
  • Schema validation processes verified that the incoming data conformed to the agreed schema and captured any deviations.
  • Error tables recorded error records and sent email alerts for any issues.
  • Airflow DAGs (Directed Acyclic Graphs) were automatically generated based on entries in the configuration tables, streamlining the process of workflow creation and management.
  • SLA Escalation notification module to monitor data pipelines and alert stakeholders when data is not available by a specified time.

High Level Architecture

Outcomes and Benefits

The implementation of this automated data integration and observability platform delivered several key benefits to the customer:

  • Reduced Development Time: Automated data pipelines significantly cut down the time required to onboard new datasets.
  • Increased Efficiency: The solution minimized manual effort, allowing the team to focus on higher-value tasks.
  • Enhanced Data Quality: Comprehensive validation processes ensured high data quality and integrity.
  • Operational Visibility: The platform provided full visibility into data movements, data quality issues, and errors.
  • Scalability: The architecture allowed easy scaling to accommodate new datasets and increased data volumes.
  • Automated Workflow Management:Automatically generated Airflow DAGs streamlined workflow management and ensured consistency.
  • SLA Escalation Notifications:The integrated notification module ensured timely alerts if data was not available by the specified SLA, enabling proactive issue resolution and reducing downtime.

Conclusion

By modernizing their data processing platform with an automated solution using AWS S3, Glue, PySpark, MWAA, and Snowflake, the finance customer achieved significant improvements in efficiency, data quality, and operational visibility. This solution not only streamlined data onboarding processes but also provided robust validation and error-handling capabilities, enabling the customer to leverage their data more effectively for strategic decision-making.