Implementing a Data Mesh architecture for a Life Sciences Customer

Executive Summary

A biotechnology and life sciences customer aimed to modernize their data management system by leveraging the capabilities of AWS to implement a robust cloud data platform. The customer opted for a data mesh architecture to enhance scalability, flexibility, and data governance across their organization.

Customer Challenge

The customer was managing a legacy data architecture and technical debt was constantly increasing due to the lack of automation. A variety of tools and technologies were being used across the enterprise due to the lack of common standards or processes for the teams to follow impacting collaboration.

As part of their modernization plan, the customer wanted to migrate to a data mesh architecture on cloud using the latest technologies.

Why AWS

Customer chose AWS as they wanted to leverage managed services offered by AWS so that they can focus on creating business value rather than spend resources in managing the infrastructure. AWS had all the services required to implement a robust data mesh architecture.

The Solution

To address these challenges, we designed and built a data mesh architecture using AWS services. The solution leveraged AWS S3, AWS Glue, Managed Workflows for Apache Airflow (MWAA), Athena, IAM and Lakeformation.

  • AWS Glue was used to integrate with various source systems via API calls. Data was pulled into the raw zone of the data lake stored in Amazon S3.
  • AWS Glue jobs were used to standardize and transform the raw data and generate data products which were cataloged and shared across functional domains
  • Transformed data was moved from the raw zone to the standardized zone and subsequently to the published zone via AWS Glue
  • AWS Lake Formation was employed to manage federated governance across the domains. This ensured data security, compliance, and access control were consistently enforced across different domains by the respective domain owners.
  • AWS Glue Catalog was used to catalog data assets in all zones of the data lake (raw, standardized, and published) and made available in AWS Lakeformation for domain owners to discover and enable sharing of the data products across domains.
  • Managed Workflows for Apache Airflow (MWAA) was used to orchestrate data integration, transformation, and publication processes.
  • MWAA facilitated the creation, scheduling, and monitoring of data workflows, ensuring seamless and automated data pipeline execution.
  • Amazon Athena was used to provide self service capabilities to the business users.

High Level Architecture

 

Outcomes and Benefits

The implementation of the data mesh architecture on AWS provided the life sciences customer with several significant benefits:

  • Enhanced Scalability: The data mesh architecture allowed for scalable and decentralized data management, enabling different business units to manage their own data domains independently.
  • Efficiency: Automated data pipelines reduced manual effort and development time, leading to faster data processing and publication.
  • Robust Data Governance: AWS Lake Formation ensured consistent and comprehensive data governance across the data lake, enhancing data security and compliance
  • Streamlined Data Management: AWS Glue Catalog facilitated efficient metadata management and dataset discovery, improving data accessibility and usability.
  • Orchestration: MWAA provided robust orchestration capabilities, ensuring seamless and automated data workflows.

Conclusion

By modernizing their data management system with a data mesh architecture using AWS services, the life sciences customer achieved a scalable, efficient, and well-governed data platform. This solution not only streamlined data integration and processing but also ensured robust data governance and enhanced operational agility, empowering the customer to leverage their data assets more effectively for strategic decision-making.