Best Practices to Implement an Azure Data Factory
– Sneha Patil
1. Setting up a code repository for your data factory
- To provide a better authoring experience, Azure Data Factory allows you to configure a Git repository with either Azure Repos or GitHub. Git is a version control system that allows for easier change tracking and collaboration.
- You can configure an Azure Repos Git repository with a data factory through two methods:
Configuration method 1: On the Azure Data Factory home page, select Set up Code Repository.
Configuration method 2: In the Azure Data Factory UX authoring canvas, select the Data Factory drop-down menu, and then select Set up Code Repository.
- You can also enable the “Continuous deployment trigger”, which instructs Azure Pipelines to create new releases automatically when it detects new artifacts are available. In the example given below, we selected adf_publish as filter branch. This means deployment will be triggered whenever there is a commit on the df_publish branch.
- For more information on Continuous deployment, triggers refer to Link.
2. Debugging Feature
- The main moto of this is to have a clean separation of resources for development, testing and production.
- Next is adding debug feature to perform basic end to end testing of newly created pipelines and using breakpoints on activities as required.
3. Setting different Environments (DEV, TEST, PROD)
- Another option and technique are to handle different environment setups internally in Data Factory via a Switch activity. In this situation a central variable controls which activity path is used at runtime. For example, having different Databricks clusters and Linked Services connected to different environment activities:
4. Naming Rules
- When applying a naming convention in ADF there are four areas of consideration.
Activities within pipelines
- The naming convention will largely come down to how you design your ETL process and we need to be aware of the rules enforced by Microsoft for different components, provided here:
- NOTE: Finally, when considering components names, be mindful that when injecting expressions into things, some parts of Data Factory don’t like spaces or things from names that could later break the JSON expression syntax.
- When creating any PIPELINE & ACTIVITY add a description to it to offer an insight into our original thinking or the reasons behind doing it.
5. Pipeline hierarchies
Every pipeline consists of grandparents, Parent, Child, Infant as described below.
While creating a grandparent pipeline the approach would be to build and consider the below operations:
- Adding Triggers to Data Factory so as to start solution execution. Either scheduled or event-based. From Logic Apps or called by PowerShell etc. The grandparent starts the processing.
- Grouping the execution of our processes, either vertically through the layers of our logical data warehouse or maybe horizontally from ingestion to output. In each case, we need to handle the high-level dependencies within our wider platform.
- If created in Data Factory, a grandparent may have a structure as given below.
At the parent level we need to handle two further elements of data processing solution:
- Controlling the scale and state of the services we are about to invoke. For example, when working with:
Azure SQL Database (SQLDB), scale it up ready for processing (DTU’s).
Azure SQL Data Warehouse (SQLDW), start the cluster and set the scale (DWU’s).
Azure Analysis Service, resume the compute, maybe also sync our read only replica databases and pause the resource if finished processing.
Azure Databricks, start up the cluster if interactive.
- To support and manage the parallel execution of child transformations/activities. For example, when working with:
Azure SQLDB or Azure SQLDW, how many stored procedures do we want to execute at once.
Azure SSIS in our ADF Integration Runtime, how many packages do we want to execute.
Azure Analysis Service, how many models do we want to process at once.
If created in Data Factory, we might have something like the below, where SQLDB is my transformation service.
Next, at the child level, we handle the actual execution of our data transformations. Plus, the nesting of the ForEach activities in each parent. Child level then gives us the additional scale-out processing needed for some services. If created in Data Factory, we might have something like the below.
Infants contain reusable handlers and calls that could potentially be used at any level in our solution. The best example of an infant is for an ‘Error Handler’.
6. Working with Folders
- We can use Folders and sub-folders to organize Data Factory components, as they provide us ease of navigation.
- Though, these folders are only used when working within the Data Factory portal UI. They are not reflected in the structure of our source code repo.
- You can enable the ability to create folders in ADF authoring UI: through launch “Author & Monitor” from factory blade → from left nav of resource explorer → click on “+” sign to create folder for Pipelines, Datasets, Data Flows, and Templates.
- Additionally, with the rich parameterization support in ADF V2, you can use do dynamic lookup and pass in an array of values into a parameterized dataset, which drastically reduces the need to create or maintain large number of hard-coded datasets or pipelines. Link
7. Linked service security via azure key vault
- Wherever possible we should be including this extra layer of security and allowing only Data Factory to retrieve secrets from Key Vault using its own Managed Identity.
- If you aren’t familiar with this approach check out this Microsoft Doc pages: https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault
- Be aware that when working with custom activities in ADF using Key Vault is essential as the Azure Batch application can’t inherit credentials from the Data Factory linked service references.
8. Dynamic linked services
- Reusing code is always a great time saver and it means you often have a smaller footprint to update when changes are needed like you can parameterize the linked service in your Azure Data Factory and make the Linked service reusable.
- For example, you can parameterize the database name in your ADF linked service instead of creating 10 separate linked services corresponding to the 10 Azure SQL databases.
- This reduces overhead and improves manageability for your data factories. You can then dynamically pass the database names at runtime.
- Simply create a new linked service and click Add Dynamic Content underneath the property that you want to parameterize in your linked service.
To deploy Data Factory we can use:
- ARM Templates:
ARM templates allow you to create and deploy an entire Azure infrastructure. For example, you can deploy not only virtual machines, but also the network infrastructure, storage systems, and any other resources you may need.
Repeatedly deploy your infrastructure throughout the development lifecycle and have confidence your resources are deployed in a consistent manner.
Here you deploy the template through one command, rather than through multiple imperative commands.
- Link to the tutorial to create and deploy ARM template.
2. PowerShell cmdlets & JSON definition files:
- Azure PowerShell is a set of cmdlets(lightweight command used in windows PowerShell environment) for managing Azure resources directly from the PowerShell command line. It simplifies and automates to create, test, deploy Azure cloud platform services using PowerShell.
- Generally, this technique of deploying Data Factory parts with a 1:1 approach between PowerShell cmdlets and JSON files offers much more control and options for dynamically changing any parts of the JSON at deployment time.
- https://azure.microsoft.com/mediahandler/files/resourcefiles/whitepaper-adf-on-azuredevops/Azure data Factory-Whitepaper-DevOps.pdf
Thanks for reading.