Demystifying Big Data: The Power of Delta Tables (Part 2)
Purpose:
The purpose of this article is to demonstrate how to create a data flow in Azure Synapse Analytics that efficiently processes data while considering the existence of specific files. By implementing dynamic file name construction and utilizing metadata checks, the data flow ensures that only relevant data is processed, thereby improving efficiency and maintaining data integrity. This approach is particularly useful for scenarios where historical data needs to be preserved, and only the latest partition needs to be processed. Through step-by-step instructions and code examples, readers will learn how to implement this dynamic data processing solution in their Azure Synapse Analytics environment.
When to Use This Method:
Description
This article explores an approach to write data from an Azure Data Factory (ADF) dataflow to Parquet files for the latest partition only, avoiding overwriting historical data. While data partitioning is a common technique for handling historical data, it might not be ideal if you require specific columns like year and month to be present in the output file.
This method utilizes dynamic file names constructed using dataflow script logic. Here's a step-by-step breakdown:
Source Data:
Dynamic File Name Construction with derviedColumn:
The script utilizes a derived column named colPattern to dynamically construct the file path for the file existence check. This approach offers flexibility compared to hardcoding the path directly.
concat('sink/folder/path',iif(equals(length(toString(month(OrderDate))),1),
concat('0',toString(month(OrderDate))),toString(month(OrderDate))),'-',
toString(year(OrderDate)),'.parquet')
Here's how it works:
concat function to combine several elements:OrderDate field using the month function. The script handles single-digit months by prepending a leading zero using the iif function if necessary.OrderDate field using the year function.Sink Configuration:
filename option property to specify the dataflow column containing the dynamic file name. This ensures the data gets written to the appropriate Parquet file based on the date.We've successfully created 6 parquet files based on the Month and Year of the OrderDate column. However, we want to implement a dynamic dataflow that focuses on processing only the current Month-Year partition. The older partitions will remain as historical data.
Define Variables:
Activities:
@formatDateTime(utcNow(),'MM-yyyy' )
To summary, the set variable will return a value in format Month-Date (e.g : 03-2024). The Metadata Activity will look inside the destination path if we have a file named 03-2024.parquet, and return true if exists, false if it doesn't exist. We need now to orechestrate the dataflow in order to execute only the last partition.
3. Define Parameters
Now, inside the dataflow, we will create a parameter called fileExists. This parameter will detect the Metadata Activity return value, and execute a command accordingly.
@if(
equals(activity('checkFileExistence').output.exists, true),
variables('currentMonthYear'),
formatDateTime(addDays(startOfMonth(utcNow()), -1), 'MM-yyyy')
)Exist output from the Metadata operation is true. This indicates the presence of a file.true), it returns the currentMonthYear variable. This suggests using the current month and year for data processing.false), it likely retrieves the last month. This ensures processing previous month's data when no current month file exists.With the Month-Year value determined, we need to modify the dataflow to process the returned month and potentially all subsequent months, if they exist. To achieve this, add a filter immediately after the data source within the dataflow using the following expression:
(year(OrderDate) >= year(toDate($fileExists, 'MM-yyyy')) &&
month(OrderDate) >= month(toDate($fileExists, 'MM-yyyy'))) ||
(year(OrderDate) > year(toDate($fileExists, 'MM-yyyy')) &&
month(OrderDate) <= month(toDate($fileExists, 'MM-yyyy')))
OrderDate falls within or after the month and year specified in $fileExists.OrderDate are both greater than or equal to those extracted from $fileExists, or if the year is strictly greater and the month is less than or equal, then the row will be included in the result.Examples :
$fileExists = 03-2024, this filter would include rows where the OrderDate is in March 2024 or any later month and year.$fileExists = 12-2023, this filter would include rows where the OrderDate is in December 2023 or any later month and year. (including all months in 2024)Conclusion:
In conclusion, implementing a dynamic data flow in Azure Synapse Analytics offers a powerful solution for efficiently processing data while ensuring data integrity and preserving historical records. By incorporating dynamic file name construction and metadata checks, this approach enables users to focus on processing only the relevant data, thereby improving overall efficiency. Particularly beneficial for scenarios involving Parquet files and the need to retain all columns in the output, this method provides a comprehensive solution for managing data processing tasks within Azure Synapse Analytics. With step-by-step instructions and code examples provided in this article, users can confidently implement dynamic data processing solutions tailored to their specific needs, enhancing their Azure Synapse Analytics environment's effectiveness and performance.
Commentaires
Enregistrer un commentaire