Demystifying Big Data: The Power of Delta Tables (Part 2)

Image
     1.Introduction Building on the core concepts in Part 1, we'll now explore advanced functionalities like partition pruning and schema evolution. Beginners will gain a solid grasp, while experts will find in-depth explanations and code examples to push Delta Tables' limits. Get ready to unlock the Delta Log's magic, master time travel, optimize performance, and more! We'll delve into features like CDC and schema evolution for flexible data management.     2. Delta Log: The Heart of Delta Tables At the core of Delta Tables lies the Delta Log, a powerful transactional log that meticulously records every data operation – inserts, updates, and deletes – performed on your table. This comprehensive log serves as the backbone for several key functionalities. Firstly, it ensures ACID transactions, guaranteeing data consistency by ensuring all operations are completed successfully or not at all. Secondly, the Delta Log empowers you with time travel capabilitie...

Selective Parquet Writing in Azure Synapse Analytics Dataflows using Dynamic File Names

Purpose:

The purpose of this article is to demonstrate how to create a data flow in Azure Synapse Analytics that efficiently processes data while considering the existence of specific files. By implementing dynamic file name construction and utilizing metadata checks, the data flow ensures that only relevant data is processed, thereby improving efficiency and maintaining data integrity. This approach is particularly useful for scenarios where historical data needs to be preserved, and only the latest partition needs to be processed. Through step-by-step instructions and code examples, readers will learn how to implement this dynamic data processing solution in their Azure Synapse Analytics environment.

When to Use This Method:

  • Specifically applicable for Parquet files (not delta).
  • Ideal for scenarios where it's necessary to retain all historical data without refreshing them.
  • Suitable when all columns in the output file need to be preserved (common partitioning may cause used columns in partitions to disappear).

Description

This article explores an approach to write data from an Azure Data Factory (ADF) dataflow to Parquet files for the latest partition only, avoiding overwriting historical data. While data partitioning is a common technique for handling historical data, it might not be ideal if you require specific columns like year and month to be present in the output file.

This method utilizes dynamic file names constructed using dataflow script logic. Here's a step-by-step breakdown:



  1. Source Data:

    • Use data flow sources to access your input data.
    • For our tutorial, a sample file of AdventureWorks was used; you can download it here

      Source Data

  2. Dynamic File Name Construction with derviedColumn:

    • The script utilizes a derived column named colPattern to dynamically construct the file path for the file existence check. This approach offers flexibility compared to hardcoding the path directly.

      concat('sink/folder/path',iif(equals(length(toString(month(OrderDate))),1),
      concat('0',toString(month(OrderDate))),toString(month(OrderDate))),'-',
      toString(year(OrderDate)),'.parquet')
      

      Here's how it works:

      • The script uses the concat function to combine several elements:
        • A base directory path (replace with your actual path).
        • The month extracted from the OrderDate field using the month function. The script handles single-digit months by prepending a leading zero using the iif function if necessary.
        • A hyphen (-) character.
        • The year extracted from the OrderDate field using the year function.
        • The file extension (e.g., ".parquet").

      By creating a derived column, this logic becomes reusable within the script. You can modify the base directory path or file extension without altering the core functionality.

  3. Sink Configuration:

    • Configure the dataflow sink to write data to Parquet format.
    • Utilize the filename option property to specify the dataflow column containing the dynamic file name. This ensures the data gets written to the appropriate Parquet file based on the date.


Now if you debug this dataflow, here is the output : 


We've successfully created 6 parquet files based on the Month and Year of the OrderDate column. However, we want to implement a dynamic dataflow that focuses on processing only the current Month-Year partition. The older partitions will remain as historical data.






  1. Define Variables:

    • In the "Variables" tab, create a variable named "currentMonthYear" of type "String". This variable will hold the month and year for the data you want to process.


  2. Activities:

    • The dataflow will consist of three activities:
      • setMonthYearVar: This activity simply sets the initial value for the "currentMonthYear" variable. If we need to look for the current Month-Year, we can use this expression : 

        @formatDateTime(utcNow(),'MM-yyyy' )
        


      • checkFileExistence: This activity uses "GetMetadata" to check if a file with the name format "[directory]/[month]-[year].parquet" exists in the specified container within your Storage. The month and year will be dynamically populated from the "currentMonthYear" variable. Note: Replace "[directory]" with the actual directory path where your data resides.




      • productsFlow: This activity represents the core data processing logic defined within the dataflow script.

To summary, the set variable will return a value in format Month-Date (e.g : 03-2024). The Metadata Activity will look inside the destination path if we have a file named 03-2024.parquet, and return true if exists, false if it doesn't exist. We need now to orechestrate the dataflow in order to execute only the last partition.

3. Define Parameters

Now, inside the dataflow, we will create a parameter called fileExists. This parameter will detect the Metadata Activity return value, and execute a command accordingly.

@if(
  equals(activity('checkFileExistence').output.exists, true),
  variables('currentMonthYear'),
  formatDateTime(addDays(startOfMonth(utcNow()), -1), 'MM-yyyy')
)

  1. Checks for Existing File: It verifies if the Exist output from the Metadata operation is true. This indicates the presence of a file.
  2. Current Month Processing: If a file exists (true), it returns the currentMonthYear variable. This suggests using the current month and year for data processing.
  3. Handling Missing File: If no file exists (false), it likely retrieves the last month. This ensures processing previous month's data when no current month file exists.
  4. Purpose: The overall objective is to refresh the entire previous month's data whenever a new month arrives. This helps to avoid potential data losses and ensures data consistency.

With the Month-Year value determined, we need to modify the dataflow to process the returned month and potentially all subsequent months, if they exist. To achieve this, add a filter immediately after the data source within the dataflow using the following expression:

    (year(OrderDate) >= year(toDate($fileExists, 'MM-yyyy')) &&
     month(OrderDate) >= month(toDate($fileExists, 'MM-yyyy'))) ||
    (year(OrderDate) > year(toDate($fileExists, 'MM-yyyy')) &&
     month(OrderDate) <= month(toDate($fileExists, 'MM-yyyy')))

  • The purpose of the expression is to filter rows where the OrderDate falls within or after the month and year specified in $fileExists.

  • If the year and month of OrderDate are both greater than or equal to those extracted from $fileExists, or if the year is strictly greater and the month is less than or equal, then the row will be included in the result.

Examples : 

  • If $fileExists = 03-2024, this filter would include rows where the OrderDate is in March 2024 or any later month and year.
  • If $fileExists = 12-2023, this filter would include rows where the OrderDate is in December 2023 or any later month and year. (including all months in 2024)
If you run the dataflow with current Month-Year data in the source, only the partition 03-2024.parquet will be refreshed. To test this dataflow's behavior, you can edit the set Variable activity and change the value to 05-2017 (since you have data from 01-2017 until 06-2017 in your CSV source file). In this scenario, you expect the dataflow to refresh 05-2017 and 06-2017 partitions, and ignore historical data.




Demonstrating the dynamic filtering, as you can see in the screenshot below, only partitions 05-2017 and 06-2017 (different modified date) were refreshed. This confirms that the dataflow works as expected, processing only the target partitions (05-2017 and 06-2017) based on the Month-Year value and leaving historical data untouched.



Conclusion:

In conclusion, implementing a dynamic data flow in Azure Synapse Analytics offers a powerful solution for efficiently processing data while ensuring data integrity and preserving historical records. By incorporating dynamic file name construction and metadata checks, this approach enables users to focus on processing only the relevant data, thereby improving overall efficiency. Particularly beneficial for scenarios involving Parquet files and the need to retain all columns in the output, this method provides a comprehensive solution for managing data processing tasks within Azure Synapse Analytics. With step-by-step instructions and code examples provided in this article, users can confidently implement dynamic data processing solutions tailored to their specific needs, enhancing their Azure Synapse Analytics environment's effectiveness and performance.

Commentaires

Posts les plus consultés de ce blog

Demystifying Big Data: The Power of Delta Tables (Part 1)

Demystifying Big Data: The Power of Delta Tables (Part 2)