Create an Azure Data Factory Resource and Run Azure Data Pipeline

In this post, we will create an Azure Data Factory Resource and run Azure Data Pipeline to combine data from Azure Data Blob Storage Containers, perform left outer join and store in output SQL table.

In this part, we will provision an Azure Data Factory resource. Make sure to select the V2 version for Azure Data Factory.

  1. In the search bar of Azure portal, search for Data Factories and select the service.

  1. Click the Create button in the top menu to set up a new Data Factory service.

  2. Fill in the required details to set up the Data Factory service.

a. For Resource group, select the default resource group from the dropdown menu.

b. For Name, enter a relevant name, such as "dataresourceAAA," where AAA are random numbers or characters.

Select the closest Region to your location.

Ensure the latest version (V2) is selected.

Click Review + Create, then click Create to finalize the Data Factory service setup.

Create Data Sets in Azure Data Factory

This process involves three steps:

  1. Open Data Factory

  2. Create linked services

  3. Create datasets

    1. Open Data Factory

After a few minutes, confirm that your Data Factory has been provisioned.

Search for "Data Factories" in the search bar and select the Data Factories service.

Your Data Factory service should appear in the list. Select the Data Factory service that you provisioned.

  1. Click on the Launch Studio to open up the Azure Data Factory Studio

  1. We will link the Data Storage and Azure SQL services to Data Factory. From the left menu, select the Manage icon (toolbox icon). Then, choose Linked Services from the sub-menu and click the Create Linked Service button.

  1. Scroll down and select Azure SQL Database, then click Continue.

    Enter "AzureSqlDatabase1" for the Name.

    For the Account selection method, ensure "From Azure subscription" is selected. Choose the default subscription from the Azure subscription dropdown. Select the server you already have under Server name and select "data" under Database name.

    For Authentication type, choose SQL authentication, and enter the username and password for the SQL Server.

    Click Create.

  1. Link the Storage Account

    1. Select the +New button on the top menu to create another linked service

    2. Scroll down and select Azure Blob Storage and then select Continue.

  1. For the Name, type in "AzureBlobStorage1"

    For the Authentication type, select Account key and select Connection string.

    For the Account selection method, ensure "From Azure subscription" is selected. Under the Azure subscription dropdown, select the default subscription. Under the Storage account name, select the storage account that you had set up.

  1. Create the Datasets. Output SQL Table

    From the left menu, select the Author icon (pencil). Next, click the ellipses in the Datasets menu and choose New dataset.

  1. Scroll down and select Azure SQL Database, then click Continue.

    Enter "output_table" for the Name. Choose "AzureSqlDatabase1" for the Linked service. Select data.waiting_times for the Table name or the name of the table you have in Azure SQL Database.

    Ensure "From connection/store" is selected for the Import schema.

    Click OK to proceed.

  1. To Input the data:

    From the left menu, select the Author icon (pencil). Next, click the ellipses in the Datasets menu and choose New dataset.

    Scroll down and select Azure Blob Storage, then click Continue. Choose Delimited Text and click continue.

    Enter "input_table_wait_times" for the Name. Select "inputdata" for the Linked service. For the File path, click the folder button and navigate to the wait_times.csv file. Ensure "From connection/store" is selected for the Import schema.

    Click OK to proceed.

Do this same step for Flight Info, name the dataset "input_table_flight_info" and navigate to the flight_info.csv" file

Create Data Flows in Azure Data Factory

In this task, we will create a data flow within Azure Data Factory to accomplish the following:

  • Load the data from the two CSV files stored in Azure Storage containers into two tables.

  • Join the two tables using a left-outer join on "flight_id" as the join key.

  • Store the joined table data in the Azure SQL table created earlier (data.full_wait_times).

  1. In Data Factory, select the Author icon (pencil) from the left menu. Next, click the ellipses in the Data flows menu and choose New data flow.

  2. In the flow canvas, add a Source by clicking the Add Source button.

  1. In the Source settings menu at the bottom of the screen, enter the following properties for the source:
  • Output stream name: input_data_wait_times

  • Dataset: Select "input_table_wait_times" from the dropdown menu

  1. Add another source in the flow canvas by clicking Add Source. In the Source settings menu at the bottom of the screen, enter the following properties:

    • Output stream name: inputdataflightinfo

    • Dataset: Select "input_table_flight_info" from the dropdown menu

  1. Next, in the flow canvas, click the + button for the "inputdatawaittimes" element, and then select Join.

  1. In the Join Settings menu, enter the following properties:

    • Output stream name: join1

    • Left stream: inputdatawaittimes

    • Right stream: inputdataflightinfo

    • Join type: Left outer

    • Join condition: flight_id == flight_id

  1. Next, in the flow canvas, click the + button for the "join1" element, and then select Sink.

  2. Under the Sink properties, enter the following:

    • Output stream name: "outputsqltable"

    • Incoming stream: join1

    • Dataset: output_table

  3. With the "outputsqltable" element selected, go to the Mapping menu and perform the following steps:

    • De-select Auto-mapping

    • Manually map the input and output columns as specified

  1. We have now set up our data flow, which takes the two input tables, joins them together, and saves the result to the SQL output table.

Create the Data Pipleline

  1. To run the data flow, we need to create a data pipeline that calls the data flow. Select the Author icon (pencil) from the left menu. Next, click the ellipses in the Pipelines menu and choose New pipeline. Select Move and Transform, then drag Data Flow into the canvas.

  1. From the bottom sub-menu, select Settings and change the Data flow property to "dataflow1". Next, click Publish All in the top menu to save your data flows and data pipelines.

  1. Select Add trigger from the top menu, and then select Trigger now, and then select OK, to start the workflow.

  1. To monitor the pipeline in Azure Data Factory, select the Monitor icon (speedometer) from the left menu. Then, choose Pipeline Runs. You should see the name of your pipeline and its status.

  1. Once the status shows Success, the Azure Data Factory data flow and pipeline have been completed.