Azure Data Factory-
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating data movement and data transformation.
Azure Data factory does not store any data in it. It acts as a path between different data sources and allows us to create a pipeline or data-driven workflow between these databases in an organized way. It is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Through Azure Data Factory, raw data can be put together into a piece of meaningful information.
Why Azure Data factory?
Without the Azure Data Factory, there will be a lot of load on developers’ shoulders to create custom data movement components or to write services to integrate the different data sources and processing at an enterprise level. This process is very long and tiresome and very hard to monitor when moving into action. It’s expensive and hard to integrate and maintain such systems. In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.
The pipelines created by Data Factory used to move and transform data. And these pipelines can be run at the user convenience that is either at an instant or can be scheduled(hourly, daily, weekly, etc.).
Connect and Collect -
- The very first step is to connect to all the required sources of data and processing different data services such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services.
- The next step is to move data either from an on-premise or Cloud source data store to a centralization data store for further analysis. Here comes the copy activity of the Azure Data factory, which creates a pipeline between them.
- For example, You can collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.
Transform and enrich -
Once the data is present in a centralized data store in the cloud, process the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
Now the raw data has been formatted into a consumable form. It can be loaded into any analytics tool according to the business requirements.
Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Log Analytics, and health panels on the Azure portal.
Components in Data Factory -
So here what is going to happen in brief -
We are going to create two linked services. One is for connecting the Salesforce, and the other is connecting the Azure Blob Storage. Then the two datasets will be created. One will represent the Salesforce object and one for the Azure blob storage.
A pipeline is a grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task. A data factory can have more than one pipeline.
The advantage of using a pipeline is that we can manage multiple tasks simultaneously instead of performing an individual task. The activities in a pipeline can be executed sequentially or independently in parallel.
For example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob storage. In our case, we are moving data from Salesforce to Microsoft Azure Blob Storage.
An activity in Data Factory represents a running step in a pipeline.
For example, copy activity.
Linked Services -
Linked Services creates a link between the data factory and your data store. They are required to define connection information needed for Azure Data Factory to connect to external resources, so they are like a connection string.
Think of it as a highway that connects two different cities for the movement of entities.
Creating Linked Service for Salesforce.
Step 1 :
Step 2 :
Step 3 :
Get security token from your org
Now add details according to your requirements :
Test your connection and click create.
Now we need a linked service for our Azure blob storage. As you can see Azure already provide a linked service called AzureBlobStorage which utilizes the Storage account details i.e. Account name and key.
Step 4 :
Now Let’s create a Copy Activity-
Go back to Data factory home and click on Copy data in Azure Data Factory -
Specify details according to the requirement -
And click next.
Select the source data store i.e. the linked service for Salesforce.
Click Next. Choose your folder in the next step.
Now you need to create the datasets.
Datasets represent the structure of data within the linked data stores. For example, in terms of Salesforce, every object can act as a dataset.
There are two types of Datasets -
Input Datasets -
An input dataset represents the input for an activity in the pipeline. It can be any object like Account, Contact, ContentVersion, or any custom object.
Output Datasets -
An output dataset represents the output for the activity. In Azure Data Factory there are few formats in which the data is stored from Salesforce which are as follows -
- Text format
- Avro format
- JSON format
- ORC format
- Parquet format
The output dataset will be created according to the mapping of fields of the Salesforce object.
We are going to use the JSON format.
Step 5 :
Creating input dataset -
Now choose any object you want to backup we are choosing Account-
Preview your data and click Next.
Step 6 :
Now let’s create a dataset for Destination datastore (Output Dataset)i.e. Azure Blob Storage
Specify the file path and file name and don’t forget to add .json after the file name. If you wouldn’t mention .json it won’t affect the file.
Choose your output file format.
Step 7 :
Review your details in Summary page -
Step 8 :
You will see a page like below -
You can monitor your pipeline from here or click finish.
Now go and check for your file in your container.
By performing the above steps, we are able to create a linked service for Salesforce and dataset for both Salesforce and Azure Blob Storage. Then, we have performed a copy activity which will fetch the data from Salesforce to Azure Blob Storage and save it to the container.
Original content shared on linkedIn.