Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.
In this tutorial we are going walk the process of:
- Defining the cluster entities
- Defining and executing a job to mirror data between two clusters
###Prerequisite
Once you have download the Hortonworks sandbox and run the VM, open a commandline shell to our Sandbox through SSH:
The default password is hadoop
###Starting Falcon
By default, Falcon is not started on the sandbox. You can use the following commands on the shell command line to start Apache Falcon:
First login as user falcon
su - falcon
Change directory to Falcon install directory:
cd /usr/hdp/2.3.0.0-2557/falcon/
Use falcon-stop
to stop any existing falcon instances
./bin/falcon-stop
Now, use falcon-start
to start Falcon
./bin/falcon-start
You can check the status of falcon using the falcon-status
command
./bin/falcon-status
###Creating the cluster entities
Before creating the cluster entities, we need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster
and backupCluster
.
Use hadoop fs -mkdir
commands to create the directories /apps/falcon/primaryCluster
and /apps/falcon/backupCluster
directories on HDFS.
hadoop fs -mkdir /apps/falcon/primaryCluster
hadoop fs -mkdir /apps/falcon/backupCluster
Further create directories called staging
inside each of the directories we created above:
hadoop fs -mkdir /apps/falcon/primaryCluster/staging
hadoop fs -mkdir /apps/falcon/backupCluster/staging
Now we need to change the permission recursively on the falcon
directory on HDFS
hadoop fs -chmod -R 777 /apps/falcon/*
Next we need to change the owner of the directories to user falcon
hadoop fs -chown -R falcon /apps/falcon/*
Next we will need to create the working
directories for primaryCluster
and backupCluster
hadoop fs -mkdir /apps/falcon/primaryCluster/working
hadoop fs -mkdir /apps/falcon/backupCluster/working
As before we have to set the owner and permission of the directories
hadoop fs -chown -R falcon /apps/falcon/*
hadoop fs -chmod -R 755 /apps/falcon/primaryCluster/working /apps/falcon/backupCluster/working
Now let’s navigate to the Falcon UI on our browser. The Falcon UI is by default at port 15000. The default username is ambari-qa
and the password is admin
.
This UI allows us to create and manage the various entities like Cluster, Feed, Process and Mirror. Each of these entities are represented by a XML file which you either directly upload or generate by filling up the various fields.
You can also search for existing entities and then edit, change state, etc.
Let’s first create a couple of cluster entities. To create a cluster entity click on the Cluster
button on the top.
Then click on the edit
button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="this is primary cluster" colo="primaryColo" xmlns="uri:falcon:cluster:0.1">
<tags>primaryKey=primaryValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/primaryCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/primaryCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="test" value="value1"/>
</properties>
</cluster>
Click Finish
on top of the XML Preview area to save the XML.
Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values press Next
.
Click Save
to persist the entity.
Similarly, we will create the backupCluster
entity. Again click on Cluster
button on the top to open up the form to create the cluster entity.
Then click on the edit
button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="backupCluster" description="this is backup colo" colo="backupColo" xmlns="uri:falcon:cluster:0.1">
<tags>backupTag=backupTagValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/backupCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/backupCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="key1" value="val1"/>
</properties>
</cluster>
Click Finish
on top of the XML Preview area to save the XML and then the Next
button to verify the values.
Click Save
to persist the backupCluster
entity.
Now let’s go back to the SSH terminal and create the directory /user/ambari-qa/falcon
on HDFS and then the directories mirrorSrc
and mirrorTgt
as the source and target of the mirroring job we are about to create.
hadoop fs -mkdir /user/ambari-qa/falcon
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt
Now we need to set a permission to allow access:
hadoop fs -chmod -R 777 /user/ambari-qa/falcon
###Setting up the Mirroring Job
To create the mirroring job, Go back the Falcon UI on your browser and click on the Mirror
button on the top to create a mirroring job
Provide a name of your choice. We named the Mirror Job MirrorTest
Select the appropriate Source and Target. In our case the source cluster is primaryCluster
and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorSrc
.
The target cluster is backupCluster
and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorTgt
.
Also set the validity of the job to your current time, so that when you attempt to run tthe job in a few minutes, the job is still within the validity period.
###Running the Job
Before we can run the job we need some data to test on HDFS. Let’s give us permission to upload some data using the HDFS View in Ambari.
hadoop fs -chmod -R 775 /user/ambari-qa
Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
From the view on the Ambari console navigate to the directory /user/ambari-qa/falcon/mirrorSrc
Upload any file
Once uploaded the file should appear in the directory
Now navigate to the Falcon Ui and search for the job we created. The name of the Mirro job we had created was MirrorTest
Select the MirrorTest
job my checking the checkbox and thebclick on Schedule
The state of the job should change to RUNNING
After a few minutes, use the HDFS View in the Ambari console to check the /user/ambari-qa/falcon/mirrorTgt
directory and you should notice your data mirrored
In this tutorial we walked through the process of defining the cluster entities representing two different clusters and then mirroring the datasets between them. In the next tutorial we will work through defining various data feeds and processing them to refine the data.
Saptak Sen
If you enjoyed this post, you should check out my book: Starting with Spark.