Mirroring Datasets between Hadoop clusters with Apache Falcon

Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.

It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.

In this tutorial we are going walk the process of:

Defining the cluster entities
Defining and executing a job to mirror data between two clusters

###Prerequisite

Download Hortonworks Sandbox

Once you have download the Hortonworks sandbox and run the VM, open a commandline shell to our Sandbox through SSH:

The default password is hadoop

###Starting Falcon

By default, Falcon is not started on the sandbox. You can use the following commands on the shell command line to start Apache Falcon:

First login as user falcon

su - falcon

Change directory to Falcon install directory:

cd /usr/hdp/2.3.0.0-2557/falcon/

Use falcon-stop to stop any existing falcon instances

./bin/falcon-stop

Now, use falcon-start to start Falcon

./bin/falcon-start

You can check the status of falcon using the falcon-status command

./bin/falcon-status

###Creating the cluster entities

Before creating the cluster entities, we need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster and backupCluster.

Use hadoop fs -mkdir commands to create the directories /apps/falcon/primaryCluster and /apps/falcon/backupCluster directories on HDFS.

hadoop fs -mkdir /apps/falcon/primaryCluster
hadoop fs -mkdir /apps/falcon/backupCluster

Further create directories called staging inside each of the directories we created above:

hadoop fs -mkdir /apps/falcon/primaryCluster/staging
hadoop fs -mkdir /apps/falcon/backupCluster/staging

Now we need to change the permission recursively on the falcon directory on HDFS

hadoop fs -chmod -R 777 /apps/falcon/*

Next we need to change the owner of the directories to user falcon

hadoop fs -chown -R falcon /apps/falcon/*

Next we will need to create the working directories for primaryCluster and backupCluster

hadoop fs -mkdir /apps/falcon/primaryCluster/working
hadoop fs -mkdir /apps/falcon/backupCluster/working

As before we have to set the owner and permission of the directories

hadoop fs -chown -R falcon /apps/falcon/*
hadoop fs -chmod -R 755 /apps/falcon/primaryCluster/working /apps/falcon/backupCluster/working

Now let’s navigate to the Falcon UI on our browser. The Falcon UI is by default at port 15000. The default username is ambari-qa and the password is admin.

This UI allows us to create and manage the various entities like Cluster, Feed, Process and Mirror. Each of these entities are represented by a XML file which you either directly upload or generate by filling up the various fields.

You can also search for existing entities and then edit, change state, etc.

Let’s first create a couple of cluster entities. To create a cluster entity click on the Cluster button on the top.

Then click on the edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="this is primary cluster" colo="primaryColo" xmlns="uri:falcon:cluster:0.1">
    <tags>primaryKey=primaryValue</tags>
    <interfaces>
        <interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
        <interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
        <interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
        <interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
        <interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/apps/falcon/primaryCluster/staging"/>
        <location name="temp" path="/tmp"/>
        <location name="working" path="/apps/falcon/primaryCluster/working"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0x755"/>
    <properties>
        <property name="test" value="value1"/>
    </properties>
</cluster>

Click Finish on top of the XML Preview area to save the XML.

Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values press Next.

Click Save to persist the entity.

Similarly, we will create the backupCluster entity. Again click on Cluster button on the top to open up the form to create the cluster entity.

Then click on the edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="backupCluster" description="this is backup colo" colo="backupColo" xmlns="uri:falcon:cluster:0.1">
    <tags>backupTag=backupTagValue</tags>
    <interfaces>
        <interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
        <interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
        <interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
        <interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
        <interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/apps/falcon/backupCluster/staging"/>
        <location name="temp" path="/tmp"/>
        <location name="working" path="/apps/falcon/backupCluster/working"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0x755"/>
    <properties>
        <property name="key1" value="val1"/>
    </properties>
</cluster>

Click Finish on top of the XML Preview area to save the XML and then the Next button to verify the values.

Click Save to persist the backupCluster entity.

Now let’s go back to the SSH terminal and create the directory /user/ambari-qa/falcon on HDFS and then the directories mirrorSrc and mirrorTgt as the source and target of the mirroring job we are about to create.

hadoop fs -mkdir /user/ambari-qa/falcon
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt

Now we need to set a permission to allow access:

hadoop fs -chmod -R 777 /user/ambari-qa/falcon

###Setting up the Mirroring Job

To create the mirroring job, Go back the Falcon UI on your browser and click on the Mirror button on the top to create a mirroring job

Provide a name of your choice. We named the Mirror Job MirrorTest

Select the appropriate Source and Target. In our case the source cluster is primaryCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorSrc.

The target cluster is backupCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorTgt.

Also set the validity of the job to your current time, so that when you attempt to run tthe job in a few minutes, the job is still within the validity period.

###Running the Job

Before we can run the job we need some data to test on HDFS. Let’s give us permission to upload some data using the HDFS View in Ambari.

hadoop fs -chmod -R 775 /user/ambari-qa

Open Ambari from your browser at port 8080.

Then launch the HDFS view from the top right hand corner.

From the view on the Ambari console navigate to the directory /user/ambari-qa/falcon/mirrorSrc

Upload any file

Once uploaded the file should appear in the directory

Now navigate to the Falcon Ui and search for the job we created. The name of the Mirro job we had created was MirrorTest

Select the MirrorTest job my checking the checkbox and thebclick on Schedule

The state of the job should change to RUNNING

After a few minutes, use the HDFS View in the Ambari console to check the /user/ambari-qa/falcon/mirrorTgt directory and you should notice your data mirrored

In this tutorial we walked through the process of defining the cluster entities representing two different clusters and then mirroring the datasets between them. In the next tutorial we will work through defining various data feeds and processing them to refine the data.

Saptak Sen

If you enjoyed this post, you should check out my book: Starting with Spark.

Saptak Sen

Processing Data Pipeline on Hadoop clusters with Apache Falcon

Saptak Sen

Share this post