Apache Falcon

Simplified Data Management for Hadoop!

What is Data Management?

Data Motion

Import, Export
Policy-based Lifecycle Management

Retention, Replication/DR/BCP, Anonymization of PII data, Archival, etc.
Process orchestration and scheduling

Late data handling, reprocessing, dependency checking, etc.

Multi-cluster management to support Local/Global Aggregations, Rollups, etc.
Data Discovery

Lineage, Audit, Classification

Why is Data Management Critical?

Productivity Gains

Large datasets are incentives for users to come to Hadoop

Data Loading optimized for space, time and bandwidth
Regulatory compliance

We cannot rely on users to adhere to data governance policies.

SEC, SOX, A29 of PII data, etc.
Process orchestration and scheduling

Data management is a common concern to be offered as a service

BCP, Security, Data Pipeline processing, etc.

Challenging Data Management Landscape

Data Management is hard and messy

New opportunities – from Traditional ETL

Steady growth in data volumes – 3 V

SLA requirements
Separation of Concerns

DIY - Silo problem

Best practices/patterns

Security, BCP, Resource management
Visibility into E2E

Lineage, Audit, etc.

Falcon - The Solution!

Introduces a higher layer of abstraction – Data Set

Decouples a data location and its properties from workflows

Understanding the life-time of a feed will allow for implicit validation of the processing rules
Provides the key services for data processing apps

Common data services are simple directives, No need to define them verbosely in each job

Allows process owners to keep their processing specific to their application logic

Sits in the execution path, intercepts to handle OOB data / retries etc.
Promotes Polyglot Programming

Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem

How does Falcon work?

System accepts entities using DSL

Infrastructure, Data Sets, Pipeline/Processing logic

Simply a dependency graph between infrastructure, data and processing logic
System orchestrates workflows

Transforms the input into automated and scheduled workflows

Handles retry logic and late data processing. Records audit, lineage and metrics

Seamless integration with metastore/catalog
Integrated Seamless experience to users

Data Set management (Replication, Retention, etc.) offered as a service

Users can cherry pick, No coupling between primitives

Automates processing and tracks the end to end progress. Provides hooks for metering and monitoring, notifications

High Level Architecture

Case Study: Replication of data sets

User creates a Primary cluster definition

<cluster colo="colo-1" description="Primary cluster"
    name="primary-cluster" xmlns="uri:ivory:cluster:0.1">
    <interfaces>
        <interface type="readonly" endpoint="hftp://localhost:50070” version="1.2"/>
        <interface type="write" endpoint="hdfs://localhost:54310” version="1.2"/>
        <interface type="execute" endpoint="localhost:54311" version="1.2"/>
        <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="3.3.0"/>
        <interface type="messaging" endpoint="tcp://localhost:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/projects/ivory/staging"/>
        <location name="temp" path="/tmp"/>
        <location name="working" path="/projects/ivory/working"/>
    </locations>
    <properties/>
</cluster>

User submits the cluster definition to Falcon

bin/falcon entity -url http://localhost:15000 -submit -type cluster -file primary-cluster.xml

Repeat the above for a BCP cluster

Case Study: Replication of data sets...Continued

User creates a Data Set

 <feed description="TestHourlySummary" name="TestHourlySummary” xmlns="uri:ivory:feed:0.1">
    <partitions/>
    <groups>bi</groups>
    <frequency>hours(1)</frequency>
    <late-arrival cut-off="hours(4)"/>
    <clusters>
        <cluster name=”cluster-primary" type="source">
            <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
            <retention limit="days(2)" action="delete"/>
        </cluster>
        <cluster name=”cluster-BCP" type="target">
            <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
            <retention limit="days(2)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data” path="/projects/test/TestHourlySummary/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
    </locations>
    <ACL owner=”venkatesh" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

Case Study: Replication of data sets...Continued

User submits the data set definition to Falcont

bin/falcon entity -url http://localhost:15000 -submit -type feed -file replicating-feed.xml

User then schedules it with Falcon

bin/falcon entity -type feed -url http://localhost:15000 -name replicating-feed -schedule

Voila! You are done. Magic happens!

Case Study: Replication of data sets...Continued

Maintains the dependencies and relationships between entities
Instruments workflows for dependencies, retry logic, Table/Partition registration, notifications, etc.
Falcon orchestrates these into scheduled recurring workflows
Replication workflow

A recurring workflow for copying data from source to target(s)
Retention workflow for each cluster based on the defined policy

A recurring workflow for purging expired data on Primary cluster

A recurring workflow for purging expired data on BCP cluster

Case Study: Geographically Distributed Data Processing

Coming Soon....

Falcon Highlights

Falcon provides the key services for data processing apps

Provides a single interface to orchestrate data lifecycle across clusters.

Provides the key services data processing applications need so Sophisticated DLM can easily be added to Hadoop applications.

Complex data processing logic handled by Falcon instead of hard-coded in apps.

Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop.
Introduces new abstractions : “Data Set”, “Process”, etc.

Promotes decoupling of data set location from Ooze definition.

Declarative processing with simple directives enabling rapid prototyping
Current Status

V0.2 is in deployment for over 12 months at InMobi.

A release will be coming soon at Apache.

Thank you!

For more information on Falcon

Visit Apache Falcon

Apache Falcon

Simplified Data Management for Hadoop!

What is Data Management?

Data Motion

Policy-based Lifecycle Management

Process orchestration and scheduling

Data Discovery

Why is Data Management Critical?

Productivity Gains

Regulatory compliance

Process orchestration and scheduling

Challenging Data Management Landscape

Data Management is hard and messy

Separation of Concerns

Visibility into E2E

Falcon - The Solution!

Introduces a higher layer of abstraction – Data Set

Provides the key services for data processing apps

Promotes Polyglot Programming

How does Falcon work?

System accepts entities using DSL

System orchestrates workflows

Integrated Seamless experience to users

High Level Architecture

Case Study: Replication of data sets

User creates a Primary cluster definition

User submits the cluster definition to Falcon

Repeat the above for a BCP cluster

Case Study: Replication of data sets...Continued

User creates a Data Set

Case Study: Replication of data sets...Continued

User submits the data set definition to Falcont

User then schedules it with Falcon

Voila! You are done. Magic happens!

Case Study: Replication of data sets...Continued

Maintains the dependencies and relationships between entities

Instruments workflows for dependencies, retry logic, Table/Partition registration, notifications, etc.

Falcon orchestrates these into scheduled recurring workflows

Replication workflow

Retention workflow for each cluster based on the defined policy

Case Study: Geographically Distributed Data Processing

Falcon Highlights

Falcon provides the key services for data processing apps

Introduces new abstractions : “Data Set”, “Process”, etc.

Current Status

Thank you!

For more information on Falcon