Import, Export
Retention, Replication/DR/BCP, Anonymization of PII data, Archival, etc.
Late data handling, reprocessing, dependency checking, etc.
Multi-cluster management to support Local/Global Aggregations, Rollups, etc.
Lineage, Audit, Classification
Large datasets are incentives for users to come to Hadoop
Data Loading optimized for space, time and bandwidth
We cannot rely on users to adhere to data governance policies.
SEC, SOX, A29 of PII data, etc.
Data management is a common concern to be offered as a service
BCP, Security, Data Pipeline processing, etc.
New opportunities – from Traditional ETL
Steady growth in data volumes – 3 V
SLA requirements
DIY - Silo problem
Best practices/patterns
Security, BCP, Resource management
Lineage, Audit, etc.
Decouples a data location and its properties from workflows
Understanding the life-time of a feed will allow for implicit validation of the processing rules
Common data services are simple directives, No need to define them verbosely in each job
Allows process owners to keep their processing specific to their application logic
Sits in the execution path, intercepts to handle OOB data / retries etc.
Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem
Infrastructure, Data Sets, Pipeline/Processing logic
Simply a dependency graph between infrastructure, data and processing logic
Transforms the input into automated and scheduled workflows
Handles retry logic and late data processing. Records audit, lineage and metrics
Seamless integration with metastore/catalog
Data Set management (Replication, Retention, etc.) offered as a service
Users can cherry pick, No coupling between primitives
Automates processing and tracks the end to end progress. Provides hooks for metering and monitoring, notifications
<cluster colo="colo-1" description="Primary cluster"
name="primary-cluster" xmlns="uri:ivory:cluster:0.1">
<interfaces>
<interface type="readonly" endpoint="hftp://localhost:50070” version="1.2"/>
<interface type="write" endpoint="hdfs://localhost:54310” version="1.2"/>
<interface type="execute" endpoint="localhost:54311" version="1.2"/>
<interface type="workflow" endpoint="http://localhost:11000/oozie/" version="3.3.0"/>
<interface type="messaging" endpoint="tcp://localhost:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/projects/ivory/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/projects/ivory/working"/>
</locations>
<properties/>
</cluster>
bin/falcon entity -url http://localhost:15000 -submit -type cluster -file primary-cluster.xml
<feed description="TestHourlySummary" name="TestHourlySummary” xmlns="uri:ivory:feed:0.1">
<partitions/>
<groups>bi</groups>
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-BCP" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/projects/test/TestHourlySummary/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/none"/>
<location type="meta" path="/none"/>
</locations>
<ACL owner=”venkatesh" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
bin/falcon entity -url http://localhost:15000 -submit -type feed -file replicating-feed.xml
bin/falcon entity -type feed -url http://localhost:15000 -name replicating-feed -schedule
A recurring workflow for copying data from source to target(s)
A recurring workflow for purging expired data on Primary cluster
A recurring workflow for purging expired data on BCP cluster
Provides a single interface to orchestrate data lifecycle across clusters.
Provides the key services data processing applications need so Sophisticated DLM can easily be added to Hadoop applications.
Complex data processing logic handled by Falcon instead of hard-coded in apps.
Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop.
Promotes decoupling of data set location from Ooze definition.
Declarative processing with simple directives enabling rapid prototyping
V0.2 is in deployment for over 12 months at InMobi.
A release will be coming soon at Apache.
Visit Apache Falcon
/
#