Apache Gobblin (v0.15): Getting Started Guide

Distributed & Highly Scalable Data Integration Platform

8 min readNov 18, 2020

Introduction:

Apache Gobblin is a distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, and data management for both streaming and batch data ecosystems.

Brief history:

Apache Gobblin was open-sourced by LinkedIn’s data infrastructure engg. team back in 2015 and it's was under Apache incubation for a couple of years and has been a top-level apache project since Jan, 2021. It started as a framework that primarily focused on solving the Hadoop data ingestion problem, but later evolved into a universal ingestion framework supporting many sources and target systems.

Apache Gobblin has released its latest 0.15 version that brings a lot of features and bug-fixes to the already stable platform. It also tries to address the new user learning curve and simplifies the on-boarding for new developers. This blog aims to serve as a getting started guide for many data infra developers who are looking to use Apache Gobblin for their data movement and integration needs.

Let’s dig into the details of Apache Gobblin.

Prepare Env.

# Requires Java >= 1.8 (use Jenv to manage multiple Java versions)
java -version#optional
mkdir -p ~/src/apache; cd ~/src/apachegit clone https://github.com/apache/incubator-gobblin.git gobblin;
cd gobblin;#Gobblin use graddle wrapper, so download it and add to source dir
curl --insecure -L https://github.com/apache/incubator-gobblin/raw/0.12.0/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

1. Build

Clone the repo and build the project: For the faster build, skip tests, Javadoc and checkstyle tasks.

./gradlew build -x findbugsMain -x rat -x checkstyleMain -x checkstyleTest -x test -x javadoc

2. Deploy

Setup the Apache Gobblin distribution: The build creates tar distribution, extract it.

mkdir -p /tools
tar -xvf apache-gobblin-incubating-bin-0.15.0.tar.gz /tools/
cd /tools/gobblin-dist

Let’s understand the structure of the Apache Gobblin distribution.

With recent release of Gobblin (0.15), It combines many scattered scripts into one standardized script with uniform features & functionality across all Gobblin executions modes(GOBBLIN-707). It really makes new user onboarding easy and improves the experience.

bin: contains gobblin.sh and gobblin-env.sh scripts that can manage all Gobblin processes including utilities. The 0.15 release of Apache Gobblin comes with many improvements that standardize and simplifies usage so all other scripts in bin are for backward compatibility only.
conf: Apache Gobblin (for most parts) uses the awesome lightbend (typesafe) config for java that uses HOCON syntax (much better than strict JSON). While starting the Gobblin process it will use conf/<execution-mode> dir to pick up configs from reference.conf and application.conf in that order. You can also override this location via --conf-dir parameter of gobblin.sh
lib: all required libraries and dependencies that are placed here and added to classpath.

All Gobblin deployment modes are available under gobblin service command and all Gobblin CLI commands & utilities are available via gobblin cli command. gobblin.sh — help helper text is pretty self-explanatory, check it out.

Gobblin Service and Deployment

./bin/gobblin service

Gobblin can be run in one of the available execution modes. To understand the execution modes better, first, we will have to understand how Gobblin internally works.

Gobblin Architecture:

As seen in architecture, Gobblin primarily has 2 major process components:

Job Executor : responsible for scheduling and preparing the job to run and creating workunits that gets submitted to get executed by the task executor.
Task Executor. responsible for executing the submitted workunits, which goes through Extraction → Conversion → Quality Checks → Writer. (Publisher only gets executed after the last workunit is complete)

All Gobblin Execution Modes:

Execution modes basically decides how these 2 components runs: In a single process, different processes or in distributed way. Let’s understand each execution mode in bit more detail.

standalone : It's a single process that runs both job & task executors. Also other gobblin services like rest services, gobblin UI, metrics services, etc.. all run within a single JVM. This is the quickest way to run Gobblin and best for local development.

./bin/gobblin service standaline start

2. Cluster: Gobblin be deployed in cluster mode as well where some nodes are master and some are worker. Apache Gobblin uses Apache Helix to manage the cluster env. and also for the execution of the job’s work-units.

cluster-master : at least one master process is required to runs on any node of the clusters that performs the scheduling and work-units creations.
cluster-worker : worker nodes basically executes the work-units.

./bin/gobblin service cluster-master start
./bin/gobblin service cluster-worker start

3. mapreduce : This execution mode is a bit outlier in running Gobblin. Unlike other execution modes which run the job and task executor processes in a daemon mode, This mode simply creates a mapper-only MapReduce job for the provided Gobblin job configuration and submits it to the provided Hadoop cluster to get it executed further. so it basically only runs the job and This mode only runs the job once without a schedule. The MR Application does include all other gobblin services within the mapper JVM.

./bin/gobblin service mapreduce start —job-conf-file ./gobblin-jobs/kafka-to-hadoop-test.pull.done —jt localhost:8088 —fs localhost:8020 —verbose

4. yarn : This mode creates a yarn app master container which will act as a work-unit creator, and it spins new containers based on the configurations and available work-units to execute. This is also the best mode to run Gobblin on production.

./bin/gobblin service yarn start

5. aws : Gobblin can be deployed on cloud infrastructure and this mode provides all the functionality to run Gobblin in AWS.

6. gobblin-as-service : This is the latest and advanced mode that is added recently which aims to provide workflow and intelligent pathfinding functionality within available routes. we will talk about his mode in detail later.

How the Gobblin process starts:

The Gobblin process starts with a lot of default values (set in ./bin/gobblin-env.sh) Each process basically runs a JVM with a predefined class per mode ( the mapping can be found in ./bin/gobblin script) using the default config from conf/<execution-mode> directory to read all platform level configs.

Apache Gobblin comes with gazillion configurations, so it is highly configurable but since not all configs are documented, one usually has to look into the code.The community is working towards adding more docs, contributions are highly welcome !

3. Run

Run Gobblin Standalone with example job.

Gobblin Standalone Config:

The Gobblin ships with default config for all modes. The standalone config conf/standalone/application.conf Make changes to this config as per your local env.

GOBBLIN_WORK_DIR: and GOBBLIN_JOB_CONFIG_DIR by default comes from gobblin-env.sh , and can be overridden via gobblin.sh script parameters

# standalone/application.confgobblin.work.dir=${GOBBLIN_WORK_DIR}
gobblin.jobconf.dir=${GOBBLIN_JOB_CONFIG_DIR}

# Directory where job configuration files are stored
jobconf.dir=${gobblin.jobconf.dir}
jobconf.fullyQualifiedPath="file://"${gobblin.jobconf.dir}
# Directory where job locks are stored
job.lock.dir=${gobblin.work.dir}/locks

# File system URIs
#use file:// if you dont have Hadoop on local
#fs.uri="file:///"
fs.uri="hdfs://localhost:8020"
writer.fs.uri=${fs.uri}
state.store.fs.uri=${fs.uri}

# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
#writer.staging.dir=${gobblin.work.dir}/task-staging
#writer.output.dir=${gobblin.work.dir}/task-output
# both of the above is not required if task.data.root.dir is specified.
task.data.root.dir=${gobblin.work.dir}/task-staging


# Data publisher related configuration properties
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${gobblin.work.dir}/job-output
data.publisher.replace.final.dir=false

# Directory where job/task state files are stored
state.store.dir=${gobblin.work.dir}/state-store

# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${gobblin.work.dir}/err

# Directory where commit sequences are stored
gobblin.runtime.commit.sequence.store.dir=${gobblin.work.dir}/commit-sequence-store

# Directory where job locks are stored
job.lock.dir=${gobblin.work.dir}/locks

# Directory where metrics log files are stored
metrics.log.dir=${gobblin.work.dir}/metrics

# Enable metrics / events
metrics.enabled=true
#Metrics report interval in milliseconds.
metrics.report.interval=1000
metrics.reporting.file.enabled=true


# UI - starts the UI to see jobs and tasks
admin.server.enabled=true
admin.server.port=9000#required for job/task to persist execution info
rest.server.host=localhost
rest.server.port=9090

# job history store used by the rest server
job.execinfo.server.enabled=true
job.history.store.enabled=true
job.history.store.url="jdbc:mysql://localhost:3306/gobblin?autoReconnect=true&useSSL=false"
#job.history.store.jdbc.driver=com.mysql.jdbc.Driver
job.history.store.jdbc.driver=com.mysql.cj.jdbc.Driver
job.history.store.user=gobblin_user
job.history.store.password=gobblin_pass

# The time gap for Job Detector to detect modification/deletion/creation of jobconfig.
# Unit in milliseconds, configurable.
jobconf.monitor.interval=10000
task.status.reportintervalinms=1000

Gobblin Sample Job:

Gobblin by default looks for job configuration at the location specified by env variable GOBBLIN_JOB_CONFIG_DIR and can be overridden in ./bin/gobblin-env.sh

Lets create the job config that can move some data from Hadoop to Hadoop at GOBBLIN_JOB_CONFIG_DIR/hadoop-to-hadoop-test-job.pull which looks as following:

job.name=hadoop-to-hadoop-test
job.group=hadoop-to-hadoop
job.runonce=truesource.class = org.apache.gobblin.data.management.copy.CopySource
source.filebased.fs.uri = "hdfs://localhost:8020"writer.fs.uri = "hdfs://localhost:8020"
writer.builder.class = org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilderextract.namespace = org.apache.gobblin.copy
converter.classes = org.apache.gobblin.converter.IdentityConverter

gobblin.dataset.pattern = /tmp/test-job-datagobblin.dataset.profile.class = org.apache.gobblin.data.management.copy.CopyableGlobDatasetFinder


data.publisher.type = org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
data.publisher.final.dir = /tmp/test-job

start Gobblin in standalone mode. The launched Gobblin process detects and executes the above job. There are many different example jobs that Gobblin ships with: gobblin-example/src/main/resources/

writer.fs.uri supports any hadoop compatible filesystem, so if you specify s3:// or gs:// it would start writing data to S3 or GCS buckets respectively. This makes it very flexible for cloud use-cases.

Gobblin CLIs and Utilities

Gobblin comes with many prepackaged utilities that are useful in managing the data it moves. some of them are self-explanatory, while some would need elaborated docs that I would cover in a separate blog.

./bin/gobblin cli

I have not explored all the features that Gobblin has to offer. I will add more details here as I explore those.

Gobblin Features

Gobblin comes with a plethora of features and flags that are useful on a high scale production system. The team has done a good job of documenting a lot of them on the official documentation page, although not 100% is documented :)

Summary

“Commit to Open Source First” approach of LinkedIn dev team makes sure that we have fairly stable releases and unbroken master branch.

Apache Gobblin is truly taking data integration as one of the distributed system problems and solving it with all the right components that are stable and horizontally scalable. I have been using Apache Gobblin for a while now and I have personally found it to be highly stable and scalable for production workloads.

What is next?

Gobblin has a lot of features and development over the years and it has collected significant-tech debt. The code works reliably but has some learning curve to get onboarded.

Open Issues

Gobblin has its fair share of issues. The code base is huge (~350k LOC), over time its modules arrangement has become complex to navigate, but the community has open improvement proposal to fix this, and again (for OSS projects) the contributions are always welcome !

some scattered modules and huge codebase make it hard to navigate, GIP-2 is created to resolve this.
Unit tests are sometimes flaky depending on the CI env.
Lacks job management over DB persistence instead of file-based job management ( GIP to improve )
some duplicate or dead code can be removed.
would love to have more community built around

Start gobble-up some data!

and let me know if you find any gap or something didn’t work for you, I will try to address it in this blog to make it more complete….