The Evolution of Data-Movement Systems

5 min readFeb 4, 2021

Believe it or not, today, data makes the decisions, moreover, larger and wider the data set, more accurate decisions can be made — this goes for both realtime (online) decisions like approving transactions or offline decisions like deciding on next product features or next country to expand the business to.

“Data is the new Oil”, — The Economist.
….but moving data or oil around is not an easy task.

Why do we keep moving data?

The world creates more than 2.5 quintillions (that's 18 zeros after 1) bytes of data every day according to the “data never sleeps” research. The new data are typically produced on an online system that is optimized to store & serve the production traffic (a.k.a OLTP) but can not be directly used for analysis or bulk processing. Hence required the movement to an offline system (a.k.a. OLAP) which is cost-efficient storage and compute platform to facilitate the analysis/offline processing. Albeit, within OLAP systems, the data constantly moves around between special-purpose systems for reporting, processing, sharing or archiving etc… to realize its value. As data gets stale, it often loses its value.

“Data that moves is alive, one at rest is dead”. — Jay

A traditional data movement system is supposed to perform Extraction, Transformation, and Load operations [a.k.a ETL] that moves data from one system to another with or without, light or heavy transformations. As the storage systems evolved to be more complex and purpose-built, we can see that the data movement platforms also evolved with it over time, which brought additional layers of complexity and functionality to support newer requirements.

Often, people wonder why can’t we simply use sql (select-insert), scp or dist-cp commands or scripts to do the job. I don’t blame the beginners :), and so this blog attempts to show progression on how the modern requirements brought the complexity that grew over time, specially today in a polyglot environment that can be loosely analogous to — Evolution — of The Data Movement Systems.

credit for base image: https://i.redd.it/frirzh0fs5k31.jpg

we also use many different but similar-sounding terms like “data movement”, “data replication”, “ETL”, and nowadays “data ingestion”, “data integration” or “data pipelines” interchangeably, adding to the confusion on which is which and what should be the boundaries and responsibility of any such system.

Let’s dissect each stage to understand it better, with high level component view and its capability & features as evolution progresses.

Movement: data movement is a commonly used term for all kinds of movement, whenever someone wants data from other platform, S/he would request for a load of data, extracted typically in a CSV file, to be uploaded to a target system, but for relative understanding, let's consider this as the simplest type of data transfer which moves data from one place to another without any transformations, which is usually done with a command or SQL(select/insert) and simple scheduler like cron with little to no monitoring (ex: email on failure).

Examples:

File transfer from one location to another, within or across organizations
ad-hoc backup of the database to a file ( Non-Native Tools)
Copy table from one system to another similar system.

Replication: data replication is not only about just moving data but also guaranteeing the functional view of the data remains the same on both source and target systems. i.e it needs to capture deletes performed at the source. The target system in such cases also needs to be treated as read-only. Data replication between homogeneous systems like Oracle to Oracle or MySQL to MySQL typically is native to the system and primarily uses raw logs to achieve very low latency. On the other hand, Data replication between heterogeneous systems typically happens via querying the database and has higher latency but the guarantee remains the same. Basically, it needs to make sure whatever happens on the source, happens on the target as well so that the functional data view remains the same. For example, the number data type in Oracle has to be mapped appropriately to decimal with the right precision/scale. so while replicating the data, it may include small transformations ( referred to as small t in EtL) but won’t do any complex data transformation (big T in ETL).

Example:

Oracle to Oracle table replication
Teradata to Teradata table replication
Oracle to Hadoop data replication
SFTP file to Hadoop data ingestion

Integration & Pipeline: data integration is a much more complex and diverse process, where movement involves multiple sources or targets and complex transformations that may or may not create new datasets. This typically can also go through multiple stages of processing with an intermediate state of the dataset. This can be conveniently seen as a DAG (Directed Acyclic Graph) workflow — often called a data pipeline. This brings a lot of additional requirements and complexities to the movement platform.

Example:

realtime data ingestion from Oracle to various business unit’s data marts and also to data lake.
Bring 3rd party data into data lake or data lake house to be processes & converted into required data model and make it available to other business entities.

Ever-increasing special-purpose systems, polyglot data architectures and adoption of cloud ( or multi-cloud env.), one can not afford to have a spaghetti of data movement tools. As demand for data, agility and speed increases with requirements on compliance, regulatory & governance, we need advance level of architecture that can incorporate and fulfill all our requirements & needs — The NorthStar of data movement !

I believe, we are at the peak of the curve, due to proliferation of data use-cases, but with plethora of startups aiming to solve this challenge by products backed by open source projects, We may soon see some standardization & commoditize moving data around with some ease, Until then Happy Moving Data !

The Evolution of Data-Movement Systems

Why do we keep moving data?

Written by Jay Sen

No responses yet