Online Events

Due to time zones, events presented by American speakers will be spread over more days, and will take place in the afternoon from 2 pm to 6 pm Italian time

Data Warehouse Modernisation

ONLINE LIVE STREAMING

Dec 04 - Dec 05, 2023

By: Mike Ferguson

Taxonomy and Metadata Design

ONLINE LIVE STREAMING

Dec 11 - Dec 12, 2023

By: Heather Hedden

Data Governance

ONLINE LIVE STREAMING

Dec 13 - Dec 14, 2023

By: Nigel Turner

Chatbot and LLM Bootcamp

ONLINE LIVE STREAMING

Dec 18 - Dec 19, 2023

By: Russell Jurney

Data Quality: A “must” for the Business Success

ONLINE LIVE STREAMING

Apr 08 - Apr 09, 2024

By: Nigel Turner

Designing, developing and deploying a Microservices Architecture

ONLINE LIVE STREAMING

Apr 12, 2024

By: Sander Hoogendoorn

Practical Guidelines for Implementing a Data Mesh

ONLINE LIVE STREAMING

Apr 15 - Apr 16, 2024

By: Mike Ferguson

Embedded Analytics, Intelligent Apps & AI Automation

ONLINE LIVE STREAMING

Apr 17, 2024

By: Mike Ferguson

Artificial Intelligence, Machine Learning and Data Management

ONLINE LIVE STREAMING

Apr 18 - Apr 19, 2024

By: Derek Strauss

Free article of the month

Jesse Anderson
December 2023

Brief History of Data Engineering

In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and HDFS in 2004. They published the papers for them in the same year.

Doug Cutting took those papers and created Apache Hadoop in 2005.

Cloudera was started in 2008, and HortonWorks started in 2011. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop.

Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption.

With an immutable file system like HDFS, we needed scalable databases to read and write data randomly. Apache HBase came in 2007, and Apache Cassandra came in 2008. Along the way, there were various explosions of databases within a type, such as GPU, graph, JSON, column-oriented, MPP, and key value.

Hadoop didn’t support doing things in real-time, and Apache Storm was open sourced in 2011. It didn’t get wide adoption as it was a bit early for real-time, and the API was difficult to wield.

Apache Spark came in 2009 and gave a unified batch and streaming engine. It gained in usage and eventually displaced Hadoop.

Apache Flink came in 2011 and gave us our first real streaming engine. It handled the stateful problems of real-time elegantly.

We lacked a scalable pub/sub system. Apache Kafka came in 2011 and gave the industry a much better way to move real-time data. Apache Kafka has its architectural limitations, and Apache Pulsar was released in 2016.

Continued to read…

Subscribe to our newsletter