Due to time zones, events presented by American speakers will be spread over more days, and will take place in the afternoon from 2 pm to 6 pm Italian time
Brief History of Data Engineering
In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and HDFS in 2004. They published the papers for them in the same year.
Doug Cutting took those papers and created Apache Hadoop in 2005.
Cloudera was started in 2008, and HortonWorks started in 2011. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop.
Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption.
With an immutable file system like HDFS, we needed scalable databases to read and write data randomly. Apache HBase came in 2007, and Apache Cassandra came in 2008. Along the way, there were various explosions of databases within a type, such as GPU, graph, JSON, column-oriented, MPP, and key value.
Hadoop didn’t support doing things in real-time, and Apache Storm was open sourced in 2011. It didn’t get wide adoption as it was a bit early for real-time, and the API was difficult to wield.
Apache Spark came in 2009 and gave a unified batch and streaming engine. It gained in usage and eventually displaced Hadoop.
Apache Flink came in 2011 and gave us our first real streaming engine. It handled the stateful problems of real-time elegantly.
We lacked a scalable pub/sub system. Apache Kafka came in 2011 and gave the industry a much better way to move real-time data. Apache Kafka has its architectural limitations, and Apache Pulsar was released in 2016.