From classic ETL approach with Postgres to Streaming ETL using Kafka Streams – Carles Planas

Working with PostgreSQL and PL/PgSQL for implementing ETLs with more than 300M of HTTP dialogs by day for a single client could be not scalable or maintainable at all, so it could be challenging moving forward to stream ETLS using Kafka Stream.
Our DDC data collector which is able to capture this amount of traffic with few resources manage to send raw data to a Kafka cluster which allows us to digest this traffic instead of using PostgreSQL.
Our stack should be able to work with different BI tools in order to visualize the important data insight of a particular client, for example building OLAP cubes for batch processes or providing real-time queries for different dimensions of the data. So we are implementing ETL applications by means of Kafka Streams.

At the end of this talk you will know more about our experience of working with Kafka Streams library for doing aggregation, disaggregation and join applications. The motivation behind the choose of Kafka Streams, why we decided to use Kafka Stream in order to replace even batch processes. Our experience using Avro or Kryo inside our application for serialization. And finally our ideas to use Kubernetes for deploying those applications.
This talk will include a Demo of our application capturing OTA (Online Travel Agency) look like HTTP dialogs traffic, deploy of our application for aggregate, disaggregate and join streams and finally the resulting data store on InfluxDB and Grafana for visualization


datahack Barcelona.
Avenida Josep Tarradellas, 34-36, 1º drcha

18 de septiembre de 2019 a las 18:30


Inscríbete ya aquí: