About Datalot
Founded in 2009, Datalot provides digital marketing and analytics solutions for insurance policy sales at scale. Their SaaS product provides the largest marketplace of live, in-market insurance shoppers and delivers qualified customers to some of the largest insurance companies in the world - as well as to a broad, distributed network of independent providers.
With data at the core of its business, Datalot enables their customers to spend less time and money on marketing campaigns, and more time focusing on what they do best. Previously, the insurance industry lagged in terms of digital marketing and customer targeting, and Datalot started with a mission to improve the quality of data - delivering qualified customers directly to the appropriate insurance company or agent.
Data Engineering at Datalot
Josh Arenberg leads data engineering at Datalot as the Director of Engineering, reporting into the Datalot CTO, and is primarily responsible for the company’s data environment. Josh brings more than two decades of engineering expertise to Datalot, with prior experience in data science, threat and botnet detection, and big data analysis, as well as experience with streaming frameworks like Apache Spark, kSQLdb, and Flink.
As data plays a critical role in their business model, their small data engineering team is always looking to do more with less.
Josh arrived at Datalot with a very broad remit - figure out how to modernize their data infrastructure. Everything had been based around a central SQL database, which had grown and grown over the years with many read-only replicas attached and services that were pulling frequently for updates.
As Arenberg describes: “We were architecturally at the point where continuing to just add and add on top of the cluster was clearly not going to work through the next several years for the business.”
Datalot needed a way to offload some of the actual load off the database - and to build some better patterns around how analytics are built, and how that data is derived.
Digital transformation and streaming data at Datalot
“There are lots of time-critical aspects to this business,” Arenberg explained. “Exposing the data in a way that wasn’t just a nightly ETL process was very important.”
“There’s a paradigm shift - thinking about the data in terms of a set of evolving conditions that are going to drive systems and building this machine that responds to events as they happen - rather than data as a static thing that we ask questions of. Data is an evolving thing that drives logic.”
While real-time data remains a goal for many companies, the initial shift from a traditional OLTP application database that is batched to an OLAP warehouse - which are both extremely reliant on relational joins - is a common challenge. Companies depend on common sets of joins across several different tables to generate and monitor critical business metrics.
As Arenberg describes: “That reality is probably blocking a ton of similar companies from making use of streaming data. In order to get to the base facts of the business, we’ve got to join a bunch of data together, and that’s not that easy to do in a typical streaming framework.”
Materialize as a New Approach to Stream Data Processing
Using a combination of Apache Kafka and Debezium, an open source distributed platform for change data capture, Datalot established the foundation of a real- time data pipeline.
As Datalot began the process of re-writing their analytics dashboards for real-time, they discovered a ton of institutional knowledge baked into their existing batch-oriented dashboards, and were hoping to utilize these existing models without a major overhaul. At this point, Arenberg engaged Materialize.
Materialize easily processes complex analytics over streaming datasets – accelerating development of internal tools, interactive dashboards, and customer- facing experiences. The platform delivers incrementally- updated materialized views in ANSI Standard, Postgres- compatible SQL. Materialize is the only technology that enables engineers to build data products on streaming data in a powerful declarative language – SQL – instead of building custom microservices.
“As I was managing our tech refresh, the timing was too good to not try to marry up some of these things. Previously where a lot of the dashboards before would have relied on summary table views, now the dashboards could simply rely on Materialize.”
Use Cases for Materialize at Datalot
The first iteration for Datalot was to use Materialize to build real-time dashboards and analytics visualizations. With a standard SQL interface, Materialize makes it simple to connect data visualization tools and applications and keeps query results incrementally refreshed with millisecond latency as new data arrives. An outline of data moving into Materialize from streaming sources and out to applications is featured below.
With Materialize, Datalot was able to roll new dashboards out without a significant investment from engineering in building something new. “We were already building real-time dashboards,” according to Arenberg. “Materialize meant that refresh could happen very quickly.” Access to real-time data analysis has PostgreSQL
improved operations across Datalot, deepening the kinds of notification services that alert Datalot employees on their clients’ performance. Datalot is also building out real-time alert services using Materialize. Arenberg is encouraged by the potential of this simple implementation, stating “We can take the same analytics that used to be embedded in our reports, and use them to let people know as soon as something becomes an issue, rather than them needing to find any report or a dashboard. “It is the simplest use case for this, but where we see that heading is driving further automation, with conditions that build more of an automated machine to handle a lot of these things.“
Datalot Architecture for Real-Time Dashboards
Most data moving into Materialize is coming from Debezium, which they run via Strimzi on Kubernetes (AWS MSK). The team also has some airflow jobs that pick up data from various provider APIs on a regular schedule and deliver them into Kafka.
The Datalot Kafka pipeline feeds a home-grown real- time ingestion pipeline into S3 & Snowflake. Their production Kafka cluster also gets mirrored using Mirrormaker into a secondary instance, which runs on Strimzi and Kubernetes and gets snapshotte d 3 times a day into EBS. All Kafka Connectors and Mirrormaker run on Strimzi as well in Kubernetes.