Materialize provides a SQL interface to work with continually changing data. You type SQL queries, and we maintain the queries incrementally, offering fast access to results. If you’re used to stream processors, Materialize offers temporal filters to present a time-based window into the data.
Support for temporal filters isn’t new, but we recently addressed some concerns around their resource utilization. In this blog, I’ll explain what makes maintaining temporal filters expensive and how we mitigate some of the cost.
What are temporal filters?
I’ll refer to Frank’s excellent introduction to temporal filters, so here’s just a quick recap. Assuming you have data that encodes event time, you can use temporal filters to define a time-based window on top of it. Let’s define a table that encodes timestamped ticket sales.
CREATE TABLE ticket_sales (
content text,
sale_ts mz_timestamp
);
The content
describes a sales event, the sale_ts
plays the role of
timestamps, indicating the moment of the sale. This is just data, and
Materialize will happily return all the contents when selecting from it. Note
that I’m using the type mz_timestamp
here, but we could use any other type
that we can cast, too.
To make things more interesting, we ask Materialize to only return the sales
that happen in the last 24 hours, using its logical time (ticket_sales
is
still empty, so it won’t return any data yet):
SELECT content
FROM ticket_sales
WHERE mz_now() >= sale_ts
AND mz_now() < (sale_ts::timestamp + '24 hours'::interval)::mz_timestamp;
This query will change its result over time, not just for changes to the
ticket_sales
table, but also as Materialize’s system clock advances.
The mz_now()
function returns the logical time at which the statement was
executed, and can be used in WHERE
conditions to limit the working dataset of
ad-hoc SELECT
queries, (materialized) views, and indexes. This is how you
define a temporal filter in Materialize!
Let’s take a closer look at how this all works under the hood.
What makes temporal filters tick?
I’ll make this more concrete by continuing with the above example. We’ll now
insert some data into the ticket_sales
table:
INSERT INTO ticket_sales VALUES ('hello', now()::mz_timestamp);
This inserts ticket sale with content
value hello
, along with its sales
time timestamps (sale_ts
). A select reveals the data:
SELECT content, sale_ts FROM ticket_sales;
content |
sale_ts |
---|---|
hello | 1736755846704 |
You might not find this particularly surprising as we’ve just inserted said data, but now we’ll define a temporal filter on this data:
SELECT content, sale_ts
FROM ticket_sales
WHERE mz_now() >= sale_ts
AND mz_now() < (sale_ts::timestamp + '24 hours'::interval)::mz_timestamp;
content |
sale_ts |
---|---|
hello | 1736755846704 |
Unless you moved away from your computer for more than a day between the two
queries, you should get the same result, because the current time is still
within the sale_ts
plus 24 hours bounds.
Next, let’s insert ticket sale in the future. It should not appear in the select with temporal filter because it’s too far out:
INSERT INTO ticket_sales VALUES ('future', (now() + '20 days')::mz_timestamp);
And repeating our select with a temporal filter still only produces the ticket
sale with a content
value of hello
.
Materialize expresses data as updates in the form of (data, time, diff)
,
representing an update at a time with a change in multiplicity. Think of the
diff representing an insert (>0
) or a delete (<0
). The time represents the
moment when Materialize learned about a change.
In the above example, the temporal filter turns data into a pair of updates, similar to the following table:
mz_timestamp |
mz_diff |
content |
sale_ts |
---|---|---|---|
1736756302896 | 1 | hello | 2025-01-13 00:00:00+00 |
1736812800000 | -1 | hello | 2025-01-13 00:00:00+00 |
1738454400000 | 1 | future | 2025-02-02 00:00:00+00 |
1738540800000 | -1 | future | 2025-02-02 00:00:00+00 |
Some of the updates occur at the current time (the insertion of the hello
sale event), but other occur in the future, such as the retraction of the
hello
sale event and the addition and retraction of the future
sale event.
Each relation has a time that marks what data is definite, and when you query a relation, we will only reveal data up to this time. Data in the future can still change, which is why Materialize cannot surface it, i.e., a pending future addition could cancel an equivalent future retraction.
Expiring future updates
Materialize supports a concept of arrangements, which you can think of as an in-memory index of data spanning a time range, ending in the current time that’s known to be definite. Specifically, it only represents data that is definite. While we might know about future updates, we cannot insert them into the arrangement yet because they might not be definite.
Instead, the arrangement stages them in a separate area that we will scan whenever we learn that the time for definite data changed. At that point, Materialize scans the pending data, and extracts definite data while leaving indefinite future updates in place. This process is what’s causing resource utilization proportional to outstanding updates. We need to retain the updates in memory, and spend CPU cycles on periodically scanning the data.
This is where an optimization we recently implemented comes into play: For some objects, we know that their definite time roughly correlates with the wall-clock advancing. This implies that updates at a specific moment in the future can only become definite at that time. We plan restarts of Materialize environments ahead of time, which allows us to reason about which update might be revealed in an index, and which updates will definitely not be revealed until the next restart.
We determine a timestamp for each replica that is some time after the next scheduled restart, and simply discard updates in the future of this time stamp. We ensure correctness by preventing times in the future of the expiration time to become definite.
What objects support expiration?
The expiration feature assumes that the time of updates correlates to wall-clock time, which is true for a subset of objects supported in Materialize. If you create tables and sources, the feature applies transitively through downstream dependencies. We currently do not support load generators because their mapping of wall-clock time to definite times is specific to their implementation-specific.
Some objects are valid for all times and thus can be queried at all times
(subscribes support AS OF
/UP TO
to carve out a user-supplied time range).
We don’t enable expiration for such objects.
Experiencing it
We enabled the feature in all Materialize. We observed it dropped the overall memory utilization between 20% and 50%, with no impact on availability or correctness!
But this doesn’t mean we’ve fully solved the underlying problem. We still need to frequently revisit indefinite updates, only likely a much smaller amount. We have some ideas on how to solve the issue, but it is a fundamentally difficult problem because times are only partially ordered. Also, this feature is not yet supported for self-managed deployments due to its dependency on scheduling restarts.
Appendix: Showing updates for constant collections
Constant collections have the nice property that they’re defined for all times. Subscribing to a select over a constant collection allows us to see the updates Materialize applies over time (enable show diffs in the console):
SUBSCRIBE TO (
SELECT content, sale_ts::timestamptz
FROM (
VALUES
(
'hello',
'3024-12-17'::mz_timestamp
),
(
'future',
('3024-12-17'::timestamp + '20 days')::mz_timestamp
)
) AS ticket_sales(content, sale_ts)
WHERE mz_now() >= sale_ts AND mz_now() < (sale_ts::timestamp + '24 hours'::interval)::mz_timestamp
);
This query results in similar updates to what I’m presenting above.