Introduction
Recently, we updated Katapult to provide more granular log entries when installing custom disk templates. This has been particularly useful for debugging purposes. While implementing the feature, we spotted an opportunity to improve how log entries were queried from the database and rendered on the webpage.
Traversing hierarchical data
Our log entries are stored in a tree-like hierarchy, where each entry can have many child entries:
- Parent log entry
- Child log entry
- Another child log entry
- Grandchild log entry
- Great grandchild log entry
[...]
- Third child log entry
- Another grandchild log entry
- [...]
Each log entry has a parent_id
except for the root log entry, where the parent_id
is nil
. While the nesting depth is theoretically unlimited, in practice, it rarely exceeds 10 levels. However, there can be hundreds of child log entries.
The main challenge? Each log entry knows its direct parent but not its grandparent, great-grandparent, or all of its descendants—at least not without traversing the tree. Similarly, fetching all immediate children is straightforward, but retrieving an entire hierarchy efficiently is another story (again without traversing the tree).
N+1 queries
Initially, our page loaded the first log entry and then fetched its children with a separate database query. For each child, we would perform yet another query to fetch its children, and so on—repeating this process until there were no more entries to retrieve.
This approach worked, but it was quite inefficient. It led to an N+1 query problem, where potentially hundreds of queries were executed per page load (depending on the number of log entries). It's worth pointing out that none of the queries were slow, but the volume of them created unnecessary database load. Since access to this page was previously limited to admin users, the impact was minor—but as we expanded logging capabilities, a more scalable solution was needed.
What is a Recursive Common Table Expression?
Traversing tree-like data is a perfect use-case of a Recursive Common Table Expressions (CTEs).
A Recursive CTE allows a single query to traverse a hierarchical structure efficiently by referencing itself. It will repeatedly query subsets of the rows that are returned until it's collected the entire set (eliminating the need for multiple queries).
We like to use MariaDB documentation for a deeper dive into how they work. CTEs are a standard SQL feature, so you'll also find Postgres supports them too.
Show me the code
We use Ruby on Rails with ActiveRecord, which doesn't have a built-in CTE syntax. While you could write this query using Arel, raw SQL is often simpler and more readable. And the code will look very similar to what you'll find in the CTE docs for your database engine of choice.
def self.with_children_for(parent_entry)
return LogEntry.none if parent_entry.nil?
query = <<-SQL.squish
SET STATEMENT max_statement_time=3 FOR
WITH RECURSIVE log_entry_parents AS (
SELECT *, cast(JSON_ARRAY(id) as CHAR(255)) as parent_ids
FROM log_entries
WHERE id = '#{parent_entry.id}'
UNION ALL
SELECT l.*, JSON_ARRAY_APPEND(parent_ids, '$', l.id)
FROM log_entries l
INNER JOIN log_entry_parents parent ON l.parent_id = parent.id
WHERE NOT JSON_CONTAINS(parent_ids, l.id)
)
SELECT * FROM log_entry_parents ORDER BY parent_id ASC, id ASC
SQL
result = ActiveRecord::Base.connection.exec_query(query)
result.map { |e| LogEntry.new(e.except('parent_ids')) }
end
At first glance, this query might look complicated, but it's really just three SELECT statements working together.
Defining the Recursive CTE
The CTE "log_entry_parents" acts as a "working table" that progressively collects rows.
The Anchor query - finding the root entry
The first select is known as the "anchor" and needs to retrieve the top of the hierarchy:
Anchor query
SELECT *, cast(JSON_ARRAY(id) as CHAR(255)) as parent_ids
FROM log_entries
WHERE id = '#{parent_entry.id}'
It finds the root log entries and stores their IDs as an array in "parent_ids". This helps prevent infinite loops in cases where, due to an unexpected issue, a root log entry's parent_id mistakenly points to one of its child entries. This should never happen but better safe than sorry.
The Recursive query - fetching child entries
Recursive query
SELECT l.*, JSON_ARRAY_APPEND(parent_ids, '$', l.id)
FROM log_entries l
INNER JOIN log_entry_parents parent ON l.parent_id = parent.id
WHERE NOT JSON_CONTAINS(parent_ids, l.id)
This query retrieves all child log entries associated with a given parent.
- First run: Joins on the results from the anchor query, finding direct children.
- Subsequent runs: Joins on its own most recent results, fetching deeper levels in the hierarchy.
- Repeats until no more child entries exist.
Each iteration appends the child’s id to the "parent_ids" column, ensuring no duplicates and preventing infinite loops.
Final query - retrieving the full result set
Once the recursive query has gathered all log entries, the final step is to retrieve them all from the working table:
SELECT * FROM log_entry_parents ORDER BY parent_id ASC, id ASC
Is it performant?
Our log entries database table contains over 30 million rows, yet this query consistently runs in less than 10ms.
One final tip
The eagle-eyed will have spotted this at the top:
SET STATEMENT max_statement_time=3 FOR ...
The SET STATEMENT command isn't specific to CTEs. It's a useful escape hatch in case of unexpected performance issues. It means that if the query takes longer than 3 seconds it will be automatically aborted. This limit is only applied for the duration of the statement, so it doesn't impact any other queries.
Summary
If you're working with hierarchical data and need an efficient way to traverse it in a single query, a Recursive Common Table Expression is worth considering. It simplifies query logic, reduces load on your database, and improves performance—making it a powerful tool for handling tree-structured data.
Share this article
About the author
Paul S
Paul is a software engineer for Krystal, he works on Krystal's cloud hosting platform Katapult. Outside of work he enjoys music, football, gardening and spending time with his kids.