Memory Efficiency With Arena Allocators
If you’ve ever heard someone ask about lightweight desktops for older hardware, you’ve probably heard GNOME referred to as a heavy weight option.
One of my hobbies is resource efficiency projects.
Last week I circled back around to a project that I’d put on my list a couple of years ago but never got around to: gnome-software. On a typical workstation, gnome-software will often use more memory than even gnome-shell itself. It will usually be the single most memory expensive component of the GNOME desktop.
gnome-software serves several purposes. In the GNOME desktop overview, it provides search results for applications that are available but not installed. It also provides a GUI for software management. And it provides notifications to the user when there are updates available to install. Notifying users that updates are available is important to maintaining a good security posture, so disabling the application entirely isn’t a great option. But its memory use tends to cause some users to seek out more resource friendly shells for their older hardware.
I’d observed that the gnome-shell process tended to increase in size as it handled search requests from the GNOME desktop overview, so I started by splitting the application search functionality out of gnome-software, and into its own application. As a separate application, it was much easier to profile the application and its memory use. Many profiling tools will hide details that are very small relative to the whole application, so getting the GTK+ code out of the process made it easier to see where memory was being allocated.
valgrind didn’t report any leaks, so the memory allocations that increased the resident size of the process were being tracked. I moved on to valgrind’s massif tool to get information about where memory was allocated. The tool confirmed that there were peaks of high memory use, but it also indicated that most dynamic allocations were being freed eventually. GNU libc has a “malloc_trim” API that can be used to release memory that had been freed, but using it released far less memory than expected, given the amount of memory that valgrind indicated was still allocated.
This suggested that I might be looking at a problem that is common and well understood, but difficult to solve: dynamic allocations that the application managed were interspersed with allocations that were made and managed within shared libraries.
The basic problem is that memory can only be returned to the OS by
free() or malloc_trim() in relatively large, contiguous blocks.
As long as some memory within a block has not been freed, that block
cannot be released. A POSIX process typically shares an address space,
a memory allocator, and a heap with all of the shared libraries that
it uses.
Sometimes the easiest way to solve this problem is to use fork() to
create a new process that can handle a request, and then exit that
process when it’s done, which will reliably release any memory
allocated by the process and its shared libraries. But that isn’t a
good option if there’s expensive setup for the first request, because
forking for every request would mean repeating that expensive setup
each time.
What we really want is an arena allocator that can help keep contiguous the memory that the application doesn’t manage. That would allow shared libraries to allocate memory in a way that doesn’t spread untracked allocations through the application’s main heap.
As I pondered that idea, I remembered… glibc does have an arena allocator. It uses per-thread arenas to reduce lock contention during allocation in threaded applications. And I wondered, how difficult would that be to expose to applications so that they could provide a hint that they wanted allocations to use a different memory pool.
Such an API should be very simple. There should be a function to request a new arena, and there should be a function to swap the current arena for a new one. An application could then allocate a new arena for shared libraries that are known to allocate memory, and it could swap memory arenas before and after making calls into such a shared library.
The idea was simple, but I wasn’t familiar with the design and architecture of glibc. So I described the API that I wanted to add, and asked Claude to implement that API in glibc’s malloc, consistent with the coding standards used in the library.
Before diving into the implementation details, let’s visualize the problem and solution.
Understanding the Problem
To visualize why arena segregation matters, consider how memory allocations are typically distributed.
Standard glibc: Interleaved Allocations
Without an arena API, all allocations go to the main arena. Library allocations (red) are scattered throughout, interleaved with application allocations (green). Even with a relatively small number of allocations by shared libraries, most memory pages contain at least one library allocation.
After Freeing App Memory (Standard glibc)
Even though the application frees 95% of the memory, each page still contains at least one library allocation (red). Since the OS can only reclaim entire pages, none of this memory can be returned.
Using the API
The API is designed to be simple and lightweight:
The typical pattern is:
- Create a dedicated arena once during initialization
- Attach the arena before calling library functions
- The library’s allocations go to the dedicated arena
- Restore the previous arena after the call returns
With Arena API: Segregated Allocations
The arena API segregates allocations into separate arenas, so that the allocations that the application manages are contiguous. Application allocations (green) go to the main arena, while library allocations (red) go to a dedicated library arena. There is no interleaving.
After Freeing App Memory (With Arena API)
When the application frees its memory, the main arena pages contain no active allocations. Entire pages are immediately returned to the OS. Library allocations remain active in their isolated arena.
The Development Process
Design by blog
At first I simply intended to write about arena allocators. Arena allocators are often written about as a technique to reduce the risk of memory leaks and simplify allocation tracking. An arena allocator can free a collection of allocations associated with an arena all at once. Although memory reclamation issues caused by interleaved allocations is common and well understood, the utility of arena allocators in mainting contiguous allocations is not frequently mentioned, among reference and discussions that I’ve seen.
I started with the initial intent merely to discuss that function of arena allocators. I described them as follows:
An application might process a data stream and dynamically allocate memory as is processes elements in that stream. If it uses a shared library as it processes element, the shared library might also dynamically allocate memory for a private internal cache. In such a case, the heap will contain application allocations interleaved with allocations from the shared library. Even if the application reliably tracks its allocations and frees them when it finishes processing the data stream, the heap might still contain small allocations from the shared library, which prevent libc from returning memory to the operating system.
Hypothetically, a malloc implementation could allow an application to register new memory arenas. The application could then set the preferred arena for a thread to an arena dedicated to a shared library before calling that shared library’s functions, and restoring the default arena on return. By segregating the arenas used by a shared library and by the rest of the process, an application could avoid allocations that it can’t track within its own memory arena, which would improve its ability to compact its memory.
Because the shared library’s allocations will be in a dedicated arena, the application should be able to return memory from its own arenas to the OS, reducing its resident size.
For example, the application might looks something like:
#include <malloc.h>
static arena_hd *netio_hd = NULL;
static void
app_register_netio_hd() {
if (netio_hd) return;
netio_hd = malloc_new_arena ();
if (netio_hd == NULL) {
// check errno and handle allocation failure
}
}
static void
app_process_element(AppElement *element) {
arena_hd *current;
// Switch to a dedicated arena
current = malloc_swap_thread_arena (netio_hd);
netio_process_element (element);
// Restore the default arena
malloc_swap_thread_arena (current);
}
Initial Design
In that markdown file, I had describing the API I wanted. I decided to see if Claude to help me implement it quickly to determine whether the idea was worth pursuing.
My first prompt was detailed:
In ../malloc-blog/malloc-arenas-proposed.md I described a problem, in
which an application and a shared library might allocate memory in
interleaved pages within an arena, and suggested that libc might
expose an API that allows an application to request an extra arena and
set a preferred arena before and after calling functions in a shared
library. The current directory contains glibc, and its malloc
implementation is in the malloc directory. I believe that this
implementation uses per-thread arenas. Review this malloc API and
suggest an idiomatic API extension that would allow an application to
request an arena and set a preferred arena for the current thread. The
API will need tests, and I'd also like a demo application consisting
of a main application and a simple shared library that demonstrates
the new API. It should allocate around 200MB of memory total, at 512
bytes per allocation, mostly in the application code but with some
allocations in the library. Once the memory is allocated, the program
should print stats about its memory use including its resident
size. Then it should free the allocations from the application but not
the library and print stats again. Prioritize consistency with the
programming style in this codebase.
The first implementation looked pretty good at first glance, but
failed to build. Claude was able to process the build failures,
determine that the problem was that it had defined functions that
should have been a public API with libc_hidden_def macros, and
corrected the problem.
Once the library and the demo compiled successfully, I was able to run the demo and compare the results. Unfortunately, resident memory use in the application with a standard glibc and the version that used the new API was basically the same.
The initial memory information wasn’t very detailed, but I knew that
glibc supported a malloc_stats() function that might give me more
information.
This works but each verion of the demo app we've tried shows no
significant difference between the version with the new API and the
version without the new API. Can you add malloc_stats and we'll see if
that provides any hints
The new build produced information that indicated that the new API was successfully creating a new memory arena, but that no allocations were expanding its size.
I reviewed the new malloc_arena_new() function and found that it was
nearly identical to _int_new_arena(), which had read about in
reference material beforehand. I examined the differences closely and
determined that it was initially attempting to allocate a size that
was incorrect. However, that didn’t seem likely to be the cause. One
thing that I was less sure how to handle was arena ownership. There
was code in the initial implementation that handled reference counting
and free list handling that looked like it was appropriate for normal
arena handling, but which might result in an arena being released and
removed in the intended use pattern. In the design I’d proposed, a
thread should own multiple arenas.
I told Claude to re-sync with the changes I’d made, and to suggest appropriate handling of reference counting and free list handling:
I've made some changes to malloc_arena_new to make it more consistent
with _int_new_arena. The arena that we're creating with this API is
intended to be used in the current thread and temporarily swapped
while calling a library. So I think this API should avoid some of the
free list accounting normally associated with changing an arena. When
the arena is initially created, it should appear to be attached to one
thread even though it isn't. And when it is attached with
malloc_arena_attached, the free lists shouldn't be changed, nor should
the atttached thread count. Basically, one thread is using both arenas
concurrently.
Claude updated the API, removing the sections that I suspected did not belong, but didn’t understand well enough to adjust on my own.
I rebuilt glibc and the demo app.
Still, no dice. The demo app was still allocating memory in the main arena, while all available debugging information confirmed that the thread_arena pointer was being updated to reference the new arena.
I prompted Claude again:
OK, with the current state of malloc/ and demo/, there are no signs
that the new arena is being used. malloc_stats still shows a second
arena, but basically no utilization. system bytes = 167936 and in use
bytes = 2160. The library's call to malloc appears to be using the
main arena. Check over the malloc implementation to see how it selects
an arena. Maybe setting thread_arena is insufficient?
Claude pondered the malloc code further and found that single-threaded applications bypassed the arena selection logic. That makes sense, of course, since the per-thread arena feature is intended to reduce lock contention in threaded applications. If this feature were adopted, I’d want to enable arena selection when a new arena was created, but for initial implementation purposes, Claude simply created a temporary thread in the demo app.
With that, the demo app started working!
There were some minor inconsistencies in the output from the demo application, so I reorganized some of the reporting code, which finished the initial work on the feature.
Does This Idea Really Work?
Theory is one thing, but does arena segregation actually solve the memory fragmentation problem in practice?
The demo allocates 200MB total: 190MB for the application and 10MB in a library, using interleaved 512-byte allocations. This creates realistic fragmentation where library allocations are scattered throughout memory.
Results from running the demo:
Without arena API:
- After all allocations: RSS = 206MB
- After freeing app memory: RSS = 205MB (minimal reduction)
- After
malloc_trim(): RSS = 96MB (still high)
In the demo, malloc_trim is able to find some areas large enough to return to the OS, but a significant amount of memory remains resident.
With arena API:
- After all allocations: RSS = 206MB
- After freeing app memory: RSS = 15MB
- After
malloc_trim(): RSS = 15MB
The difference is dramatic: with arena segregation, the application
can reclaim the entirety of its 190MB allocation, while the library’s
10MB remains in use. Without interleaved allocations, hidden from the
application, the application is able to release memory from its main
arena. In the case of the demo, the application doesn’t even need to
call malloc_trim()!
Broader Implications
This kind of exploration—prototyping a new API in a complex codebase to validate an architectural idea—would have been difficult to justify without AI assistance. The learning curve for glibc’s malloc is steep: understanding arena management, thread-local storage, optimization paths, and symbol versioning all at once is a significant investment. Without assistance, the time required to explore the idea would have seemed much too high a barrier, with no means to evaluate the chance of useful results.
Using Claude Code allowed me to explore an idea that could have taken weeks or months in just a couple of days during a weekend. With Claude, I could focus on the problem I wanted to solve while getting guidance on implementation details.
The result is a working implementation that demonstrates both the problem and the solution, ready for consideration by the glibc maintainers.
Next Steps
The proof-of-concept demonstrates that arena segregation can dramatically reduce memory fragmentation when application and library allocations are interleaved. The next step is to propose this API to the glibc community and gather feedback on the design and implementation.
If accepted, this simple API could help applications like gnome-software reduce their memory footprint significantly, making GNOME more viable on resource-constrained systems. And beyond GNOME, any long-running application that loads shared libraries with different allocation patterns could benefit from this approach.
The demo code and implementation are available in my glibc fork on codeberg for anyone interested in experimenting with the API or understanding the fragmentation problem in more detail.