Awk performs malloc() when accessing arrays?

Guppy

Registered
I've written an awk script that shouldn't, in theory, be taking up that much memory but in practice it is crashing due to running out of virtually memory on my server (like 800GB). It does store about 3200 strings in an associative array but that part doesn't seem to take much memory. It's the subsequent accessing of the array elements that is causing the memory footprint to sore. The following script illustrates this behaviour:

BEGIN { idx = 1 }

# Store the first 800 words
NR < 801 { word[idx++] = $1 }

# Now test whether accessing a stored array value increases
# the memory load
NR > 800 {
for (i=1; i<800; i++)
printf("%d\t%d\t%s\n", NR, i, word);
}

if one calls this 'test.awk' and run it on the builtin dictionary (awk -f test.awk /usr/share/dict/words) all is does is store the first 800 words in the dictionary file in a simple array and this doesn't take much memory. But the second part just keeps printing the damn things over and over again and this starts seriously running up the memory requirements. I watch the process grow in memory using Activity monitor and don't get it. In contrast, if you replace "word" in the printf() statement with "duh" (a constant string) the memory profile is perfectly flat over time. So it is something about accessing the array element that is costing memory (using malloc()s).

Can anyone explain this to me? It's ruining an otherwise reasonable script and doing my head in.

Thanks, in advance.
 
You've definitely found something -- running valgrind on awk with your example and again on GNU awk (from MacPorts) shows that sure enough awk leaks memory like a sieve, while GNU awk doesn't.

It might be worth filing a Bug Report with Apple (not that you'll ever hear back but it might at least call their attention to it and get it fixed in a future release).

Consider installing GNU awk, which is not only not afflicted by this apparent bug, but also more powerful. As a quick fix, if you don't want to install GNU awk right now, you can modify your code as shown below, and even the Apple-provided awk doesn't leak memory with this. There's one major caveat -- you're not guaranteed to have your array items in any particular order, so it may not be suitable for what you want to accomplish:

Code:
# Store the first 800 words 
NR < 801 { word[NR] = $1 }

# Now access a stored array value without a memory leak
NR > 800 {
    for (i in word)
        printf("%d\t%d\t%s\n", NR, i, word[i]);
}
 
Last edited:
Thanks, that's extremely helpful. Both fixed the memory bug and order wasn't important. Interesting, mac's native awk ran 4x faster than my newly compiled gawk on the same code/same data.
 
Back
Top