Post

Fizz Buzz Pro on Elixir

Fizz Buzz Pro on Elixir

This post is inspired by the following StackExchange CodeGolf thread. For the uninitiated, Fizz Buzz is the programming equivalent of a “Hello World” - a simple task used in job interviews to filter out people who can’t code their way out of a paper bag.

The challenge:

Write a program that prints numbers from 1 to a very high number. For multiples of 3, print “Fizz” instead. For multiples of 5, print “Buzz”. For multiples of both, print “FizzBuzz”.

The output must go on forever (or at least to 2^58) without halting prematurely.

Simple, right? A 2-minute problem at most.

I spent 2 weekends on this.

The baseline: C does it easily

Here’s your standard C implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <stdio.h>

int main() {
    for (int i = 1; i < 1000000000; i++) {
        if ((i % 3 == 0) && (i % 5 == 0)) {
            printf("FizzBuzz\n");
        } else if (i % 3 == 0) {
            printf("Fizz\n");
        } else if (i % 5 == 0) {
            printf("Buzz\n");
        } else {
            printf("%d\n", i);
        }
    }
}

Compile it, run it, done. 65-70 MiB/s on my 2019 Intel MacBook Pro.

“This should be easy to port to Elixir,” I thought, like a fool.

The humbling: Elixir’s first attempt

1
2
3
4
5
6
7
8
9
10
11
defp reply(n) do
  case {rem(n, 3), rem(n, 5)} do
    {0, 0} -> "FizzBuzz\n"
    {0, _} -> "Fizz\n"
    {_, 0} -> "Buzz\n"
    {_, _} -> "#{n}\n"
  end
end

Stream.map(lower..upper, &reply/1)
|> Enum.into(IO.binstream)

To measure throughput:

1
2
MIX_ENV=prod mix escript.build
./fizzbuzz 1 1000000000 | pv > /dev/null

Result: 2-3 MiB/s.

Two. To. Three.

That’s not a typo. The Elixir version was running at roughly 3% of the C version’s speed. For a counting game.

And thus began my descent into madness.

Optimization #1: Stop screaming into the void one number at a time

The obvious problem: we’re making an IO call for every single number. A billion numbers means a billion IO calls. Each IO call has overhead. We’re basically knocking on the operating system’s door a billion times asking “can you print this one number please?”

The OS, understandably, is not amused.

Solution: batch the output into chunks.

1
2
3
Stream.map(lower..upper, &reply/1)
|> Stream.chunk_every(6000)
|> Enum.into(IO.binstream)

Instead of printing one number at a time, we now collect 6000 results and print them all at once.

Result: 18 MiB/s

We just 6x’d our performance by adding one line. This is the kind of win that makes you feel smart for about 30 seconds before you realize you’re still 4x slower than C.

Optimization #2: Put those cores to work

Modern computers have multiple cores. My laptop has 12 threads. Currently, 11 of them are sitting idle watching one core do all the FizzBuzz work.

The plan: divide the input range into chunks, process them in parallel, then print results in order.

For input 1..1000, we create sub-ranges: [1..100, 101..200, 201..300, ..., 901..1000]

1
2
3
4
5
6
7
8
9
10
11
defp get_input_ranges(lower, upper, chunk_size) do
  if chunk_size >= 10 do
    if lower >= upper do
      []
    else
      [lower..min(lower+chunk_size, upper) | get_input_ranges(min(lower+chunk_size, upper) + 1, upper, chunk_size)]
    end
  else
    [lower..upper]
  end
end

But what’s the right chunk size?

Too small: We spin up thousands of tasks that do almost nothing. The overhead of managing tasks exceeds the work itself.

Too big: Each task holds its results in memory while waiting to print. Giant chunks = giant memory usage.

Here’s the thing about concurrency: a task can’t just print whenever it wants. Task #47 has to wait for tasks #1-46 to print first, or we’ll get FizzBuzz soup. So every task sits there, holding its results hostage, waiting for its turn.

Image1

Image2 FIGURE 2: Concurrent execution without buffering - chaos

Image3 FIGURE 3: Concurrent execution with buffering - controlled chaos

The magic formula I landed on:

1
chunk_size = min(div(upper - lower, System.schedulers_online), 6000)

This gives us chunks that are:

  • Small enough to not explode memory
  • Big enough to be worth the task overhead
  • Conveniently sized for IO batching (remember our 6000-item chunks from earlier?)

Optimization #3: Task.async_stream to the rescue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
defmodule Fizzbuzz do
  def fizzbuzz(enumerable) do
    Stream.map(enumerable, &reply/1)
    |> Stream.chunk_every(6000)
    |> Enum.into([])
  end
end

defmodule Fizzbuzz.Cli do
  def main([lower, upper]) do
    {lower, upper} = {String.to_integer(lower), String.to_integer(upper)}
    chunk_size = min(div(upper - lower, System.schedulers_online), 5000)
    input_enumerable = get_input_ranges(lower, upper, chunk_size)
    stream = Task.async_stream(input_enumerable,
              fn range -> Fizzbuzz.fizzbuzz(range) end,
              timeout: :infinity)
    Enum.reduce(stream, :ok,
      fn {:ok, res}, _acc -> res |> Enum.into(IO.binstream) end
    )
  end
end

Quick note on iolists: In Elixir, concatenating lists is expensive. But we can cheat using iolists - special lists that can contain strings, bytes, or other iolists. IO operations flatten them automatically. It’s like telling the printer “here’s a box of boxes of papers” and it just figures it out.

Result: 40 MiB/s

We’re now faster than half of C. Progress!

Optimization #4: GenServer over Task (reduce data ping-pong)

Here’s the problem with Task: it computes results, sends them back to the main process, which then sends them to the IO process. That’s two hops for every chunk of data.

1
Worker Task → Main Process → IO Process

What if workers could talk directly to IO?

With GenServer, we can tell a worker “compute now, print later” - and when we say print, it sends directly to IO. One hop instead of two.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
defmodule Fizzbuzz.Worker do
  use GenServer

  def init([range]) do
    send(self(), {:calculate, range})
    {:ok, []}
  end

  def handle_info({:calculate, range}, _state) do
    res = Fizzbuzz.fizzbuzz(range)
    {:noreply, res}
  end

  def handle_call(:print, _from, results) do
    results |> Enum.into(IO.binstream)
    {:reply, :ok, []}
  end
end

The driver spawns workers, waits for them to compute, then tells them to print in order:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def main([lower, upper]) do
  {lower, upper} = {String.to_integer(lower), String.to_integer(upper)}
  chunk_size = min(div(upper - lower, System.schedulers_online), 6000)
  input_enumerable = get_input_ranges(lower, upper, chunk_size)

  Task.async_stream(input_enumerable,
    fn input -> elem(GenServer.start_link(Fizzbuzz.Worker, [input]), 1) end,
    timeout: :infinity)
  |> Stream.map(fn {:ok, res} -> res end)
  |> Stream.each(fn pid ->
      GenServer.call(pid, :print)
      Process.exit(pid, :kill)  # You've served your purpose. Goodbye.
    end)
  |> Stream.run()
end

Result: 80 MiB/s

We’ve now surpassed C. But we’re just getting started.

Optimization #5: String interpolation is secretly expensive

Look at this innocent line:

1
{_, _} -> "#{n}\n"  # <- This guy right here

String interpolation creates a new string every time. For a billion numbers, that’s a billion string allocations. We can use iolists instead:

1
2
3
4
5
6
7
8
defp reply(n) do
  case {rem(n, 3), rem(n, 5)} do
    {0, 0} -> "FizzBuzz\n"
    {0, _} -> "Fizz\n"
    {_, 0} -> "Buzz\n"
    {_, _} -> [Integer.to_string(n), "\n"]  # iolist, not string
  end
end

“This can’t possibly matter that much,” I thought.

Result: 150 MiB/s

It mattered. Almost 2x improvement from changing one line. The reply function gets called a lot.

Optimization #6: Don’t materialize 48 trillion elements

The problem statement says we need to handle up to 2^58 numbers. Let’s try it:

1
./fizzbuzz 1 288230376151711744

The program hangs.

Why? Our get_input_ranges function returns a list. For 2^58 with 6000-sized chunks, that’s 48 trillion list elements. We’d need 48,000+ GB of RAM just to store the list of ranges, let alone process them.

Solution: streams. Generate ranges lazily, one at a time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
defmodule ChunkRangeStream do
  def create(range, chunk_size) do
    Stream.resource(
      fn -> initialize(range, chunk_size) end,
      &generate_next_value/1,
      &done/1
    )
  end

  defp initialize(range, chunk_size), do: {range, chunk_size, range.first}

  defp generate_next_value({range, chunk_size, lower}) do
    if lower < range.last do
      next_range = lower..min(lower+chunk_size, range.last)
      new_lower = min(range.last, lower+chunk_size+1)
      {[next_range], {range, chunk_size, new_lower}}
    else
      {:halt, {range, chunk_size, lower}}
    end
  end

  defp done(_), do: nil
end

Now we generate ranges on-demand. Memory crisis averted.

Optimization #7: Exploit the pattern

Here’s a shower thought: FizzBuzz has a period of 15.

1
1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz

Then it repeats, just with bigger numbers. We can hardcode the pattern and just fill in the numbers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def fizzbuzz6k(range) do
  i = range.first - 1
  0..399
  |> Stream.map(fn j ->
    [[Integer.to_string(15*j + 1 + i), "\n"],
     [Integer.to_string(15*j + 2 + i), "\n"],
     "Fizz\n",
     [Integer.to_string(15*j + 4 + i), "\n"],
     "Buzz\nFizz\n",
     [Integer.to_string(15*j + 7 + i), "\n"],
     [Integer.to_string(15*j + 8 + i), "\n"],
     "Fizz\nBuzz\n",
     [Integer.to_string(15*j + 11 + i), "\n"],
     "Fizz\n",
     [Integer.to_string(15*j + 13 + i), "\n"],
     [Integer.to_string(15*j + 14 + i), "\n"],
     "FizzBuzz\n"]
  end)
  |> Enum.chunk_every(400)
end

No more modulo operations. No more conditionals. Just math and string conversion.

For this to work, we need to split our input into three parts:

  • Beginning: Numbers before the first multiple of 15 (processed normally)
  • Middle: The bulk of the work, in 6000-number chunks (processed with the pattern)
  • End: Leftover numbers (processed normally)

Result: 175 MiB/s

Optimization #8: Stop copying gigabytes of data

In Erlang/Elixir, messages between processes are copied. Our GenServer computes an iolist, sends it to IO process = one full copy.

But there’s an exception: refc binaries (strings > 64 bytes) are passed by reference, not copied.

So if we convert our iolist to a binary before sending…

1
2
3
4
5
6
7
8
9
def handle_info({:calculate, range}, _state) do
  res = Fizzbuzz.fizzbuzz(range)
  {:noreply, res |> :erlang.iolist_to_binary}  # Convert to binary
end

def handle_call(:print, _from, results) do
  IO.binwrite(results)  # Now passing a reference, not copying
  {:reply, :ok, []}
end

For a billion numbers, that’s 166,666 chunks. Each chunk was being fully copied. Now it’s just passing pointers.

Result: 230 MiB/s

Since IPC is no longer the bottleneck, we can now use bigger chunks and pass even more data per worker.

Result: 300+ MiB/s

Optimization #9: Erlang’s IO has too many features

Erlang’s standard IO is designed for distributed systems. When you call IO.puts, it:

  1. Ensures atomic ordering (no interleaved output from concurrent calls)
  2. Forwards to a “group leader” process (for distributed environments)
  3. Handles all sorts of edge cases we don’t care about

We’re not distributed. We’re not concurrent on IO. We just want to blast bytes to stdout.

Enter: ports. Direct communication with the OS, no middleman.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Open a port to stdout
port = :erlang.open_port({:fd, 0, 1}, [:binary, :out])

# Hand control to a worker
send(port, {self(), {:connect, worker_pid}})

# Worker writes directly
send(port, {self(), {:command, results}})

# Hand control back to main
send(port, {self(), {:connect, main_pid}})

# Close when done
send(port, {self(), :close})

Result: 550+ MiB/s

We’ve removed IO as the bottleneck. The only limit now is how fast we can generate numbers.

Optimization #10: When in doubt, Rust

At this point, I’d squeezed everything I could from pure Elixir. Time to bring in the heavy artillery.

Using Rustler, I ported the number-crunching to Rust and called it as a NIF (Native Implemented Function).

Result: 3 GiB/s

That’s a 6x improvement just from recompiling the hot path in Rust. The Elixir code still handles orchestration, concurrency, and IO - Rust just does the math faster.

The scoreboard

OptimizationThroughputImprovement
Naive Elixir2-3 MiB/sBaseline
Chunked IO18 MiB/s6x
Parallel tasks40 MiB/s2x
GenServer80 MiB/s2x
Iolist optimization150 MiB/s1.9x
Pattern exploitation175 MiB/s1.2x
Refc binaries300+ MiB/s1.7x
Raw ports550+ MiB/s1.8x
Rust NIF3 GiB/s5.5x

Total improvement: ~1000x over naive Elixir. ~46x over naive C.

All for a children’s counting game.

What I learned

  1. IO is usually the bottleneck. Batch your writes.
  2. Concurrency helps, but coordination costs. Choose your chunk sizes wisely.
  3. Data copying is expensive. Use refc binaries when passing large data between processes.
  4. Standard library IO is feature-rich. Sometimes you just need a raw pipe.
  5. Know when to drop to native code. Elixir is great for orchestration; Rust is great for math.
  6. Premature optimization is the root of all evil, but mature optimization is pretty fun.

Was any of this practical? Absolutely not. Would I do it again? In a heartbeat.

Code is available on GitHub.

Credits

All the amazing folks at ElixirForum who answered my increasingly unhinged questions: https://elixirforum.com/t/question-regarding-improving-fizzbuzz-io-performance/56816

This post is licensed under CC BY 4.0 by the author.