The last article on Cajal set up a lot of important background, but it was rather light on Rust'y material. In this article, I'll walk through the growth phase of Cajal which should, hopefully, be a bit more interesting

Rustc version: rustc 1.8.0-nightly (fae516277 2016-02-13)

To entice you into reading, I'll be covering the mechanics that makes this happen:

Low density Cajal growth. Brown = neuron body, Teal = Dendrite, Red = Axon

Cajal Initialization

The initialization procedure is relatively straightforward, and I don't really want to spend a lot of time on it. It has three main goals:

Allocate a whole kaboodle of memory
Spread a random genome across the entire set of memory
Pre-populate a certain number of “neuron” bodies

Each Cajal struct holds a Grid, which in turn holds a vector of Page's. Pages are self-contained computation units which can be processed in parallel by Rayon. By default, each page is 256 x 256 cells wide, totaling 65,536 cells.

The size of the page is arbitrary; it should be sufficiently large so that more time is spent processing the Page than spent context-switching, but small enough that other threads aren't sitting around waiting for the last page to finish. Some very rough perf tuning shows that 256x256 is at least moderately decent.

Pages are allocated in row-major order to simplify access. Internally, each Page initializes a big slab of memory holding the Cell's. Each Cell is given a random chromosome, gate direction and firing threshold.

Finally, a small handful of cells are converted into neuron bodies and then “sprouted” so that a single iteration of axon/dendrite growth ocurrs. This simplifies later growth processing since the code won't need to deal with the neuron body condition.

Aaaaand that's about it. Simple, and yet my implementation is also rather slow. Whoops!

On my test server, I tried to allocate a Grid with 500 x 500 pages (~60gb of memory ) and it took fooooorever. I suspect the allocations can be optimized, and the various random number generation is probably not helping. If the random numbers are the bottleneck, the code can be made parallel fairly easily. If it's memory allocation speed…I'm not sure what to do. :)

Growth

The scene is set, our memory is allocated, the neurons have sprouted cute little axons and dendrites. Let's start growing!

Growth has three components:

Process each active cell and record it's growth pattern
Collect changes that cross Page boundaries and transfer to the appropriate pages
Apply all pending changes to update the grid

The first step, growing local neurons, can be accomplished (potentially) in parallel with Rayon:

impl Grid {

  pub fn grow_step(&mut self) -> u32 {

    self.pages.par_iter_mut()
                .weight_max()
                .for_each(|page| page.grow());
    // ...
  }
}

This uses Rayon's “parallel iterator” to easily parallelize the growth of each page. We have to explicitly tell Rayon that each computation unit is “heavy” with weight_max(), otherwise Rayon assumes each operation is very light (like copying an integer) and tries to batch them all together. Net result is that few or no threads are actually used. By setting weight_max(), we tell Rayon that each element probably needs it's own thread instead of being batched.

On my test server, that small change improved runtime of a test simulation from 40s to 5s, and leveraged all 12 threads instead of just two. Yay!

Then we use for_each to execute the grow() method on each page. Simple! Rayon will (usually) execute each of these Pages in parallel across multiple threads.

If we dive inside that grow() method, we'll see a ball of mud that performs the actual calculations:

impl Page {
  pub fn grow(&mut self) {

    // Does this Page have any active cells from last growth phase?
    // If no, just return early
    // This might be unnecessary and optimized away?
    if self.active.is_empty() == true {
        return;
    }

    // Not strictly necessary, just allows brevity later
    let mut cells = &mut self.cells;

    for index in self.active.iter() {

        // Pull out the basic metadata we need
        let (x, y) = zorder::z_to_xy(index);
        let cell_type = cells[index as usize].get_cell_type();
        let stim = cells[index as usize].get_stim();

        // Each cell has a chromosome that "points" north, south, east, west or
        // any combination of those. We need to check the four cardinal directions
        // to see if they are set
        for direction in CARDINAL_DIRECTIONS {
            if cells[index as usize].get_chromosome().contains(Chromosome::from(*direction)) {

                // If the chromosome does point in that direction, attempt to grow that way
                let change = Page::process_chromosome_direction(*direction, &mut cells, x, y,
                                self.offset_x, self.offset_y, cell_type, stim);

                // Persist the change in our local changes, or buffer for "exporting"
                match change {
                    ChangeType::Local((target, change)) => {self.changes.insert(target, change);},
                    ChangeType::Remote(change) => {self.remote_changes.push(change);},
                    ChangeType::NoChange => {}
                }
            }
        }
    }
  }
}

The first thing a Page does is check it's active list to see if there are any flagged cells that need processing. The active object is a Roaring Bitmap which compactly tracks the active cells. This is why we “sprout” our axons/dendrites during initialization; we need the active list to be populated with axons/dendrites for any growth to occur.

Side note: I love Roaring Bitmaps, but they may not be the appropriate solution in this case. Informal profiling shows Roaring as a relatively large contender for CPU time. This isn't a knock on the library at all…it's just the set of tradeoffs that the Roaring algo makes.

The Roaring paper states that Roaring (and indeed, any sort of bitmap, compressed or otherwise) are generally inappropriate if your sparsity is less than 0.1% due to memory and CPU overhead. Depending on settings, Cajal starts at 1-10% sparsity. But by halfway through the simulation, that usually drops down to < 0.1% (~65 active cells). Which means a simpler vector of changes would likely be more compact and more efficient.

So, earmarking this for future investigation. Perhaps start out with Roaring but cutover to a simple vector once sparsity falls below a threshold?

Next, we need to iterate over the active cells. Conveniently, cells are toggled “active” based on their index, meaning our iteration walks over the cells in z-order. This gives us the best chance possible for pre-fetching the right strides of memory.

Each cell has an underlying “chromosome” which points in some combination of directions (or none at all, called block). A cell grows in the directions dictated by the underlying chromosome at that point in space. So we need to iterate over the four cardinal directions and see if the chromosome “contains” that direction.

Dealing with `&mut self`

If the chromosome does “contain” that direction, we attempt to grow a new axon/dendrite in that direction. I say attempt because we may be at the edge of the Page, or the target may already be occupied.

To handle these potential outcomes, we call Page::process_chromosome_direction(), whose signature looks like so:

fn process_chromosome_direction(travel_direction: Gate,
                                cells: &mut Vec<Cell>,
                                x: u32, y: u32,
                                offset_x: u32, offset_y: u32,
                                cell_type: CellType, stim: bool) -> ChangeType {

Well, first, eww. Just lookit’ how gross that function signature is. Why so many parameters? Alas, I've been bitten by the “can't mutate self multiple times” anti-pattern. Because we've borrowed self here:

for index in self.active.iter() {
  // ...
}

We run into troubles where other methods need to access self. The root problem is that grow() needs mutable access to self, so the compiler can't guarantee (I think) that other methods won't mutate self in some way that, say, invalidates the iterator we are traversing.

Instead, I've refactored the methods into impl-private plain ol'functions. Pass in some state, get something back out. The upside is that unit testing will be simpler for these functions because they don't rely on hidden state of the Page struct. The downside is that we now need to pass around a fair amount of arguments.

But in reality, what this really means is that I have logical groupings of components that should probably be wrapped up in their own struct. E.g. x and y always travel as pairs, so perhaps add a Coord struct, etc. Or perhaps this function is trying to do too much, and the functionality needs to be partitioned in a different way.

Anyways, back on target, notice the return type: ChangeType. What's that, you say?

enum ChangeType {
    Local((u32, Cell)),
    Remote(RemoteChange),
    NoChange
}

Behold, a glorified Option! We can either grow into the new cell (Local), grow out of bounds and into a different Page (Remote), or can't grow at all because the space is occupied by something else (NoChange).

And since our plain ol'function can't modify self directly, it just returns this ChangeType and kicks the problem back upstream. That's the meat of it, although the mudball continues downward since we have to actually determine if the target is in/out of bounds, if it's an empty cell, etc. But eventually we'll bubble back up a ChangeType.

Then we just persist the change, depending on what kind of change it is (e.g. our local hashmap of changes to apply next round, or a list of remote changes which will be exported to other pages).

Collecting Remote Changes

You may be wondering “Why bother separating remote and local changes?" We need to separate these because remote changes must touch a different Page, and Page's are executed in isolation on potentially different threads. We'd prefer pages to not interact, as any cross-page interaction could induce a slowdown from coordination.

Imagine the situation where page A needs to check coordinate (10,0) on page B, to see if it can grow there. Unfortunately, page B hasn't finished processing yet and the outcome of (10,0) is presently unknown. Page A must stall until B eventually processes (10,0).

Ideally, we want all compute units to make progress independent of other pages. And it also avoids sharing memory between threads, invalidating caches local to processors, etc etc.

So what we do is grow the pages independently, then collect up all the remote changes and apply those on a single thread:

impl Grid {

  pub fn grow_step(&mut self) -> u32 {

    // ... Growth code
    // ...

    // Run back over the pages and collect their "active cell counts"
    // This value is returned so the calling code knows when to stop
    // (e.g. when count == 0)
    let active_cells = self.pages.iter()
                            .map(|page| page.get_active_cell_count())
                            .fold(0u32, |acc, x| acc + x);

    for i in 0..self.pages.len() {

        // Get the list of remote changes generated by the page...
        let changes = self.pages[i].get_remote_changes().clone();

        for c in changes {
            if !(c.x > 0 && c.x < self.dimension && c.y > 0 && c.y < self.dimension ) {
                continue;
            }

            // ... and apply them to the appropriate page
            self.get_mut_page(c.x, c.y)
                .add_change(c.x % PAGE_WIDTH, c.y % PAGE_WIDTH, c.cell, c.travel_direction, c.stim);
        }
    }
  }
}

Few interesting points here. First, we clone out the changes, but we could probably work some lifetime magic and avoid extra allocations. But remember, these changes are really just u32's wrapped in some enums, so it's not a huge burden (I don't think).

Second, we are iterating over a range instead of iterating on the pages directly. This is another location where I've run afoul with &mut self borrowing. I need to iterate over the set of pages, retrieve the changes for that page, then potentially modify one or more other pages mutably using add_change().

If I iterate over the pages directly, I'm prevented from borrowing self later because it was previously borrowed mutably in the iterator. Which makes sense, I'm only allowed mutable access to the current element, otherwise I might accidentally invalidate the entire iterator.

Alas, that means the only solution I found was to simply iterate over the range and use direct indexing.

`reduce_with` vs sequential getter

Finally, you might notice that we retrieved the remote changes via a special getter method. Rayon actually has the capability of returning values from the parallel iterator, which would be a more ergonomic solution. That was my first approach, but I ran into some performance difficulties.

My original iteration looked like this:

let (active_cells, remote_changes) = self.pages.par_iter_mut()
    .map(|page| page.grow())
    .reduce_with(|a, b| {
        let active_cells = a.0 + b.0;
        let remote_changes = match (a.1, b.1) {
            (Some(a_r), Some(b_r)) => {
                let mut t = Vec::with_capacity(a_r.len() + b_r.len());
                t.extend(a_r);
                t.extend(b_r);
                Some(t)
            },
            (Some(a_r), None) => Some(a_r),
            (None, Some(b_r)) => Some(b_r),
            (None, None) => None
        };
        (active_cells, remote_changes)
    }).unwrap();

There are some terrible things happening here, but they aren't Rayon's fault. Each page can return zero or more remote changes, and also needs to express how many cells were made active this round. So we use reduce_with to accumulate these two return values, which devolves into a relatively gross match combining Options of tuples (blargh).

A quick performance check showed that this cobbled together reduction was burning a fair amount of CPU. Ultimately, I suspect it's that vector allocation which is dooming performance. This is less than ideal anyway, but could be exacerbated by how Rayon does reductions.

My neanderthal understanding of Rayon is that it recursively divides the work to be done using its core join method, until it reaches the bottom of the work tree. Then it starts computing, and as results finish on various threads the results bubble back together and are reduced in a reverse-pairwise fashion.

I think :)

So if that's the case, then a new vector is allocated for each fork point in the computation tree, instead of just n times (one for each page) as I initially thought.

In any case, it was simple enough to rip out the reduction, earmark it for future investigation when I have time to do it properly and use a simpler getter for now. I do want to revisit this eventually. A reduction would be ideal because it forces the programmer (me) to deal with the results, instead of accidentally forgetting to call the getter.

After swapping to a simple getter, the flamechart looks like this.. The on-CPU portions are largely Roaring and Hashmap related now.

Getting back to the remote changes, the final point is that we add these to the appropriate page as if they were generated as local changes. So handling them during the next iteration is transparent.

Updating pages

Ok, so far we have generated a set of changes local to each Page then collected any remote changes that may have cross page boundaries and propagated those to the appropriate page. Now we need to actually apply those changes and update the grid.

For this we can go back to parallel execution and hand things off to Rayon again:

impl Grid {

  pub fn grow_step(&mut self) -> u32 {

    // ... Growth Code
    // ...
    // ... Remote Changes Code
    // ...

    self.pages.par_iter_mut()
            .weight_max()
            .for_each(|page| page.update());
    }
}

Because we have propagated remote changes already, pages have no dependence on each other anymore and can go about their business in parallel. Inside the update() method:

impl Page {
  pub fn update(&mut self) {
      // Clear out the active cell bitmap, and add the cells we just grew
      self.active.clear();

      for (k, v) in &self.changes {
          self.cells[*k as usize].set_cell_type(v.get_cell_type());
          self.cells[*k as usize].set_gate(v.get_gate());
          self.cells[*k as usize].set_stim(v.get_stim());
          self.active.insert(*k);
      }

      self.changes.clear();
      self.remote_changes.clear();
  }
}

Which is pretty mundane actually. Clear out our bitmap of active cells because they have all been processed, then apply each change to the grid (setting cell type, direction gate and stimulatory status). Also insert this cell into the list of active cells, since next time through the growth phase it'll need to check its surroundings for growth.

Finally do some housecleaning and we're done!

Conclusion and some perf numbers

And that is growth. Check the active cells to see if they can grow anywhere, save/propagate the changes, materialize those changes and repeat the whole process until the grid stops changing.

Building a clean benchmark is a little tricky because the processing time per page changes as the simulation progresses (due to more/less active cells). But on my Macbook Air, it takes ~200-400 ns per active cell. An average 1-3% sparse Page can expect to finish processing in 800-1200 microseconds.

On one hand that feels pretty awesome to me, as someone coming from slower dynamic languages. But on the other hand, these operations are relatively simple (bit shifts and conditionals basically), so there is likely plenty of room for optimization in the future.

For fun, here are some more graphs. This is a single page at “high density” (and yes, some of those neuron bodies are “growing”, which is a bug):

High density Cajal growth. Brown = neuron body, Teal = Dendrite, Red = Axon

And here are four pages, which demonstrates cross-page growth (the neuron in the upper left grows across all four pages total):

Low density Cajal growth across four pages. Brown = neuron body, Teal = Dendrite, Red = Axon

Note: these gifs were taken when the page size was considerably smaller, 64x64 iirc*

Next time I'll dig into the signaling mechanics which ocurr after the growth phase.

Growing a network in Cajal

Agenda

Cajal Initialization

Growth

Dealing with `&mut self`

Collecting Remote Changes

`reduce_with` vs sequential getter

Updating pages

Conclusion and some perf numbers

Growing a network in Cajal

Agenda

Cajal Initialization

Growth

Dealing with &mut self

Collecting Remote Changes

reduce_with vs sequential getter

Updating pages

Conclusion and some perf numbers

Dealing with `&mut self`

`reduce_with` vs sequential getter