The last article on Cajal set up a lot of important background, but it was rather light on Rust'y material. In this article, I'll walk through the growth phase of Cajal which should, hopefully, be a bit more interesting
Rustc version:
rustc 1.8.0-nightly (fae516277 2016-02-13)
To entice you into reading, I'll be covering the mechanics that makes this happen:
Cajal Initialization
The initialization procedure is relatively straightforward, and I don't really want to spend a lot of time on it. It has three main goals:
- Allocate a whole kaboodle of memory
- Spread a random genome across the entire set of memory
- Pre-populate a certain number of “neuron” bodies
Each Cajal struct holds a Grid, which in turn holds a vector of Page
's. Pages
are self-contained computation units which can be processed in parallel by Rayon.
By default, each page is 256 x 256 cells wide, totaling 65,536 cells.
The size of the page is arbitrary; it should be sufficiently large so that more
time is spent processing the Page
than spent context-switching, but small enough that
other threads aren't sitting around waiting for the last page to finish. Some very
rough perf tuning shows that 256x256 is at least moderately decent.
Pages are allocated in row-major order to simplify access. Internally, each Page
initializes
a big slab of memory holding the Cell
's. Each Cell
is given a random chromosome,
gate direction and firing threshold.
Finally, a small handful of cells are converted into neuron bodies and then “sprouted” so that a single iteration of axon/dendrite growth ocurrs. This simplifies later growth processing since the code won't need to deal with the neuron body condition.
Aaaaand that's about it. Simple, and yet my implementation is also rather slow. Whoops!
On my test server, I tried to allocate a Grid with 500 x 500 pages (~60gb of memory ) and it took fooooorever. I suspect the allocations can be optimized, and the various random number generation is probably not helping. If the random numbers are the bottleneck, the code can be made parallel fairly easily. If it's memory allocation speed…I'm not sure what to do. :)
Growth
The scene is set, our memory is allocated, the neurons have sprouted cute little axons and dendrites. Let's start growing!
Growth has three components:
- Process each active cell and record it's growth pattern
- Collect changes that cross
Page
boundaries and transfer to the appropriate pages - Apply all pending changes to update the grid
The first step, growing local neurons, can be accomplished (potentially) in parallel with Rayon:
impl Grid {
pub fn grow_step(&mut self) -> u32 {
self.pages.par_iter_mut()
.weight_max()
.for_each(|page| page.grow());
// ...
}
}
This uses Rayon's “parallel iterator” to easily parallelize the growth of each page.
We have to explicitly tell Rayon that each computation unit is “heavy” with weight_max()
,
otherwise Rayon assumes each operation is very light (like copying an integer) and
tries to batch them all together. Net result is that few or no threads are actually
used. By setting weight_max()
, we tell Rayon that each element probably needs
it's own thread instead of being batched.
On my test server, that small change improved runtime of a test simulation from 40s to 5s, and leveraged all 12 threads instead of just two. Yay!
Then we use for_each
to execute the grow()
method on each page. Simple!
Rayon will (usually) execute each of these Pages in parallel across multiple threads.
If we dive inside that grow()
method, we'll see a ball of mud that performs the
actual calculations:
impl Page {
pub fn grow(&mut self) {
// Does this Page have any active cells from last growth phase?
// If no, just return early
// This might be unnecessary and optimized away?
if self.active.is_empty() == true {
return;
}
// Not strictly necessary, just allows brevity later
let mut cells = &mut self.cells;
for index in self.active.iter() {
// Pull out the basic metadata we need
let (x, y) = zorder::z_to_xy(index);
let cell_type = cells[index as usize].get_cell_type();
let stim = cells[index as usize].get_stim();
// Each cell has a chromosome that "points" north, south, east, west or
// any combination of those. We need to check the four cardinal directions
// to see if they are set
for direction in CARDINAL_DIRECTIONS {
if cells[index as usize].get_chromosome().contains(Chromosome::from(*direction)) {
// If the chromosome does point in that direction, attempt to grow that way
let change = Page::process_chromosome_direction(*direction, &mut cells, x, y,
self.offset_x, self.offset_y, cell_type, stim);
// Persist the change in our local changes, or buffer for "exporting"
match change {
ChangeType::Local((target, change)) => {self.changes.insert(target, change);},
ChangeType::Remote(change) => {self.remote_changes.push(change);},
ChangeType::NoChange => {}
}
}
}
}
}
}
The first thing a Page
does is check it's active
list to see if there are any flagged
cells that need processing. The active
object is a Roaring Bitmap
which compactly tracks the active cells. This is why we “sprout” our axons/dendrites
during initialization; we need the active
list to be populated with axons/dendrites
for any growth to occur.
Side note: I love Roaring Bitmaps, but they may not be the appropriate solution in this case. Informal profiling shows Roaring as a relatively large contender for CPU time. This isn't a knock on the library at all…it's just the set of tradeoffs that the Roaring algo makes.
The Roaring paper states that Roaring (and indeed, any sort of bitmap, compressed or otherwise) are generally inappropriate if your sparsity is less than 0.1% due to memory and CPU overhead. Depending on settings, Cajal starts at 1-10% sparsity. But by halfway through the simulation, that usually drops down to < 0.1% (~65 active cells). Which means a simpler vector of changes would likely be more compact and more efficient.
So, earmarking this for future investigation. Perhaps start out with Roaring but cutover to a simple vector once sparsity falls below a threshold?
Next, we need to iterate over the active cells. Conveniently, cells are toggled “active” based on their index, meaning our iteration walks over the cells in z-order. This gives us the best chance possible for pre-fetching the right strides of memory.
Each cell has an underlying “chromosome” which points in some combination of directions
(or none at all, called block
). A cell grows in the directions dictated by the
underlying chromosome at that point in space. So we need to iterate over the
four cardinal directions and see if the chromosome “contains” that direction.
Dealing with &mut self
If the chromosome does “contain” that direction, we attempt to grow a new axon/dendrite in that direction. I say
attempt because we may be at the edge of the Page
, or the target may already be occupied.
To handle these potential outcomes, we call Page::process_chromosome_direction()
,
whose signature looks like so:
fn process_chromosome_direction(travel_direction: Gate,
cells: &mut Vec<Cell>,
x: u32, y: u32,
offset_x: u32, offset_y: u32,
cell_type: CellType, stim: bool) -> ChangeType {
Well, first, eww. Just lookit’ how gross that function signature is. Why so many parameters? Alas, I've been bitten by the “can't mutate self multiple times” anti-pattern. Because we've borrowed self here:
for index in self.active.iter() {
// ...
}
We run into troubles where other methods need to access self. The root problem
is that grow()
needs mutable access to self, so the compiler can't guarantee (I think)
that other methods won't mutate self in some way that, say, invalidates the iterator
we are traversing.
Instead, I've refactored the methods into impl-private plain ol'functions. Pass in
some state, get something back out. The upside is that unit testing will be simpler
for these functions because they don't rely on hidden state of the Page
struct.
The downside is that we now need to pass around a fair amount of arguments.
But in reality, what this really means is that I have logical groupings of components
that should probably be wrapped up in their own struct. E.g. x and y always travel
as pairs, so perhaps add a Coord
struct, etc. Or perhaps this function is
trying to do too much, and the functionality needs to be partitioned in a different
way.
Anyways, back on target, notice the return type: ChangeType. What's that, you say?
enum ChangeType {
Local((u32, Cell)),
Remote(RemoteChange),
NoChange
}
Behold, a glorified Option! We can either grow into the new cell (Local
), grow
out of bounds and into a different Page
(Remote
), or can't grow at all because
the space is occupied by something else (NoChange
).
And since our plain ol'function can't modify self
directly, it just returns
this ChangeType and kicks the problem back upstream. That's the meat of it, although
the mudball continues downward since we have to actually determine if the target
is in/out of bounds, if it's an empty cell, etc. But eventually we'll bubble
back up a ChangeType.
Then we just persist the change, depending on what kind of change it is (e.g. our local hashmap of changes to apply next round, or a list of remote changes which will be exported to other pages).
Collecting Remote Changes
You may be wondering “Why bother separating remote and local changes?" We need
to separate these because remote changes must touch a different Page
,
and Page
's are executed in isolation on potentially different threads. We'd
prefer pages to not interact, as any cross-page interaction could induce a slowdown
from coordination.
Imagine the situation where page A needs to check coordinate (10,0)
on page B,
to see if it can grow there. Unfortunately, page B hasn't finished processing yet
and the outcome of (10,0)
is presently unknown. Page A must stall until B
eventually processes (10,0)
.
Ideally, we want all compute units to make progress independent of other pages. And it also avoids sharing memory between threads, invalidating caches local to processors, etc etc.
So what we do is grow the pages independently, then collect up all the remote changes and apply those on a single thread:
impl Grid {
pub fn grow_step(&mut self) -> u32 {
// ... Growth code
// ...
// Run back over the pages and collect their "active cell counts"
// This value is returned so the calling code knows when to stop
// (e.g. when count == 0)
let active_cells = self.pages.iter()
.map(|page| page.get_active_cell_count())
.fold(0u32, |acc, x| acc + x);
for i in 0..self.pages.len() {
// Get the list of remote changes generated by the page...
let changes = self.pages[i].get_remote_changes().clone();
for c in changes {
if !(c.x > 0 && c.x < self.dimension && c.y > 0 && c.y < self.dimension ) {
continue;
}
// ... and apply them to the appropriate page
self.get_mut_page(c.x, c.y)
.add_change(c.x % PAGE_WIDTH, c.y % PAGE_WIDTH, c.cell, c.travel_direction, c.stim);
}
}
}
}
Few interesting points here. First, we clone out the changes, but we could probably work some lifetime magic and avoid extra allocations. But remember, these changes are really just u32's wrapped in some enums, so it's not a huge burden (I don't think).
Second, we are iterating over a range instead of iterating on the pages directly.
This is another location where I've run afoul with &mut self
borrowing. I need
to iterate over the set of pages, retrieve the changes for that page, then potentially
modify one or more other pages mutably using add_change()
.
If I iterate over the pages directly, I'm prevented from borrowing self later because it was previously borrowed mutably in the iterator. Which makes sense, I'm only allowed mutable access to the current element, otherwise I might accidentally invalidate the entire iterator.
Alas, that means the only solution I found was to simply iterate over the range and use direct indexing.
reduce_with
vs sequential getter
Finally, you might notice that we retrieved the remote changes via a special getter method. Rayon actually has the capability of returning values from the parallel iterator, which would be a more ergonomic solution. That was my first approach, but I ran into some performance difficulties.
My original iteration looked like this:
let (active_cells, remote_changes) = self.pages.par_iter_mut()
.map(|page| page.grow())
.reduce_with(|a, b| {
let active_cells = a.0 + b.0;
let remote_changes = match (a.1, b.1) {
(Some(a_r), Some(b_r)) => {
let mut t = Vec::with_capacity(a_r.len() + b_r.len());
t.extend(a_r);
t.extend(b_r);
Some(t)
},
(Some(a_r), None) => Some(a_r),
(None, Some(b_r)) => Some(b_r),
(None, None) => None
};
(active_cells, remote_changes)
}).unwrap();
There are some terrible things happening here, but they aren't Rayon's fault. Each page
can return zero or more remote changes, and also needs to express how many cells
were made active this round. So we use reduce_with
to accumulate these two
return values, which devolves into a relatively gross match combining Options of tuples
(blargh).
A quick performance check showed that this cobbled together reduction was burning a fair amount of CPU. Ultimately, I suspect it's that vector allocation which is dooming performance. This is less than ideal anyway, but could be exacerbated by how Rayon does reductions.
My neanderthal understanding of Rayon is that it recursively divides the work to
be done using its core join
method, until it reaches the bottom of the work tree.
Then it starts computing, and as results finish on various threads the results bubble
back together and are reduced in a reverse-pairwise fashion.
I think :)
So if that's the case, then a new vector is allocated for each fork point in the
computation tree, instead of just n
times (one for each page) as I initially thought.
In any case, it was simple enough to rip out the reduction, earmark it for future investigation when I have time to do it properly and use a simpler getter for now. I do want to revisit this eventually. A reduction would be ideal because it forces the programmer (me) to deal with the results, instead of accidentally forgetting to call the getter.
After swapping to a simple getter, the flamechart looks like this.. The on-CPU portions are largely Roaring and Hashmap related now.
Getting back to the remote changes, the final point is that we add these to the appropriate page as if they were generated as local changes. So handling them during the next iteration is transparent.
Updating pages
Ok, so far we have generated a set of changes local to each Page
then collected
any remote changes that may have cross page boundaries and propagated those to the
appropriate page. Now we need to actually apply those changes and update the grid.
For this we can go back to parallel execution and hand things off to Rayon again:
impl Grid {
pub fn grow_step(&mut self) -> u32 {
// ... Growth Code
// ...
// ... Remote Changes Code
// ...
self.pages.par_iter_mut()
.weight_max()
.for_each(|page| page.update());
}
}
Because we have propagated remote changes already, pages have no dependence on
each other anymore and can go about their business in parallel. Inside the
update()
method:
impl Page {
pub fn update(&mut self) {
// Clear out the active cell bitmap, and add the cells we just grew
self.active.clear();
for (k, v) in &self.changes {
self.cells[*k as usize].set_cell_type(v.get_cell_type());
self.cells[*k as usize].set_gate(v.get_gate());
self.cells[*k as usize].set_stim(v.get_stim());
self.active.insert(*k);
}
self.changes.clear();
self.remote_changes.clear();
}
}
Which is pretty mundane actually. Clear out our bitmap of active cells because they have all been processed, then apply each change to the grid (setting cell type, direction gate and stimulatory status). Also insert this cell into the list of active cells, since next time through the growth phase it'll need to check its surroundings for growth.
Finally do some housecleaning and we're done!
Conclusion and some perf numbers
And that is growth. Check the active cells to see if they can grow anywhere, save/propagate the changes, materialize those changes and repeat the whole process until the grid stops changing.
Building a clean benchmark is a little tricky because the processing time
per page changes as the simulation progresses (due to more/less active cells).
But on my Macbook Air, it takes ~200-400 ns per active cell. An average 1-3%
sparse Page
can expect to finish processing in 800-1200 microseconds.
On one hand that feels pretty awesome to me, as someone coming from slower dynamic languages. But on the other hand, these operations are relatively simple (bit shifts and conditionals basically), so there is likely plenty of room for optimization in the future.
For fun, here are some more graphs. This is a single page at “high density” (and yes, some of those neuron bodies are “growing”, which is a bug):
And here are four pages, which demonstrates cross-page growth (the neuron in the upper left grows across all four pages total):
Note: these gifs were taken when the page size was considerably smaller, 64x64 iirc*
Next time I'll dig into the signaling mechanics which ocurr after the growth phase.