Parallel cat-o-rama?

I'm having some "fun" with OpenCL.  If you have never heard about it, it's a computing language to perform general-purpose programming on GPUs.  Since the current-generation GPUs pack a ton of processing power, it can lead to some impressive speed-ups, if you recode the time-critical part of your program and make them run on the GPU.  Only caveat: the GPU is massively parallel, so if your problem is not suitable for parallelization, you have a problem (and you lose performance).
Of course things are not as easy as the tutorials online make it: coding is trivial, but getting good performance is a nightmare of memory-access optimizations.  This is because even if the GPU has massive computing power, the memory latency is enormous, and unless you arrange the execution of your code so that memory accesses (intrinsically slow) can be "coalesced" (= many programs access contiguous memory at the same time, leading to a single operation), performance gains can be minimal.  Still, when things work it's fun: right now I'm trying different approaches for a 7x7 image box filter.  I'm at x100 execution speed-up (note, filtering calculation only, I'm forgetting about the computer<->GPU memory transfers, which add some more time, bringing the speedup to some x40-50 "only"), but I'm trying to see if it's possible to do more....

What's the relation to WoW?  None.  Except that making a cat simulator which can run on the GPU would be nice and allow for some serious kick-ass speedup of the simulations (simulating multiple combats is very parallelizable and very little memory-bound if all you care about is the final DPS).
Genetic cat optimization algorithms, anyone? :)