<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
  <title>strobe.cc - new articles</title>
  <link href="http://strobe.cc/" rel="alternate"></link>
  <link href="http://strobe.cc/feeds/content.xml" rel="self"></link>
  <id>http://strobe.cc/</id>
  <updated>2009-12-30T00:00:00Z</updated>
  <entry>
    <title>CUDA atomics: a practical analysis (updated)</title>
    <link href="http://strobe.cc/cuda_atomics/" rel="alternate"></link>
    <updated>2009-12-30T00:00:00Z</updated>
    <id>http://strobe.cc/cuda_atomics/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#hardware-and-assumptions" id="id13">Hardware and assumptions</a></li>
<li><a class="reference internal" href="#precision-and-accuracy" id="id14">Precision and accuracy</a></li>
<li><a class="reference internal" href="#what-does-clocks-return" id="id15">What does clocks() return?</a></li>
<li><a class="reference internal" href="#how-long-does-an-atomic-operation-stall-a-thread" id="id16">How long does an atomic operation stall a thread?</a><ul>
<li><a class="reference internal" href="#benchmark" id="id17">Benchmark</a></li>
<li><a class="reference internal" href="#analysis" id="id18">Analysis</a></li>
</ul>
</li>
<li><a class="reference internal" href="#memory-access-patterns-with-delays-between-accesses" id="id19">Memory access patterns with delays between accesses</a><ul>
<li><a class="reference internal" href="#id7" id="id20">Benchmark</a></li>
<li><a class="reference internal" href="#id8" id="id21">Analysis</a></li>
</ul>
</li>
<li><a class="reference internal" href="#do-adjacent-memory-operations-cause-atomic-collisions" id="id22">Do adjacent memory operations cause atomic collisions?</a><ul>
<li><a class="reference internal" href="#id11" id="id23">Benchmark</a></li>
<li><a class="reference internal" href="#id12" id="id24">Analysis</a></li>
</ul>
</li>
<li><a class="reference internal" href="#to-be-continued" id="id25">To be continued&#8230;</a></li>
<li><a class="reference internal" href="#acknowledgments" id="id26">Acknowledgments</a></li>
</ul>
</div>
<p>NVIDIA&#8217;s chips can have huge numbers of threads in-flight at a time; on my GTX
275, nearly 30,000 threads can be in the midst of executing. There is limited
thread synchronization between threads on the same processor, and no
inherent synchronization between other processors. Any coordination
between threads must be achieved by writing to global memory, an activity
with a large latency penalty.</p>
<p>As such, the atomic operations in the CUDA ISA are critical to some
algorithms. In typical fashion, NVIDIA provides little guidance as to the
performance of these operations, or as to the manner in which they are
implemented. On a traditional CPU with a reasonable number of cores, one might
simply suggest a few benchmarks, and indeed some benchmarks have been done on
GPUs. The issue is that those benchmarks reveal that, for an operation as
simple as an addition of a single variable, performance of atomic operations
can be thousands of times slower than a traditional read-modify-write cycle.
Under the right circumstances, they can also perform <em>faster</em> than non-atomic
instructions.</p>
<p>For those crafting algorithms which use a number of atomic operations,
understanding what causes the enormous difference in performance operations
makes it easier to create optimized implementations. Along with a few others,
I&#8217;m working on <a class="reference external" href="http://strobe.cc/do_androids_render/">one such algorithm</a>. This article will attempt to use
benchmarks to help uncover the architecture used for memory operations in
CUDA, in order to make writing such algorithms less of a trial and error
affair.</p>
<p>I&#8217;m aiming to build an understanding of the architecture by asking a testable
question, benchmarking, crafting a testable hypothesis as to why the results
are the way they are, and then repeating the cycle until we&#8217;re out of
surprises. It&#8217;s possible that this will lead to bad predictions (in fact, it
already has); for the sake of conciseness, I&#8217;ll edit the incorrect conclusions
out (and possibly place them in a scrap-heap article so that everyone who
wishes to can still mock me for being so very wrong) rather than describe my
backtracking.</p>
<div class="section" id="hardware-and-assumptions">
<h2><a class="toc-backref" href="#id13">Hardware and assumptions</a></h2>
<p>These benchmarks are being done on an NVIDIA GTX 275 GPU, running at standard
clocks, plugged into a Intel G965 motherboard, driven by a Core 2 Duo 6400
with 6GB RAM. The fact that my motherboard only supports PCIe 1.0 shouldn&#8217;t
cause a difference with any of these benchmarks, as they exclude any
host-to-device latencies.</p>
<p>My understanding of NVIDIA&#8217;s GT200 architecture comes largely from <a class="reference external" href="http://developer.nvidia.com/page/home.html">NVIDIA&#8217;s
own documentation</a> and the analyses done by <a class="reference external" href="http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242">Real World Tech</a> and <a class="reference external" href="http://www.beyond3d.com/content/reviews/51">Beyond3D</a>.
I suggest the latter two if you&#8217;re not familiar with the architecture.</p>
<p>It should also be noted that my background is in audio and video, <em>not</em> in 3D,
so this will be focused exclusively on general-purpose computation. While
certain tricks may exist that use dedicated hardware for better performance,
if you can&#8217;t do it from CUDA, I don&#8217;t consider it here. That might be a stupid
move, but we&#8217;ll see.</p>
<p>These benchmarks are being conducted using <a class="reference external" href="http://mathema.tician.de/software/pycuda">PyCUDA</a>, with data processing
handled by numpy and scipy. They&#8217;re all written in PTX, NVIDIA&#8217;s own assembly
language<sup><a class="footnote-reference" href="#id2" id="id1">1</a></sup><span class="fntarget" id="id2_target"></span>. Output is rendered by <a class="reference external" href="http://matplotlib.sourceforge.net/index.html">matplotlib</a>.  The source (<em>caution:
ugly</em>) is available <a class="reference external" href="http://strobe.cc/cuda_atomicsptx.py">here</a>.</p>
<div class="fnwrap"><p class="footnote" id="id2"><a class="fn-backref" href="#id1">1</a> PTX is a rather nice assembly language, and in some ways is still
rather high-level. After trying both CUDA and PTX, I find that PTX
allows me to write optimized code more easily, and involves fewer
guesses about what&#8217;s going on than CUDA&#8217;s C compiler.</p></div>
</div>
<div class="section" id="precision-and-accuracy">
<h2><a class="toc-backref" href="#id14">Precision and accuracy</a></h2>
<p>Oh, and before we get started: you should probably know that <a class="reference external" href="http://strobe.cc/articles/cuda_atomics_FAIL/">my last
attempt</a> contained flagrantly incorrect results. So maybe don&#8217;t trust these
so much until the article is done, okay?</p>
<p>For the curious, or those prone to schadenfreude: the benchmarks previously
presented here were entirely correct; they accurately reported the number of
clocks it took to run the given kernel. They seemed exceptionally strange, and
they were—how many architectures do you know of where an atomic operation
beats an unsynchronized one?—but I verified all the data being written by the
kernels, ran the tests dozens of times, reread the statistics and graph code
to make sure that everything was right. It all checked out. In fact, it
checked out exceptionally well; most runs ended up with averages that stayed
within one cycle of each other, and several would produce exactly identical
results across each warp in each SM.</p>
<p>I figured that with so much consistency on my side, I had to be right. I even
came up with an explanation which was consistent with all public information
about the chip, one that seemed to satisfy myself and others working on the
same project. Sure, the numbers implied a sustained write performance of 7
TB/s to global memory, but that&#8217;s not unreasonable for a large, wide cache.</p>
<p>Ultimately, I uncovered my error when testing my explanation. The test kernel
used a single 32-thread warp to probe memory locations in a pattern designed
to uncover the total cache size and cache line size of the supposed writeback
cache. The results were fantastical, absurd; either NVIDIA hid more than a
billion extra transistors on the die for a 32 MB SRAM cache, or something was
wrong with my technique.</p>
<p>I had fallen into a trap which has ensnared scientists and engineers for
centuries: assuming that <a class="reference external" href="http://en.wikipedia.org/wiki/Accuracy_and_precision">accuracy meant precision</a>. When an
experiment contains a systematic error, sometimes the data is kind enough to
be inexplicable, at which point most rational people will go back and recheck
their experimental design. Unfortunately, humans are <em>very good at explaining
things.</em> Given data which is remarkably consistent, we excel at conjuring a
mechanism which explains it.</p>
<p>I suppose the moral of this story is that you should never use a black box to explain the behavior of a system unless you can set up an experiment to isolate and characterize the black box directly<sup><a class="footnote-reference" href="#id4" id="id3">2</a></sup><span class="fntarget" id="id4_target"></span>.</p>
<div class="fnwrap"><p class="footnote" id="id4"><a class="fn-backref" href="#id3">2</a> It is a very <em>particular</em> moral.</p></div>
<p>As for the bug, it was pretty trivial: I was repeating a short benchmark many times, and I assumed that the precision of the results indicated that they were an accurate representation of longer kernels. I also did not depend on the return value of the atomic operations. The out-of-order capabilities of the compiler and/or chip were allowing all timing code to execute before any results were returned. The rest of the article will take this into account, and (upon completion) will also explore these capabilities directly.</p>
</div>
<div class="section" id="what-does-clocks-return">
<h2><a class="toc-backref" href="#id15">What does clocks() return?</a></h2>
<p>This one has nothing to do with atomics, but a good understanding is necessary for benchmarking. It seems like it could be a stupid question, as the documentation says it quite clearly:</p>
<blockquote>
&quot;When executed in device code, returns the value of a per-multiprocessor
counter that is incremented every clock cycle.&quot;</blockquote>
<p>Okay, neat. Except, wait, <em>which</em> clock? One would assume that this refers to the frontend clock, which ticks twice for each warp, but does that leave the two half-warps with different clocks? Or does it refer to the clock on the backend, which ticks four times per warp, leaving us with up to four different values per warp?</p>
<p>The heart of this experiment is in these three lines:</p>
<div class="blockcode"><div class="highlight"><pre>    mov.u32     clka,   %clock;
    mov.u32     clkb,   %clock;
    sub.u32     clka,   clka,   clkb;
</pre></div>
</div><p>Register &#8216;clka&#8217; ends up holding the difference between two samples of the clock. Running the kernel in a single thread per SM and dumping the results to memory, we get to see this value. Turns out it&#8217;s <em>exactly</em> 28 clocks, without deviation.</p>
<p>Running it at 32 threads per SM, the results stay steady at 28 clocks, and all results in a warp are equal, indicating that this is the frontend clock latched at the start of a two-clock warp. Setting up a tight 256-round loop and storing the sum of differences to memory, we find this result:</p>
<img alt="consecutive_clocks.png" src="http://strobe.cc/cuda_atomicsconsecutive_clocks.png" />
<p>The uncanny exactness of 28 clocks per round disappears when you have more
than more than 2 warps per SM. This makes a lot of sense; at 4 warps, with two
cycles per warp instruction and two instructions per clock, a round-robin
scheduler would take 32 cycles to come back to the first warp, giving enough
room to hide whatever caused the 28-clock minimum latency. Adding a few
instructions in between those operations suggests that each SM is pipelined to
give that massive register file time to breathe (obvious), that the exactness
of the 28 clocks may be related to accessing special registers like %clock
(less obvious), and that register dependencies are caught and handled by the
instruction scheduler (obvious in hindsight).</p>
<p>The tightness of the error bars, even as the card climbs past full occupancy,
is misleading, as this is the mean of 256 runs per thread. Cutting down the
number of runs per thread to 8 shows much less determinism in saturated SM
scheduling, although it&#8217;s comforting to note that the algorithm in use tends
to keep threads at approximately the same instruction count (in the absence of
memory operations) without the explicit use of thread synchronization over
longer runs.</p>
<img alt="consecutive_clocks_8_iter.png" src="http://strobe.cc/cuda_atomicsconsecutive_clocks_8_iter.png" />
<p><em>Conclusion:</em> clocks() returns the frontend clock at the start of a warp&#8217;s execution. On an underutilized SM which can&#8217;t hide instruction latency, the comparison adds 28 cycles of latency on top of whatever was between the calls; this drops to 2 cycles on a fully utilized SM. It should be safe to use clocks() for benchmarking.</p>
</div>
<div class="section" id="how-long-does-an-atomic-operation-stall-a-thread">
<h2><a class="toc-backref" href="#id16">How long does an atomic operation stall a thread?</a></h2>
<div class="section" id="benchmark">
<h3><a class="toc-backref" href="#id17">Benchmark</a></h3>
<p>For this question, we&#8217;ll consider five types of operations: &#8216;load&#8217; and
&#8216;store&#8217;, neither of which is sufficient to compare to an atomic operation like
&#8216;add&#8217; but are included for reference; &#8216;load_store&#8217;, the traditional
read-modify-write approach to addition; &#8216;red&#8217;, which performs an atomic
reduction—that is, it computes and stores to global memory, but does not use
the value returned from the memory controller in subsequent operations<sup><a class="footnote-reference" href="#id6" id="id5">3</a></sup><span class="fntarget" id="id6_target"></span>;
and &#8216;atomic&#8217;, which explicitly uses the result.</p>
<div class="fnwrap"><p class="footnote" id="id6"><a class="fn-backref" href="#id5">3</a> In C/C++, the compiler should emit a &#8216;red&#8217; automatically when you
ignore the return value of AtomicAdd() and friends.</p></div>
<p>These global memory operations will be run in a tight loop with code that times each operation. For &#8216;load&#8217;, &#8216;load_store&#8217;, and &#8216;atomic&#8217;, an explicit register dependency is created on the return value of the global memory operation by xor&#8217;ing it with &#8216;clka&#8217; in the example above before reading in &#8216;clkb&#8217;. This trick seems to prevent an SM from reordering the clock sampling to improve accuracy. It does <em>not</em> affect &#8216;store&#8217; or &#8216;red&#8217; operations, so the reported numbers there may be incorrect or at least misrepresentative. More on this later.</p>
<p>Three memory access patterns will be tested. The first goes straight for the jugular: all writes across an SM go to the same address, ensuring that all atomic operations cause a conflict. Each SM gets its own address, though, because having all processors write to the same location caused several system crashes during testing. This is expected to be nearly the worst case for atomic operations, and the results do not disappoint:</p>
<img alt="basic_add_good_single.png" src="http://strobe.cc/cuda_atomicsbasic_add_good_single.png" />
<p>Ick. Let&#8217;s not do that again.</p>
<p>The next access pattern is less pessimal; each memory location is separated by 128 bytes, and each thread gets its own memory location, ensuring that no conflicts occur but also preventing the chip from coalescing any memory operations.</p>
<img alt="basic_add_good_uncoa.png" src="http://strobe.cc/cuda_atomicsbasic_add_good_uncoa.png" />
<p>Well, that&#8217;s&#8230; tolerable. It remains to be seen whether atomics can be used for scatters in computation threads, but this looks like it wouldn&#8217;t cause too much damage. One last access pattern: this time, all threads are neatly coalesced, each accessing a 4-byte memory location in order, such that a warp hits a single 256-byte-wide, 256-byte-aligned region of memory.</p>
<img alt="basic_add_good_coa.png" src="http://strobe.cc/cuda_atomicsbasic_add_good_coa.png" />
<p>Crap. That&#8217;s quite a bit worse. Sure, the total latency for an atomic operation is better, but the ratio between an uncoalesced atomic and read-modify-write latency is much smaller than that for the coalesced pattern, so the <em>relative</em> cost of atomic operations in this context is much worse.</p>
</div>
<div class="section" id="analysis">
<h3><a class="toc-backref" href="#id18">Analysis</a></h3>
<p>Take a look at the error bars in the above graphs. For the &#8216;all conflicts&#8217; access pattern, there&#8217;s an enormous variability in the time it takes to serve requests; whatever mechanism is being used to deal with conflicting atomic operations isn&#8217;t capable of FIFO scheduling all of them. In the &#8216;uncoalesced&#8217; access pattern, the error bars shrink substantially; the variability of the times it takes to issue the memory request is very low. Coalesced memory accesses also have very steady times for both the load and load-store operations, but have a higher variance for store, atomic, and reduction operations. Note also that coalesced reductions, which should in theory allow the scheduler more freedom to hide memory latency, take longer and have more variance than atomics which prevent a kernel from processing the next instruction.</p>
<p>To explain this behavior, we need a detailed model of the memory architecture
of the chip. From the descriptions at <a class="reference external" href="http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242">Real World Tech</a> and <a class="reference external" href="http://www.beyond3d.com/content/reviews/51">Beyond3D</a>, along
with a little inference and a few patent searches, we have some <em>a priori</em>
knowledge. Stream Multiprocessors have independent computation hardware,
register files, and shared memory, but they&#8217;re not entirely independent. Each
SM is bundled with two others into a Thread Processing Cluster, which handles
instruction fetch, scheduling, and dispatch, as well as global memory
operations (including ROP and texture fetch). The TPC&#8217;s controlling logic
(<em>frontend</em>) is in a different clock domain from the ALU, FPU, and SFU
(<em>backend</em>), with the former at half the speed of the latter. The TPC is also
connected to a crossbar bus that connects to the other TPCs and the memory
controller, among other things.</p>
<p><a class="reference external" href="http://www.google.com/patents/about?id=IQS_AAAAEBAJ">US Patent Application 12/327,626</a> vaguely describes a GPU memory controller.
Given the filing date and subject matter, it probably covers technology
developed for Fermi, but Fermi and GT200 are not so dissimilar as to make the
filing irrelevant. It states,</p>
<blockquote>
&quot;In one embodiment, memory hub bus 240 is a high-speed bus, such as a bus
communicating data and memory requests in data packets (a &quot;packetized&quot;
bus). For example, high-speed I/O buses may be implemented using a low
voltage differential signal technique and interface logic to support a
packet protocol to transmit and receive data as data packets.&quot;</blockquote>
<p>By indulging in some speculation, it is easy to envision a vague protocol for
issuing memory transactions on this bus. For the sake of having something to
test, even if it is later found incorrect, let us assume that each TPC has a
finite queue for pending memory operations, and that the memory controller
also has such a queue. A TPC issuing a memory transaction would queue it, mark
some registers as dirty on the <a class="reference external" href="http://www.google.com/patents/about?id=vDiuAAAAEBAJ">scoreboard</a>, and post the request on the bus.
Then—and this part is entirely speculation, as other mechanisms for doing QoS
or rate-limiting are widely employed in buses like PCI-E and
HyperTransport—the TPC waits for an acknowledgment from the memory controller
indicating that the memory request was successfully queued. In the event that
the memory controller&#8217;s queue is full, the controller would bounce a &quot;retry
later&quot; message to the TPC. All of this is done over the packet-oriented bus
described above. Atomic calculations are handled by a dedicated SIMD ALU on or
near the memory controller.</p>
<p>This mechanism will be tested and refined as we go, but for now it does manage
to account for a few of the curiosities in the first round of benchmark
results. If we assume the proposed system is true, then:</p>
<ul class="simple">
<li>The small but nonzero wait time of &quot;set-and-forget&quot; operations such as
&#8216;store&#8217; and &#8216;red&#8217; under low-utilization conditions is the round-trip time to
the controller. (When the controller&#8217;s not flooded, a &#8216;red&#8217; performs more or
less just as fast as a &#8216;store&#8217;, as we&#8217;ll see later.)</li>
<li>The increasing wait time and variance of &#8216;store&#8217; and &#8216;red&#8217; as compared to
their typically-slower analogs &#8216;load&#8217; and &#8216;atomic&#8217;, respectively, under
conditions when the controller was starved for DRAM bandwidth—viz,
coalesced, 32 warps/SM—are related to increased numbers of memory controller
&quot;retry later&quot; rejection messages. In other words, the limited TPC memory
transaction queue is filled by &#8216;load&#8217; or &#8216;atomic&#8217; instructions waiting to
return, acting as an implicit rate-control, whereas the TPCs simply retry
continuously when attempting to push a &#8216;store&#8217; or &#8216;red&#8217; at the GPU, and the
loop of rejection packets floods the <em>internal bus bandwidth</em> (or packet
rate limit) as well as the DRAM bandwidth, causing the slight penalty seen
in those instructions on the latter benchmark.</li>
<li>The limiting factor causing the decrease in the ratio &#8216;load_store&#8217;/&#8217;atomic&#8217;
in the coalesced case is the memory controller&#8217;s ALU.</li>
</ul>
<p>However, the explanation is not perfect, or at least not complete; it doesn&#8217;t
seem to explain why uncoalesced operations have such a tight variance, nor
does it answer any questions about how conflicts are handled. It also doesn&#8217;t
include hard numbers, such as the width of the SIMD ALU at the memory
controller or the depth of the transaction queues. But there are plenty of
benchmarks left to run which could help clear up these matters.</p>
</div>
</div>
<div class="section" id="memory-access-patterns-with-delays-between-accesses">
<h2><a class="toc-backref" href="#id19">Memory access patterns with delays between accesses</a></h2>
<div class="section" id="id7">
<h3><a class="toc-backref" href="#id20">Benchmark</a></h3>
<p>The same three benchmarks as above, but with 50 32-bit multiply-adds thrown in. Remember, on GT200, a multiply-add is implemented as four separate instructions, so this is actually 200 instructions or 400 front-end cycles of computation added in addition to the memory operation and loop construct.</p>
<img alt="compute_bar_single.png" src="http://strobe.cc/cuda_atomicscompute_bar_single.png" />
<p>Yes, atomic collisions suck. But we knew that.</p>
<img alt="compute_bar_uncoa.png" src="http://strobe.cc/cuda_atomicscompute_bar_uncoa.png" />
<p>Note the performance of memory operations when the memory core is underutilized. Promising.</p>
<img alt="compute_bar_coa.png" src="http://strobe.cc/cuda_atomicscompute_bar_coa.png" />
<p>That&#8217;s right: free atomics.</p>
</div>
<div class="section" id="id8">
<h3><a class="toc-backref" href="#id21">Analysis</a></h3>
<p>If you have the luxury of using coalesced memory operations, the performance
cost of atomic operations which use the result are essentially identical to
that of a read-modify-write cycle. The performance cost of a coalesced &#8216;red&#8217;
operation actually <em>beats</em> &#8216;load_store&#8217; handily. If your kernel has enough
number-crunching instructions between memory accesses, then the performance
difference between any of these is insignificant, as long as the memory
controller is not flooded<sup><a class="footnote-reference" href="#id10" id="id9">4</a></sup><span class="fntarget" id="id10_target"></span>.</p>
<div class="fnwrap"><p class="footnote" id="id10"><a class="fn-backref" href="#id9">4</a> Coalescing uses the memory controller more efficiently, so it reduces
the load, but the same effect can be achieved for uncoalesced memory
writes if your kernels perform more computations between writes, as we&#8217;ll
see in later benchmarks.</p></div>
<p>The particular conditions determining when atomic operations are &quot;free&quot; depend
on a number of factors, including kernel length, SM occupancy, register
dependencies, and memory access patterns. For example, the test kernel&#8217;s
&#8216;filler&#8217; instructions all depend on the result of the previous instruction, so
it takes an occupancy of 8 warps/SM to hide register file latency and fully
utilize the ALU. A different kernel might be able to swap threads more
frequently, meaning that 1/8 occupancy might fully load the ALU. Of course,
such a kernel might also issue more memory transactions as a result of its
faster rate of execution, which could lead to the bandwidth constraints that
result in higher penalties for memory operations. In other words, if you
absolutely need atomics to be &quot;free&quot;, benchmark your particular code!</p>
<p>On the other hand, these results also show that it&#8217;s not hard to get free or at least cheap atomics. I had prepared a complex workaround for the flame algorithm to avoid using these &quot;slow&quot; operations, and the <a class="reference external" href="http://sourceforge.net/projects/flam4/">flam4</a>
implementation just gives the finger to atomicity and doesn&#8217;t attempt to avoid collisions (granted, they shouldn&#8217;t be <em>that</em> common, but still). Both of these tradeoffs were intended to avoid the high perceived cost of atomic operations; neither, as it turns out, were necessary.</p>
<p>These results are consistent with the proposed model for operation of the
memory controller, but do not provide significant refinements to that model.
The delays in the uncoalesced access pattern provides a bit more support for
the theory that atomic operations are handled on-chip by a SIMD ALU;
presumably, the flood of single-location uncoalesced memory requests were
causing the ALU to be saturated with 1-vector operations.</p>
</div>
</div>
<div class="section" id="do-adjacent-memory-operations-cause-atomic-collisions">
<h2><a class="toc-backref" href="#id22">Do adjacent memory operations cause atomic collisions?</a></h2>
<div class="section" id="id11">
<h3><a class="toc-backref" href="#id23">Benchmark</a></h3>
<p>Each CTA is given its own 32K region of global memory. The first eight lanes
of each warp in a 32×8 CTA choose a memory address, so that each is offset
from a 4K boundary by the distance under test. The result is that each warp
places 8 memory accesses per iteration, each exactly 4K apart, and each offset
from a 4K boundary by the same distance per <em>warp</em>, but a linearly varying
distance across the CTA. It&#8217;s easier to understand with an equation:</p>
<pre class="literal-block">
address = 32768*ctaid.x + 4096*ctaid.x + OFFSET*ctaid.y;
</pre>
<p>Note that &#8216;x&#8217; and &#8216;y&#8217; are in the opposite order from what you might expect, to
prevent memory accesses from being coalesced. A concern with this method is
the potential to exhaust the number of queued memory operations per local TPC
scheduler; this motivated the choice to limit to 8 memory operations per warp,
which may help to avoid that condition, but will be investigated again after
the queue depth tests.</p>
<p>We expect to see a penalty for atomic operations with a very high peak at low
offsets, which drops off sharply to draw nearly even with a load/store
operation at higher offsets. We are not disappointed:</p>
<img alt="atomic_lock_width.png" src="http://strobe.cc/cuda_atomicsatomic_lock_width.png" />
</div>
<div class="section" id="id12">
<h3><a class="toc-backref" href="#id24">Analysis</a></h3>
<p>CUDA is geared towards parallel computation, and the memory architecture
benefits from coalescing memory operations. Given the bias towards
implementing memory operations with large widths, regardless of the actual
amount of data requested, it seems likely that the mechanism which prevents
atomic transactions from interfering with each other also operates on
something more than a byte at a time. This, really, is the ultimate purpose of
asking this question; it can be imagined that if you&#8217;re relying on atomic
scatters to make your algorithm feasible, you&#8217;re not likely to implement a
complicated mechanism for preventing adjacent writes, so the information
gathered here is simply being used to test and expand the model for GT200b&#8217;s
operation.</p>
<p>Unfortunately, the data isn&#8217;t as much of a slam dunk as would be desired for
further asserting the nature of the underlying architecture. The chip was
running a 30-block grid, which should allocate one 256-thread warp per SM, and
yet the lowest time to complete an iteration is much higher for this benchmark
than for previous benchmarks at this occupancy. Perhaps this discrepancy will
be explained by later benchmarks, but for now have caution in interpreting
these results.</p>
<p>While the scale may be different than what was expected, however, the
<em>relative</em> sizes of the results are right in line with expectations. This
benchmark is somewhat more probabilistic in nature than the previous
benchmarks: not only is the thread execution order nondeterministic (as far as
we know), the results of a collision can only be seen when that collision
actually happens, which requires both memory transactions to be in the
collision detection mechanism at the same time. For the all-conflicts case,
assuming that both the memory controller and TPC have queues of sufficient
depth to dispatch more than one warp&#8217;s worth of transactions at a time (8
transactions, in this case), a collision should happen for every thread
iteration; as this is a deterministic result, we can arrive at conclusions by
comparing the other results with the 0-byte offset case.</p>
<p>The results for a 4-byte offset are clearly very similar to those obtained for
a 0-byte offset. With 8 different warps each writing 4 bytes of data with a
4-byte offset, every write happens within a 32-byte range per lane. Since the
results are so closely to the collision-guaranteed 0-byte case, the data
suggests that atomic writes within 32 bytes of one another are considered
conflicting by the memory controller; in other words, we can say with
confidence that the atomic lock width is at least 32 bytes.</p>
<p>At 8 bytes, corresponding to a 64-byte region of memory per lane, a run takes
about 80% of the time per iteration as the baseline. Barring DDR shenanigans,
we can chalk this improvement up to consecutive atomic operations that do not
result in collisions. The mere presence of an improvement alone is compelling
evidence which points to an atomic lock size of exactly 32 bytes, as a larger
lock would have given the same performance for an 8-byte offset as for a
4-byte one. The trend continues, with a 16-byte offset having an even lower
performance penalty, and the 32-byte offset having no significant performance
penalty. As a whole, the evidence is strong for a 32-byte lock width.</p>
<p>An interesting and unexpected result came up when running this benchmark with
more than one warp per SM. Take a look:</p>
<img alt="atomic_lock_width_3_warps_per_sm.png" src="http://strobe.cc/cuda_atomicsatomic_lock_width_3_warps_per_sm.png" />
<p>The performance boost at 64 bytes, and the further improvement at 128 bytes,
points toward an increase in memory bandwidth when transactions are spread out
past a certain width. Curiously, the 256-byte offsets and above aren&#8217;t quite
as fast as the 128-byte case, although they&#8217;re certainly faster than the
32-byte offsets. (Again, these numbers refer to the offsets only; all of the
actual write widths are 4 bytes.) These figures are probably due to the effect
of the memory itself shining through the memory controller. They may be
related to the way bank interleaving is done on the card, or to latency
penalties for certain addressing operations. I&#8217;d have to study DDR&#8217;s operation
a little more to draw conclusions from these results.</p>
<p>The first hints of the manner in which memory operations are scheduled and
retired are also visible in these results, and once a few experiments are run
which target the TPC and memory controller queue depths explicitly we can come
back and validate these results, but the probabilistic nature of these results
(along with the lack of nice assumptions like purely-random scheduling) makes
extracting that kind of information a bit of a stretch, so I&#8217;m also setting
that one aside for later.</p>
</div>
</div>
<div class="section" id="to-be-continued">
<h2><a class="toc-backref" href="#id25">To be continued&#8230;</a></h2>
<p>This article is split into multiple parts, some of which have yet to be
written. Questions that I&#8217;m working on answering include:</p>
<ul class="simple">
<li>How many instructions are needed to hide atomic latency?</li>
<li>Does an atomic collision interfere with non-colliding transactions?</li>
<li>What&#8217;s the width of the memory controller SIMD ALU?</li>
<li>What is the depth of the transaction queue on a TPC? On the memory
controller?</li>
<li>Does the hardware do any kind of out-of-order execution? If so, what?</li>
<li>Can instruction reordering by the programmer result in significant speedups?</li>
</ul>
<p>Note: as of March 1, it&#8217;s looking highly unlikely that I&#8217;ll actually finish
answering these questions before the release of GF100-based chips, after
which interest (both yours and mine) is expected to wane considerably. Rest
assured, I&#8217;m hard at work, just not on this.</p>
</div>
<div class="section" id="acknowledgments">
<h2><a class="toc-backref" href="#id26">Acknowledgments</a></h2>
<p>My thanks go out to <a class="reference external" href="http://gpgpu.univ-perp.fr/index.php/Barra">Sylvain Collange</a> and Christian Buchner for valuable
feedback and pointers at <a class="reference external" href="http://forums.nvidia.com/index.php?showtopic=150856">the CUDA forums</a>.</p>
</div>

    ]]>
    </summary>
  </entry>
  <entry>
    <title>Quodlibot, a Google Code IRC bot</title>
    <link href="http://strobe.cc/quodlibot/" rel="alternate"></link>
    <updated>2009-12-26T00:00:00Z</updated>
    <id>http://strobe.cc/quodlibot/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    <p>I was looking for a simple IRC bot which would announce changes to the <a class="reference external" href="http://code.google.com/p/quodlibet/">Quod
Libet</a> project, like what <a class="reference external" href="http://cia.vc/">CIA</a> does for version control. Finding none, I
hacked one together in about an hour, using <a class="reference external" href="http://twistedmatrix.com/trac/">Twisted</a> and <a class="reference external" href="http://www.feedparser.org/">Feed Parser</a>. It&#8217;s
trivial, but code that isn&#8217;t shared is lost, and it may save a few others some
time in the future. If you maintain a GC project and want me to host an
instance of the bot, send me an email.</p>
quodlibot.py
<a class="smallcaps" style="color: black;" href="http://strobe.cc/quodlibotquodlibot.py">(download)</a><div class="blockcode"><table class="highlighttable"><tr><td class="linenos"><pre>  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103</pre></td><td class="code"><div class="highlight"><pre><span style="color: #408080; font-style: italic"># Copyright (c) 2009 Steven Robertson.</span>
<span style="color: #408080; font-style: italic">#</span>
<span style="color: #408080; font-style: italic"># This program is free software; you can redistribute it and/or modify</span>
<span style="color: #408080; font-style: italic"># it under the terms of the GNU General Public License version 2 or</span>
<span style="color: #408080; font-style: italic"># later, as published by the Free Software Foundation.</span>

NAME<span style="color: #666666">=</span><span style="color: #BA2121">&quot;Google_Code_RSS_IRC_Bridge_Bot&quot;</span>
VERSION<span style="color: #666666">=</span><span style="color: #BA2121">&quot;0.1&quot;</span>

<span style="color: #008000; font-weight: bold">from</span> <span style="color: #0000FF; font-weight: bold">twisted.words.protocols</span> <span style="color: #008000; font-weight: bold">import</span> irc
<span style="color: #008000; font-weight: bold">from</span> <span style="color: #0000FF; font-weight: bold">twisted.internet</span> <span style="color: #008000; font-weight: bold">import</span> reactor, protocol, task

<span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">feedparser</span>

<span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">re</span>
<span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">sys</span>
<span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">urllib2</span>

<span style="color: #008000; font-weight: bold">class</span> <span style="color: #0000FF; font-weight: bold">AnnounceBot</span>(irc<span style="color: #666666">.</span>IRCClient):

    username <span style="color: #666666">=</span> <span style="color: #BA2121">&quot;</span><span style="color: #BB6688; font-weight: bold">%s</span><span style="color: #BA2121">-</span><span style="color: #BB6688; font-weight: bold">%s</span><span style="color: #BA2121">&quot;</span> <span style="color: #666666">%</span> (NAME, VERSION)
    sourceURL <span style="color: #666666">=</span> <span style="color: #BA2121">&quot;http://strobe.cc/&quot;</span>

    <span style="color: #408080; font-style: italic"># I am a terrible person.</span>
    instance <span style="color: #666666">=</span> <span style="color: #008000">None</span>

    <span style="color: #408080; font-style: italic"># Intentionally &#39;None&#39; until we join a channel</span>
    channel <span style="color: #666666">=</span> <span style="color: #008000">None</span>

    <span style="color: #408080; font-style: italic"># Prevent flooding</span>
    lineRate <span style="color: #666666">=</span> <span style="color: #666666">3</span>

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">signedOn</span>(<span style="color: #008000">self</span>):
        <span style="color: #008000">self</span><span style="color: #666666">.</span>join(<span style="color: #008000">self</span><span style="color: #666666">.</span>factory<span style="color: #666666">.</span>channel)
        AnnounceBot<span style="color: #666666">.</span>instance <span style="color: #666666">=</span> <span style="color: #008000">self</span>

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">joined</span>(<span style="color: #008000">self</span>, channel):
        <span style="color: #008000">self</span><span style="color: #666666">.</span>channel <span style="color: #666666">=</span> <span style="color: #008000">self</span><span style="color: #666666">.</span>factory<span style="color: #666666">.</span>channel

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">left</span>(<span style="color: #008000">self</span>, channel):
        <span style="color: #008000">self</span><span style="color: #666666">.</span>channel <span style="color: #666666">=</span> <span style="color: #008000">None</span>

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">trysay</span>(<span style="color: #008000">self</span>, msg):
        <span style="color: #BA2121; font-style: italic">&quot;&quot;&quot;Attempts to send the given message to the channel.&quot;&quot;&quot;</span>
        <span style="color: #008000; font-weight: bold">if</span> <span style="color: #008000">self</span><span style="color: #666666">.</span>channel:
            <span style="color: #008000; font-weight: bold">try</span>:
                <span style="color: #008000">self</span><span style="color: #666666">.</span>say(<span style="color: #008000">self</span><span style="color: #666666">.</span>channel, msg)
                <span style="color: #008000; font-weight: bold">return</span> <span style="color: #008000">True</span>
            <span style="color: #008000; font-weight: bold">except</span>: <span style="color: #008000; font-weight: bold">pass</span>

<span style="color: #008000; font-weight: bold">class</span> <span style="color: #0000FF; font-weight: bold">AnnounceBotFactory</span>(protocol<span style="color: #666666">.</span>ReconnectingClientFactory):
    protocol <span style="color: #666666">=</span> AnnounceBot
    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">__init__</span>(<span style="color: #008000">self</span>, channel):
        <span style="color: #008000">self</span><span style="color: #666666">.</span>channel <span style="color: #666666">=</span> channel

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">clientConnectionFailed</span>(<span style="color: #008000">self</span>, connector, reason):
        <span style="color: #008000; font-weight: bold">print</span> <span style="color: #BA2121">&quot;connection failed:&quot;</span>, reason
        reactor<span style="color: #666666">.</span>stop()

<span style="color: #008000; font-weight: bold">class</span> <span style="color: #0000FF; font-weight: bold">FeedReader</span>:
    _schema <span style="color: #666666">=</span> <span style="color: #BA2121">&#39;http://code.google.com/feeds/p/</span><span style="color: #BB6688; font-weight: bold">%s</span><span style="color: #BA2121">/updates/basic&#39;</span>

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">__init__</span>(<span style="color: #008000">self</span>, project):
        <span style="color: #008000">self</span><span style="color: #666666">.</span>project <span style="color: #666666">=</span> project
        <span style="color: #008000">self</span><span style="color: #666666">.</span>entries <span style="color: #666666">=</span> {}

    <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">update</span>(<span style="color: #008000">self</span>):
        <span style="color: #BA2121; font-style: italic">&quot;&quot;&quot;Returns list of new items.&quot;&quot;&quot;</span>
        feed <span style="color: #666666">=</span> feedparser<span style="color: #666666">.</span>parse(<span style="color: #008000">self</span><span style="color: #666666">.</span>_schema <span style="color: #666666">%</span> <span style="color: #008000">self</span><span style="color: #666666">.</span>project)
        added <span style="color: #666666">=</span> []
        <span style="color: #008000; font-weight: bold">for</span> entry <span style="color: #AA22FF; font-weight: bold">in</span> feed[<span style="color: #BA2121">&#39;entries&#39;</span>]:
            <span style="color: #008000; font-weight: bold">if</span> entry[<span style="color: #BA2121">&#39;id&#39;</span>] <span style="color: #AA22FF; font-weight: bold">not</span> <span style="color: #AA22FF; font-weight: bold">in</span> <span style="color: #008000">self</span><span style="color: #666666">.</span>entries:
                <span style="color: #008000">self</span><span style="color: #666666">.</span>entries[entry[<span style="color: #BA2121">&#39;id&#39;</span>]] <span style="color: #666666">=</span> entry
                added<span style="color: #666666">.</span>append(entry)
        <span style="color: #008000; font-weight: bold">return</span> added

<span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">strip_tags</span>(value):
    <span style="color: #008000; font-weight: bold">return</span> re<span style="color: #666666">.</span>sub(<span style="color: #BA2121">r&#39;&lt;[^&gt;]*?&gt;&#39;</span>, <span style="color: #BA2121">&#39;&#39;</span>, value)

<span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">announce</span>(feed):
    new <span style="color: #666666">=</span> feed<span style="color: #666666">.</span>update()
    <span style="color: #008000; font-weight: bold">for</span> entry <span style="color: #AA22FF; font-weight: bold">in</span> new:
        msg <span style="color: #666666">=</span> <span style="color: #BA2121">&#39;</span><span style="color: #BB6688; font-weight: bold">%s</span><span style="color: #BA2121">: </span><span style="color: #BB6688; font-weight: bold">%s</span><span style="color: #BA2121">&#39;</span> <span style="color: #666666">%</span> (strip_tags(entry[<span style="color: #BA2121">&#39;title&#39;</span>]), entry[<span style="color: #BA2121">&#39;link&#39;</span>])
        <span style="color: #008000; font-weight: bold">if</span> AnnounceBot<span style="color: #666666">.</span>instance:
            AnnounceBot<span style="color: #666666">.</span>instance<span style="color: #666666">.</span>trysay(msg<span style="color: #666666">.</span>replace(<span style="color: #BA2121">&#39;</span><span style="color: #BB6622; font-weight: bold">\n</span><span style="color: #BA2121">&#39;</span>, <span style="color: #BA2121">&#39;&#39;</span>)<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>))

<span style="color: #008000; font-weight: bold">if</span> __name__ <span style="color: #666666">==</span> <span style="color: #BA2121">&#39;__main__&#39;</span>:
    <span style="color: #408080; font-style: italic"># All per-project customizations should be done here</span>

    AnnounceBot<span style="color: #666666">.</span>nickname <span style="color: #666666">=</span> <span style="color: #BA2121">&#39;quodlibot&#39;</span>
    fact <span style="color: #666666">=</span> AnnounceBotFactory(<span style="color: #BA2121">&quot;#quodlibet&quot;</span>)
    feed <span style="color: #666666">=</span> FeedReader(<span style="color: #BA2121">&#39;quodlibet&#39;</span>)
    reactor<span style="color: #666666">.</span>connectTCP(<span style="color: #BA2121">&#39;irc.oftc.net&#39;</span>, <span style="color: #666666">6667</span>, fact)

    <span style="color: #408080; font-style: italic"># Don&#39;t reannounce every update on startup</span>
    feed<span style="color: #666666">.</span>update()

    update_task <span style="color: #666666">=</span> task<span style="color: #666666">.</span>LoopingCall(announce, feed)
    update_task<span style="color: #666666">.</span>start(<span style="color: #666666">600</span>, now<span style="color: #666666">=</span><span style="color: #008000">False</span>)

    reactor<span style="color: #666666">.</span>callLater(<span style="color: #666666">10</span>, announce, feed)

    reactor<span style="color: #666666">.</span>run()
</pre></div>
</td></tr></table></div>
    ]]>
    </summary>
  </entry>
  <entry>
    <title>This is your segfault on CUDA</title>
    <link href="http://strobe.cc/cuda_segfault/" rel="alternate"></link>
    <updated>2009-11-21T00:00:00Z</updated>
    <id>http://strobe.cc/cuda_segfault/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    <p>This is your segfault:</p>
<img alt="segv.png" src="http://strobe.cc/cuda_segfaultsegv.png" />
<p>This is your segfault on CUDA:</p>
<a class="reference external image-reference" href="http://strobe.cc/cuda_segfaultcuda-segv-large.png"><img alt="cuda-segv-small.png" src="http://strobe.cc/cuda_segfaultcuda-segv-small.png" /></a>
<p><em>Any questions?</em></p>

    ]]>
    </summary>
  </entry>
  <entry>
    <title>The video codec formerly known as Fermi</title>
    <link href="http://strobe.cc/fermi/" rel="alternate"></link>
    <updated>2009-10-30T00:00:00Z</updated>
    <id>http://strobe.cc/fermi/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    
<p>I&#8217;m writing a video codec!</p>
<p>To those unfamiliar with the video compression landscape, this seems like a bold and innovative move, and one which should generate much excitement from people who have that irrational love of squeezing their treasured collection of videos of their cats doing silly things in high definition down by an additional 2%<sup><a class="footnote-reference" href="#id2" id="id1">1</a></sup><span class="fntarget" id="id2_target"></span>. On the other hand, more experienced individuals might simply mutter, &quot;another one?&quot;</p>
<div class="fnwrap"><p class="footnote" id="id2"><a class="fn-backref" href="#id1">1</a> I&#8217;m one of &#8216;em. (The 2% part, not the cat videos. I do not have cats.)</p></div>
<p>Yes, folks, another stab at video compression, a field which generates hundreds
of published papers a year (and likely many more unpublishable ones) describing
how to eke out another 0.02dB PSNR from a Playboy centerfold snapshot or a
ten-second CIF-sized video of some UT idiot fumbling a football<sup><a class="footnote-reference" href="#id4" id="id3">2</a></sup><span class="fntarget" id="id4_target"></span>. Hooray,
you&#8217;ve saved five bytes on one video and increased decoding time by 600%. <em>In
MatLab.</em></p>
<div class="fnwrap"><p class="footnote" id="id4"><a class="fn-backref" href="#id3">2</a> <a class="reference external" href="http://en.wikipedia.org/wiki/Lenna">Not</a> <a class="reference external" href="http://media.xiph.org/video/derf/">kidding</a>.</p></div>
<p>Okay, I admit it, that&#8217;s unfair. Engineering research is all about trying new
things, even if they sometimes kinda suck, and publishing those papers may
eventually lead to better real-world codecs. I can understand and almost
forgive people trying to make tenure by trying ten ideas in a MatLab script and
spacing out twenty 4-page papers over two years describing them. It should also
be said that video compression is one of those fields in computer science where
things get <em>harder</em> as time goes on: because compression is fighting against
entropy, each step forward takes us another step towards a hard,
nature-of-the-universe law, making it that much harder to attain the next round
of stunning performance enhancements.</p>
<p>Of course, sometimes a new idea comes along which changes something fundamental
about the way we do things and enables that next round of gains. For a couple
years, a class of domain transforms which can be accurately but somewhat
enigmatically be referred to as &quot;directional multiresolution decompositions&quot;
have seemed like they could be that breakthrough in image and video
compression. I&#8217;m preparing a blog post describing that&#8230; <em>very
impressive-sounding</em> concept in high-level terms<sup><a class="footnote-reference" href="#id6" id="id5">3</a></sup><span class="fntarget" id="id6_target"></span>, but for now it will
suffice to say that these methods are new, promising, and just different enough
to require throwing out most of the old tricks we used to squish video
before.</p>
<div class="fnwrap"><p class="footnote" id="id6"><a class="fn-backref" href="#id5">3</a> After describing it to friends and family for a few months now, I
might be able to pull this off in a reasonably articulate way.</p></div>
<p>As far as my preliminary paper-trawl has turned up, nobody&#8217;s even tried a directional decomposition video codec before, much less constructed one with a real-world implementation that makes impressive gains in coding efficiency. This either means that the topic is ripe for the researching, or that people have tried and failed abjectly. It certainly means that the topic is challenging.</p>
<p>Naturally, this means I pretty much have to try it, because I&#8217;m An Idiot™.</p>
<p>So, in one way, this is slightly different from most of the instances in which
someone announces that they&#8217;re making a new video codec before they have code
to prove it, as the fundamental underpinnings of the video codec are
substantially different from any previous attempts I have seen and thus have
the potential for producing significant and informative results.</p>
<p>In another way, it&#8217;s <em>exactly</em> like most of those instances, because by
creating a new codec I&#8217;m ensuring that it will be at least ten years or so
before it sees widespread adoption and general usefulness, <em>even if it is a
staggeringly good improvement</em>. Perhaps it will take even longer than that.
But, alas, there&#8217;s little room for avoiding that fate, as these changes can&#8217;t
simply be patched into the <a class="reference external" href="http://diracvideo.org">Dirac</a> or <a class="reference external" href="http://www.videolan.org/developers/x264.html">x264</a> code.</p>
<p>As I move from researching this problem to actually coding it, I hope to
collaborate closely with existing open-source video codecs, -stealing- sharing
code whereever possible and submitting patches enabling any new
backwards-compatible compression techniques I might be so fortunate as to
discover.</p>
<p>More as I learn it.</p>
<div class="section" id="about-the-title">
<h2>About the title</h2>
<p>Oh, and I should mention: the codec was to be called &quot;Fermi&quot;. Enrico Fermi
derived Fermi-Dirac statistics independently of Paul Dirac, and since I plan to
make use of as much code (including a significant amount of the bitstream
specification) from Dirac as possible, I thought the name was apt. Except
NVIDIA decided to steal my name for their latest GPU architecture. The product
is still months away from launch, but Googling for &quot;fermi video&quot; already fills
the page with noise about the cards, so a new name is required but not yet
chosen<sup><a class="footnote-reference" href="#id8" id="id7">4</a></sup><span class="fntarget" id="id8_target"></span>.</p>
<div class="fnwrap"><p class="footnote" id="id8"><a class="fn-backref" href="#id7">4</a> I&#8217;m calling it &quot;The Video Codec Formerly Known As Fermi&quot; in my head.
FCAF (eff-kaff) for short.</p></div>
</div>

    ]]>
    </summary>
  </entry>
  <entry>
    <title>Dear Typekit...</title>
    <link href="http://strobe.cc/dear_typekit/" rel="alternate"></link>
    <updated>2009-10-28T00:00:00Z</updated>
    <id>http://strobe.cc/dear_typekit/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    
<p><strong>Update</strong>: Typekit has done one better, not only using Gecko browser
detection but also <a class="reference external" href="http://blog.typekit.com/2010/01/21/typekit-supports-woff-in-firefox-3-6/">using WOFF for Firefox 3.6 and up</a>. Took a few months,
but they got it done the right way.</p>
<hr class="docutils" />
<p>Dear Typekit,</p>
<p>Your service is currently broken. The good news: it&#8217;s a one-line fix.</p>
<p>The JavaScript generated for each client includes the folowing regex against
<tt class="docutils literal"><span class="pre">navigator.userAgent</span></tt>, intended to check whether or not the browser is
compatible with &#64;font-face and Typekit:</p>
<div class="blockcode"><table class="highlighttable"><tr><td class="linenos"><pre>1
2
3
4
5
6
7</pre></td><td class="code"><div class="highlight"><pre><span style="color: #008000; font-weight: bold">function</span>(D){
    <span style="color: #008000; font-weight: bold">var</span> C<span style="color: #666666">=</span>D.match(<span style="color: #BB6688">/Firefox\/(\d+\.\d+)/</span>);
    <span style="color: #008000; font-weight: bold">if</span>(C){
        <span style="color: #008000; font-weight: bold">var</span> B<span style="color: #666666">=</span>C[<span style="color: #666666">1</span>];
        <span style="color: #008000; font-weight: bold">return</span> <span style="color: #008000">parseFloat</span>(B)<span style="color: #666666">&gt;=3.5</span>
    }
}
</pre></div>
</td></tr></table></div><p>The problem is that many Linux distributors <a class="reference external" href="http://en.wikipedia.org/wiki/Mozilla_Corporation_software_rebranded_by_the_Debian_project#Origins_of_the_issue_and_of_the_Iceweasel_name">can&#8217;t legally call their browser
Mozilla Firefox</a>. The Debian project is a notable example; they&#8217;ve rebranded
their Firefox build &quot;Iceweasel&quot;, and chosen similar names for other Mozilla
software. To avoid this dispute, other distributions have taken to using the
code-name for a particular Firefox build as the name—I&#8217;m currently posting this
from a browser called &quot;Shiretoko&quot;. This doesn&#8217;t even cover contexts in which
the Gecko engine is embedded by other applications which fully support TypeKit,
such as Mozilla SeaMonkey.</p>
<p>It would be unreasonable to check for every variant of the browser name in the
JavaScript handed out by your application. Fortunately, Mozilla has made a
provision allowing the right decision to be made, regardless of browser name.
Here&#8217;s the value of <tt class="docutils literal"><span class="pre">navigator.userAgent</span></tt> from the official build of
Firefox:</p>
<pre class="literal-block">
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20091021 Firefox/3.5.3
</pre>
<p>Here&#8217;s what it looks like in my browser:</p>
<pre class="literal-block">
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20091021 Shiretoko/3.5.3
</pre>
<p>Note the common string component <tt class="docutils literal"><span class="pre">rv:1.9.1.3</span></tt>; this identifies the Gecko
release version. Since browsers based on the Gecko rendering engine get most of
their characteristics from that engine, you can in almost all cases simply
check the Gecko version instead of the Firefox version<sup><a class="footnote-reference" href="#id2" id="id1">1</a></sup><span class="fntarget" id="id2_target"></span>.</p>
<div class="fnwrap"><p class="footnote" id="id2"><a class="fn-backref" href="#id1">1</a> Ideally, you shouldn&#8217;t use the user-agent string at all to do browser
detection; <a class="reference external" href="https://developer.mozilla.org/en/Gecko_User_Agent_Strings">Mozilla says why, and what to do instead</a>. But often
user-agent checking is the pragmatic solution.</p></div>
<p>Doing so is really this simple:</p>
<div class="blockcode"><table class="highlighttable"><tr><td class="linenos"><pre>1
2
3
4
5
6
7</pre></td><td class="code"><div class="highlight"><pre><span style="color: #008000; font-weight: bold">function</span>(D){
    <span style="color: #008000; font-weight: bold">var</span> C<span style="color: #666666">=</span>D.match(<span style="color: #BB6688">/rv:(\d+\.\d+).*Gecko\//</span>);
    <span style="color: #008000; font-weight: bold">if</span>(C){
        <span style="color: #008000; font-weight: bold">var</span> B<span style="color: #666666">=</span>C[<span style="color: #666666">1</span>];
        <span style="color: #008000; font-weight: bold">return</span> <span style="color: #008000">parseFloat</span>(B)<span style="color: #666666">&gt;=1.9</span>
    }
}
</pre></div>
</td></tr></table></div><p>Since this change is a simple one, and since Gecko&#8217;s user-agent strings have
been standardized for a while, I hope that you can make it soon, and enable
support for the many users using browsers whch would otherwise work flawlessly
with Typekit.</p>
<p>Thank you for providing an awesome and much-needed service.</p>
<p>Steven</p>
<p>PS: For users of unbranded Firefox versions who just want things to work now:
go to <tt class="docutils literal"><span class="pre">about:config</span></tt> and change the string
<tt class="docutils literal"><span class="pre">general.useragent.extra.firefox</span></tt> to <tt class="docutils literal"><span class="pre">Firefox/3.5.3</span></tt> (or your current
version). But remember to update it along with your browser.</p>

    ]]>
    </summary>
  </entry>
  <entry>
    <title>Touchy-Feely: a TouchBook teardown</title>
    <link href="http://strobe.cc/touchy_feely/" rel="alternate"></link>
    <updated>2009-09-23T00:00:00Z</updated>
    <id>http://strobe.cc/touchy_feely/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    <p>The <a class="reference external" href="http://alwaysinnovating.com/">Always Innovating  TouchBook</a> is a touch-sensitive tablet/netbook
device, powered by an TI OMAP-3530 processor and running Linux. They&#8217;re
currently in very limited supply, although many pre-orders should be shipping
soon, according to the company.</p>
<p>I was lucky enough to get one of the earlier models, and decided to tear it
apart for everyone&#8217;s edification. The whole is, as usual, greater than the sum
of the parts (viz, <em>it works</em>), but having a bunch of parts strewn around your
house looks really impressive to non-technical friends.</p>
<p>There&#8217;s a <a class="reference external" href="http://www.flickr.com/photos/strobe_cc/sets/72157622439090430/">Flickr set</a> with more photos.</p>
<div class="section" id="uncasing-it">
<h2>Uncasing it</h2>
<p>I&#8217;m assuming that if you intend to tear this thing to pieces, you&#8217;ve
already figured out how to get the back cover off. You&#8217;ll see something
like this:</p>
<a class="reference external image-reference" href="http://www.flickr.com/photos/strobe_cc/3947916630/"><img alt="Together again" src="http://farm4.static.flickr.com/3107/3947916630_241459ec94.jpg" /></a>
<p>Before you go any further, pop off the battery&#8217;s tiny connector to the main
PCB to avoid damaging the device or yourself, and pop off the speaker
connector at the bottom of the board as well. The battery is glued to a metal
shield, which is held on at four points by screws.  You have two choices for
removing it, and both kind of suck.</p>
<p><strong>Choice 1:</strong> You can very carefully insert a putty knife or other thin, dull
piece of metal under the top left corner of the battery, and slowly work it
downwards, cutting the glue. Go slowly as you get towards the middle of the
device. The glue stops, and if you go at it with the putty knife you might hit
the small, delicate cable located below the bottom half of the battery (see
below).</p>
<p>This will, of course, permanently weaken the stickiness of the glue; you&#8217;ll
have to re-glue the device when you&#8217;re done with it.  There&#8217;s a less-expected
downside to this method: the battery is not reinforced, and is therefore
pretty flexible. You will almost certainly bend the battery while doing this,
and could damage or destroy it. (Remember, all of the exploding
cell-phone/iPod incidents were caused by damaged batteries, and most of those
are a <em>lot</em> smaller than this one.) Here&#8217;s mine after the removal:</p>
<a class="reference external image-reference" href="http://www.flickr.com/photos/strobe_cc/3947079383/"><img alt="Battery" src="http://farm4.static.flickr.com/3443/3947079383_ca180f9519.jpg" /></a>
<p><strong>Choice 2:</strong> remove the battery&#8217;s shield with the battery still attached to
it. The shield is secured to the case at four points, as seen below (with the
battery already unstuck via the first method).</p>
<a class="reference external image-reference" href="http://www.flickr.com/photos/strobe_cc/3947079251/"><img alt="Battery shield" src="http://farm4.static.flickr.com/3462/3947079251_c8c6b755a3.jpg" /></a>
<p>When you&#8217;ve got all four screws off, <em>very gently</em> lift the battery towards
the bottom of the unit, angling slightly upward and feeding the black cable
through, as the cable is still attached to the display underneath with a very
thin connector. When you have enough clearance, detach the display cable from
underneath. This is what the cable looks like when attached:</p>
<a class="reference external image-reference" href="http://www.flickr.com/photos/strobe_cc/3947136897/"><img alt="Display connector" src="http://farm4.static.flickr.com/3519/3947136897_34c516bdac.jpg" /></a>
<p>Whether you want to break the glue or try <em>not</em> to break the cable is up to
you. I went the first route, as I didn&#8217;t know what was underneath the battery
when I started digging, but I may have tried the second method if I had.</p>
<p>Once the battery&#8217;s clear, using either method, you can unscrew the main PCB;
two screws anchor it at the top of the unit, and one at the bottom. Rotate the
board clockwise slightly, moving the top left away from the case; then lift
the top of the board slowly until it is free.  There&#8217;s one last cable
anchoring the board to the case. It&#8217;s attached close to the left side of the
PCB; you&#8217;ll need to rotate it like it&#8217;s hinged on the outside left of the case
to see it without damaging it.</p>
<a class="reference external image-reference" href="http://www.flickr.com/photos/strobe_cc/3947137157/"><img alt="Panel power (attached)" src="http://farm3.static.flickr.com/2561/3947137157_977c0d5721.jpg" /></a>
<p>There&#8217;s no trick to this cable; just pull, gently but firmly, until it
releases from the socket. (Reassembling works the same way, athough a pair of
pliers may help in applying enough force to the cardboard backing the contacts
to get it to properly mate in such a small space.)</p>
<p>The rest of the tear-down is pretty obvious. Take a look at the Flickr set
linked above if you&#8217;d like to see the results.</p>
<p>Enjoy your TouchBooks!</p>
</div>

    ]]>
    </summary>
  </entry>
  <entry>
    <title>Can androids render Electric Sheep?</title>
    <link href="http://strobe.cc/do_androids_render/" rel="alternate"></link>
    <updated>2009-09-21T00:00:00Z</updated>
    <id>http://strobe.cc/do_androids_render/</id>
    <author><name>Steven Robertson</name></author>
    <summary type="html"><![CDATA[
    
<p>The <a class="reference external" href="http://electricsheep.org">Electric Sheep</a> screensaver combines a genetic algorithm, an iterated
function system, and some postprocessing to create one of the most mesmerizing
visualizations I&#8217;ve ever seen. Over the summer of 2008, I looked into creating
a version of the rendering library of Electric Sheep, a library known as
<a class="reference external" href="http://flam3.com">flam3</a>, which would use NVIDIA GPUs to accelerate their transforms.</p>
<p>The algorithm is pretty straightforward. In broad terms, a fractal flame, or
simply <em>sheep</em>, consists of a set of equations, each having as input a
coordinate <span class="raw-math"><img src="http://strobe.cc/do_androids_render.eqnb26ac7906215dae6d92a66ead02921e0.gif" alt="(equation)" class="eqn" /></span> and as output a coordinate <span class="raw-math"><img src="http://strobe.cc/do_androids_render.eqn72a7a162ec5958c154ce6a5ed58cea84.gif" alt="(equation)" class="eqn" /></span>.  Start with a random point. Plug in this point&#8217;s coordinates into one
of these randomly-chosen pairs of equations, and plot the result.  Repeat.
When it&#8217;s done, you get a fractal image. There&#8217;s more to it than that, of
course - here&#8217;s <a class="reference external" href="http://flam3.com/flame.pdf">Scott Draves&#8217; paper</a> on the subject if you want the
details.</p>
<p>At first glance, this maps relatively well to the idea of a massively parallel
processor like a GPU.  Instead of running one point at a time, just run
hundreds! Unfortunately, as you might have guessed, it ain&#8217;t so simple.</p>
<p>Part of the problem in getting the transforms over to the GPU had to do with
NVIDIA&#8217;s CUDA SDK. Normally, CUDA is supposed to be written in a C-like
language, which gets compiled to an assembly language equivalent.  The
assembly - written in NVIDIA&#8217;s own PTX language - then gets some
register-allocation optimizations before winding up as machine code. The PTX
emitted by the first implementation of the transform in CUDA&#8217;s C dialect did
not agree with the register allocator, and the resulting code used more than
50 registers after allocation.</p>
<p>50 registers is really bad. In CUDA, each processor unit handles hundreds of
threads, and attempts to hide latency by switching between threads rapidly.
Context-switching has a low overhead because all registers are allocated at
the start of a thread and remain allocated throughout the thread&#8217;s lifetime.
In first-generation GPGPUs, maximizing the occupancy of each processor (and
therefore hiding the most latency) required kernels which used a mere 10
registers. Needless to say, this code performed very poorly (for this and
other reasons).  After fighting with the C code for a while, I gave up on the
compiler and decided to code the entire transform kernel in assembly.</p>
<p>It helps that PTX is actually a rather nice assembly language, as far as they
go, but it still turned out to be a pretty significant task, taking me the
better part of a season to even get to a workable state. Assembly can be
challenging for large applications, but the time and effort had more to do
with the unusual memory architecture of NVIDIA GPUs than the programming
language in use.</p>
<div class="section" id="hardware-interpolation">
<h2>Hardware interpolation</h2>
<p>Some essential background information is necessary for the next sections to
make sense; most of this is detailed in NVIDIA&#8217;s <a class="reference external" href="http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf">CUDA Programming Guide</a>
[pdf].  A lot of this section is well-founded speculation.</p>
<p>Computation on the GPU is handled by <em>multiprocessors</em>. Each multiprocessor on
a current-gen (GTX 200 series) GPU is capable of tracking 1024 threads
in-flight. These threads are grouped into <em>warps</em> of 32 threads; each of these
warps has a single instruction pointer, meaning that the 32 threads must all
execute the same instructions on different data. (Instructions can be
predicated, meaning the results of those computations are not stored, so
branches where some threads of the warp follow a different path can be
simulated by following one path with some threads &quot;turned off,&quot; then the other
path with the complement threads disabled.  Highly-divergent paths are an
absolute disaster, performance-wise.)</p>
<p>Each multiprocessor will step through a warp in two &quot;front-end&quot; clock cycles.
(The front-end faces the multiprocessor bus, which interfaces with the memory
controller; the back-end contains the ALU and is double-clocked.) During each
of these clock cycles, threads being executed can issue a memory access
request. If the memory access meets certain criteria, the front-end bundles
the requests into a single transaction and ships it off to the memory
controller; if not, the requests are transmitted individually.</p>
<p>Main memory on an NVIDIA GPU is accessed by a controller that is shared by all
processors.  Memory requests are sent down a very wide bus to the controller,
which queues them. The controller interleaves storage across all RAM on the
board in order to increase bandwidth.  When processing requests for contiguous
blocks of data, this has a significant effect on performance; the latency cost
of bringing 14 separate RAM chips (on my GPU) to the same address is less
significant than being able to complete your transaction in a few clock
cycles.  Unfortunately, it means that the latency cost for accessing a single
byte of data is at worst case 14 times higher than for non-interleaved memory
architectures. I&#8217;m not sure if the controller can process requests for
non-contiguous memory simultaneously if the regions are located on separate
chips, but given the dire warnings against non-contiguous memory access
scattered throughout CUDA documentation, I doubt it.</p>
<p>These latencies are compounded by the lack of any real cache.  The decision to
omit a cache from each multiprocessor is the right one; for most operations, a
small cache would just miss anyway, a large one would be absurdly expensive.
Worse still, keeping cache coherent between 30 processors would require
jaw-dropping complexity. To sort of make up for it, NVIDIA places 16K of
(usually) penalty-free RAM on each multiprocessor, which must be shared by all
active threads. While it sounds like it could be used as a manually-managed
cache, the size of the shared memory is too small to make that effective; at
full capacity, you get a mere 16 bytes per thread. This memory is the only way
to communicate with the other threads active on a particular multiprocessor,
so it is more often dedicated to coordination instead of cache.</p>
<p>There is also a shared 8K cache per multiprocessor which shadows a region of
memory only writable by the host system; this <em>constant memory</em> is useful in
providing constants (transform coefficients, pointers to buffers, etc) to a
kernel, but cannot be used for coordination as it is not guaranteed to be
consistent if memory is changed by the host after the kernel is started.</p>
<p>These hardware oddities can be summed up in three simple design rules for CUDA
applications:</p>
<ul class="simple">
<li>Split your tasks into groups of 32 threads.</li>
<li>Make these groups branch as little as possible.</li>
<li>Access memory infrequently and contiguously.</li>
</ul>
</div>
<div class="section" id="porting-flam3">
<h2>Porting flam3</h2>
<p>These three goals may seem trivial, but they can be Zen-like in their
elusiveness. Some unexpectedly difficult parts of the port are outlined below,
along with solutions, in no particular order. I&#8217;ll continue to update this as
I learn more.</p>
<div class="section" id="random-number-generation">
<h3>Random number generation</h3>
<p>The iterated function system at the heart of the fractal flame algorithm
depends heavily on high-quality random numbers. The current implementation
uses the PRNG.  However, a reasonable ISAAC implementation requires a minimum
state of 144 bytes per random context, and ISAAC produces random numbers using
a procedure that must operate in serial over the context. At a minimum, using
atomic transactions to block the execution of any other thread on the chip,
and keeping one random context per warp, the context alone would consume 4,608
bytes.  This blows through the entire allocation of shared memory for a
256-thread warp without storing a shred of data about the IFS itself.</p>
<p>Stronger random number generators, such as the popular Mersenne Twister,
could have been selected; these generators would have to be called separately
to generate and store a large block of random numbers, which would then be
read in as needed by each warp. This solution may be the best in terms of
quality of generated numbers, but the coordination required to get this working
was deemed too expensive for a first effort.</p>
<p>Ultimately, I decided to go with an <a class="reference external" href="http://en.wikipedia.org/wiki/Multiply-with-carry">MWC</a> algorithm, using different values of
<tt class="docutils literal"><span class="pre">a</span></tt> for each thread selected from a pregenerated table distributed with my
build. Given the extensive list of compromises already made in porting flam3
to the GPU, I doubt that an MWC algorithm would make a significant impact on
the results of the computation. This method was implemented using two
persistent registers per thread, and required no shared memory.  I will
reconsider this decision after I see CUDA 3.0&#8217;s memory hierarchy, which
arrives alongside the GTX 300 series sometime soon.</p>
</div>
<div class="section" id="transform-data-structure">
<h3>Transform data structure</h3>
<p>In flam3, a single (x, y) pair of functions in the IFS is created from a
series of fixed-form functions that operate on the input coordinate pair. The
outputs of each of these are summed together with weights applied. The
transform description includes the list of functions to run, information about
the weights of the functions, and for many functions, coeffecients or other
parameters. On the CPU, these transforms are stored as arrays; the
coeffecients for every transform are present and simply ignored if the
transform is not in the list. These arrays take around 8K per transform, and
there can be many transforms in a particular sheep. Even using constant
memory, this creates an unacceptable amount of memory usage and traffic for
the embedded architecture.</p>
<p>Instead of storing the transforms in this format, the CPU reads the entire set
of information about the transforms and pushes it onto a stack.  The stream is
then read in sequence when on the GPU; each transform is executed in order,
and all information is popped from the bottom as the transform is processed.
This cut most transforms to 200 bytes, allowing the entire set of transforms
to fit in the constant memory cache and allowing multiple transforms to be run
at one time.  It&#8217;s obvious in retrospect, but it took me a while to see this
strategy.</p>
</div>
<div class="section" id="trigonometric-functions">
<h3>Trigonometric functions</h3>
<p>Many processors don&#8217;t have trig primitives implemented as instructions. If you
find yourself in an assembly language and need an arctangent, compute a Taylor
series for the function of interest. Be careful to measure the divergence of
the series from the function being modeled; for things like a tangent, you may
have to clamp the input values to a certain range depending on the size and
precision of your series.</p>
</div>
<div class="section" id="storing-transform-output">
<h3>Storing transform output</h3>
<p>The brightness of fractal flame images is computed as the log of the density
of the points in a sample region. Hence, the bright spots on an image
correspond to a set of counters in memory which have been written to hundreds
or even thousands of times during the course of a render.  Combine a high
write density to a particular region of memory, a massive thread count, and
enormous delays between reading a memory location, adding to it, and writing
the result, and you get one of two things: either a result which can be off by
an order of magnitude, or a desperate need for atomic transactions.</p>
<p>A previous approach to this problem involved the latter: simply do every
operation on the framebuffer using atomic intrinsics.  Unfortunately, this
slaughtered performance. Not only was the memory interface crippled by
millions of separate I/O requests, the majority of them suddenly became
atomic - stalling most of the execution units on the chips. I don&#8217;t have hard
numbers for the performance hit yet, but theoretical caluclations showed that
the code achieved less than 2% of its expected throughput with atomic writes
enabled.</p>
<p>I&#8217;ve thought of a different approach - a considerably more complex one, to be
sure, but also one that is much more suited to the GPU&#8217;s memory model, and
will almost certainly yield great gains in practical rendering performance. It
may even be possible to achieve real-time performance at near-HD resolutions
on GTX 300 cards, opening the door to a new class of sheep visualizations
taking their input from real-time data. It will take some time to implement
it, but I plan to restart my work as a hobby project over the next few months,
and get things ready to test when the new GPUs roll out.</p>
<p>The algorithm addresses the key aspects of the memory system: small working
set, high latency, no externally-controlled consistency, and efficiency gains
with contiguous requests. It is conceptually simple: instead of writing the
results of the computation - a coordinate pair and a color value - directly to
the frame buffer, pack them into a 32-bit int and write them into a log. When
the log is full, hand it off to another thread. This thread will read the log
and split it, moving the contents of the log into one of 32 smaller logs, each
corresponding to 32 subdivisions of the image. Another thread will take these
logs when they get full, further dividing them until each log corresponds to
an image area of 256 pixels. 256 four-color pixels, at four bytes per pixel,
gives a total memory size of 4,096 bytes. This is small enough to fit three
256-wide thread blocks onto a multiprocessor at a time, ensuring 75% occupancy
of each GPU multiprocessor, giving the device enough threads to work
effectively without stalling for memory.</p>
<p>Here&#8217;s the gotcha: there&#8217;s no central coordination mechanism on NVIDIA GPUs.
While they run, each block of threads can communicate to other threads in the
block of memory using shared memory, and with other blocks using main memory,
and that&#8217;s it. So, while this simple strategy takes the memory delay from
<span class="raw-math"><img src="http://strobe.cc/do_androids_render.eqne7a2f022962441f2be6dc8e70e837b4a.gif" alt="(equation)" class="eqn" /></span> to <span class="raw-math"><img src="http://strobe.cc/do_androids_render.eqnfa65ace72709f92a1411c3dfc767a29d.gif" alt="(equation)" class="eqn" /></span>, it also involves writing a
memory allocator and threading library from scratch, in assembly, without a
debugger and using nothing but a few atomic intrinsics. It&#8217;s not an impossible
task by any means, but it <em>is</em> a significant challenge. I&#8217;ll have more details
on this process as I go about it.</p>
</div>
</div>
<div class="section" id="moving-forward">
<h2>Moving forward</h2>
<p>I&#8217;d like to finish this up sometime. I&#8217;ll be investing in a new system to
coincide with the upcoming release of new graphics cards sometime in the
future, and I&#8217;ll also be performing a lot of heavy lifting on GPUs and DSPs as
I work on my thesis. I will investigate OpenCL as well; there&#8217;s no doubt it
would be easier to code this thing in a high-level language<sup><a class="footnote-reference" href="#id2" id="id1">1</a></sup><span class="fntarget" id="id2_target"></span>.</p>
<div class="fnwrap"><p class="footnote" id="id2"><a class="fn-backref" href="#id1">1</a> God help me, I just called a C-based language &quot;high-level&quot;.</p></div>
<p>I&#8217;ll keep you posted.</p>
</div>

    ]]>
    </summary>
  </entry>
</feed>
