Age | Commit message (Collapse) | Author |
|
In spu_tri.c:setup_sort_vertices() triangles are culled after the
vertices are sorted. This patch moves the check a little earlier
and performs the actual check a little faster through intrinsics and
a little trickery.
Reduced code size and less work is done before a triangle is deemed
OK to skip.
|
|
It was taking approximately 50 cycles to extract the vertex indices,
calculate the vertex_header pointers and call tri_draw() for each
three vertices - .
Unrolled, it takes less than 100 cycles to extract, unpack,
calculate pointers and call tri_draw() eight times. It does have a
nasty jump-tabled switch. I'm sure that there's a better way...
Code size of spu_render.o gets larger due to the extra constants and
work in the inner loop, there are extra stack saves and loads
because there are more registers in use, and an assert. spu_tri.o
gets a little smaller.
|
|
The debug functions depend on several util function for os abstractions, and
these depend on debug functions, so a seperate module is not possible.
|
|
Suggested by Jonathan Adamczewski. There may be more places to do this...
|
|
|
|
|
|
|
|
|
|
Without the f, the constant is treated as a double, resulting in
slower arithmetic and libgcc conversion calls each time CEILF()
is used.
|
|
|
|
Replace cell_batch{align,alloc)*() with cell_batch_alloc16(), allocating
multiples of 16 bytes that are 16 byte aligned.
Opcodes are stored in preferred slot of SPU machine word.
Various structures are explicitly padded to 16 byte multiples.
Added STATIC_ASSERT().
|
|
Put setup.v{min,mid,max,provoke} into a union with qword vertex_headers.
Rewrite vertex sorting to more efficiently handle the packed data items.
Reduces spu_tri.o by ~128 bytes.
|
|
Put edge.{dx,dy} into a union with a vector and perform subtractions in
setup_sort_vertices() on vectors.
Reduces spu_tri.o by ~300 bytes.
|
|
Replace int setup.span{left,right}[2] with vec_uint4 setup.span.quad
SIMDize calculate_mask() and inline into into flush_spans()
Set setup.span.quad members using spu_shuffle() or spu_sel().
Reduces spu_tri.o by ~116 bytes.
|
|
Facilitates creation of shuffle patterns for use with spu_shuffle()
and si_shufb() intrinsics.
To be used by subsequent patches.
|
|
This is a set of changes that optimizes the memory use of fragment
operation programs (by using and transmitting only as much memory as is
needed for the fragment ops programs, instead of maximal sizes), as well
as eliminate the dependency on hard-coded maximal program sizes. State
that is not dependent on fragment facing (i.e. that isn't using
two-sided stenciling) will only save and transmit a single
fragment operation program, instead of two identical programs.
- Added the ability to emit a LNOP (No Operation (Load)) instruction.
This is used to pad the generated fragment operations programs to
a multiple of 8 bytes, which is necessary for proper operation of
the dual instruction pipeline, and also required for proper SPU-side
decoding.
- Added the ability to allocate and manage a variant-length
struct cell_command_fragment_ops. This structure now puts the
generated function field at the end, where it can be as large
as necessary.
- On the PPU side, we now combine the generated front-facing and
back-facing code into a single variant-length buffer (and only use one
if the two sets of code are identical) for transmission to the SPU.
- On the SPU side, we pull the correct sizes out of the buffer,
allocate a new code buffer if the one we have isn't large enough,
and save the code to that buffer. The buffer is deallocated when
the SPU exits.
- Commented out the emit_fetch() static function, which was not being used.
|
|
With these changes, the tests/stencil_twoside test now works.
- Eliminate blending from the stencil_twoside test, as it produces an
unneeded dependency on having blending working
- The spe_splat() function will now work if the register being splatted
and the destination register are the same
- Separate fragment code generated for front-facing and back-facing
fragments. Often these are the same; if two-sided stenciling is on,
they can be different. This is easier and faster than generating
code that does both tests and merges the results.
- Fixed a cut/paste bug where if the back Z-pass stencil operation
were different from all the other operations, the back Z-fail
results were incorrect.
|
|
Two definitive bugs in stenciling were fixed.
The first, reversed registers in the generated Select Bytes (selb)
instruction, caused the stenciling INCR and DECR operations to
fail dramatically, putting new values in where old values were
supposed to be and vice versa.
The second caused stencil tiles to not be read and written from
main memory by the SPUs. A per-spu flag, spu.read_depth, was used
to indicate whether the SPU should be reading depth tiles, and was set
only when depth was enabled. A second flag, spu.read_stencil, was
set when stenciling was enabled, but never referenced.
As stenciling and depth are in the same tiles on the Cell, and there
is no corresponding TAG_WRITE_TILE_STENCIL to complement
TAG_WRITE_TILE_COLOR and TAG_WRITE_TILE_Z, I fixed this by
eliminating the unused "spu.read_stencil", renaming "spu.read_depth"
to "spu.read_depth_stencil", and setting it if either stenciling or
depth is enabled.
I also added an optimization to the fragment ops generation code,
that avoids calculating stencil values and/or stencil writemask
when the stencil operations are all KEEP.
|
|
|
|
Plus add assertions to check status, alignment, etc.
|
|
If we delete a texture, we need to keep the underlying tiled data buffer
around until any rendering that references it has completed.
Keep a list of buffers referenced by a rendering batch. Unref/free them when
the associated batch's fence is executed/signalled.
|
|
Though, the logf() call still needs attention.
|
|
|
|
This allows us to use 16-bit signed mul/add instructions. Had to
used unsigned mul before and there's no unsigned mul/add instruction.
|
|
|
|
|
|
|
|
|
|
|
|
Distinguish among texture targets in codegen.
progs/demos/cubemap.c runs correctly now too.
|
|
Use the spu_write_decrementer() and spu_read_decrementer() functions to
measure time. Convert to milliseconds according to the system timebase value.
|
|
|
|
Results in slightly tighter code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Though, progs/demos/cubemap.c doesn't quite work right...
|
|
glDrawPixels works now.
|
|
|
|
|
|
Though, only GL_MIPMAP_NEAREST / GL_LINEAR works right now.
|