I have a series of low poly models (~20 triangles each, 100 of them on the screen at a time). They are not interconnected, so drawing them all at one time with GL_TRIANGLE_STRIP is not an option.
My two options as I see it are to either put all 100 models' vertex, normal, and color data into a big interleaved array and draw them using one DrawElements() call using GL_TRIANGLES. Alternatively, I can construct a strip-ordered interleaved array to contain the data for one model, and use this array to draw each model over and over again in a for loop by calling DrawElements() 100 times using GL_TRIANGLE_STRIP.
Do each of the models need to be able to rotate and translate on their own? If so how do you intend on rendering them in a batch? The only way I'm aware this is possible is to use matrix transforms on the vertex data and that could be slow applying to that many vertices. If you use glRotate, glTranslate etc. you obviously can't do them all in a batch.
Also are you enabling GL_BLEND? This seems to be a big performance hit, if your models don't have alpha blending then disable it and make sure your model textures are 24 bit.
Are you using PVR texture compression? Are all your textures in one big texture atlas?
GL_TRIANGLE_STRIP is faster than GL_TRIANGLES but I don't see much of a performance hit for what you're doing. iPhone's main problem is alpha blending and fill rate. Of course the less texture and state changes the better.
The best way to find out where your bottle necks are is to profile your code. A simple way to do this is the following
I have a series of low poly models (~20 triangles each, 100 of them on the screen at a time). They are not interconnected, so drawing them all at one time with GL_TRIANGLE_STRIP is not an option.
My two options as I see it are to either put all 100 models' vertex, normal, and color data into a big interleaved array and draw them using one DrawElements() call using GL_TRIANGLES. Alternatively, I can construct a strip-ordered interleaved array to contain the data for one model, and use this array to draw each model over and over again in a for loop by calling DrawElements() 100 times using GL_TRIANGLE_STRIP.
Which one is faster on the iPhone?
Yeah, the only sane way of handling this setup is to transform vertices on the CPU and submit them with a single interleaved array.
Yeah, the only sane way of handling this setup is to transform vertices on the CPU and submit them with a single interleaved array.
Actually that's an interesting point, is it that important to have the array interleaved? Currently for my model animation system I am setting the vertex pointer for each anim frame rather than copying over the data which is obviously quicker. This means my vertex, tex coords etc. are all separate pointers to the data. But I did stumble upon an article which said to keep your vertex data in a single struct for better performance. I take it that's what you mean by interleaved? But with the iPhone sharing it's RAM and not having to transfer data to any sort of "VRAM" would it really incur a penalty?
EDIT: Found the article Interleaving Vertex Data which is where this is stated. Also for the OP who may want to look at transforming vertex data using matrices, another good article from the same site Transformations and Matricies
Actually that's an interesting point, is it that important to have the array interleaved? Currently for my model animation system I am setting the vertex pointer for each anim frame rather than copying over the data which is obviously quicker. This means my vertex, tex coords etc. are all separate pointers to the data. But I did stumble upon an article which said to keep your vertex data in a single struct for better performance. I take it that's what you mean by interleaved? But with the iPhone sharing it's RAM and not having to transfer data to any sort of "VRAM" would it really incur a penalty?
EDIT: Found the article Interleaving Vertex Data which is where this is stated. Also for the OP who may want to look at transforming vertex data using matrices, another good article from the same site Transformations and Matricies
It all about spatial locality - if you are accessing positions , there is a good chance you will also access normals ( if you are doing CPU transformations) and having normals reside in some other part of memory means having potential two cache line misses as opposed to just one.
Beside your own code , the GLES driver on the iPhone walks your entire vertex stream and does some preprocessing on its own so you do want to use interleaved arrays.
PS. I am not sure I understand your point about animations and having separate vertex streams.
Do each of the models need to be able to rotate and translate on their own? If so how do you intend on rendering them in a batch?
No texture, no lighting. I will have to perform rotation and scaling operations on them. I realize I could write my own operations to do this logic on the CPU, as a way to avoid having to call DrawElements 100 times. Just trying to get a sense of whether or not it's worth the effort.
No texture, no lighting. I will have to perform rotation and scaling operations on them. I realize I could write my own operations to do this logic on the CPU, as a way to avoid having to call DrawElements 100 times. Just trying to get a sense of whether or not it's worth the effort.
It makes a lot of difference..
I actually did some tests and even have pictures to prove it :-)
Here is a run without batching ( you can see it issues 121 DrawElements calls "renderables") - it runs at around 22 fps in release mode ( this screenshot is from my debug session) http://www.warmi.net/tmp/Screenshot_1a.png
And here is another run with batching (transforming positions/normal on the CPU - you can see it only has 4 DrawElements "renderables") - it runs at 55+ fps ( again the screenshot is from debug session ) http://www.warmi.net/tmp/Screenshot_1b.png
And here is another run with batching (transforming positions/normal on the CPU - you can see it only has 4 DrawElements "renderables")
Thanks for the stats warmi it's good to know I'm heading in the right direction too
BTW I assume you're using matrix transforms like those in the link I posted?
Quote:
PS. I am not sure I understand your point about animations and having separate vertex streams.
The problem is my models can change texture coordinates so they can use different texture images inside a texture atlas. It's too slow for me to loop through the tex coords in all the frames of vertex arrays to set them all, so I have a separate array for them. Unfortunately this means I can't use interleaving for the whole vertex.
Thanks for the stats warmi it's good to know I'm heading in the right direction too
BTW I assume you're using matrix transforms like those in the link I posted?
.
I am using custom VFP/Neon asm code which is actually faster than the GPU itself ( about 3-4 times faster than optimized C code) so for me transforming vertices/normals on the CPU is not a problem at all.
The biggest FPS killer are the draw calls and the internal driver vertex processing code which I can do nothing about (well, almost nothing, one way to minimize that is to submit your positions/normals/uvs as shorts and have them rescaled back on the GPU - this way a typical vertex structure which takes 32 bytes ( 3 floats/position, 3 floats/normal, 2 floats/uvs) can be shortened to 20 bytes (4 shorts/position , 4 shorts/normal , 2 shorts/uvs)
I am using custom VFP/Neon asm code which is actually faster than the GPU itself ( about 3-4 times faster than optimized C code) so for me transforming vertices/normals on the CPU is not a problem at all.
The biggest FPS killer are the draw calls and the internal driver vertex processing code which I can do nothing about (well, almost nothing, one way to minimize that is to submit your positions/normals/uvs as shorts and have them rescaled back on the GPU - this way a typical vertex structure which takes 32 bytes ( 3 floats/position, 3 floats/normal, 2 floats/uvs) can be shortened to 20 bytes (4 shorts/position , 4 shorts/normal , 2 shorts/uvs)
Interesting. I had read somewhere that multiple draw calls did not seem to affect performance but this definitely points to the contrary.
I see you've ignored the suggestion about aligning your vertex structure row length to a multiple of 8 bytes. Wondering what your results would be like if you manage to somehow cut those 20 bytes down to 16 (or better yet, pad with 4 to make it 24).
Interesting. I had read somewhere that multiple draw calls did not seem to affect performance but this definitely points to the contrary.
I see you've ignored the suggestion about aligning your vertex structure row length to a multiple of 8 bytes. Wondering what your results would be like if you manage to somehow cut those 20 bytes down to 16 (or better yet, pad with 4 to make it 24).
Why multiple of 8s ? 4s is plenty since that's what ARM operates with - if you really want to be cache friendly you would have to align your entire vertex struct to be 32 bytes exact.
As far as multiple draw calls - of course it does ... even if you are just doing simply glDrawArrays in a loop it still makes a lot of difference but if you take into account the fact that a typical engine ( as opposed to some lean demo code) will attempt to do a lot more then just call glDrawArrays for each renderable batch - it becomes even more important.