以下是analog devices的dsp bf532的汇编程序。是YUV2RGB。现在基本上所有图像压缩都是先把RGB转为YUV,然后再进行变换压缩。解压时就要把YUV转回到RGB显示。每个像数都要做这样的处理。所以YUV与RGB的变换的效率直接影响编解码的速度。为了提高速度,很多人想出各种不同的方法,多数是在时间与空间上进行调配。而在一些带有特殊硬件的,就进行特别优化计算(下面的程序就是了)。这优化手段都有一个共同的特点--使用汇编进行编程。因为很难用C进行优化,说句难听的话,基本上是无能为力。看看现在的开源图像编解码器,无一不是使用汇编进行优化。在x86方面,mmx sse sse2 sse3 3DNOW 3DNOWext扩展指令集等基本上C编译器支持很少。虽说支持,但无法真正发挥他们的效能。以下面的为例说,它利用了DSP的地址发生器实现参数循环自动加载和无开销循环,利用了双乘法器同时进行两组16位数据乘法,并行多指令运行方式使它可同时加载参数为下一步计算准备并进行加减法。在保存结果的同时加载新数据,为下一组YUV计算做准备,实现流水线式运算。通过这些优化,使得它进行一个像数的运算只需要13个指令周期(包括取数据,计算,存数据)。比查表方式的计算还要快(网上流行的经典YUVtoRGB程序还要快),不得不佩服其智慧。引用一些人的话:是一门艺术。但是它的设计还可以进一步优化。启发来源于哪个经典YUVtoRGB程序,是根据YUV数据特点而想到的。因为很多解码后得到的数据是以I420或YV12的,简单地说就是四个像数的UV值是相同,只是Y不同。而下面的程序设计是每个像数YUV都不同。而整转换最主要是UV进行计算,而Y就只是进行加法。按这个想法,是不是可以提高三倍的计算速度呢?现在正开始研究如何实现了。
以上讲了对汇编的爱,恨是。。。不用说就是和c语言的区别了,功能不是很到位。总之各有利弊。。加油。。
/*******************************************************************************
Copyright(c) 2000 - 2002 Analog Devices. All Rights Reserved.
Developed by Joint Development Software Application Team, IPDC, Bangalore, India
for Blackfin DSPs ( Micro Signal Architecture 1.0 specification).
By using this module you agree to the terms of the Analog Devices License
Agreement for DSP Software.
********************************************************************************
Module Name : YCbCrtoRGB.asm
Label Name : __YCbCrtoRGB
Version : 1.3
Change History :
Version Date Author Comments
1.3 11/18/2002 Swarnalatha Tested with VDSP++ 3.0
compiler 6.2.2 on
ADSP-21535 Rev.0.2
1.2 11/13/2002 Swarnalatha Tested with VDSP++ 3.0
on ADSP-21535 Rev. 0.2
1.1 02/28/2002 Raghavendra Modified to match
silicon cycle count
1.0 05/16/2001 Raghavendra Original
Description : In this function the range of Y, Cb and Cr is 0 to 255 and the
output range of R, G and B is also 0 to 255.
The formula implemented is
as below:
R = Y + 1.402 (Cr - 128) = Y +(Cr-128) + 0.402(Cr-128)
G = Y - 0.34414 (Cb - 128) - 0.71414 (Cr - 128)
B = Y + 1.772 (Cb - 128) = Y + (Cb-128) + 0.772(Cb-128)
Prototype : void YCbCrtoRGB(unsigned char input[], unsigned char out[],
int N);
input[] - Input YCbCr array
out[] - Out put array to store in RGB format
N - Number of inputs
Registers used : A0, A1, R0-R7, I1, B1, L1, P0-P2, LC0.
Performance :
Code Size : 164 bytes
Cycle count : 13*N + 31 cycles
: 96 Cycles (for N = 5)
*******************************************************************************/
.section L1_code;
.global __YCbCrtoRGB;
.align 8;
__YCbCrtoRGB: [--SP] = (R7:4);
// Pushing the Registers on stack.
SP += -8; // SP modified to store coefficients
I1 = SP; //
B1 = I1; // Initialize base register B1 and I1 for circular
// buffer
L1 = 8; // Initialize length for circular buffer
R6.L = 0x3374;
R6.H = 0xA498; // Coefficients 0.402 and -0.34414 are stored in R6
R7.L = 0xD3F4;
P0 = R0; // Address of input YCbCr array
R7.H = 0x62D0; // Coefficients -0.71414 and 0.772 are stored in R7
P2 = R1; // Address of output array to store RGB values
P1 = R2; // Number of inputs N
R4.L = 0xFF;
R4 = PACK(R4.L,R4.L) || R0 = B[P0++](Z) || [I1++] = R6;
// Initialize R4.H to 255 and fetch Y value
// and store coefficients 0.402 and -0,34414 to
// temp. location
R5 = R7-R7(NS) || R1 = B[P0++] (Z)|| [I1++] = R7;
// Clear R5 and fetch Cb value and store
// coefficients
// -0.71414 and 0,772 in temp. location
R7 = 128; // Initialize R7 to 128
R1 = R1 - R7(NS) || R2 = B[P0++](Z);
// R1 = Cb-128 and fetch Cr value
R6.L = 0X7FFF; // Initialize R6 to maximum positive value
LSETUP(YCB_STRT, YCB_END) LC0 = P1;
YCB_STRT:
R2 = R2 - R7; // R2 = Cr-128
A1 = R0.L * R6.L, A0 = R0.L * R6.L || R3 = [I1++];
// Get Y value in A1 and A0 and fetch coefficients
A1 += R2.L * R3.H, A0 += R2.L * R3.L || R3 = [I1++];
// Multiply (Cr -128) value with coefficients 0.402
// and -0.71414
R2.L = (A0 += R2.L * R6.L);
// Add (Cr-128) value to A0 to get R value
R2.H = (A1 += R1.L * R3.L), A0 = R1.L * R3.H;
// multiply (Cb-128) with -0.34414 and add to A1,
// A0= 0.772(Cb-128)
R2 = MAX(R2,R5)(V); // check if value is within 0 and 255
R2 = MIN(R2,R4)(V); // R2.L contains R value and R2.H contains B value
A0 += R0.L * R6.L || B[P2++] = R2;
// Add Y value to A0 and store R value
R2 = R2 >> 16 || R0 = B[P0++](Z);
// Leftshift to get B value in lower half and fetch
// next Y data
R3.L = (A0 += R1.L * R6.L) || B[P2++] = R
2;
// Add (Cb-128) value to A0 and store B value
R3 = MAX(R3,R5)(V) || R1 = B[P0++](Z);
// Check if value is within the limit 0 to 255 and
// fetch next Cb data
R3 = MIN(R3,R4)(V) || R2 = B[P0++](Z);
// fetch next Cr value
YCB_END:
R1 = R1 - R7(NS) || B[P2++] = R3;
// R1 = Cb-128 and store B data
SP += 8; // Clear temp. location
(R7:4) = [SP++]; // Pop up the saved registers.
RTS;
NOP; //to avoid one stall if LINK or UNLINK happens to be
//the next instruction after RTS in the memory.
__YCbCrtoRGB.end:
|