【安富莱——DSP教程】第8章 BasicMathFunctions的使用（一）

ti

第8章 BasicMathFunctions的使用（一）

本期教程开始学习ARM官方的DSP库，这里我们先从基本数学函数开始。本期教程主要讲绝对值，加法，点乘和乘法四种运算。

8.1 绝对值（VectorAbsolute Value）

8.2 求和（VectorAddition）

8.3 点乘（VectorDot Product）

8.4 乘法（VectorMultiplication）

硬汉Eric2013 · 2015-6-4 14:21:12

8.1 绝对值（Vector Absolute Value）

这部分函数主要用于求绝对值，公式描述如下：

pDst[n] = abs(pSrc[n]), 0 <= n < blockSize.

特别注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

8.1.1 arm_abs_f32

这个函数用于求32位浮点数的绝对值，源代码分析如下：

/**
* @brief Floating-point vector absolute value. （1）
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* [url=home.php?mod=space&uid=1141835]@Return[/url] none.
*/
void arm_abs_f32( （2）
float32_t * pSrc,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY （3）
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u; （4）
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute and then store the results in the destination buffer. */
/* read sample from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
/* find absolute value */
in1 = fabsf(in1); （5）
/* read sample from source */
in4 = *(pSrc + 3);
/* find absolute value */
in2 = fabsf(in2);
/* read sample from source */
*pDst = in1;
/* find absolute value */
in3 = fabsf(in3);
/* find absolute value */
in4 = fabsf(in4);
/* store result to destination */
*(pDst + 1) = in2;
/* store result to destination */
*(pDst + 2) = in3;
/* store result to destination */
*(pDst + 3) = in4;
/* Update source pointer to process next sampels */ （6）
pSrc += 4u;
/* Update destination pointer to process next sampels */
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else （7）
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u) （8）
{
/* C = |A| */
/* Calculate absolute and then store the results in the destination buffer. */
*pDst++ = fabsf(*pSrc++);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 在这里简单的跟大家介绍一下DSP库中函数的通用格式，后面就不再赘述了。

（1）基本所有的函数都是可重入的。

（2）大部分函数都支持一组数的计算，比如这个函数arm_abs_f32就可以计算一组数的绝对值。所以如果只是就几个数的绝对值，用这个库函数就没有什么优势了。

（3）库函数基本是CM0，CM3和CM4都支持的（最新的DSP库已经添加CM7的支持）。

（4）每组数据基本上都是以4个数为一个单位进行计算，不够四个再单独计算。

（5）大部分函数都是配有f32，Q31，Q15和Q7四种格式。

2. 函数参数，支持输入一个数组进行计算绝对值。

3. 这部分代码是用于CM3和CM4内核。

4. 左移两位从而实现每4个数据为一组进行计算。

5. fabsf：这个函数不是用Cortex-M4F支持的DSP指令实现的，而是用C语言实现的，这个函数是被MDK封装起来的。

6. 切换到下一组数据。

7. 这部分代码用于CM0.

8. 用于不够4个数据的计算或者CM0内核。

8.1.2 arm_abs_q31

这个函数用于求32位定点数的绝对值，源代码分析如下：

/**
* @brief Q31 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: （1）
* par
* The function uses saturating arithmetic.
* The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
*/
void arm_abs_q31(
q31_t * pSrc,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q31_t in; /* Input value */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute of input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = (in1 > 0) ? in1 : (q31_t)__QSUB(0, in1); （2）
*pDst++ = (in2 > 0) ? in2 : (q31_t)__QSUB(0, in2);
*pDst++ = (in3 > 0) ? in3 : (q31_t)__QSUB(0, in3);
*pDst++ = (in4 > 0) ? in4 : (q31_t)__QSUB(0, in4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute value of the input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
in = *pSrc++;
*pDst++ = (in > 0) ? in : ((in == INT32_MIN) ? INT32_MAX : -in);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和运算，其实不光这个函数，后面很多函数都是使用了饱和运算的，关于什么是饱和运算，大家看Cortex-M3权威指南中文版的4.3.6 小节：汇编语言：饱和运算即可。

对于Q31格式的数据，饱和运算会使得数据0x80000000变成0x7fffffff（这个数比较特殊，算是特殊处理，记住即可）。

2. 这里重点说一下函数__QSUB，其实这个函数算是Cortex-M4/M3的一个指令，用于实现饱和减法。

比如函数：__QSUB(0,in1) 的作用就是实现0 – in1并返回结果。这里__QSUB实现的是32位数的饱和减法。还有__QSUB16和__QSUB8实现的是16位和8位数的减法。

8.1.3 arm_abs_q15

这个函数用于求15位定点数的绝对值，源代码分析如下：

/**
* @brief Q15 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF. （1）
*/
void arm_abs_q15(
q15_t * pSrc,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
__SIMD32_TYPE *simd; （2）
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1; /* Input value1 */
q15_t in2; /* Input value2 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
simd = __SIMD32_CONST(pDst); （3）
while(blkCnt > 0u)
{
/* C = |A| */
/* Read two inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* Store the Absolute result in the destination buffer by packing the two values, in a single cycle */
#ifndef ARM_MATH_BIG_ENDIAN
*simd++ =
__PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), （4）
((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
#else
*simd++ =
__PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndef ARM_MATH_BIG_ENDIAN
*simd++ =
__PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)),
((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
#else
*simd++ =
__PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
pDst = (q15_t *)simd;
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in1 = *pSrc++;
/* Calculate absolute value of input and then store the result in the destination buffer. */
*pDst++ = (in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
q15_t in; /* Temporary input variable */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in = *pSrc++;
/* Calculate absolute value of input and then store the result in the destination buffer. */
*pDst++ = (in > 0) ? in : ((in == (q15_t) 0x8000) ? 0x7fff : -in);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 对于Q15格式的数据，饱和运算会使得数据0x8000变成0x7fff。

2. __SIMD32_TYPE的定义在文件arm_math.h中，具体定义如下：

#define __SIMD32_TYPE int32_t __packed

SIMD就是咱们上期教程所将的单指令多数据流。简单的理解就是__SIMD32_TYPE就是定义了一个int32_t类型的数据，__packed的含义就是实现字节的对齐功能，方便两个16位数据的都存入到这个数据类型中。

3. 函数__SIMD32_CONST的定义如下：

#define __SIMD32_CONST(addr) ((__SIMD32_TYPE *)(addr))

4. 函数__PKHBT的定义在文件core_cm4_simd.h，定义如下：

#define __PKHBT(ARG1,ARG2,ARG3) ( ((((uint32_t)(ARG1)) ) &0x0000FFFFUL) |

((((uint32_t)(ARG2)) <<(ARG3)) & 0xFFFF0000UL) )

这个宏定义的作用就是将将两个16位的数据合并成32位数据。但是有一点要特别说明__PKHBT也是CM4内核支持的SIMD指令，上面的宏定义的C函数会被MDK自动识别并调用相应的PKHBT指令。

__QSUB16用于实现16位数据的饱和减法。

8.1.4 arm_abs_q7

这个函数用于求8位定点数的绝对值，源代码分析如下：

/**
* @brief Q7 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* par Conditions for optimum performance
* Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
*/
void arm_abs_q7(
q7_t * pSrc,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q7_t in; /* Input value1 */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Read inputs */
in1 = (q31_t) * pSrc;
in2 = (q31_t) * (pSrc + 1);
in3 = (q31_t) * (pSrc + 2);
/* find absolute value */
out1 = (in1 > 0) ? in1 : (q31_t)__QSUB8(0, in1); (2)
/* read input */
in4 = (q31_t) * (pSrc + 3);
/* find absolute value */
out2 = (in2 > 0) ? in2 : (q31_t)__QSUB8(0, in2);
/* store result to destination */
*pDst = (q7_t) out1;
/* find absolute value */
out3 = (in3 > 0) ? in3 : (q31_t)__QSUB8(0, in3);
/* find absolute value */
out4 = (in4 > 0) ? in4 : (q31_t)__QSUB8(0, in4);
/* store result to destination */
*(pDst + 1) = (q7_t) out2;
/* store result to destination */
*(pDst + 2) = (q7_t) out3;
/* store result to destination */
*(pDst + 3) = (q7_t) out4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
blkCnt = blockSize;
#endif // #define ARM_MATH_CM0_FAMILY
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in = *pSrc++;
/* Store the Absolute result in the destination buffer */
*pDst++ = (in > 0) ? in : ((in == (q7_t) 0x80) ? 0x7f : -in);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 由于饱和运算，0x80求绝对值将变成数据0x7F。

2. __QSUB8用以实现8位数的饱和减法运算。

8.1.5 实例讲解

实验目的：

1. 四种数据类型数据绝对值求解

实验内容：

1. 按下按键K1, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_ABS

* 功能说明: 求绝对值

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_ABS(void)

{

   static float32_t pSrc;

static float32_t pDst;

   static q31_t pSrc1;

   static q31_t pDst1;

   static q15_t pSrc2;

   static q15_t pDst2;

   static q7_t pSrc3 = 127; /* 为了说明问题，在这里设置初始值为127，然后查看0x80是否饱和为0x7F */

   static q7_t pDst3;

   pSrc -= 1.23f;

   arm_abs_f32(&pSrc, &pDst, 1);                                                                   (1)

   printf("arm_abs_f32 = %frn", pDst);

   pSrc1 -= 1;

   arm_abs_q31(&pSrc1, &pDst1, 1);                                                                (2)

   printf("arm_abs_q31 = %drn", pDst1);

   pSrc2 -= 1;

   arm_abs_q15(&pSrc2, &pDst2, 1);                                                                (3)

   printf("arm_abs_q15 = %drn", pDst2);

   pSrc3 += 1;

   printf("pSrc3 = %drn", pSrc3);

   arm_abs_q7(&pSrc3, &pDst3, 1);                                                                (4)

   printf("arm_abs_q7 = %drn", pDst3);

   printf("***********************************rn");

}
复制代码

(1)到(4)实现相应格式下绝对值的求解。这里只求了一个数，大家可以尝试求解一个数组的绝对值。

硬汉Eric2013 · 2015-6-4 14:25:52

8.2 求和（Vector Addition）

这部分函数主要用于求和，公式描述如下：

pDst[n] = pSrcA[n] + pSrcB[n], 0 <= n < blockSize.

8.2.1 arm_add_f32

这个函数用于求32位浮点数的和，源代码分析如下：

/**
* @brief Floating-point vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_add_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variabels */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
/* read four inputs from sourceA and four inputs from sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* C = A + B */ (1)
/* add and store result to destination */
*pDst = inA1 + inB1;
*(pDst + 1) = inA2 + inB2;
*(pDst + 2) = inA3 + inB3;
*(pDst + 3) = inA4 + inB4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) + (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 这部分的代码比较简单，只是求解两个数的和。

8.2.2 arm_add_q31

这个函数用于求32位定点数的和，源代码分析如下：

/**
* @brief Q31 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_add_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
*pDst++ = __QADD(inA1, inB1); (2)
*pDst++ = __QADD(inA2, inB2);
*pDst++ = __QADD(inA3, inB3);
*pDst++ = __QADD(inA4, inB4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = __QADD(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ + *pSrcB++); (3)
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数也是饱和运算，输出结果的范围[0x800000000x7FFFFFFF]，超出这个结果将产生饱和结果。

2. __QADD实现32位数的加法。

3. 函数clip_q63_to_q31的定义在文件arm_math.h里面

static __INLINE q31_t clip_q63_to_q31(

q63_t x)

{

return ((q31_t) (x >> 32) != ((q31_t)x >> 31)) ?

((0x7FFFFFFF ^ ((q31_t) (x >>63)))) : (q31_t) x;

}

这个函数的作用是实现饱和结果。

8.2.3 arm_add_q15

这个函数用于求16位定点数的和，源代码分析如下：

/**
* @brief Q15 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_add_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inB1, inB2;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */ (2)
/* Add and then store the results in the destination buffer. */
inA1 = *__SIMD32(pSrcA)++;
inA2 = *__SIMD32(pSrcA)++;
inB1 = *__SIMD32(pSrcB)++;
inB2 = *__SIMD32(pSrcB)++;
*__SIMD32(pDst)++ = __QADD16(inA1, inB1);
*__SIMD32(pDst)++ = __QADD16(inA2, inB2);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q15_t) __QADD16(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ + *pSrcB++), 16); (3)
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数也是饱和运算，输出结果的范围[0x80000x7FFF]，超出这个结果将产生饱和结果。

2. 函数inA1 = *__SIMD32(pSrcA)++仅需要一条SIMD指令即可完成将两个16位数存到32位的变量inA1中。

3. __SSAT也是SIMD指令，这里是将结果饱和到16位精度。

8.2.4 arm_add_q7

这个函数用于求8位定点数的绝对值，源代码分析如下：

/**
* @brief Q7 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_add_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */ (2)
*__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q7_t) __SSAT(*pSrcA++ + *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ + *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1．这个函数也是饱和运算，输出结果的范围[0x800x7F]，超出这个结果将产生饱和。

2．这里通过SIMD指令实现4组8位数的加法。

8.2.5 实例讲解

实验目的：

1. 四种类似数据的求和

实验内容：

1. 按下按键K2, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_ABS

* 功能说明: 加法

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Add(void)

{

   static float32_t pSrcA;

   static float32_t pSrcB;

   static float32_t pDst;

   static q31_t  pSrcA1;

   static q31_t  pSrcB1;

   static q31_t  pDst1;

   static q15_t  pSrcA2;

   static q15_t  pSrcB2;

   static q15_t  pDst2;

   static q7_t  pSrcA3;

   static q7_t  pSrcB3;

   static q7_t  pDst3;

   pSrcA--;

   arm_add_f32(&pSrcA, &pSrcB, &pDst, 1);

   printf("arm_add_f32 = %frn", pDst);

   pSrcA1--;

   arm_add_q31(&pSrcA1, &pSrcB1, &pDst1, 1);

   printf("arm_add_q31 = %drn", pDst1);

   pSrcA2--;

   arm_add_q15(&pSrcA2, &pSrcB2, &pDst2, 1);

   printf("arm_add_q15 = %drn", pDst2);

   pSrcA3--;

   arm_add_q7(&pSrcA3, &pSrcB3, &pDst3, 1);

   printf("arm_add_q7 = %drn", pDst3);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-4 14:32:38

8.3 点乘（Vector Dot Product）

这部分函数主要用于点乘，公式描述如下：

sum =pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... +pSrcA[blockSize-1]*pSrcB[blockSize-1]

8.3.1 arm_dot_prod_f32

这个函数用于求32位浮点数的点乘，源代码分析如下：

/**
* @defgroup dot_prod Vector Dot Product
*
* Computes the dot product of two vectors.
* The vectors are multiplied element-by-element and then summed.
*

*



 *     sum = pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... + pSrcA[blockSize-1]*pSrcB[blockSize-1]

 *

*
* There are separate functions for floating-point, Q7, Q15, and Q31 data types.
*/
/**
* @addtogroup dot_prod
* @{
*/
/**
* @brief Dot product of floating-point vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*/
void arm_dot_prod_f32(
float32_t * pSrcA,
float32_t * pSrcB,
uint32_t blockSize,
float32_t * result)
{
float32_t sum = 0.0f; /* Temporary result storage */ (1)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer */
sum += (*pSrcA++) * (*pSrcB++); (2)
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result back in the destination buffer */
*result = sum;
}

复制代码

1. 由于CM4上带的FPU是单精度的，所以初始化float32_t类型的浮点数时需要在数据的末尾加上f。

2. 类似函数sum += (*pSrcA++) * (*pSrcB++)最终会通过浮点的MAC（乘累加）实现，从而加快执行时间。

8.3.2 arm_dot_prod_q31

这个函数用于求32位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q31 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.31 x 1.31 = 2.62 format and these
* are truncated to 2.48 format by discarding the lower 14 bits.
* The 2.48 result is then added without saturation to a 64-bit accumulator in 16.48 format.
* There are 15 guard bits in the accumulator and there is no risk of overflow as long as
* the length of the vectors is less than 2^16 elements.
* The return result is in 16.48 format.
*/
void arm_dot_prod_q31(
q31_t * pSrcA,
q31_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
sum += ((q63_t) inA1 * inB1) >> 14u; (2)
sum += ((q63_t) inA2 * inB2) >> 14u;
sum += ((q63_t) inA3 * inB3) >> 14u;
sum += ((q63_t) inA4 * inB4) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += ((q63_t) * pSrcA++ * *pSrcB++) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result in the destination buffer in 16.48 format */
*result = sum;
}

复制代码

1. 两个Q31格式的32位数相乘，那么输出结果的格式是1.31*1.31 = 2.62。实际应用中基本不需要这么高的精度，这个函数将低14位的数据截取掉，反应在函数中就是两个数的乘积左移14位，也就是定点数的小数点也左移14位，那么最终的结果的格式是16.48。所以只要乘累加的个数小于2^16就没有输出结果溢出的危险（不知道这里为什么不是2^14，留作以后解决）。

2. 将获取的结果左移14位。

8.3.3 arm_dot_prod_q15

这个函数用于求16位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q15 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.15 x 1.15 = 2.30 format and these
* results are added to a 64-bit accumulator in 34.30 format.
* Nonsaturating additions are used and given that there are 33 guard bits in the accumulator
* there is no risk of overflow.
* The return result is in 34.30 format.
*/
void arm_dot_prod_q15(
q15_t * pSrcA,
q15_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */ (2)
/* Calculate dot product and then store the result in a temporary buffer. */
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum = __SMLALD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum += (q63_t) ((q31_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 34.30 format */
*result = sum;
}

复制代码

1．两个Q15格式的数据相乘，那么输出结果的格式是1.15*1.15 = 2.30，这个函数将输出结果赋值给了64位变量，那么输出结果就是34.30格式。所以基本没有溢出的危险。

2． __SMLALD也是SIMD指令，实现两个16位数相乘，并把结果累加给64位变量。

8.3.4 arm_dot_prod_q7

这个函数用于求8位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q7 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.7 x 1.7 = 2.14 format and these
* results are added to an accumulator in 18.14 format.
* Nonsaturating additions are used and there is no danger of wrap around as long as
* the vectors are less than 2^18 elements long.
* The return result is in 18.14 format.
*/
void arm_dot_prod_q7(
q7_t * pSrcA,
q7_t * pSrcB,
uint32_t blockSize,
q31_t * result)
{
uint32_t blkCnt; /* loop counter */
q31_t sum = 0; /* Temporary variables to store output */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input1, input2; /* Temporary variables to store input */
q31_t inA1, inA2, inB1, inB2; /* Temporary variables to store input */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read 4 samples at a time from sourceA */ (2)
input1 = *__SIMD32(pSrcA)++;
/* read 4 samples at a time from sourceB */
input2 = *__SIMD32(pSrcB)++;
/* extract two q7_t samples to q15_t samples */
inA1 = __SXTB16(__ROR(input1, 8)); (3)
/* extract reminaing two samples */
inA2 = __SXTB16(input1);
/* extract two q7_t samples to q15_t samples */
inB1 = __SXTB16(__ROR(input2, 8));
/* extract reminaing two samples */
inB2 = __SXTB16(input2);
/* multiply and accumulate two samples at a time */
sum = __SMLAD(inA1, inB1, sum); (4)
sum = __SMLAD(inA2, inB2, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum = __SMLAD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum += (q31_t) ((q15_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 18.14 format */
*result = sum;
}

复制代码

1. 两个Q8格式的数据相乘，那么输出结果就是1.7*1.7 = 2.14格式。这里将最终结果赋值给了32位的变量，那么最终的格式就是18.14。如果乘累加的个数小于2^18那么就不会有溢出的危险（感觉这里应该是2^16）。

2. 一次读取4个8位的数据。

3. __SXTB16也是SIMD指令，用于将两个8位的有符号数扩展成16位。__ROR用于实现数据的循环右移。

4. __SMLAD也是SIMD指令，用于实现如下功能：

sum = __SMLAD(x, y, z)

sum = z + ((short)(x>>16) * (short)(y>>16))+ ((short)x * (short)y)

8.3.5 实例讲解

实验目的：

1. 四种类型数据的点乘。

实验内容：

1. 按下按键K3, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_DotProduct

* 功能说明: 乘积

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_DotProduct(void)

{

   static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t result;

   static q31_t  pSrcA1[5] = {0x7ffffff0,1,1,1,1};

   static q31_t  pSrcB1[5] = {1,1,1,1,1};

   static q63_t result1;

   static q15_t  pSrcA2[5] = {1,1,1,1,1};

   static q15_t  pSrcB2[5] = {1,1,1,1,1};

   static q63_t  result2;

   static q7_t  pSrcA3[5] = {1,1,1,1,1};

   static q7_t  pSrcB3[5] = {1,1,1,1,1};

   static q31_t result3;

   pSrcA[0] -= 1.1f;

   arm_dot_prod_f32(pSrcA, pSrcB, 5, &result);

   printf("arm_dot_prod_f32 = %frn", result);

   pSrcA1[0] -= 0xffff;

   arm_dot_prod_q31(pSrcA1, pSrcB1, 5, &result1);

   printf("arm_dot_prod_q31 = %lldrn", result1);

   pSrcA2[0] -= 1;

   arm_dot_prod_q15(pSrcA2, pSrcB2, 5, &result2);

   printf("arm_dot_prod_q15 = %lldrn", result2);

   pSrcA3[0] -= 1;

   arm_dot_prod_q7(pSrcA3, pSrcB3, 5, &result3);

   printf("arm_dot_prod_q7 = %drn", result3);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-4 14:36:49

8.4 乘法（Vector Multiplication）

这部分函数主要用于乘法，公式描述如下：

pDst[n]= pSrcA[n] * pSrcB[n], 0 <= n
8.4.1 arm_mult_f32

这个函数用于求32位浮点数的乘法，源代码分析如下：

/**
* @brief Floating-point vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_mult_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variables */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
float32_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */ (1)
/* read sample from sourceA */
inA1 = *pSrcA;
/* read sample from sourceB */
inB1 = *pSrcB;
/* read sample from sourceA */
inA2 = *(pSrcA + 1);
/* read sample from sourceB */
inB2 = *(pSrcB + 1);
/* out = sourceA * sourceB */
out1 = inA1 * inB1;
/* read sample from sourceA */
inA3 = *(pSrcA + 2);
/* read sample from sourceB */
inB3 = *(pSrcB + 2);
/* out = sourceA * sourceB */
out2 = inA2 * inB2;
/* read sample from sourceA */
inA4 = *(pSrcA + 3);
/* store result to destination buffer */
*pDst = out1;
/* read sample from sourceB */
inB4 = *(pSrcB + 3);
/* out = sourceA * sourceB */
out3 = inA3 * inB3;
/* store result to destination buffer */
*(pDst + 1) = out2;
/* out = sourceA * sourceB */
out4 = inA4 * inB4;
/* store result to destination buffer */
*(pDst + 2) = out3;
/* store result to destination buffer */
*(pDst + 3) = out4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */
*pDst++ = (*pSrcA++) * (*pSrcB++);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

复制代码

1. 浮点的32位乘法比较简单，这里依然是以4次的计算为一组。

8.4.2 arm_mult_q31

这个函数用于求32位定点数的乘法，源代码分析如下：

/**
* @brief Q31 vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_mult_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4; /* temporary input variables */
q31_t inB1, inB2, inB3, inB4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
out1 = ((q63_t) inA1 * inB1) >> 32; (2)
out2 = ((q63_t) inA2 * inB2) >> 32;
out3 = ((q63_t) inA3 * inB3) >> 32;
out4 = ((q63_t) inA4 * inB4) >> 32;
out1 = __SSAT(out1, 31); (3)
out2 = __SSAT(out2, 31);
out3 = __SSAT(out3, 31);
out4 = __SSAT(out4, 31);
*pDst++ = out1 << 1u; (4)
*pDst++ = out2 << 1u;
*pDst++ = out3 << 1u;
*pDst++ = out4 << 1u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
*pDst++ =
(q31_t) clip_q63_to_q31(((q63_t) (*pSrcA++) * (*pSrcB++)) >> 31);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和算法。

所得结果是Q31格式，范围Q31range[0x80000000 0x7FFFFFFF]。

2. 所得乘积左移32位。

3. 实现31位精度的饱和运算。

4. 右移一位，保证所得结果是Q31格式。

8.4.3 arm_mult_q15

这个函数用于求16位定点数的乘法，源代码分析如下：

/**
* @brief Q15 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_mult_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inB1, inB2; /* temporary input variables */
q15_t out1, out2, out3, out4; /* temporary output variables */
q31_t mul1, mul2, mul3, mul4; /* temporary variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read two samples at a time from sourceA */
inA1 = *__SIMD32(pSrcA)++; (2)
/* read two samples at a time from sourceB */
inB1 = *__SIMD32(pSrcB)++;
/* read two samples at a time from sourceA */
inA2 = *__SIMD32(pSrcA)++;
/* read two samples at a time from sourceB */
inB2 = *__SIMD32(pSrcB)++;
/* multiply mul = sourceA * sourceB */
mul1 = (q31_t) ((q15_t) (inA1 >> 16) * (q15_t) (inB1 >> 16)); (3)
mul2 = (q31_t) ((q15_t) inA1 * (q15_t) inB1);
mul3 = (q31_t) ((q15_t) (inA2 >> 16) * (q15_t) (inB2 >> 16));
mul4 = (q31_t) ((q15_t) inA2 * (q15_t) inB2);
/* saturate result to 16 bit */
out1 = (q15_t) __SSAT(mul1 >> 15, 16); (4)
out2 = (q15_t) __SSAT(mul2 >> 15, 16);
out3 = (q15_t) __SSAT(mul3 >> 15, 16);
out4 = (q15_t) __SSAT(mul4 >> 15, 16);
/* store the result */
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16); (5)
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#else
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16);
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#endif // #ifndef ARM_MATH_BIG_ENDIAN
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q15_t) __SSAT((((q31_t) (*pSrcA++) * (*pSrcB++)) >> 15), 16);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和算法。

所得结果是Q15格式，范围[0x8000 0x7FFF]。

2. 一次读取两个Q15格式的数据。

3. 将四组数的乘积保存到Q31格式的变量mul1，mul2，mul3，mul4。

4. 丢弃32位数据的低15位，并把最终结果饱和到16位精度。

5. 通过SIMD指令__PKHBT将两个Q15格式的数据保存的结果数组中，从而一个指令周期就能完成两个数据的存储。

8.4.4 arm_mult_q7

这个函数用于求8位定点数的乘法，源代码分析如下：

/**
* @brief Q7 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_mult_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t out1, out2, out3, out4; /* Temporary variables to store the product */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in temporary variables */ (2)
out1 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out2 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out3 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out4 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Store the results of 4 inputs in the destination buffer in single cycle by packing */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (3)
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和算法。

所得结果是Q7格式，范围 [0x80 0x7F]。

2. 将两个Q7格式的数据乘积左移7位，也就是丢掉低7位的数据，并将所得结果饱和到8位精度。

3. __PACKq7函数可以在一个时钟周期就能完成相应操作。

8.4.5 实例讲解

实验目的：

1. 四种类型数据的乘法。

实验内容：

1. 按下摇杆的UP键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Multiplication

* 功能说明: 乘法

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Multiplication(void)

{

   static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pDst[5];

   static q31_t  pSrcA1[5] = {1,1,1,1,1};

   static q31_t  pSrcB1[5] = {1,1,1,1,1};

   static q31_t  pDst1[5];

   static q15_t  pSrcA2[5] = {1,1,1,1,1};

   static q15_t  pSrcB2[5] = {1,1,1,1,1};

   static q15_t  pDst2[5];

   static q7_t  pSrcA3[5] = {0x70,1,1,1,1};

   static q7_t  pSrcB3[5] = {0x7f,1,1,1,1};

   static q7_t pDst3[5];

   pSrcA[0] += 1.1f;

   arm_mult_f32(pSrcA, pSrcB, pDst, 5);

   printf("arm_mult_f32 = %frn", pDst[0]);

   pSrcA1[0] += 1;

   arm_mult_q31(pSrcA1, pSrcB1, pDst1, 5);

   printf("arm_mult_q31 = %drn", pDst1[0]);

   pSrcA2[0] += 1;

   arm_mult_q15(pSrcA2, pSrcB2, pDst2, 5);

   printf("arm_mult_q15 = %drn", pDst2[0]);

   pSrcA3[0] += 1;

   arm_mult_q7(pSrcA3, pSrcB3, pDst3, 5);

   printf("arm_mult_q7 = %drn", pDst3[0]);

   printf("***********************************rn");

}
复制代码