STM32/STM8技术论坛

硬汉Eric2013

9年用户 1035经验值

擅长:嵌入式技术

私信关注

[经验]

【安富莱——DSP教程】第8章 BasicMathFunctions的使用（一）

第8章 BasicMathFunctions的使用（一）

本期教程开始学习ARM官方的DSP库，这里我们先从基本数学函数开始。本期教程主要讲绝对值，加法，点乘和乘法四种运算。

8.1 绝对值（VectorAbsolute Value）

8.2 求和（VectorAddition）

8.3 点乘（VectorDot Product）

8.4 乘法（VectorMultiplication）

回帖（5）

硬汉Eric2013

2015-6-4 14:21:12

8.1 绝对值（Vector Absolute Value）

这部分函数主要用于求绝对值，公式描述如下：

pDst[n] = abs(pSrc[n]), 0 <= n < blockSize.

特别注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

8.1.1 arm_abs_f32

这个函数用于求32位浮点数的绝对值，源代码分析如下：

/**
* @brief Floating-point vector absolute value. （1）
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* [url=home.php?mod=space&uid=1141835]@Return[/url] none.
*/
void arm_abs_f32( （2）
float32_t * pSrc,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY （3）
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u; （4）
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute and then store the results in the destination buffer. */
/* read sample from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
/* find absolute value */
in1 = fabsf(in1); （5）
/* read sample from source */
in4 = *(pSrc + 3);
/* find absolute value */
in2 = fabsf(in2);
/* read sample from source */
*pDst = in1;
/* find absolute value */
in3 = fabsf(in3);
/* find absolute value */
in4 = fabsf(in4);
/* store result to destination */
*(pDst + 1) = in2;
/* store result to destination */
*(pDst + 2) = in3;
/* store result to destination */
*(pDst + 3) = in4;
/* Update source pointer to process next sampels */ （6）
pSrc += 4u;
/* Update destination pointer to process next sampels */
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else （7）
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u) （8）
{
/* C = |A| */
/* Calculate absolute and then store the results in the destination buffer. */
*pDst++ = fabsf(*pSrc++);
/* Decrement the loop counter */
blkCnt--;
}
}

1. 在这里简单的跟大家介绍一下DSP库中函数的通用格式，后面就不再赘述了。

（1）基本所有的函数都是可重入的。

（2）大部分函数都支持一组数的计算，比如这个函数arm_abs_f32就可以计算一组数的绝对值。所以如果只是就几个数的绝对值，用这个库函数就没有什么优势了。

（3）库函数基本是CM0，CM3和CM4都支持的（最新的DSP库已经添加CM7的支持）。

（4）每组数据基本上都是以4个数为一个单位进行计算，不够四个再单独计算。

（5）大部分函数都是配有f32，Q31，Q15和Q7四种格式。

2. 函数参数，支持输入一个数组进行计算绝对值。

3. 这部分代码是用于CM3和CM4内核。

4. 左移两位从而实现每4个数据为一组进行计算。

5. fabsf：这个函数不是用Cortex-M4F支持的DSP指令实现的，而是用C语言实现的，这个函数是被MDK封装起来的。

6. 切换到下一组数据。

7. 这部分代码用于CM0.

8. 用于不够4个数据的计算或者CM0内核。

8.1.2 arm_abs_q31

这个函数用于求32位定点数的绝对值，源代码分析如下：

/**
* @brief Q31 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: （1）
* par
* The function uses saturating arithmetic.
* The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
*/
void arm_abs_q31(
q31_t * pSrc,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q31_t in; /* Input value */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute of input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = (in1 > 0) ? in1 : (q31_t)__QSUB(0, in1); （2）
*pDst++ = (in2 > 0) ? in2 : (q31_t)__QSUB(0, in2);
*pDst++ = (in3 > 0) ? in3 : (q31_t)__QSUB(0, in3);
*pDst++ = (in4 > 0) ? in4 : (q31_t)__QSUB(0, in4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = |A| */
/* Calculate absolute value of the input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. */
in = *pSrc++;
*pDst++ = (in > 0) ? in : ((in == INT32_MIN) ? INT32_MAX : -in);
/* Decrement the loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和运算，其实不光这个函数，后面很多函数都是使用了饱和运算的，关于什么是饱和运算，大家看Cortex-M3权威指南中文版的4.3.6 小节：汇编语言：饱和运算即可。

对于Q31格式的数据，饱和运算会使得数据0x80000000变成0x7fffffff（这个数比较特殊，算是特殊处理，记住即可）。

2. 这里重点说一下函数__QSUB，其实这个函数算是Cortex-M4/M3的一个指令，用于实现饱和减法。

比如函数：__QSUB(0,in1) 的作用就是实现0 – in1并返回结果。这里__QSUB实现的是32位数的饱和减法。还有__QSUB16和__QSUB8实现的是16位和8位数的减法。

8.1.3 arm_abs_q15

这个函数用于求15位定点数的绝对值，源代码分析如下：

/**
* @brief Q15 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF. （1）
*/
void arm_abs_q15(
q15_t * pSrc,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
__SIMD32_TYPE *simd; （2）
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1; /* Input value1 */
q15_t in2; /* Input value2 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
simd = __SIMD32_CONST(pDst); （3）
while(blkCnt > 0u)
{
/* C = |A| */
/* Read two inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* Store the Absolute result in the destination buffer by packing the two values, in a single cycle */
#ifndef ARM_MATH_BIG_ENDIAN
*simd++ =
__PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), （4）
((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
#else
*simd++ =
__PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndef ARM_MATH_BIG_ENDIAN
*simd++ =
__PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)),
((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16);
#else
*simd++ =
__PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)),
((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
pDst = (q15_t *)simd;
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in1 = *pSrc++;
/* Calculate absolute value of input and then store the result in the destination buffer. */
*pDst++ = (in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
q15_t in; /* Temporary input variable */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in = *pSrc++;
/* Calculate absolute value of input and then store the result in the destination buffer. */
*pDst++ = (in > 0) ? in : ((in == (q15_t) 0x8000) ? 0x7fff : -in);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

1. 对于Q15格式的数据，饱和运算会使得数据0x8000变成0x7fff。

2. __SIMD32_TYPE的定义在文件arm_math.h中，具体定义如下：

#define __SIMD32_TYPE int32_t __packed

SIMD就是咱们上期教程所将的单指令多数据流。简单的理解就是__SIMD32_TYPE就是定义了一个int32_t类型的数据，__packed的含义就是实现字节的对齐功能，方便两个16位数据的都存入到这个数据类型中。

3. 函数__SIMD32_CONST的定义如下：

#define __SIMD32_CONST(addr) ((__SIMD32_TYPE *)(addr))

4. 函数__PKHBT的定义在文件core_cm4_simd.h，定义如下：

#define __PKHBT(ARG1,ARG2,ARG3) ( ((((uint32_t)(ARG1)) ) &0x0000FFFFUL) |

((((uint32_t)(ARG2)) <<(ARG3)) & 0xFFFF0000UL) )

这个宏定义的作用就是将将两个16位的数据合并成32位数据。但是有一点要特别说明__PKHBT也是CM4内核支持的SIMD指令，上面的宏定义的C函数会被MDK自动识别并调用相应的PKHBT指令。

__QSUB16用于实现16位数据的饱和减法。

8.1.4 arm_abs_q7

这个函数用于求8位定点数的绝对值，源代码分析如下：

/**
* @brief Q7 vector absolute value.
* @param[in] *pSrc points to the input buffer
* @param[out] *pDst points to the output buffer
* @param[in] blockSize number of samples in each vector
* @return none.
*
* par Conditions for optimum performance
* Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
*/
void arm_abs_q7(
q7_t * pSrc,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q7_t in; /* Input value1 */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = |A| */
/* Read inputs */
in1 = (q31_t) * pSrc;
in2 = (q31_t) * (pSrc + 1);
in3 = (q31_t) * (pSrc + 2);
/* find absolute value */
out1 = (in1 > 0) ? in1 : (q31_t)__QSUB8(0, in1); (2)
/* read input */
in4 = (q31_t) * (pSrc + 3);
/* find absolute value */
out2 = (in2 > 0) ? in2 : (q31_t)__QSUB8(0, in2);
/* store result to destination */
*pDst = (q7_t) out1;
/* find absolute value */
out3 = (in3 > 0) ? in3 : (q31_t)__QSUB8(0, in3);
/* find absolute value */
out4 = (in4 > 0) ? in4 : (q31_t)__QSUB8(0, in4);
/* store result to destination */
*(pDst + 1) = (q7_t) out2;
/* store result to destination */
*(pDst + 2) = (q7_t) out3;
/* store result to destination */
*(pDst + 3) = (q7_t) out4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
blkCnt = blockSize;
#endif // #define ARM_MATH_CM0_FAMILY
while(blkCnt > 0u)
{
/* C = |A| */
/* Read the input */
in = *pSrc++;
/* Store the Absolute result in the destination buffer */
*pDst++ = (in > 0) ? in : ((in == (q7_t) 0x80) ? 0x7f : -in);
/* Decrement the loop counter */
blkCnt--;
}
}

1. 由于饱和运算，0x80求绝对值将变成数据0x7F。

2. __QSUB8用以实现8位数的饱和减法运算。

8.1.5 实例讲解

实验目的：

1. 四种数据类型数据绝对值求解

实验内容：

1. 按下按键K1, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_ABS

* 功能说明: 求绝对值

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_ABS(void)

{

   static float32_t pSrc;

static float32_t pDst;

   static q31_t pSrc1;

   static q31_t pDst1;

   static q15_t pSrc2;

   static q15_t pDst2;

   static q7_t pSrc3 = 127; /* 为了说明问题，在这里设置初始值为127，然后查看0x80是否饱和为0x7F */

   static q7_t pDst3;

   pSrc -= 1.23f;

   arm_abs_f32(&pSrc, &pDst, 1);                                                                   (1)

   printf("arm_abs_f32 = %frn", pDst);

   pSrc1 -= 1;

   arm_abs_q31(&pSrc1, &pDst1, 1);                                                                (2)

   printf("arm_abs_q31 = %drn", pDst1);

   pSrc2 -= 1;

   arm_abs_q15(&pSrc2, &pDst2, 1);                                                                (3)

   printf("arm_abs_q15 = %drn", pDst2);

   pSrc3 += 1;

   printf("pSrc3 = %drn", pSrc3);

   arm_abs_q7(&pSrc3, &pDst3, 1);                                                                (4)

   printf("arm_abs_q7 = %drn", pDst3);

   printf("***********************************rn");

}

(1)到(4)实现相应格式下绝对值的求解。这里只求了一个数，大家可以尝试求解一个数组的绝对值。

硬汉Eric2013

2015-6-4 14:25:52

8.2 求和（Vector Addition）

这部分函数主要用于求和，公式描述如下：

pDst[n] = pSrcA[n] + pSrcB[n], 0 <= n < blockSize.

8.2.1 arm_add_f32

这个函数用于求32位浮点数的和，源代码分析如下：

/**
* @brief Floating-point vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_add_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variabels */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
/* read four inputs from sourceA and four inputs from sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* C = A + B */ (1)
/* add and store result to destination */
*pDst = inA1 + inB1;
*(pDst + 1) = inA2 + inB2;
*(pDst + 2) = inA3 + inB3;
*(pDst + 3) = inA4 + inB4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) + (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
}

1. 这部分的代码比较简单，只是求解两个数的和。

8.2.2 arm_add_q31

这个函数用于求32位定点数的和，源代码分析如下：

/**
* @brief Q31 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_add_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
*pDst++ = __QADD(inA1, inB1); (2)
*pDst++ = __QADD(inA2, inB2);
*pDst++ = __QADD(inA3, inB3);
*pDst++ = __QADD(inA4, inB4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = __QADD(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ + *pSrcB++); (3)
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

1. 这个函数也是饱和运算，输出结果的范围[0x800000000x7FFFFFFF]，超出这个结果将产生饱和结果。

2. __QADD实现32位数的加法。

3. 函数clip_q63_to_q31的定义在文件arm_math.h里面

static __INLINE q31_t clip_q63_to_q31(

q63_t x)

{

return ((q31_t) (x >> 32) != ((q31_t)x >> 31)) ?

((0x7FFFFFFF ^ ((q31_t) (x >>63)))) : (q31_t) x;

}

这个函数的作用是实现饱和结果。

8.2.3 arm_add_q15

这个函数用于求16位定点数的和，源代码分析如下：

/**
* @brief Q15 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_add_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inB1, inB2;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */ (2)
/* Add and then store the results in the destination buffer. */
inA1 = *__SIMD32(pSrcA)++;
inA2 = *__SIMD32(pSrcA)++;
inB1 = *__SIMD32(pSrcB)++;
inB2 = *__SIMD32(pSrcB)++;
*__SIMD32(pDst)++ = __QADD16(inA1, inB1);
*__SIMD32(pDst)++ = __QADD16(inA2, inB2);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q15_t) __QADD16(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ + *pSrcB++), 16); (3)
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

1. 这个函数也是饱和运算，输出结果的范围[0x80000x7FFF]，超出这个结果将产生饱和结果。

2. 函数inA1 = *__SIMD32(pSrcA)++仅需要一条SIMD指令即可完成将两个16位数存到32位的变量inA1中。

3. __SSAT也是SIMD指令，这里是将结果饱和到16位精度。

8.2.4 arm_add_q7

这个函数用于求8位定点数的绝对值，源代码分析如下：

/**
* @brief Q7 vector addition.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_add_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */ (2)
*__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q7_t) __SSAT(*pSrcA++ + *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + B */
/* Add and then store the results in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ + *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

1．这个函数也是饱和运算，输出结果的范围[0x800x7F]，超出这个结果将产生饱和。

2．这里通过SIMD指令实现4组8位数的加法。

8.2.5 实例讲解

实验目的：

1. 四种类似数据的求和

实验内容：

1. 按下按键K2, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_ABS

* 功能说明: 加法

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Add(void)

{

   static float32_t pSrcA;

   static float32_t pSrcB;

   static float32_t pDst;

   static q31_t  pSrcA1;

   static q31_t  pSrcB1;

   static q31_t  pDst1;

   static q15_t  pSrcA2;

   static q15_t  pSrcB2;

   static q15_t  pDst2;

   static q7_t  pSrcA3;

   static q7_t  pSrcB3;

   static q7_t  pDst3;

   pSrcA--;

   arm_add_f32(&pSrcA, &pSrcB, &pDst, 1);

   printf("arm_add_f32 = %frn", pDst);

   pSrcA1--;

   arm_add_q31(&pSrcA1, &pSrcB1, &pDst1, 1);

   printf("arm_add_q31 = %drn", pDst1);

   pSrcA2--;

   arm_add_q15(&pSrcA2, &pSrcB2, &pDst2, 1);

   printf("arm_add_q15 = %drn", pDst2);

   pSrcA3--;

   arm_add_q7(&pSrcA3, &pSrcB3, &pDst3, 1);

   printf("arm_add_q7 = %drn", pDst3);

   printf("***********************************rn");

}

硬汉Eric2013

2015-6-4 14:32:38

8.3 点乘（Vector Dot Product）

这部分函数主要用于点乘，公式描述如下：

sum =pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... +pSrcA[blockSize-1]*pSrcB[blockSize-1]

8.3.1 arm_dot_prod_f32

这个函数用于求32位浮点数的点乘，源代码分析如下：

/**
* @defgroup dot_prod Vector Dot Product
*
* Computes the dot product of two vectors.
* The vectors are multiplied element-by-element and then summed.
*



 *     sum = pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... + pSrcA[blockSize-1]*pSrcB[blockSize-1]

 *

*
* There are separate functions for floating-point, Q7, Q15, and Q31 data types.
*/
/**
* @addtogroup dot_prod
* @{
*/
/**
* @brief Dot product of floating-point vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*/
void arm_dot_prod_f32(
float32_t * pSrcA,
float32_t * pSrcB,
uint32_t blockSize,
float32_t * result)
{
float32_t sum = 0.0f; /* Temporary result storage */ (1)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer */
sum += (*pSrcA++) * (*pSrcB++); (2)
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result back in the destination buffer */
*result = sum;
}

1. 由于CM4上带的FPU是单精度的，所以初始化float32_t类型的浮点数时需要在数据的末尾加上f。

2. 类似函数sum += (*pSrcA++) * (*pSrcB++)最终会通过浮点的MAC（乘累加）实现，从而加快执行时间。

8.3.2 arm_dot_prod_q31

这个函数用于求32位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q31 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.31 x 1.31 = 2.62 format and these
* are truncated to 2.48 format by discarding the lower 14 bits.
* The 2.48 result is then added without saturation to a 64-bit accumulator in 16.48 format.
* There are 15 guard bits in the accumulator and there is no risk of overflow as long as
* the length of the vectors is less than 2^16 elements.
* The return result is in 16.48 format.
*/
void arm_dot_prod_q31(
q31_t * pSrcA,
q31_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
sum += ((q63_t) inA1 * inB1) >> 14u; (2)
sum += ((q63_t) inA2 * inB2) >> 14u;
sum += ((q63_t) inA3 * inB3) >> 14u;
sum += ((q63_t) inA4 * inB4) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += ((q63_t) * pSrcA++ * *pSrcB++) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result in the destination buffer in 16.48 format */
*result = sum;
}

1. 两个Q31格式的32位数相乘，那么输出结果的格式是1.31*1.31 = 2.62。实际应用中基本不需要这么高的精度，这个函数将低14位的数据截取掉，反应在函数中就是两个数的乘积左移14位，也就是定点数的小数点也左移14位，那么最终的结果的格式是16.48。所以只要乘累加的个数小于2^16就没有输出结果溢出的危险（不知道这里为什么不是2^14，留作以后解决）。

2. 将获取的结果左移14位。

8.3.3 arm_dot_prod_q15

这个函数用于求16位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q15 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.15 x 1.15 = 2.30 format and these
* results are added to a 64-bit accumulator in 34.30 format.
* Nonsaturating additions are used and given that there are 33 guard bits in the accumulator
* there is no risk of overflow.
* The return result is in 34.30 format.
*/
void arm_dot_prod_q15(
q15_t * pSrcA,
q15_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */ (2)
/* Calculate dot product and then store the result in a temporary buffer. */
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum = __SMLALD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum += (q63_t) ((q31_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 34.30 format */
*result = sum;
}

1．两个Q15格式的数据相乘，那么输出结果的格式是1.15*1.15 = 2.30，这个函数将输出结果赋值给了64位变量，那么输出结果就是34.30格式。所以基本没有溢出的危险。

2． __SMLALD也是SIMD指令，实现两个16位数相乘，并把结果累加给64位变量。

8.3.4 arm_dot_prod_q7

这个函数用于求8位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q7 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.7 x 1.7 = 2.14 format and these
* results are added to an accumulator in 18.14 format.
* Nonsaturating additions are used and there is no danger of wrap around as long as
* the vectors are less than 2^18 elements long.
* The return result is in 18.14 format.
*/
void arm_dot_prod_q7(
q7_t * pSrcA,
q7_t * pSrcB,
uint32_t blockSize,
q31_t * result)
{
uint32_t blkCnt; /* loop counter */
q31_t sum = 0; /* Temporary variables to store output */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input1, input2; /* Temporary variables to store input */
q31_t inA1, inA2, inB1, inB2; /* Temporary variables to store input */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read 4 samples at a time from sourceA */ (2)
input1 = *__SIMD32(pSrcA)++;
/* read 4 samples at a time from sourceB */
input2 = *__SIMD32(pSrcB)++;
/* extract two q7_t samples to q15_t samples */
inA1 = __SXTB16(__ROR(input1, 8)); (3)
/* extract reminaing two samples */
inA2 = __SXTB16(input1);
/* extract two q7_t samples to q15_t samples */
inB1 = __SXTB16(__ROR(input2, 8));
/* extract reminaing two samples */
inB2 = __SXTB16(input2);
/* multiply and accumulate two samples at a time */
sum = __SMLAD(inA1, inB1, sum); (4)
sum = __SMLAD(inA2, inB2, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum = __SMLAD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum += (q31_t) ((q15_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 18.14 format */
*result = sum;
}

1. 两个Q8格式的数据相乘，那么输出结果就是1.7*1.7 = 2.14格式。这里将最终结果赋值给了32位的变量，那么最终的格式就是18.14。如果乘累加的个数小于2^18那么就不会有溢出的危险（感觉这里应该是2^16）。

2. 一次读取4个8位的数据。

3. __SXTB16也是SIMD指令，用于将两个8位的有符号数扩展成16位。__ROR用于实现数据的循环右移。

4. __SMLAD也是SIMD指令，用于实现如下功能：

sum = __SMLAD(x, y, z)

sum = z + ((short)(x>>16) * (short)(y>>16))+ ((short)x * (short)y)

8.3.5 实例讲解

实验目的：

1. 四种类型数据的点乘。

实验内容：

1. 按下按键K3, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_DotProduct

* 功能说明: 乘积

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_DotProduct(void)

{

   static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t result;

   static q31_t  pSrcA1[5] = {0x7ffffff0,1,1,1,1};

   static q31_t  pSrcB1[5] = {1,1,1,1,1};

   static q63_t result1;

   static q15_t  pSrcA2[5] = {1,1,1,1,1};

   static q15_t  pSrcB2[5] = {1,1,1,1,1};

   static q63_t  result2;

   static q7_t  pSrcA3[5] = {1,1,1,1,1};

   static q7_t  pSrcB3[5] = {1,1,1,1,1};

   static q31_t result3;

   pSrcA[0] -= 1.1f;

   arm_dot_prod_f32(pSrcA, pSrcB, 5, &result);

   printf("arm_dot_prod_f32 = %frn", result);

   pSrcA1[0] -= 0xffff;

   arm_dot_prod_q31(pSrcA1, pSrcB1, 5, &result1);

   printf("arm_dot_prod_q31 = %lldrn", result1);

   pSrcA2[0] -= 1;

   arm_dot_prod_q15(pSrcA2, pSrcB2, 5, &result2);

   printf("arm_dot_prod_q15 = %lldrn", result2);

   pSrcA3[0] -= 1;

   arm_dot_prod_q7(pSrcA3, pSrcB3, 5, &result3);

   printf("arm_dot_prod_q7 = %drn", result3);

   printf("***********************************rn");

}

8.3 点乘（Vector Dot Product）

这部分函数主要用于点乘，公式描述如下：

sum =pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... +pSrcA[blockSize-1]*pSrcB[blockSize-1]

8.3.1 arm_dot_prod_f32

这个函数用于求32位浮点数的点乘，源代码分析如下：

/**
* @defgroup dot_prod Vector Dot Product
*
* Computes the dot product of two vectors.
* The vectors are multiplied element-by-element and then summed.
*



 *     sum = pSrcA[0]*pSrcB[0] + pSrcA[1]*pSrcB[1] + ... + pSrcA[blockSize-1]*pSrcB[blockSize-1]

 *

*
* There are separate functions for floating-point, Q7, Q15, and Q31 data types.
*/
/**
* @addtogroup dot_prod
* @{
*/
/**
* @brief Dot product of floating-point vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*/
void arm_dot_prod_f32(
float32_t * pSrcA,
float32_t * pSrcB,
uint32_t blockSize,
float32_t * result)
{
float32_t sum = 0.0f; /* Temporary result storage */ (1)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer */
sum += (*pSrcA++) * (*pSrcB++); (2)
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result back in the destination buffer */
*result = sum;
}

1. 由于CM4上带的FPU是单精度的，所以初始化float32_t类型的浮点数时需要在数据的末尾加上f。

2. 类似函数sum += (*pSrcA++) * (*pSrcB++)最终会通过浮点的MAC（乘累加）实现，从而加快执行时间。

8.3.2 arm_dot_prod_q31

这个函数用于求32位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q31 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.31 x 1.31 = 2.62 format and these
* are truncated to 2.48 format by discarding the lower 14 bits.
* The 2.48 result is then added without saturation to a 64-bit accumulator in 16.48 format.
* There are 15 guard bits in the accumulator and there is no risk of overflow as long as
* the length of the vectors is less than 2^16 elements.
* The return result is in 16.48 format.
*/
void arm_dot_prod_q31(
q31_t * pSrcA,
q31_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
sum += ((q63_t) inA1 * inB1) >> 14u; (2)
sum += ((q63_t) inA2 * inB2) >> 14u;
sum += ((q63_t) inA3 * inB3) >> 14u;
sum += ((q63_t) inA4 * inB4) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the result in a temporary buffer. */
sum += ((q63_t) * pSrcA++ * *pSrcB++) >> 14u;
/* Decrement the loop counter */
blkCnt--;
}
/* Store the result in the destination buffer in 16.48 format */
*result = sum;
}

2. 将获取的结果左移14位。

8.3.3 arm_dot_prod_q15

这个函数用于求16位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q15 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.15 x 1.15 = 2.30 format and these
* results are added to a 64-bit accumulator in 34.30 format.
* Nonsaturating additions are used and given that there are 33 guard bits in the accumulator
* there is no risk of overflow.
* The return result is in 34.30 format.
*/
void arm_dot_prod_q15(
q15_t * pSrcA,
q15_t * pSrcB,
uint32_t blockSize,
q63_t * result)
{
q63_t sum = 0; /* Temporary result storage */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */ (2)
/* Calculate dot product and then store the result in a temporary buffer. */
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
sum = __SMLALD(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum = __SMLALD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Calculate dot product and then store the results in a temporary buffer. */
sum += (q63_t) ((q31_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 34.30 format */
*result = sum;
}

2． __SMLALD也是SIMD指令，实现两个16位数相乘，并把结果累加给64位变量。

8.3.4 arm_dot_prod_q7

这个函数用于求8位定点数的点乘，源代码分析如下：

/**
* @brief Dot product of Q7 vectors.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[in] blockSize number of samples in each vector
* @param[out] *result output result returned here
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The intermediate multiplications are in 1.7 x 1.7 = 2.14 format and these
* results are added to an accumulator in 18.14 format.
* Nonsaturating additions are used and there is no danger of wrap around as long as
* the vectors are less than 2^18 elements long.
* The return result is in 18.14 format.
*/
void arm_dot_prod_q7(
q7_t * pSrcA,
q7_t * pSrcB,
uint32_t blockSize,
q31_t * result)
{
uint32_t blkCnt; /* loop counter */
q31_t sum = 0; /* Temporary variables to store output */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input1, input2; /* Temporary variables to store input */
q31_t inA1, inA2, inB1, inB2; /* Temporary variables to store input */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read 4 samples at a time from sourceA */ (2)
input1 = *__SIMD32(pSrcA)++;
/* read 4 samples at a time from sourceB */
input2 = *__SIMD32(pSrcB)++;
/* extract two q7_t samples to q15_t samples */
inA1 = __SXTB16(__ROR(input1, 8)); (3)
/* extract reminaing two samples */
inA2 = __SXTB16(input1);
/* extract two q7_t samples to q15_t samples */
inB1 = __SXTB16(__ROR(input2, 8));
/* extract reminaing two samples */
inB2 = __SXTB16(input2);
/* multiply and accumulate two samples at a time */
sum = __SMLAD(inA1, inB1, sum); (4)
sum = __SMLAD(inA2, inB2, sum);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum = __SMLAD(*pSrcA++, *pSrcB++, sum);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] */
/* Dot product and then store the results in a temporary buffer. */
sum += (q31_t) ((q15_t) * pSrcA++ * *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
/* Store the result in the destination buffer in 18.14 format */
*result = sum;
}

2. 一次读取4个8位的数据。

3. __SXTB16也是SIMD指令，用于将两个8位的有符号数扩展成16位。__ROR用于实现数据的循环右移。

4. __SMLAD也是SIMD指令，用于实现如下功能：

sum = __SMLAD(x, y, z)

sum = z + ((short)(x>>16) * (short)(y>>16))+ ((short)x * (short)y)

8.3.5 实例讲解

实验目的：

1. 四种类型数据的点乘。

实验内容：

1. 按下按键K3, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

硬汉Eric2013

2015-6-4 14:36:49

8.4 乘法（Vector Multiplication）

这部分函数主要用于乘法，公式描述如下：

pDst[n]= pSrcA[n] * pSrcB[n], 0 <= n
8.4.1 arm_mult_f32

这个函数用于求32位浮点数的乘法，源代码分析如下：

/**
* @brief Floating-point vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_mult_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variables */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
float32_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */ (1)
/* read sample from sourceA */
inA1 = *pSrcA;
/* read sample from sourceB */
inB1 = *pSrcB;
/* read sample from sourceA */
inA2 = *(pSrcA + 1);
/* read sample from sourceB */
inB2 = *(pSrcB + 1);
/* out = sourceA * sourceB */
out1 = inA1 * inB1;
/* read sample from sourceA */
inA3 = *(pSrcA + 2);
/* read sample from sourceB */
inB3 = *(pSrcB + 2);
/* out = sourceA * sourceB */
out2 = inA2 * inB2;
/* read sample from sourceA */
inA4 = *(pSrcA + 3);
/* store result to destination buffer */
*pDst = out1;
/* read sample from sourceB */
inB4 = *(pSrcB + 3);
/* out = sourceA * sourceB */
out3 = inA3 * inB3;
/* store result to destination buffer */
*(pDst + 1) = out2;
/* out = sourceA * sourceB */
out4 = inA4 * inB4;
/* store result to destination buffer */
*(pDst + 2) = out3;
/* store result to destination buffer */
*(pDst + 3) = out4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */
*pDst++ = (*pSrcA++) * (*pSrcB++);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 浮点的32位乘法比较简单，这里依然是以4次的计算为一组。

8.4.2 arm_mult_q31

这个函数用于求32位定点数的乘法，源代码分析如下：

/**
* @brief Q31 vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_mult_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4; /* temporary input variables */
q31_t inB1, inB2, inB3, inB4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
out1 = ((q63_t) inA1 * inB1) >> 32; (2)
out2 = ((q63_t) inA2 * inB2) >> 32;
out3 = ((q63_t) inA3 * inB3) >> 32;
out4 = ((q63_t) inA4 * inB4) >> 32;
out1 = __SSAT(out1, 31); (3)
out2 = __SSAT(out2, 31);
out3 = __SSAT(out3, 31);
out4 = __SSAT(out4, 31);
*pDst++ = out1 << 1u; (4)
*pDst++ = out2 << 1u;
*pDst++ = out3 << 1u;
*pDst++ = out4 << 1u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
*pDst++ =
(q31_t) clip_q63_to_q31(((q63_t) (*pSrcA++) * (*pSrcB++)) >> 31);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q31格式，范围Q31range[0x80000000 0x7FFFFFFF]。

2. 所得乘积左移32位。

3. 实现31位精度的饱和运算。

4. 右移一位，保证所得结果是Q31格式。

8.4.3 arm_mult_q15

这个函数用于求16位定点数的乘法，源代码分析如下：

/**
* @brief Q15 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_mult_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inB1, inB2; /* temporary input variables */
q15_t out1, out2, out3, out4; /* temporary output variables */
q31_t mul1, mul2, mul3, mul4; /* temporary variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read two samples at a time from sourceA */
inA1 = *__SIMD32(pSrcA)++; (2)
/* read two samples at a time from sourceB */
inB1 = *__SIMD32(pSrcB)++;
/* read two samples at a time from sourceA */
inA2 = *__SIMD32(pSrcA)++;
/* read two samples at a time from sourceB */
inB2 = *__SIMD32(pSrcB)++;
/* multiply mul = sourceA * sourceB */
mul1 = (q31_t) ((q15_t) (inA1 >> 16) * (q15_t) (inB1 >> 16)); (3)
mul2 = (q31_t) ((q15_t) inA1 * (q15_t) inB1);
mul3 = (q31_t) ((q15_t) (inA2 >> 16) * (q15_t) (inB2 >> 16));
mul4 = (q31_t) ((q15_t) inA2 * (q15_t) inB2);
/* saturate result to 16 bit */
out1 = (q15_t) __SSAT(mul1 >> 15, 16); (4)
out2 = (q15_t) __SSAT(mul2 >> 15, 16);
out3 = (q15_t) __SSAT(mul3 >> 15, 16);
out4 = (q15_t) __SSAT(mul4 >> 15, 16);
/* store the result */
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16); (5)
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#else
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16);
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#endif // #ifndef ARM_MATH_BIG_ENDIAN
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q15_t) __SSAT((((q31_t) (*pSrcA++) * (*pSrcB++)) >> 15), 16);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q15格式，范围[0x8000 0x7FFF]。

2. 一次读取两个Q15格式的数据。

3. 将四组数的乘积保存到Q31格式的变量mul1，mul2，mul3，mul4。

4. 丢弃32位数据的低15位，并把最终结果饱和到16位精度。

5. 通过SIMD指令__PKHBT将两个Q15格式的数据保存的结果数组中，从而一个指令周期就能完成两个数据的存储。

8.4.4 arm_mult_q7

这个函数用于求8位定点数的乘法，源代码分析如下：

/**
* @brief Q7 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_mult_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t out1, out2, out3, out4; /* Temporary variables to store the product */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in temporary variables */ (2)
out1 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out2 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out3 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out4 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Store the results of 4 inputs in the destination buffer in single cycle by packing */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (3)
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q7格式，范围 [0x80 0x7F]。

2. 将两个Q7格式的数据乘积左移7位，也就是丢掉低7位的数据，并将所得结果饱和到8位精度。

3. __PACKq7函数可以在一个时钟周期就能完成相应操作。

8.4.5 实例讲解

实验目的：

1. 四种类型数据的乘法。

实验内容：

1. 按下摇杆的UP键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Multiplication

* 功能说明: 乘法

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Multiplication(void)

{

   static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pDst[5];

   static q31_t  pSrcA1[5] = {1,1,1,1,1};

   static q31_t  pSrcB1[5] = {1,1,1,1,1};

   static q31_t  pDst1[5];

   static q15_t  pSrcA2[5] = {1,1,1,1,1};

   static q15_t  pSrcB2[5] = {1,1,1,1,1};

   static q15_t  pDst2[5];

   static q7_t  pSrcA3[5] = {0x70,1,1,1,1};

   static q7_t  pSrcB3[5] = {0x7f,1,1,1,1};

   static q7_t pDst3[5];

   pSrcA[0] += 1.1f;

   arm_mult_f32(pSrcA, pSrcB, pDst, 5);

   printf("arm_mult_f32 = %frn", pDst[0]);

   pSrcA1[0] += 1;

   arm_mult_q31(pSrcA1, pSrcB1, pDst1, 5);

   printf("arm_mult_q31 = %drn", pDst1[0]);

   pSrcA2[0] += 1;

   arm_mult_q15(pSrcA2, pSrcB2, pDst2, 5);

   printf("arm_mult_q15 = %drn", pDst2[0]);

   pSrcA3[0] += 1;

   arm_mult_q7(pSrcA3, pSrcB3, pDst3, 5);

   printf("arm_mult_q7 = %drn", pDst3[0]);

   printf("***********************************rn");

}

8.4 乘法（Vector Multiplication）

这部分函数主要用于乘法，公式描述如下：

pDst[n]= pSrcA[n] * pSrcB[n], 0 <= n
8.4.1 arm_mult_f32

这个函数用于求32位浮点数的乘法，源代码分析如下：

/**
* @brief Floating-point vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_mult_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary input variables */
float32_t inB1, inB2, inB3, inB4; /* temporary input variables */
float32_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */ (1)
/* read sample from sourceA */
inA1 = *pSrcA;
/* read sample from sourceB */
inB1 = *pSrcB;
/* read sample from sourceA */
inA2 = *(pSrcA + 1);
/* read sample from sourceB */
inB2 = *(pSrcB + 1);
/* out = sourceA * sourceB */
out1 = inA1 * inB1;
/* read sample from sourceA */
inA3 = *(pSrcA + 2);
/* read sample from sourceB */
inB3 = *(pSrcB + 2);
/* out = sourceA * sourceB */
out2 = inA2 * inB2;
/* read sample from sourceA */
inA4 = *(pSrcA + 3);
/* store result to destination buffer */
*pDst = out1;
/* read sample from sourceB */
inB4 = *(pSrcB + 3);
/* out = sourceA * sourceB */
out3 = inA3 * inB3;
/* store result to destination buffer */
*(pDst + 1) = out2;
/* out = sourceA * sourceB */
out4 = inA4 * inB4;
/* store result to destination buffer */
*(pDst + 2) = out3;
/* store result to destination buffer */
*(pDst + 3) = out4;
/* update pointers to process next samples */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in output buffer */
*pDst++ = (*pSrcA++) * (*pSrcB++);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 浮点的32位乘法比较简单，这里依然是以4次的计算为一组。

8.4.2 arm_mult_q31

这个函数用于求32位定点数的乘法，源代码分析如下：

/**
* @brief Q31 vector multiplication.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_mult_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4; /* temporary input variables */
q31_t inB1, inB2, inB3, inB4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
out1 = ((q63_t) inA1 * inB1) >> 32; (2)
out2 = ((q63_t) inA2 * inB2) >> 32;
out3 = ((q63_t) inA3 * inB3) >> 32;
out4 = ((q63_t) inA4 * inB4) >> 32;
out1 = __SSAT(out1, 31); (3)
out2 = __SSAT(out2, 31);
out3 = __SSAT(out3, 31);
out4 = __SSAT(out4, 31);
*pDst++ = out1 << 1u; (4)
*pDst++ = out2 << 1u;
*pDst++ = out3 << 1u;
*pDst++ = out4 << 1u;
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and then store the results in the destination buffer. */
*pDst++ =
(q31_t) clip_q63_to_q31(((q63_t) (*pSrcA++) * (*pSrcB++)) >> 31);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q31格式，范围Q31range[0x80000000 0x7FFFFFFF]。

2. 所得乘积左移32位。

3. 实现31位精度的饱和运算。

4. 右移一位，保证所得结果是Q31格式。

8.4.3 arm_mult_q15

这个函数用于求16位定点数的乘法，源代码分析如下：

/**
* @brief Q15 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_mult_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inB1, inB2; /* temporary input variables */
q15_t out1, out2, out3, out4; /* temporary output variables */
q31_t mul1, mul2, mul3, mul4; /* temporary variables */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read two samples at a time from sourceA */
inA1 = *__SIMD32(pSrcA)++; (2)
/* read two samples at a time from sourceB */
inB1 = *__SIMD32(pSrcB)++;
/* read two samples at a time from sourceA */
inA2 = *__SIMD32(pSrcA)++;
/* read two samples at a time from sourceB */
inB2 = *__SIMD32(pSrcB)++;
/* multiply mul = sourceA * sourceB */
mul1 = (q31_t) ((q15_t) (inA1 >> 16) * (q15_t) (inB1 >> 16)); (3)
mul2 = (q31_t) ((q15_t) inA1 * (q15_t) inB1);
mul3 = (q31_t) ((q15_t) (inA2 >> 16) * (q15_t) (inB2 >> 16));
mul4 = (q31_t) ((q15_t) inA2 * (q15_t) inB2);
/* saturate result to 16 bit */
out1 = (q15_t) __SSAT(mul1 >> 15, 16); (4)
out2 = (q15_t) __SSAT(mul2 >> 15, 16);
out3 = (q15_t) __SSAT(mul3 >> 15, 16);
out4 = (q15_t) __SSAT(mul4 >> 15, 16);
/* store the result */
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16); (5)
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#else
*__SIMD32(pDst)++ = __PKHBT(out2, out1, 16);
*__SIMD32(pDst)++ = __PKHBT(out4, out3, 16);
#endif // #ifndef ARM_MATH_BIG_ENDIAN
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q15_t) __SSAT((((q31_t) (*pSrcA++) * (*pSrcB++)) >> 15), 16);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q15格式，范围[0x8000 0x7FFF]。

2. 一次读取两个Q15格式的数据。

3. 将四组数的乘积保存到Q31格式的变量mul1，mul2，mul3，mul4。

4. 丢弃32位数据的低15位，并把最终结果饱和到16位精度。

5. 通过SIMD指令__PKHBT将两个Q15格式的数据保存的结果数组中，从而一个指令周期就能完成两个数据的存储。

8.4.4 arm_mult_q7

这个函数用于求8位定点数的乘法，源代码分析如下：

/**
* @brief Q7 vector multiplication
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_mult_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counters */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t out1, out2, out3, out4; /* Temporary variables to store the product */
/* loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the results in temporary variables */ (2)
out1 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out2 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out3 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
out4 = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Store the results of 4 inputs in the destination buffer in single cycle by packing */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (3)
/* Decrement the blockSize loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * B */
/* Multiply the inputs and store the result in the destination buffer */
*pDst++ = (q7_t) __SSAT((((q15_t) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
/* Decrement the blockSize loop counter */
blkCnt--;
}
}

1. 这个函数使用了饱和算法。

所得结果是Q7格式，范围 [0x80 0x7F]。

2. 将两个Q7格式的数据乘积左移7位，也就是丢掉低7位的数据，并将所得结果饱和到8位精度。

3. __PACKq7函数可以在一个时钟周期就能完成相应操作。

8.4.5 实例讲解

实验目的：

1. 四种类型数据的乘法。

实验内容：

1. 按下摇杆的UP键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

硬汉Eric2013

2015-6-4 14:37:43

8.5 总结

本期教程就跟大家讲这么多，还是那句话，可以自己写些代码调用本期教程中讲的这几个函数，如果可以的话，可以自己尝试直接调用这些DSP指令。

更多回帖

硬汉Eric2013

【安富莱——DSP教程】第8章 BasicMathFunctions的使用（一）

回帖（5）

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

相关帖子

【安富莱DSP教程】第9章 BasicMathFunctions的使用（二）

【安富莱——DSP教程】第17章 ComplexMathFunctions的使用（一）

【安富莱——DSP教程】第13章 SupportFunctions的使用(一)

【安富莱——DSP教程】第11章 StatisticsMathFunctions的使用(一)

【安富莱——DSP教程】第19章 MatrixFunctions的使用（一）

【安富莱——DSP教程】第20章 MatrixFunctions的使用（二）

【安富莱——DSP教程】第23章傅里叶变换

【安富莱——DSP教程】第45章示波器设计—系统框架

【安富莱——DSP教程】第32章实数FFT的实现

【安富莱——DSP教程】第27章 FFT的Matlab实现

20万+工程师都在用，免费PCB检查工具

硬汉Eric2013

【安富莱——DSP教程】第8章 BasicMathFunctions的使用（一）

回帖（5）

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

硬汉Eric2013

相关帖子

【安富莱DSP教程】第9章 BasicMathFunctions的使用（二）

【安富莱——DSP教程】第17章 ComplexMathFunctions的使用（一）

【安富莱——DSP教程】第13章 SupportFunctions的使用(一)

【安富莱——DSP教程】第11章 StatisticsMathFunctions的使用(一)

【安富莱——DSP教程】第19章 MatrixFunctions的使用（一）

【安富莱——DSP教程】第20章 MatrixFunctions的使用（二）

【安富莱——DSP教程】第23章 傅里叶变换

【安富莱——DSP教程】第45章 示波器设计—系统框架

【安富莱——DSP教程】第32章 实数FFT的实现

【安富莱——DSP教程】第27章 FFT的Matlab实现

20万+工程师都在用，免费PCB检查工具

【安富莱——DSP教程】第23章傅里叶变换

【安富莱——DSP教程】第45章示波器设计—系统框架

【安富莱——DSP教程】第32章实数FFT的实现