【安富莱DSP教程】第9章 BasicMathFunctions的使用（二）

ti

第9章 BasicMathFunctions的使用（二）

本期教程主要讲基本函数中的相反数，偏移，位移，减法和比例因子。

9.1 相反数（VectorNegate）

9.2 求和（VectorOffset）

9.3 点乘（VectorShift）

9.4 减法（VectorSub）

9.5 比例因子（VectorScale）

9.6 BasicMathFunctions的重要说明

9.7 总结

硬汉Eric2013 · 2015-6-5 14:37:04

9.1 相反数（Vector Negate）

这部分函数主要用于求相反数，公式描述如下：

pDst[n] = -pSrc[n], 0 <= n < blockSize.

特别注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.1.1 arm_negate_f32

这个函数用于求32位浮点数的相反数，源代码分析如下：

/**
* @brief Negates the elements of a floating-point vector.
* @param[in] *pSrc points to the input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* [url=home.php?mod=space&uid=1141835]@Return[/url] none.
*/
void arm_negate_f32(
float32_t * pSrc,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* negate the input */ (1)
in1 = -in1;
in2 = -in2;
in3 = -in3;
in4 = -in4;
/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
*pDst++ = -*pSrc++;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 浮点数的相反数求解比较简单，直接在相应的变量前加上负号即可。

9.1.2 arm_negate_q31

这个函数用于求32位定点数的相反数，源代码分析如下：

/**
* @brief Negates the elements of a Q31 vector.
* @param[in] *pSrc points to the input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF.
*/
void arm_negate_q31(
q31_t * pSrc,
q31_t * pDst,
uint32_t blockSize)
{
q31_t in; /* Temporary variable */
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = __QSUB(0, in1); (2)
*pDst++ = __QSUB(0, in2);
*pDst++ = __QSUB(0, in3);
*pDst++ = __QSUB(0, in4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x80000000将变成0x7FFFFFFF。

2. 饱和运算__QSUB我们在上一章已经详细讲述了，这就就是实现数值0减去相应的参数变量。

9.1.3 arm_negate_q15

这个函数用于求16位定点数的相反数，源代码分析如下：

/**
* @brief Negates the elements of a Q15 vector.
* @param[in] *pSrc points to the input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* par Conditions for optimum performance
* Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF.
*/
void arm_negate_q15(
q15_t * pSrc,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q15_t in;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2; /* Temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read two inputs at a time */ (2)
in1 = _SIMD32_OFFSET(pSrc);
in2 = _SIMD32_OFFSET(pSrc + 2);
/* negate two samples at a time */ (3)
in1 = __QSUB16(0, in1);
/* negate two samples at a time */
in2 = __QSUB16(0, in2);
/* store the result to destination 2 samples at a time */ (4)
_SIMD32_OFFSET(pDst) = in1;
/* store the result to destination 2 samples at a time */
_SIMD32_OFFSET(pDst + 2) = in2;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the result in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == (q15_t) 0x8000) ? 0x7fff : -in;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x8000将变成0x7FFF。

2. 一次读取两个Q15格式的数据。

3. 由于__QSUB是SIMD指令，这里可以实现一次计算两个Q15数据的相反数。

4. 这里实现一次赋值两个Q15数据。

9.1.4 arm_negate_q7

这个函数用于求8位定点数的相反数，源代码分析如下：

/**
* @brief Negates the elements of a Q7 vector.
* @param[in] *pSrc points to the input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F.
*/
void arm_negate_q7(
q7_t * pSrc,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
q7_t in;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t input; /* Input values1-4 */
q31_t zero = 0x00000000; (2)
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = -A */
/* Read four inputs */
input = *__SIMD32(pSrc)++; (3)
/* Store the Negated results in the destination buffer in a single cycle by packing the results */
*__SIMD32(pDst)++ = __QSUB8(zero, input); (4)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = -A */
/* Negate and then store the results in the destination buffer. */
in = *pSrc++;
*pDst++ = (in == (q7_t) 0x80) ? 0x7f : -in;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x80将变成0x7F。

2. 给局部变量赋初值，防止默认初始值不是0，所以从某种意义上来说，给变量赋初值是很有必要的。

3. 一次读取4个Q7格式的数据到input里面。

4. 通过__QSUB8实现一次计算四个Q7格式数据的相反数。

9.1.5 实例讲解

实验目的：

1. 四种类型数据的相反数。

实验内容：

1. 按下K1键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Negate

* 功能说明: 求相反数

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Negate(void)

{

   static float32_t pSrc;

static float32_t pDst;

   static q31_t pSrc1;

   static q31_t pDst1;

   static q15_t pSrc2;

   static q15_t pDst2;

   static q7_t pSrc3 = 127; /* 为了说明问题，在这里设置初始值为127，然后查看0x80是否饱和为0x7F */

   static q7_t pDst3;

   pSrc -= 1.23f;

   arm_negate_f32(&pSrc, &pDst, 1);

   printf("arm_negate_f32 = %frn", pDst);

   pSrc1 -= 1;

   arm_negate_q31(&pSrc1, &pDst1, 1);

   printf("arm_negate_q31 = %drn", pDst1);

   pSrc2 -= 1;

   arm_negate_q15(&pSrc2, &pDst2, 1);

   printf("arm_negate_q15 = %drn", pDst2);

   pSrc3 += 1;

   arm_negate_q7(&pSrc3, &pDst3, 1);

   printf("arm_negate_q7 = %drn", pDst3);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-5 14:41:05

9.2 偏移（Vector Offset）

这部分函数主要用于求相反数，公式描述如下：

pDst[n] = pSrc[n] + offset, 0 <= n < blockSize.

注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.2.1 arm_offset_f32

这个函数用于求32位浮点数的偏移，源代码分析如下：

/**
* @brief Adds a constant offset to a floating-point vector.
* @param[in] *pSrc points to the input vector
* @param[in] offset is the offset to be added
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*/
void arm_offset_f32(
float32_t * pSrc,
float32_t offset,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */ (1)
/* Add offset and then store the results in the destination buffer. */
/* read samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
/* add offset to input */
in1 = in1 + offset;
/* read samples from source */
in3 = *(pSrc + 2);
/* add offset to input */
in2 = in2 + offset;
/* read samples from source */
in4 = *(pSrc + 3);
/* add offset to input */
in3 = in3 + offset;
/* store result to destination */
*pDst = in1;
/* add offset to input */
in4 = in4 + offset;
/* store result to destination */
*(pDst + 1) = in2;
/* store result to destination */
*(pDst + 2) = in3;
/* store result to destination */
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) + offset;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 浮点数的偏移值求解比较简单，加上相应的偏移值并赋值给目标变量即可。

9.2.2 arm_offset_q31

这个函数用于求32位定点数的偏移值，源代码分析如下：

/**
* @brief Adds a constant offset to a Q31 vector.
* @param[in] *pSrc points to the input vector
* @param[in] offset is the offset to be added
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range [0x80000000 0x7FFFFFFF] are saturated.
*/
void arm_offset_q31(
q31_t * pSrc,
q31_t offset,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
*pDst++ = __QADD(in1, offset); (2)
*pDst++ = __QADD(in2, offset);
*pDst++ = __QADD(in3, offset);
*pDst++ = __QADD(in4, offset);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = __QADD(*pSrc++, offset);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrc++ + offset);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x80000000将变成0x7FFFFFFF。

2. 指令__QADD我们在上章教程中已经讲解过，这里是实现两个参数相加。

9.2.3 arm_offset_q15

这个函数用于求16位定点数的偏移，源代码分析如下：

/**
* @brief Adds a constant offset to a Q15 vector.
* @param[in] *pSrc points to the input vector
* @param[in] offset is the offset to be added
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] are saturated.
*/
void arm_offset_q15(
q15_t * pSrc,
q15_t offset,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Offset is packed to 32 bit in order to use SIMD32 for addition */
offset_packed = __PKHBT(offset, offset, 16); (2)
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer, 2 samples at a time. */
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed); (3)
*__SIMD32(pDst)++ = __QADD16(*__SIMD32(pSrc)++, offset_packed);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __QADD16(*pSrc++, offset);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrc++ + offset), 16);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x8000将变成0x7FFF。

2. 将两个Q15格式的变量合并成一个Q31格式的数据，方便指令__QADD16的调用。

3. 由于__QADD16是SIMD指令，这里调用一次就能实现两个Q15格式数据的计算。

9.2.4 arm_offset_q7

这个函数用于求8位定点数的偏移，源代码分析如下：

/**
* @brief Adds a constant offset to a Q7 vector.
* @param[in] *pSrc points to the input vector
* @param[in] offset is the offset to be added
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] are saturated.
*/
void arm_offset_q7(
q7_t * pSrc,
q7_t offset,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t offset_packed; /* Offset packed to 32 bit */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Offset is packed to 32 bit in order to use SIMD32 for addition */ (2)
offset_packed = __PACKq7(offset, offset, offset, offset);
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the results in the destination bufferfor 4 samples at a time. */
*__SIMD32(pDst)++ = __QADD8(*__SIMD32(pSrc)++, offset_packed); (3)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT(*pSrc++ + offset, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A + offset */
/* Add offset and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrc++ + offset, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x80将变成0x7F。

2. 通过__PACKq7将4个Q7格式的数据合并成一个Q31格式的数据。

3. 由于__QADD8是SIMD指令，这里调用一次就能实现四个Q8格式数据的计算。

9.2.5 实例讲解

实验目的：

1. 四种类型数据的相反数。

实验内容：

1. 按下K2键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Offset

* 功能说明: 偏移

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Offset(void)

{

   static float32_t pSrcA;

   static float32_t Offset = 0.0f;

   static float32_t pDst;

   static q31_t  pSrcA1;

   static q31_t  Offset1 = 0;

   static q31_t  pDst1;

   static q15_t  pSrcA2;

   static q15_t  Offset2 = 0;

   static q15_t  pDst2;

   static q7_t  pSrcA3;

   static q7_t  Offset3 = 0;

   static q7_t  pDst3;

   Offset--;

   arm_offset_f32(&pSrcA, Offset, &pDst, 1);

   printf("arm_add_f32 = %frn", pDst);

   Offset1--;

   arm_offset_q31(&pSrcA1, Offset1, &pDst1, 1);

   printf("arm_add_q31 = %drn", pDst1);

   Offset2--;

   arm_offset_q15(&pSrcA2, Offset2, &pDst2, 1);

   printf("arm_add_q15 = %drn", pDst2);

   Offset3--;

   arm_offset_q7(&pSrcA3, Offset3, &pDst3, 1);

   printf("arm_add_q7 = %drn", pDst3);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-5 14:46:47

9.3 位移（Vector Shift）

这部分函数主要用于实现位移，公式描述如下：

pDst[n] = pSrc[n] << shift, 0 <= n < blockSize.

注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.3.1 arm_shift_q31

这个函数用于求32位定点数的位移，源代码分析如下：

/**
* @brief Shifts the elements of a Q31 vector a specified number of bits.
* @param[in] *pSrc points to the input vector
* @param[in] shiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range [0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_shift_q31(
q31_t * pSrc,
int8_t shiftBits,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign = (shiftBits & 0x80); /* Sign of shiftBits */ (3)
#ifndef ARM_MATH_CM0_FAMILY
q31_t in1, in2, in3, in4; /* Temporary input variables */
q31_t out1, out2, out3, out4; /* Temporary output variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
if(sign == 0u) (4)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift the input and then store the results in the destination buffer. */
in1 = *pSrc;
in2 = *(pSrc + 1);
out1 = in1 << shiftBits;
in3 = *(pSrc + 2);
out2 = in2 << shiftBits;
in4 = *(pSrc + 3);
if(in1 != (out1 >> shiftBits)) (5)
out1 = 0x7FFFFFFF ^ (in1 >> 31);
if(in2 != (out2 >> shiftBits))
out2 = 0x7FFFFFFF ^ (in2 >> 31);
*pDst = out1;
out3 = in3 << shiftBits;
*(pDst + 1) = out2;
out4 = in4 << shiftBits;
if(in3 != (out3 >> shiftBits))
out3 = 0x7FFFFFFF ^ (in3 >> 31);
if(in4 != (out4 >> shiftBits))
out4 = 0x7FFFFFFF ^ (in4 >> 31);
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update destination pointer to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the input and then store the results in the destination buffer. */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
*pDst = (in1 >> -shiftBits); (7)
*(pDst + 1) = (in2 >> -shiftBits);
*(pDst + 2) = (in3 >> -shiftBits);
*(pDst + 3) = (in4 >> -shiftBits);
pSrc += 4u;
pDst += 4u;
blkCnt--;
}
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A (>> or <<) shiftBits */
/* Shift the input and then store the result in the destination buffer. */ (8)
*pDst++ = (sign == 0u) ? clip_q63_to_q31((q63_t) * pSrc++ << shiftBits) :
(*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。

2. 这个函数使用了饱和运算。

饱和运算数值0x80000000将变成0x7FFFFFFF。

3. 获取偏移值shiftBits是正数还是负数。

4. 如果移位值是正数，那么就是左移。

5. 数值的左移仅支持将其左移后再右移相应的位数后数值不变的情况，如果不满足这个条件，那么输出结果只有两种结果（这里就是实现输出结果的饱和运算）。

out =0x7FFFFFFF & 0xFFFFFFFF =0x80000000

out =0x7FFFFFFF & 0x0000000 =0x7FFFFFFF

6. 如果移位值是负数，那么就是右移。

7. 将偏移值取反然后左移即可。

8. 用于实现剩余数值偏移的计算。

9.3.2 arm_shift_q15

这个函数用于求16位定点数的位移，源代码分析如下：

/**
* @brief Shifts the elements of a Q15 vector a specified number of bits.
* @param[in] *pSrc points to the input vector
* @param[in] shiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_shift_q15(
q15_t * pSrc,
int8_t shiftBits,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1, in2; /* Temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Read 2 inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* C = A << shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
__SSAT((in2 << shiftBits), 16), 16);
#else
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16), (4)
__SSAT((in1 << shiftBits), 16), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in1 << shiftBits), 16),
__SSAT((in2 << shiftBits), 16), 16);
#else
*__SIMD32(pDst)++ = __PKHBT(__SSAT((in2 << shiftBits), 16),
__SSAT((in1 << shiftBits), 16), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift and then store the results in the destination buffer. */
*pDst++ = __SSAT((*pSrc++ << shiftBits), 16); (5)
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Read 2 inputs */
in1 = *pSrc++;
in2 = *pSrc++;
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
(in2 >> -shiftBits), 16);
#else
*__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits), (7)
(in1 >> -shiftBits), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
in1 = *pSrc++;
in2 = *pSrc++;
#ifndef ARM_MATH_BIG_ENDIAN
*__SIMD32(pDst)++ = __PKHBT((in1 >> -shiftBits),
(in2 >> -shiftBits), 16);
#else
*__SIMD32(pDst)++ = __PKHBT((in2 >> -shiftBits),
(in1 >> -shiftBits), 16);
#endif /* #ifndef ARM_MATH_BIG_ENDIAN */
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#else
/* Run the below code for Cortex-M0 */
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift and then store the results in the destination buffer. */
*pDst++ = __SSAT(((q31_t) * pSrc++ << shiftBits), 16);
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the inputs and then store the results in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。

2. 这个函数使用了饱和运算。

饱和运算数值0x8000将变成0x7FFF。

3. 获取偏移值是正数还是负数。

4. 通过调用一次__PKHBT实现两个Q15格式数据的计算。

5. 剩余几个数值的计算。

6. 如果位移值为负数，那么就是右移。

7. 将位移值取反以后，通过调用一次__PKHBT实现两个Q15格式数据的计算。

9.3.3 arm_shift_q7

这个函数用于求8位定点数的位移，源代码分析如下：

/**
* @brief Shifts the elements of a Q7 vector a specified number of bits.
* @param[in] *pSrc points to the input vector
* @param[in] shiftBits number of bits to shift.
* A positive value shifts left; a negative value shifts right. (1)
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* par Conditions for optimum performance
* Input and output buffers should be aligned by 32-bit
*
*
* Scaling and Overflow Behavior: (2)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x8 0x7F] will be saturated.
*/
void arm_shift_q7(
q7_t * pSrc,
int8_t shiftBits,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
uint8_t sign; /* Sign of shiftBits */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1; /* Input value1 */
q7_t in2; /* Input value2 */
q7_t in3; /* Input value3 */
q7_t in4; /* Input value4 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80); (3)
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Read 4 inputs */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
(4)
/* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
*__SIMD32(pDst)++ = __PACKq7(__SSAT((in1 << shiftBits), 8),
__SSAT((in2 << shiftBits), 8),
__SSAT((in3 << shiftBits), 8),
__SSAT((in4 << shiftBits), 8));
/* Update source pointer to process next sampels */
pSrc += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A << shiftBits */ (5)
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((*pSrc++ << shiftBits), 8);
/* Decrement the loop counter */
blkCnt--;
}
}
else (6)
{
shiftBits = -shiftBits;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Read 4 inputs */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* Store the Shifted result in the destination buffer in single cycle by packing the outputs */
*__SIMD32(pDst)++ = __PACKq7((in1 >> shiftBits), (in2 >> shiftBits),
(in3 >> shiftBits), (in4 >> shiftBits));
pSrc += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the input and then store the result in the destination buffer. */
in1 = *pSrc++;
*pDst++ = (in1 >> shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#else
/* Run the below code for Cortex-M0 */
/* Getting the sign of shiftBits */
sign = (shiftBits & 0x80);
/* If the shift value is positive then do right shift else left shift */
if(sign == 0u)
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A << shiftBits */
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT(((q15_t) * pSrc++ << shiftBits), 8);
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A >> shiftBits */
/* Shift the input and then store the result in the destination buffer. */
*pDst++ = (*pSrc++ >> -shiftBits);
/* Decrement the loop counter */
blkCnt--;
}
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 如果函数的参数shiftBits是正数那么表示左移，如果参数shiftBits是负数那么就是右移。

2. 这个函数使用了饱和运算。

饱和运算数值0x80将变成0x7F。

3. 获取偏移值是正数还是负数。

4. 通过调用一次__PACKq7实现四个Q7格式数据的位移。

5. 剩余几不足4个数据的位移求解。

6. 如果移位值是负数，那么就是右移。

9.3.4 实例讲解

实验目的：

1. 三种类型数据的位移。

实验内容：

1. 按下K3键, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Shift

* 功能说明: 位移

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Shift(void)

{

   static q31_t  pSrcA1 = 0x88886666;

   static q31_t  pDst1;

   static q15_t  pSrcA2 = 0x8866;

   static q15_t  pDst2;

   static q7_t  pSrcA3 = 0x86;

   static q7_t  pDst3;

   arm_shift_q31(&pSrcA1, 3, &pDst1, 1);

   printf("arm_shift_q31 = %8xrn", pDst1);

   arm_shift_q15(&pSrcA2, -3, &pDst2, 1);

   printf("arm_shift_q15 = %4xrn", pDst2);

   arm_shift_q7(&pSrcA3, 3, &pDst3, 1);

   printf("arm_shift_q7 = %2xrn", pDst3);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-5 14:54:50

9.4 减法（Vector Sub）

这部分函数主要用于实现减法，公式描述如下：

pDst[n] = pSrcA[n] - pSrcB[n], 0 <= n < blockSize.

9.4.1 arm_sub_f32

这个函数用于求32位浮点数的减法，源代码分析如下：

/**
* @brief Floating-point vector subtraction.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*/
void arm_sub_f32(
float32_t * pSrcA,
float32_t * pSrcB,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t inA1, inA2, inA3, inA4; /* temporary variables */
float32_t inB1, inB2, inB3, inB4; /* temporary variables */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
/* Read 4 input samples from sourceA and sourceB */
inA1 = *pSrcA;
inB1 = *pSrcB;
inA2 = *(pSrcA + 1);
inB2 = *(pSrcB + 1);
inA3 = *(pSrcA + 2);
inB3 = *(pSrcB + 2);
inA4 = *(pSrcA + 3);
inB4 = *(pSrcB + 3);
/* dst = srcA - srcB */
/* subtract and store the result */ (1)
*pDst = inA1 - inB1;
*(pDst + 1) = inA2 - inB2;
*(pDst + 2) = inA3 - inB3;
*(pDst + 3) = inA4 - inB4;
/* Update pointers to process next sampels */
pSrcA += 4u;
pSrcB += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
*pDst++ = (*pSrcA++) - (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 浮点数的减法运算比较简单，直接两个数值相减即可。

9.4.2 arm_sub_q31

这个函数用于求32位定点数的减法，源代码分析如下：

/**
* @brief Q31 vector subtraction.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q31 range [0x80000000 0x7FFFFFFF] will be saturated.
*/
void arm_sub_q31(
q31_t * pSrcA,
q31_t * pSrcB,
q31_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2, inA3, inA4;
q31_t inB1, inB2, inB3, inB4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer. */
inA1 = *pSrcA++;
inA2 = *pSrcA++;
inB1 = *pSrcB++;
inB2 = *pSrcB++;
inA3 = *pSrcA++;
inA4 = *pSrcA++;
inB3 = *pSrcB++;
inB4 = *pSrcB++;
*pDst++ = __QSUB(inA1, inB1); (2)
*pDst++ = __QSUB(inA2, inB2);
*pDst++ = __QSUB(inA3, inB3);
*pDst++ = __QSUB(inA4, inB4);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __QSUB(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ - *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这个函数使用了饱和运算。

饱和运算数值0x80000000将变成0x7FFFFFFF。

2. __QSUB也是SIMD指令，这里可以用这个指令实现两个Q31格式数据的饱和减法。

9.4.3 arm_sub_q15

这个函数用于求16位定点数的减法，源代码分析如下：

/**
* @brief Q15 vector subtraction.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated.
*/
void arm_sub_q15(
q15_t * pSrcA,
q15_t * pSrcB,
q15_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t inA1, inA2;
q31_t inB1, inB2;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer two samples at a time. */
inA1 = *__SIMD32(pSrcA)++; (1)
inA2 = *__SIMD32(pSrcA)++;
inB1 = *__SIMD32(pSrcB)++;
inB2 = *__SIMD32(pSrcB)++;
*__SIMD32(pDst)++ = __QSUB16(inA1, inB1); (2)
*__SIMD32(pDst)++ = __QSUB16(inA2, inB2);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __QSUB16(*pSrcA++, *pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ - *pSrcB++), 16);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 这里一次读取两个Q15格式的数据。

2. 由于__QSUB16是SIMD指令，在这里调用一次__QSUB16可以实现两次减法运算。

9.4.4 arm_sub_q7

这个函数用于求8位定点数的减法，源代码分析如下：

/**
* @brief Q7 vector subtraction.
* @param[in] *pSrcA points to the first input vector
* @param[in] *pSrcB points to the second input vector
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in each vector
* @return none.
*
* Scaling and Overflow Behavior:
* par
* The function uses saturating arithmetic.
* Results outside of the allowable Q7 range [0x80 0x7F] will be saturated.
*/
void arm_sub_q7(
q7_t * pSrcA,
q7_t * pSrcB,
q7_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the results in the destination buffer 4 samples at a time. */
*__SIMD32(pDst)++ = __QSUB8(*__SIMD32(pSrcA)++, *__SIMD32(pSrcB)++); (1)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = __SSAT(*pSrcA++ - *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A - B */
/* Subtract and then store the result in the destination buffer. */
*pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ - *pSrcB++, 8);
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. __QSUB8也是SIMD指令，调用一次就能实现4个Q7格式数据的减法运算。

9.4.5 实例讲解

实验目的：

1. 四种种类型数据的减法。

实验内容：

1. 按下按键UP, 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*

*********************************************************************************************************

* 函数名: DSP_Sub

* 功能说明: 减法

* 形参：无

* 返回值: 无

*********************************************************************************************************

*/

static void DSP_Sub(void)

{

   static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};

   static float32_t pDst[5];

   static q31_t  pSrcA1[5] = {1,1,1,1,1};

   static q31_t  pSrcB1[5] = {1,1,1,1,1};

   static q31_t  pDst1[5];

   static q15_t  pSrcA2[5] = {1,1,1,1,1};

   static q15_t  pSrcB2[5] = {1,1,1,1,1};

   static q15_t  pDst2[5];

   static q7_t  pSrcA3[5] = {0x70,1,1,1,1};

   static q7_t  pSrcB3[5] = {0x7f,1,1,1,1};

   static q7_t pDst3[5];

   pSrcA[0] += 1.1f;

   arm_sub_f32(pSrcA, pSrcB, pDst, 5);

   printf("arm_sub_f32 = %frn", pDst[0]);

   pSrcA1[0] += 1;

   arm_sub_q31(pSrcA1, pSrcB1, pDst1, 5);

   printf("arm_sub_q31 = %drn", pDst1[0]);

   pSrcA2[0] += 1;

   arm_sub_q15(pSrcA2, pSrcB2, pDst2, 5);

   printf("arm_sub_q15 = %drn", pDst2[0]);

   pSrcA3[0] += 1;

   arm_sub_q7(pSrcA3, pSrcB3, pDst3, 5);

   printf("arm_sub_q7 = %drn", pDst3[0]);

   printf("***********************************rn");

}
复制代码

硬汉Eric2013 · 2015-6-5 18:02:28

9.5 比例因子（Vector Scale）

这部分函数主要用于实现数据的比例放大和缩小，浮点数据公式描述如下：

pDst[n]= pSrc[n] * scale, 0 <= n

如果是Q31，Q15，Q7格式的数据，公式描述如下：

pDst[n] = (pSrc[n]* scaleFract) << shift, 0 <= n< blockSize.

这种情况下，比例因子就是：

scale = scaleFract* 2^shift.

注意，这部分函数支持目标指针和源指针指向相同的缓冲区。

9.5.1 arm_scale_f32

这个函数用于求32位浮点数的比例放缩，源代码分析如下：

/**
* @brief Multiplies a floating-point vector by a scalar.
* @param[in] *pSrc points to the input vector
* @param[in] scale scale factor to be applied
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*/
void arm_scale_f32(
float32_t * pSrc,
float32_t scale,
float32_t * pDst,
uint32_t blockSize)
{
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
float32_t in1, in2, in3, in4; /* temporary variabels */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the results in the destination buffer. */
/* read input samples from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
/* multiply with scaling factor */ (1)
in1 = in1 * scale;
/* read input sample from source */
in3 = *(pSrc + 2);
/* multiply with scaling factor */
in2 = in2 * scale;
/* read input sample from source */
in4 = *(pSrc + 3);
/* multiply with scaling factor */
in3 = in3 * scale;
in4 = in4 * scale;
/* store the result to destination */
*pDst = in1;
*(pDst + 1) = in2;
*(pDst + 2) = in3;
*(pDst + 3) = in4;
/* update pointers to process next samples */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (*pSrc++) * scale;
/* Decrement the loop counter */
blkCnt--;
}
}

复制代码

1. 浮点数据的比例因子计算比较简单，源浮点数相应相应的比例因子即可。

9.5.2 arm_scale_q31

这个函数用于求32位定点数的比例放缩，源代码分析如下：

/**
* @brief Multiplies a Q31 vector by a scalar.
* @param[in] *pSrc points to the input vector
* @param[in] scaleFract fractional portion of the scale value
* @param[in] shift number of bits to shift the result by
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data *pSrc and scaleFract are in 1.31 format.
* These are multiplied to yield a 2.62 intermediate result and this is shifted with saturation to 1.31 format.
*/
void arm_scale_q31(
q31_t * pSrc,
q31_t scaleFract,
int8_t shift,
q31_t * pDst,
uint32_t blockSize)
{
int8_t kShift = shift + 1; /* Shift to apply after scaling */ (2)
int8_t sign = (kShift & 0x80);
uint32_t blkCnt; /* loop counter */
q31_t in, out;
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q31_t in1, in2, in3, in4; /* temporary input variables */
q31_t out1, out2, out3, out4; /* temporary output variabels */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
if(sign == 0u) (3)
{
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read four inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* multiply input with scaler value */ (4)
in1 = ((q63_t) in1 * scaleFract) >> 32;
in2 = ((q63_t) in2 * scaleFract) >> 32;
in3 = ((q63_t) in3 * scaleFract) >> 32;
in4 = ((q63_t) in4 * scaleFract) >> 32;
/* apply shifting */
out1 = in1 << kShift;
out2 = in2 << kShift;
/* saturate the results. */
if(in1 != (out1 >> kShift)) (5)
out1 = 0x7FFFFFFF ^ (in1 >> 31);
if(in2 != (out2 >> kShift))
out2 = 0x7FFFFFFF ^ (in2 >> 31);
out3 = in3 << kShift;
out4 = in4 << kShift;
*pDst = out1;
*(pDst + 1) = out2;
if(in3 != (out3 >> kShift))
out3 = 0x7FFFFFFF ^ (in3 >> 31);
if(in4 != (out4 >> kShift))
out4 = 0x7FFFFFFF ^ (in4 >> 31);
/* Store result destination */
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update pointers to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
else {
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* read four inputs from source */
in1 = *pSrc;
in2 = *(pSrc + 1);
in3 = *(pSrc + 2);
in4 = *(pSrc + 3);
/* multiply input with scaler value */
in1 = ((q63_t) in1 * scaleFract) >> 32;
in2 = ((q63_t) in2 * scaleFract) >> 32;
in3 = ((q63_t) in3 * scaleFract) >> 32;
in4 = ((q63_t) in4 * scaleFract) >> 32;
/* apply shifting */ (6)
out1 = in1 >> -kShift;
out2 = in2 >> -kShift;
out3 = in3 >> -kShift;
out4 = in4 >> -kShift;
/* Store result destination */
*pDst = out1;
*(pDst + 1) = out2;
*(pDst + 2) = out3;
*(pDst + 3) = out4;
/* Update pointers to process next sampels */
pSrc += 4u;
pDst += 4u;
/* Decrement the loop counter */
blkCnt--;
}
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
if(sign == 0)
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;
out = in << kShift;
if(in != (out >> kShift))
out = 0x7FFFFFFF ^ (in >> 31);
*pDst++ = out;
/* Decrement the loop counter */
blkCnt--;
}
}
else
{
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
in = *pSrc++;
in = ((q63_t) in * scaleFract) >> 32;
out = in >> -kShift;
*pDst++ = out;
/* Decrement the loop counter */
blkCnt--;
}
}
}

复制代码

1. 源数据和比例因子都是Q31格式。这样他们的乘积就是1.31 * 1.31 = 2.62格式。由于输出结果也是Q31格式，那么源数据和比例因子的乘积需要右移32位，并且输出结果需要饱和处理。

2. 这里不清楚为什么要加1操作，留作以后解决。

3. 如果位移是正值，那么就是左移位，否则就是右移位。

4. 将源数据和比例因子的乘积左移32位，保证结果也是Q31格式。

5. 这里是对结果的饱和处理。

6. 数值的右移不存在饱和问题，这里直接取反即可。

9.5.3 arm_scale_q15

这个函数用于求16位定点数的比例放缩，源代码分析如下：

/**
* @brief Multiplies a Q15 vector by a scalar.
* @param[in] *pSrc points to the input vector
* @param[in] scaleFract fractional portion of the scale value
* @param[in] shift number of bits to shift the result by
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data *pSrc and scaleFract are in 1.15 format.
* These are multiplied to yield a 2.30 intermediate result and this is shifted with saturation to 1.15 format.
*/
void arm_scale_q15(
q15_t * pSrc,
q15_t scaleFract,
int8_t shift,
q15_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 15 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q15_t in1, in2, in3, in4;
q31_t inA1, inA2; /* Temporary variables */
q31_t out1, out2, out3, out4;
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 2 inputs from memory */
inA1 = *__SIMD32(pSrc)++; (3)
inA2 = *__SIMD32(pSrc)++;
/* C = A * scale */
/* Scale the inputs and then store the 2 results in the destination buffer
* in single cycle by packing the outputs */
out1 = (q31_t) ((q15_t) (inA1 >> 16) * scaleFract); (4)
out2 = (q31_t) ((q15_t) inA1 * scaleFract);
out3 = (q31_t) ((q15_t) (inA2 >> 16) * scaleFract);
out4 = (q31_t) ((q15_t) inA2 * scaleFract);
/* apply shifting */
out1 = out1 >> kShift;
out2 = out2 >> kShift;
out3 = out3 >> kShift;
out4 = out4 >> kShift;
/* saturate the output */
in1 = (q15_t) (__SSAT(out1, 16)); (5)
in2 = (q15_t) (__SSAT(out2, 16));
in3 = (q15_t) (__SSAT(out3, 16));
in4 = (q15_t) (__SSAT(out4, 16));
/* store the result to destination */ (6)
*__SIMD32(pDst)++ = __PKHBT(in2, in1, 16);
*__SIMD32(pDst)++ = __PKHBT(in4, in3, 16);
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 16));
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q15_t) (__SSAT(((q31_t) * pSrc++ * scaleFract) >> kShift, 16));
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 源数据和比例因子的数据格式都是Q15，这样的话，输出结果就是1.15 * 1.15 = 2.30格式，由于输出结果也是Q15格式，所以输出结果需要饱和处理。

2. 这个变量设计很巧妙，这样下面处理正数左移和负数右移就很方面了，可以直接使用一个右移就可以实现。

3. 读取两个Q15格式的数据。

4. 将源数据乘以比例因子后赋值给Q31格式的变量。

5. 对输出结果做饱和处理。

6. 通过调用一次__PKHBT指令，将两个Q15格式的数据都赋值给目的变量。

9.5.4 arm_scale_q7

这个函数用于求8位定点数的比例放缩，源代码分析如下：

/**
* @brief Multiplies a Q7 vector by a scalar.
* @param[in] *pSrc points to the input vector
* @param[in] scaleFract fractional portion of the scale value
* @param[in] shift number of bits to shift the result by
* @param[out] *pDst points to the output vector
* @param[in] blockSize number of samples in the vector
* @return none.
*
* Scaling and Overflow Behavior: (1)
* par
* The input data *pSrc and scaleFract are in 1.7 format.
* These are multiplied to yield a 2.14 intermediate result and this is shifted with saturation to 1.7 format.
*/
void arm_scale_q7(
q7_t * pSrc,
q7_t scaleFract,
int8_t shift,
q7_t * pDst,
uint32_t blockSize)
{
int8_t kShift = 7 - shift; /* shift to apply after scaling */ (2)
uint32_t blkCnt; /* loop counter */
#ifndef ARM_MATH_CM0_FAMILY
/* Run the below code for Cortex-M4 and Cortex-M3 */
q7_t in1, in2, in3, in4, out1, out2, out3, out4; /* Temporary variables to store input & output */
/*loop Unrolling */
blkCnt = blockSize >> 2u;
/* First part of the processing with loop unrolling. Compute 4 outputs at a time.
** a second loop below computes the remaining 1 to 3 samples. */
while(blkCnt > 0u)
{
/* Reading 4 inputs from memory */
in1 = *pSrc++;
in2 = *pSrc++;
in3 = *pSrc++;
in4 = *pSrc++;
/* C = A * scale */
/* Scale the inputs and then store the results in the temporary variables. */
out1 = (q7_t) (__SSAT(((in1) * scaleFract) >> kShift, 8)); (3)
out2 = (q7_t) (__SSAT(((in2) * scaleFract) >> kShift, 8));
out3 = (q7_t) (__SSAT(((in3) * scaleFract) >> kShift, 8));
out4 = (q7_t) (__SSAT(((in4) * scaleFract) >> kShift, 8));
/* Packing the individual outputs into 32bit and storing in
* destination buffer in single write */
*__SIMD32(pDst)++ = __PACKq7(out1, out2, out3, out4); (4)
/* Decrement the loop counter */
blkCnt--;
}
/* If the blockSize is not a multiple of 4, compute any remaining output samples here.
** No loop unrolling is used. */
blkCnt = blockSize % 0x4u;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT(((*pSrc++) * scaleFract) >> kShift, 8));
/* Decrement the loop counter */
blkCnt--;
}
#else
/* Run the below code for Cortex-M0 */
/* Initialize blkCnt with number of samples */
blkCnt = blockSize;
while(blkCnt > 0u)
{
/* C = A * scale */
/* Scale the input and then store the result in the destination buffer. */
*pDst++ = (q7_t) (__SSAT((((q15_t) * pSrc++ * scaleFract) >> kShift), 8));
/* Decrement the loop counter */
blkCnt--;
}
#endif /* #ifndef ARM_MATH_CM0_FAMILY */
}

复制代码

1. 源数据和比例因子的数据格式都是Q7，这样的话，输出结果就是1.7 * 1.7 = 2.14格式，由于输出结果也是Q7格式，所以输出结果需要饱和处理。

2. 这个变量设计很巧妙，这样下面处理正数左移和负数右移就很方面了，可以直接使用一个右移就可以实现。

3. 对源数据和比例因子的输出结果做8位精度的饱和处理。

9.5.5 实例讲解

实验目的：

1. 四种种类型数据的比例放缩。

实验内容：

1. 按下按键DOWN 串口打印输出结果

实验现象：

通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

程序设计：

/*
*********************************************************************************************************
* 函数名: DSP_Scale
* 功能说明: 比例因子
* 形参：无
* 返回值: 无
*********************************************************************************************************
*/
static void DSP_Scale(void)
{
static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f};
static float32_t scale = 0.0f;
static float32_t pDst[5];
static q31_t pSrcA1[5] = {0x6fffffff,1,1,1,1};
static q31_t scale1 = 0x6fffffff;
static q31_t pDst1[5];
static q15_t pSrcA2[5] = {0x6fff,1,1,1,1};
static q15_t scale2 = 0x6fff;
static q15_t pDst2[5];
static q7_t pSrcA3[5] = {0x70,1,1,1,1};
static q7_t scale3 = 0x6f;
static q7_t pDst3[5];
scale += 0.1f;
arm_scale_f32(pSrcA, scale, pDst, 5);
printf("arm_sub_f32 = %frn", pDst[0]);
scale1 += 1;
arm_scale_q31(pSrcA1, scale1, 0, pDst1, 5);
printf("arm_scale_q31 = %xrn", pDst1[0]);
scale2 += 1;
arm_scale_q15(pSrcA2, scale2, 0, pDst2, 5);
printf("arm_scale_q15 = %xrn", pDst2[0]);
scale3 += 1;
arm_scale_q7(pSrcA3, scale3, 0, pDst3, 5);
printf("arm_scale_q7 = %xrn", pDst3[0]);
printf("***********************************rn");