发帖

【优惠升级】华秋PCB首单最高立减100元，SMT免费贴片！！！

[资料]

第8章 BasicMathFunctions的使用（一）

2016-9-22 13:05:33 5482 ARM

0 转dsp系列教程本期教程开始学习ARM官方的DSP库，这里我们先从基本数学函数开始。本期教程主要讲绝对值，加法，点乘和乘法四种运算。 8.1 绝对值（Vector Absolute Value） 8.2 求和（Vector Addition） 8.3 点乘（Vector Dot Product） 8.4 乘法（Vector Multiplication） 8.1 绝对值（Vector Absolute Value）这部分函数主要用于求绝对值，公式描述如下： pDst[n] = abs(pSrc[n]), 0 <= n < blockSize. 特别注意，这部分函数支持目标指针和源指针指向相同的缓冲区。 8.1.1 arm_abs_f32 这个函数用于求32位浮点数的绝对值，源代码分析如下： [url=]复制代码[/url] /** * @brief Floating-point vector absolute value. （1） * @param[in] pSrc points to the input buffer @param[out] pDst points to the output buffer @param[in] blockSize number of samples in each vector * @Return none. / void arm_abs_f32( （2） float32_t pSrc, float32_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY （3） / Run the below code for Cortex-M4 and Cortex-M3 / float32_t in1, in2, in3, in4; / temporary variables / /loop Unrolling / blkCnt = blockSize >> 2u; （4） / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = \|A\| / / Calculate absolute and then store the results in the destination buffer. / / read sample from source / in1 = pSrc; in2 = (pSrc + 1); in3 = (pSrc + 2); /* find absolute value / in1 = fabsf(in1); （5） / read sample from source / in4 = (pSrc + 3); /* find absolute value / in2 = fabsf(in2); / read sample from source / pDst = in1; /* find absolute value / in3 = fabsf(in3); / find absolute value / in4 = fabsf(in4); / store result to destination / (pDst + 1) = in2; /* store result to destination / (pDst + 2) = in3; /* store result to destination / (pDst + 3) = in4; /* Update source pointer to process next sampels / （6） pSrc += 4u; / Update destination pointer to process next sampels / pDst += 4u; / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else （7） / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) （8） { / C = \|A\| / / Calculate absolute and then store the results in the destination buffer. / pDst++ = fabsf(pSrc++); / Decrement the loop counter */ blkCnt--; } } 0
举报淘帖0 只看该作者相关推荐 • 第8章管理质量报告软件 832 • 智能控制（[刘金琨编着]第1版）--第8章高级神经网络 1632 • 第8章遗传算法辨识 2078 • 智能控制--第8章高级神经网络 1860 • 第8章安全技术防范系统 1878 • 机械设计基础答案(第五版)第8章 1631 • 第9章 BasicMathFunctions的使用（二） 5474 • 《精通LabVIEW程序设计》一书的课件第8章 LabVIEW在电路中的应用 2407 • 【安富莱DSP教程】第9章 BasicMathFunctions的使用（二） 3372 • 【安富莱——DSP教程】第8章 BasicMathFunctions的使用（一） 8280 23条评论发表评论只看该作者

lee_st · 2016-9-22 13:05:49 沙发 1. 在这里简单的跟大家介绍一下DSP库中函数的通用格式，后面就不再赘述了。（1）基本所有的函数都是可重入的。（2）大部分函数都支持一组数的计算，比如这个函数arm_abs_f32就可以计算一组数的绝对值。所以如果只是就几个数的绝对值，用这个库函数就没有什么优势了。（3）库函数基本是CM0，CM3和CM4都支持的（最新的DSP库已经添加CM7的支持）。（4）每组数据基本上都是以4个数为一个单位进行计算，不够四个再单独计算。（5）大部分函数都是配有f32，Q31，Q15和Q7四种格式。 2. 函数参数，支持输入一个数组进行计算绝对值。 3. 这部分代码是用于CM3和CM4内核。 4. 左移两位从而实现每4个数据为一组进行计算。 5. fabsf：这个函数不是用Cortex-M4F支持的DSP指令实现的，而是用C语言实现的，这个函数是被MDK封装起来的。 6. 切换到下一组数据。 7. 这部分代码用于CM0. 8. 用于不够4个数据的计算或者CM0内核。 8.1.2 arm_abs_q31 这个函数用于求32位定点数的绝对值，源代码分析如下：复制代码 /** * @brief Q31 vector absolute value. * @param[in] pSrc points to the input buffer @param[out] pDst points to the output buffer @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: （1） * par * The function uses saturating arithmetic. * The Q31 value -1 (0x80000000) will be saturated to the maximum allowable positive value 0x7FFFFFFF. / void arm_abs_q31( q31_t pSrc, q31_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / q31_t in; / Input value / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t in1, in2, in3, in4; /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = \|A\| / / Calculate absolute of input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. / in1 = pSrc++; in2 = pSrc++; in3 = pSrc++; in4 = pSrc++; pDst++ = (in1 > 0) ? in1 : (q31_t)__QSUB(0, in1); （2） pDst++ = (in2 > 0) ? in2 : (q31_t)__QSUB(0, in2); pDst++ = (in3 > 0) ? in3 : (q31_t)__QSUB(0, in3); pDst++ = (in4 > 0) ? in4 : (q31_t)__QSUB(0, in4); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = \|A\| / / Calculate absolute value of the input (if -1 then saturated to 0x7fffffff) and then store the results in the destination buffer. / in = pSrc++; pDst++ = (in > 0) ? in : ((in == INT32_MIN) ? INT32_MAX : -in); / Decrement the loop counter */ blkCnt--; } } 1. 这个函数使用了饱和运算，其实不光这个函数，后面很多函数都是使用了饱和运算的，关于什么是饱和运算，大家看Cortex-M3权威指南中文版的4.3.6 小节：汇编语言：饱和运算即可。对于Q31格式的数据，饱和运算会使得数据0x80000000变成0x7fffffff（这个数比较特殊，算是特殊处理，记住即可）。 2. 这里重点说一下函数__QSUB，其实这个函数算是Cortex-M4/M3的一个指令，用于实现饱和减法。比如函数：__QSUB(0, in1) 的作用就是实现0 – in1并返回结果。这里__QSUB实现的是32位数的饱和减法。还有__QSUB16和__QSUB8实现的是16位和8位数的减法。

赞回复举报

lee_st · 2016-9-22 13:06:02 板凳 8.1.3 arm_abs_q15 这个函数用于求15位定点数的绝对值，源代码分析如下：复制代码 /** * @brief Q15 vector absolute value. * @param[in] pSrc points to the input buffer @param[out] pDst points to the output buffer @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: * par * The function uses saturating arithmetic. * The Q15 value -1 (0x8000) will be saturated to the maximum allowable positive value 0x7FFF. （1） / void arm_abs_q15( q15_t pSrc, q15_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY __SIMD32_TYPE simd; （2） /* Run the below code for Cortex-M4 and Cortex-M3 / q15_t in1; / Input value1 / q15_t in2; / Input value2 / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / simd = __SIMD32_CONST(pDst); （3） while(blkCnt > 0u) { / C = \|A\| / / Read two inputs / in1 = pSrc++; in2 = pSrc++; / Store the Absolute result in the destination buffer by packing the two values, in a single cycle / #ifndef ARM_MATH_BIG_ENDIAN simd++ = __PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), （4） ((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16); #else simd++ = __PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), ((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16); #endif / #ifndef ARM_MATH_BIG_ENDIAN / in1 = pSrc++; in2 = pSrc++; #ifndef ARM_MATH_BIG_ENDIAN simd++ = __PKHBT(((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), ((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), 16); #else simd++ = __PKHBT(((in2 > 0) ? in2 : (q15_t)__QSUB16(0, in2)), ((in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1)), 16); #endif / #ifndef ARM_MATH_BIG_ENDIAN / / Decrement the loop counter / blkCnt--; } pDst = (q15_t )simd; /* If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = \|A\| / / Read the input / in1 = pSrc++; /* Calculate absolute value of input and then store the result in the destination buffer. / pDst++ = (in1 > 0) ? in1 : (q15_t)__QSUB16(0, in1); /* Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / q15_t in; / Temporary input variable / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = \|A\| / / Read the input / in = pSrc++; /* Calculate absolute value of input and then store the result in the destination buffer. / pDst++ = (in > 0) ? in : ((in == (q15_t) 0x8000) ? 0x7fff : -in); /* Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY / } 1. 对于Q15格式的数据，饱和运算会使得数据0x8000变成0x7fff。 2. __SIMD32_TYPE的定义在文件arm_math.h中，具体定义如下： #define __SIMD32_TYPE int32_t __packed SIMD就是咱们上期教程所将的单指令多数据流。简单的理解就是__SIMD32_TYPE就是定义了一个int32_t类型的数据，__packed的含义就是实现字节的对齐功能，方便两个16位数据的都存入到这个数据类型中。 3. 函数__SIMD32_CONST的定义如下： #define __SIMD32_CONST(addr) ((__SIMD32_TYPE )(addr)) 4. 函数__PKHBT的定义在文件core_cm4_simd.h，定义如下： #define __PKHBT(ARG1,ARG2,ARG3) ( ((((uint32_t)(ARG1)) ) & 0x0000FFFFUL) \| ((((uint32_t)(ARG2)) << (ARG3)) & 0xFFFF0000UL) ) 这个宏定义的作用就是将将两个16位的数据合并成32位数据。但是有一点要特别说明__PKHBT也是CM4内核支持的SIMD指令，上面的宏定义的C函数会被MDK自动识别并调用相应的PKHBT指令。__QSUB16用于实现16位数据的饱和减法。

赞回复举报

lee_st · 2016-9-22 13:06:14 4^# 8.1.4 arm_abs_q7 这个函数用于求8位定点数的绝对值，源代码分析如下：复制代码 /** * @brief Q7 vector absolute value. * @param[in] pSrc points to the input buffer @param[out] pDst points to the output buffer @param[in] blockSize number of samples in each vector * @return none. * * par Conditions for optimum performance * Input and output buffers should be aligned by 32-bit * * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * The Q7 value -1 (0x80) will be saturated to the maximum allowable positive value 0x7F. / void arm_abs_q7( q7_t pSrc, q7_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / q7_t in; / Input value1 / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t in1, in2, in3, in4; / temporary input variables / q31_t out1, out2, out3, out4; / temporary output variables / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = \|A\| / / Read inputs / in1 = (q31_t) pSrc; in2 = (q31_t) * (pSrc + 1); in3 = (q31_t) * (pSrc + 2); /* find absolute value / out1 = (in1 > 0) ? in1 : (q31_t)__QSUB8(0, in1); (2) / read input / in4 = (q31_t) (pSrc + 3); /* find absolute value / out2 = (in2 > 0) ? in2 : (q31_t)__QSUB8(0, in2); / store result to destination / pDst = (q7_t) out1; /* find absolute value / out3 = (in3 > 0) ? in3 : (q31_t)__QSUB8(0, in3); / find absolute value / out4 = (in4 > 0) ? in4 : (q31_t)__QSUB8(0, in4); / store result to destination / (pDst + 1) = (q7_t) out2; /* store result to destination / (pDst + 2) = (q7_t) out3; /* store result to destination / (pDst + 3) = (q7_t) out4; /* update pointers to process next samples / pSrc += 4u; pDst += 4u; / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / blkCnt = blockSize; #endif // #define ARM_MATH_CM0_FAMILY while(blkCnt > 0u) { / C = \|A\| / / Read the input / in = pSrc++; /* Store the Absolute result in the destination buffer / pDst++ = (in > 0) ? in : ((in == (q7_t) 0x80) ? 0x7f : -in); /* Decrement the loop counter */ blkCnt--; } } 1. 由于饱和运算，0x80求绝对值将变成数据0x7F。 2. __QSUB8用以实现8位数的饱和减法运算。

赞回复举报

lee_st · 2016-9-22 13:06:26 5^# 8.1.5 实例讲解实验目的： 1. 四种数据类型数据绝对值求解实验内容： 1. 按下按键K1, 串口打印输出结果实验现象：通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：程序设计： [url=]复制代码[/url] /* ********************************************************************************************************* * 函数名: DSP_ABS * 功能说明: 求绝对值 * 形参：无 * 返回值: 无 ********************************************************************************************************* / static void DSP_ABS(void) { static float32_t pSrc; static float32_t pDst; static q31_t pSrc1; static q31_t pDst1; static q15_t pSrc2; static q15_t pDst2; static q7_t pSrc3 = 127; / 为了说明问题，在这里设置初始值为127，然后查看0x80是否饱和为0x7F / static q7_t pDst3; pSrc -= 1.23f; arm_abs_f32(&pSrc, &pDst, 1); (1) printf("arm_abs_f32 = %frn", pDst); pSrc1 -= 1; arm_abs_q31(&pSrc1, &pDst1, 1); (2) printf("arm_abs_q31 = %drn", pDst1); pSrc2 -= 1; arm_abs_q15(&pSrc2, &pDst2, 1); (3) printf("arm_abs_q15 = %drn", pDst2); pSrc3 += 1; printf("pSrc3 = %drn", pSrc3); arm_abs_q7(&pSrc3, &pDst3, 1); (4) printf("arm_abs_q7 = %drn", pDst3); printf("**********************************rn"); } (1)到(4)实现相应格式下绝对值的求解。这里只求了一个数，大家可以尝试求解一个数组的绝对值。

赞回复举报

lee_st · 2016-9-22 13:06:42 6^# 8.2 求和（Vector Addition）这部分函数主要用于求和，公式描述如下： pDst[n] = pSrcA[n] + pSrcB[n], 0 <= n < blockSize. 8.2.1 arm_add_f32 这个函数用于求32位浮点数的和，源代码分析如下：复制代码 /** * @brief Floating-point vector addition. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. / void arm_add_f32( float32_t pSrcA, float32_t * pSrcB, float32_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / float32_t inA1, inA2, inA3, inA4; / temporary input variabels / float32_t inB1, inB2, inB3, inB4; / temporary input variables / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / / read four inputs from sourceA and four inputs from sourceB / inA1 = pSrcA; inB1 = pSrcB; inA2 = (pSrcA + 1); inB2 = (pSrcB + 1); inA3 = (pSrcA + 2); inB3 = (pSrcB + 2); inA4 = (pSrcA + 3); inB4 = (pSrcB + 3); / C = A + B / (1) / add and store result to destination / pDst = inA1 + inB1; (pDst + 1) = inA2 + inB2; (pDst + 2) = inA3 + inB3; (pDst + 3) = inA4 + inB4; / update pointers to process next samples / pSrcA += 4u; pSrcB += 4u; pDst += 4u; / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (pSrcA++) + (pSrcB++); /* Decrement the loop counter */ blkCnt--; } } 1. 这部分的代码比较简单，只是求解两个数的和。

赞回复举报

lee_st · 2016-9-22 13:06:52 7^# 8.2.2 arm_add_q31 这个函数用于求32位定点数的和，源代码分析如下：复制代码 /** * @brief Q31 vector addition. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated. / void arm_add_q31( q31_t pSrcA, q31_t * pSrcB, q31_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t inA1, inA2, inA3, inA4; q31_t inB1, inB2, inB3, inB4; /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / inA1 = pSrcA++; inA2 = pSrcA++; inB1 = pSrcB++; inB2 = pSrcB++; inA3 = pSrcA++; inA4 = pSrcA++; inB3 = pSrcB++; inB4 = pSrcB++; pDst++ = __QADD(inA1, inB1); (2) pDst++ = __QADD(inA2, inB2); pDst++ = __QADD(inA3, inB3); pDst++ = __QADD(inA4, inB4); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = __QADD(pSrcA++, pSrcB++); /* Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (q31_t) clip_q63_to_q31((q63_t) * pSrcA++ + pSrcB++); (3) / Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY */ } 1. 这个函数也是饱和运算，输出结果的范围[0x80000000 0x7FFFFFFF]，超出这个结果将产生饱和结果。 2. __QADD实现32位数的加法。 3. 函数clip_q63_to_q31的定义在文件arm_math.h里面 static __INLINE q31_t clip_q63_to_q31( q63_t x) { return ((q31_t) (x >> 32) != ((q31_t) x >> 31)) ? ((0x7FFFFFFF ^ ((q31_t) (x >> 63)))) : (q31_t) x; } 这个函数的作用是实现饱和结果。

赞回复举报

lee_st · 2016-9-22 13:07:04 8^# 8.2.3 arm_add_q15 这个函数用于求16位定点数的和，源代码分析如下：复制代码 /** * @brief Q15 vector addition. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated. / void arm_add_q15( q15_t pSrcA, q15_t * pSrcB, q15_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t inA1, inA2, inB1, inB2; /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A + B / (2) / Add and then store the results in the destination buffer. / inA1 = __SIMD32(pSrcA)++; inA2 = __SIMD32(pSrcA)++; inB1 = __SIMD32(pSrcB)++; inB2 = __SIMD32(pSrcB)++; __SIMD32(pDst)++ = __QADD16(inA1, inB1); __SIMD32(pDst)++ = __QADD16(inA2, inB2); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (q15_t) __QADD16(pSrcA++, pSrcB++); /* Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (q15_t) __SSAT(((q31_t) * pSrcA++ + pSrcB++), 16); (3) / Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY / } 1. 这个函数也是饱和运算，输出结果的范围[0x8000 0x7FFF]，超出这个结果将产生饱和结果。 2. 函数inA1 = __SIMD32(pSrcA)++仅需要一条SIMD指令即可完成将两个16位数存到32位的变量inA1中。 3. __SSAT也是SIMD指令，这里是将结果饱和到16位精度。

赞回复举报

lee_st · 2016-9-22 13:07:23 9^# 8.2.4 arm_add_q7 这个函数用于求8位定点数的绝对值，源代码分析如下：复制代码 /** * @brief Q7 vector addition. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * Results outside of the allowable Q7 range [0x80 0x7F] will be saturated. / void arm_add_q7( q7_t pSrcA, q7_t * pSrcB, q7_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / (2) __SIMD32(pDst)++ = __QADD8(__SIMD32(pSrcA)++, __SIMD32(pSrcB)++); /* Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (q7_t) __SSAT(pSrcA++ + pSrcB++, 8); /* Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = A + B / / Add and then store the results in the destination buffer. / pDst++ = (q7_t) __SSAT((q15_t) * pSrcA++ + pSrcB++, 8); / Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY */ } 1．这个函数也是饱和运算，输出结果的范围[0x80 0x7F]，超出这个结果将产生饱和。 2．这里通过SIMD指令实现4组8位数的加法。

赞回复举报

lee_st · 2016-9-22 13:07:41 10^# 8.2.5 实例讲解实验目的： 1. 四种类似数据的求和实验内容： 1. 按下按键K2, 串口打印输出结果实验现象：通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

赞回复举报

lee_st · 2016-9-22 13:08:00 11^# 程序设计：复制代码 /* ********************************************************************************************************* * 函数名: DSP_ABS * 功能说明: 加法 * 形参：无 * 返回值: 无 ********************************************************************************************************* / static void DSP_Add(void) { static float32_t pSrcA; static float32_t pSrcB; static float32_t pDst; static q31_t pSrcA1; static q31_t pSrcB1; static q31_t pDst1; static q15_t pSrcA2; static q15_t pSrcB2; static q15_t pDst2; static q7_t pSrcA3; static q7_t pSrcB3; static q7_t pDst3; pSrcA--; arm_add_f32(&pSrcA, &pSrcB, &pDst, 1); printf("arm_add_f32 = %frn", pDst); pSrcA1--; arm_add_q31(&pSrcA1, &pSrcB1, &pDst1, 1); printf("arm_add_q31 = %drn", pDst1); pSrcA2--; arm_add_q15(&pSrcA2, &pSrcB2, &pDst2, 1); printf("arm_add_q15 = %drn", pDst2); pSrcA3--; arm_add_q7(&pSrcA3, &pSrcB3, &pDst3, 1); printf("arm_add_q7 = %drn", pDst3); printf("**********************************rn"); }

赞回复举报

lee_st · 2016-9-22 13:08:12 12^# 8.3 点乘（Vector Dot Product）这部分函数主要用于点乘，公式描述如下： sum = pSrcA[0]pSrcB[0] + pSrcA[1]pSrcB[1] + ... + pSrcA[blockSize-1]pSrcB[blockSize-1] 8.3.1 arm_dot_prod_f32 这个函数用于求32位浮点数的点乘，源代码分析如下：复制代码 /* * @defgroup dot_prod Vector Dot Product * * Computes the dot product of two vectors. * The vectors are multiplied element-by-element and then summed. * * * sum = pSrcA[0]pSrcB[0] + pSrcA[1]pSrcB[1] + ... + pSrcA[blockSize-1]pSrcB[blockSize-1] * * There are separate functions for floating-point, Q7, Q15, and Q31 data types. / /* * @addtogroup dot_prod * @{ / /* * @brief Dot product of floating-point vectors. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[in] blockSize number of samples in each vector * @param[out] result output result returned here @return none. / void arm_dot_prod_f32( float32_t pSrcA, float32_t * pSrcB, uint32_t blockSize, float32_t * result) { float32_t sum = 0.0f; /* Temporary result storage / (1) uint32_t blkCnt; / loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the result in a temporary buffer / sum += (pSrcA++) * (pSrcB++); (2) sum += (pSrcA++) * (pSrcB++); sum += (pSrcA++) * (pSrcB++); sum += (pSrcA++) * (pSrcB++); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the result in a temporary buffer. / sum += (pSrcA++) * (pSrcB++); / Decrement the loop counter / blkCnt--; } / Store the result back in the destination buffer / result = sum; } 1. 由于CM4上带的FPU是单精度的，所以初始化float32_t类型的浮点数时需要在数据的末尾加上f。 2. 类似函数sum += (pSrcA++) (*pSrcB++)最终会通过浮点的MAC（乘累加）实现，从而加快执行时间。

赞回复举报

lee_st · 2016-9-22 13:08:23 13^# 8.3.2 arm_dot_prod_q31 这个函数用于求32位定点数的点乘，源代码分析如下：复制代码 /** * @brief Dot product of Q31 vectors. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[in] blockSize number of samples in each vector * @param[out] result output result returned here @return none. * * Scaling and Overflow Behavior: (1) * par * The intermediate multiplications are in 1.31 x 1.31 = 2.62 format and these * are truncated to 2.48 format by discarding the lower 14 bits. * The 2.48 result is then added without saturation to a 64-bit accumulator in 16.48 format. * There are 15 guard bits in the accumulator and there is no risk of overflow as long as * the length of the vectors is less than 2^16 elements. * The return result is in 16.48 format. / void arm_dot_prod_q31( q31_t pSrcA, q31_t * pSrcB, uint32_t blockSize, q63_t * result) { q63_t sum = 0; /* Temporary result storage / uint32_t blkCnt; / loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t inA1, inA2, inA3, inA4; q31_t inB1, inB2, inB3, inB4; /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the result in a temporary buffer. / inA1 = pSrcA++; inA2 = pSrcA++; inA3 = pSrcA++; inA4 = pSrcA++; inB1 = pSrcB++; inB2 = pSrcB++; inB3 = pSrcB++; inB4 = pSrcB++; sum += ((q63_t) inA1 inB1) >> 14u; (2) sum += ((q63_t) inA2 * inB2) >> 14u; sum += ((q63_t) inA3 * inB3) >> 14u; sum += ((q63_t) inA4 * inB4) >> 14u; /* Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the result in a temporary buffer. / sum += ((q63_t) pSrcA++ * pSrcB++) >> 14u; / Decrement the loop counter / blkCnt--; } / Store the result in the destination buffer in 16.48 format / result = sum; } 1. 两个Q31格式的32位数相乘，那么输出结果的格式是1.31*1.31 = 2.62。实际应用中基本不需要这么高的精度，这个函数将低14位的数据截取掉，反应在函数中就是两个数的乘积左移14位，也就是定点数的小数点也左移14位，那么最终的结果的格式是16.48。所以只要乘累加的个数小于2^16就没有输出结果溢出的危险（不知道这里为什么不是2^14，留作以后解决）。 2. 将获取的结果左移14位。

赞回复举报

lee_st · 2016-9-22 13:08:34 14^# 8.3.3 arm_dot_prod_q15 这个函数用于求16位定点数的点乘，源代码分析如下：复制代码 /** * @brief Dot product of Q15 vectors. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[in] blockSize number of samples in each vector * @param[out] result output result returned here @return none. * * Scaling and Overflow Behavior: (1) * par * The intermediate multiplications are in 1.15 x 1.15 = 2.30 format and these * results are added to a 64-bit accumulator in 34.30 format. * Nonsaturating additions are used and given that there are 33 guard bits in the accumulator * there is no risk of overflow. * The return result is in 34.30 format. / void arm_dot_prod_q15( q15_t pSrcA, q15_t * pSrcB, uint32_t blockSize, q63_t * result) { q63_t sum = 0; /* Temporary result storage / uint32_t blkCnt; / loop counter / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / (2) / Calculate dot product and then store the result in a temporary buffer. / sum = __SMLALD(__SIMD32(pSrcA)++, __SIMD32(pSrcB)++, sum); sum = __SMLALD(__SIMD32(pSrcA)++, __SIMD32(pSrcB)++, sum); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the results in a temporary buffer. / sum = __SMLALD(pSrcA++, pSrcB++, sum); / Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Calculate dot product and then store the results in a temporary buffer. / sum += (q63_t) ((q31_t) pSrcA++ * pSrcB++); / Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY / / Store the result in the destination buffer in 34.30 format / result = sum; } 1．两个Q15格式的数据相乘，那么输出结果的格式是1.15*1.15 = 2.30，这个函数将输出结果赋值给了64位变量，那么输出结果就是34.30格式。所以基本没有溢出的危险。 2． __SMLALD也是SIMD指令，实现两个16位数相乘，并把结果累加给64位变量。

赞回复举报

lee_st · 2016-9-22 13:08:45 15^# 8.3.4 arm_dot_prod_q7 这个函数用于求8位定点数的点乘，源代码分析如下：复制代码 /** * @brief Dot product of Q7 vectors. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[in] blockSize number of samples in each vector * @param[out] result output result returned here @return none. * * Scaling and Overflow Behavior: (1) * par * The intermediate multiplications are in 1.7 x 1.7 = 2.14 format and these * results are added to an accumulator in 18.14 format. * Nonsaturating additions are used and there is no danger of wrap around as long as * the vectors are less than 2^18 elements long. * The return result is in 18.14 format. / void arm_dot_prod_q7( q7_t pSrcA, q7_t * pSrcB, uint32_t blockSize, q31_t * result) { uint32_t blkCnt; /* loop counter / q31_t sum = 0; / Temporary variables to store output / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t input1, input2; / Temporary variables to store input / q31_t inA1, inA2, inB1, inB2; / Temporary variables to store input / /loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / read 4 samples at a time from sourceA / (2) input1 = __SIMD32(pSrcA)++; /* read 4 samples at a time from sourceB / input2 = __SIMD32(pSrcB)++; /* extract two q7_t samples to q15_t samples / inA1 = __SXTB16(__ROR(input1, 8)); (3) / extract reminaing two samples / inA2 = __SXTB16(input1); / extract two q7_t samples to q15_t samples / inB1 = __SXTB16(__ROR(input2, 8)); / extract reminaing two samples / inB2 = __SXTB16(input2); / multiply and accumulate two samples at a time / sum = __SMLAD(inA1, inB1, sum); (4) sum = __SMLAD(inA2, inB2, sum); / Decrement the loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Dot product and then store the results in a temporary buffer. / sum = __SMLAD(pSrcA++, pSrcB++, sum); / Decrement the loop counter / blkCnt--; } #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; while(blkCnt > 0u) { / C = A[0]* B[0] + A[1]* B[1] + A[2]* B[2] + .....+ A[blockSize-1]* B[blockSize-1] / / Dot product and then store the results in a temporary buffer. / sum += (q31_t) ((q15_t) pSrcA++ * pSrcB++); / Decrement the loop counter / blkCnt--; } #endif / #ifndef ARM_MATH_CM0_FAMILY / / Store the result in the destination buffer in 18.14 format / result = sum; } 1. 两个Q8格式的数据相乘，那么输出结果就是1.71.7 = 2.14格式。这里将最终结果赋值给了32位的变量，那么最终的格式就是18.14。如果乘累加的个数小于2^18那么就不会有溢出的危险（感觉这里应该是2^16）。 2. 一次读取4个8位的数据。 3. __SXTB16也是SIMD指令，用于将两个8位的有符号数扩展成16位。__ROR用于实现数据的循环右移。 4. __SMLAD也是SIMD指令，用于实现如下功能： sum = __SMLAD(x, y, z) sum = z + ((short)(x>>16) (short)(y>>16)) + ((short)x * (short)y)

赞回复举报

lee_st · 2016-9-22 13:09:06 16^# 8.3.5 实例讲解实验目的： 1. 四种类型数据的点乘。实验内容： 1. 按下按键K3, 串口打印输出结果实验现象：通过窗口上位机软件SecureCRT（V5光盘里面有此软件）查看打印信息现象如下：

赞回复举报

lee_st · 2016-9-22 13:09:27 17^# 程序设计：复制代码 /* ********************************************************************************************************* * 函数名: DSP_DotProduct * 功能说明: 乘积 * 形参：无 * 返回值: 无 ********************************************************************************************************* / static void DSP_DotProduct(void) { static float32_t pSrcA[5] = {1.0f,1.0f,1.0f,1.0f,1.0f}; static float32_t pSrcB[5] = {1.0f,1.0f,1.0f,1.0f,1.0f}; static float32_t result; static q31_t pSrcA1[5] = {0x7ffffff0,1,1,1,1}; static q31_t pSrcB1[5] = {1,1,1,1,1}; static q63_t result1; static q15_t pSrcA2[5] = {1,1,1,1,1}; static q15_t pSrcB2[5] = {1,1,1,1,1}; static q63_t result2; static q7_t pSrcA3[5] = {1,1,1,1,1}; static q7_t pSrcB3[5] = {1,1,1,1,1}; static q31_t result3; pSrcA[0] -= 1.1f; arm_dot_prod_f32(pSrcA, pSrcB, 5, &result); printf("arm_dot_prod_f32 = %frn", result); pSrcA1[0] -= 0xffff; arm_dot_prod_q31(pSrcA1, pSrcB1, 5, &result1); printf("arm_dot_prod_q31 = %lldrn", result1); pSrcA2[0] -= 1; arm_dot_prod_q15(pSrcA2, pSrcB2, 5, &result2); printf("arm_dot_prod_q15 = %lldrn", result2); pSrcA3[0] -= 1; arm_dot_prod_q7(pSrcA3, pSrcB3, 5, &result3); printf("arm_dot_prod_q7 = %drn", result3); printf("**********************************rn"); }

赞回复举报

lee_st · 2016-9-22 13:09:41 18^# 8.4 乘法（Vector Multiplication）这部分函数主要用于乘法，公式描述如下： pDst[n] = pSrcA[n] * pSrcB[n], 0 <= n < blockSize. 8.4.1 arm_mult_f32 这个函数用于求32位浮点数的乘法，源代码分析如下：复制代码 /** * @brief Floating-point vector multiplication. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. / void arm_mult_f32( float32_t pSrcA, float32_t * pSrcB, float32_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counters / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / float32_t inA1, inA2, inA3, inA4; / temporary input variables / float32_t inB1, inB2, inB3, inB4; / temporary input variables / float32_t out1, out2, out3, out4; / temporary output variables / / loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A * B / / Multiply the inputs and store the results in output buffer / (1) / read sample from sourceA / inA1 = pSrcA; /* read sample from sourceB / inB1 = pSrcB; /* read sample from sourceA / inA2 = (pSrcA + 1); /* read sample from sourceB / inB2 = (pSrcB + 1); /* out = sourceA * sourceB / out1 = inA1 inB1; /* read sample from sourceA / inA3 = (pSrcA + 2); /* read sample from sourceB / inB3 = (pSrcB + 2); /* out = sourceA * sourceB / out2 = inA2 inB2; /* read sample from sourceA / inA4 = (pSrcA + 3); /* store result to destination buffer / pDst = out1; /* read sample from sourceB / inB4 = (pSrcB + 3); /* out = sourceA * sourceB / out3 = inA3 inB3; /* store result to destination buffer / (pDst + 1) = out2; /* out = sourceA * sourceB / out4 = inA4 inB4; /* store result to destination buffer / (pDst + 2) = out3; /* store result to destination buffer / (pDst + 3) = out4; /* update pointers to process next samples / pSrcA += 4u; pSrcB += 4u; pDst += 4u; / Decrement the blockSize loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A * B / / Multiply the inputs and store the results in output buffer / pDst++ = (pSrcA++) (pSrcB++); / Decrement the blockSize loop counter */ blkCnt--; } } 1. 浮点的32位乘法比较简单，这里依然是以4次的计算为一组。

赞回复举报

lee_st · 2016-9-22 13:09:56 19^# 8.4.2 arm_mult_q31 这个函数用于求32位定点数的乘法，源代码分析如下：复制代码 /** * @brief Q31 vector multiplication. * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] will be saturated. / void arm_mult_q31( q31_t pSrcA, q31_t * pSrcB, q31_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counters / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t inA1, inA2, inA3, inA4; / temporary input variables / q31_t inB1, inB2, inB3, inB4; / temporary input variables / q31_t out1, out2, out3, out4; / temporary output variables / / loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / C = A * B / / Multiply the inputs and then store the results in the destination buffer. / inA1 = pSrcA++; inA2 = pSrcA++; inA3 = pSrcA++; inA4 = pSrcA++; inB1 = pSrcB++; inB2 = pSrcB++; inB3 = pSrcB++; inB4 = pSrcB++; out1 = ((q63_t) inA1 inB1) >> 32; (2) out2 = ((q63_t) inA2 * inB2) >> 32; out3 = ((q63_t) inA3 * inB3) >> 32; out4 = ((q63_t) inA4 * inB4) >> 32; out1 = __SSAT(out1, 31); (3) out2 = __SSAT(out2, 31); out3 = __SSAT(out3, 31); out4 = __SSAT(out4, 31); pDst++ = out1 << 1u; (4) pDst++ = out2 << 1u; pDst++ = out3 << 1u; pDst++ = out4 << 1u; /* Decrement the blockSize loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A * B / / Multiply the inputs and then store the results in the destination buffer. / pDst++ = (q31_t) clip_q63_to_q31(((q63_t) (pSrcA++) (pSrcB++)) >> 31); / Decrement the blockSize loop counter */ blkCnt--; } } 1. 这个函数使用了饱和算法。所得结果是Q31格式，范围Q31 range[0x80000000 0x7FFFFFFF]。 2. 所得乘积左移32位。 3. 实现31位精度的饱和运算。 4. 右移一位，保证所得结果是Q31格式。

赞回复举报

lee_st · 2016-9-22 13:10:13 20^# 8.4.3 arm_mult_q15 这个函数用于求16位定点数的乘法，源代码分析如下：复制代码 /** * @brief Q15 vector multiplication * @param[in] pSrcA points to the first input vector @param[in] pSrcB points to the second input vector @param[out] pDst points to the output vector @param[in] blockSize number of samples in each vector * @return none. * * Scaling and Overflow Behavior: (1) * par * The function uses saturating arithmetic. * Results outside of the allowable Q15 range [0x8000 0x7FFF] will be saturated. / void arm_mult_q15( q15_t pSrcA, q15_t * pSrcB, q15_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counters / #ifndef ARM_MATH_CM0_FAMILY / Run the below code for Cortex-M4 and Cortex-M3 / q31_t inA1, inA2, inB1, inB2; / temporary input variables / q15_t out1, out2, out3, out4; / temporary output variables / q31_t mul1, mul2, mul3, mul4; / temporary variables / / loop Unrolling / blkCnt = blockSize >> 2u; / First part of the processing with loop unrolling. Compute 4 outputs at a time. ** a second loop below computes the remaining 1 to 3 samples. / while(blkCnt > 0u) { / read two samples at a time from sourceA / inA1 = __SIMD32(pSrcA)++; (2) /* read two samples at a time from sourceB / inB1 = __SIMD32(pSrcB)++; /* read two samples at a time from sourceA / inA2 = __SIMD32(pSrcA)++; /* read two samples at a time from sourceB / inB2 = __SIMD32(pSrcB)++; /* multiply mul = sourceA * sourceB / mul1 = (q31_t) ((q15_t) (inA1 >> 16) (q15_t) (inB1 >> 16)); (3) mul2 = (q31_t) ((q15_t) inA1 * (q15_t) inB1); mul3 = (q31_t) ((q15_t) (inA2 >> 16) * (q15_t) (inB2 >> 16)); mul4 = (q31_t) ((q15_t) inA2 * (q15_t) inB2); /* saturate result to 16 bit / out1 = (q15_t) __SSAT(mul1 >> 15, 16); (4) out2 = (q15_t) __SSAT(mul2 >> 15, 16); out3 = (q15_t) __SSAT(mul3 >> 15, 16); out4 = (q15_t) __SSAT(mul4 >> 15, 16); / store the result / #ifndef ARM_MATH_BIG_ENDIAN __SIMD32(pDst)++ = __PKHBT(out2, out1, 16); (5) __SIMD32(pDst)++ = __PKHBT(out4, out3, 16); #else __SIMD32(pDst)++ = __PKHBT(out2, out1, 16); __SIMD32(pDst)++ = __PKHBT(out4, out3, 16); #endif // #ifndef ARM_MATH_BIG_ENDIAN / Decrement the blockSize loop counter / blkCnt--; } / If the blockSize is not a multiple of 4, compute any remaining output samples here. ** No loop unrolling is used. / blkCnt = blockSize % 0x4u; #else / Run the below code for Cortex-M0 / / Initialize blkCnt with number of samples / blkCnt = blockSize; #endif / #ifndef ARM_MATH_CM0_FAMILY / while(blkCnt > 0u) { / C = A * B / / Multiply the inputs and store the result in the destination buffer / pDst++ = (q15_t) __SSAT((((q31_t) (pSrcA++) (pSrcB++)) >> 15), 16); / Decrement the blockSize loop counter */ blkCnt--; } } 1. 这个函数使用了饱和算法。所得结果是Q15格式，范围 [0x8000 0x7FFF]。 2. 一次读取两个Q15格式的数据。 3. 将四组数的乘积保存到Q31格式的变量mul1，mul2，mul3，mul4。 4. 丢弃32位数据的低15位，并把最终结果饱和到16位精度。 5. 通过SIMD指令__PKHBT将两个Q15格式的数据保存的结果数组中，从而一个指令周期就能完成两个数据的存储。

赞回复举报

评论

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容图片侵权或者其他问题，请联系本站作侵删。侵权投诉

12 / 2 页下一页

发资料

精选推荐

【敏矽微ME32G070开发板免费体验】新建工程（MDK）

386 浏览 0 评论
求助一下关于51系列单片机的Timer0的计时问题，TH0、TL0+1的时间是怎么算的？

1670 浏览 1 评论
【RA-Eco-RA4E2-64PIN-V1.0开发板试用】开箱+Keil环境搭建+点灯+点亮OLED

1123 浏览 0 评论
【敏矽微ME32G070开发板免费体验】使用coremark测试敏矽微ME32G070 跑分

1005 浏览 0 评论
【敏矽微ME32G070开发板免费体验】开箱+点灯+点亮OLED

1229 浏览 2 评论

热门帖

【youyeetoo X1 windows 开发板体验】少儿AI智能STEAM积木平台

12013 浏览 31 评论

快速回复 返回顶部 返回列表

关注微信公众号

电子发烧友网

电子发烧友论坛

社区合作: 刘勇; 联系电话：15994832713; 邮箱地址：liuyong@huaqiu.com

社区管理: elecfans短短; 微信：elecfans_666; 邮箱：users@huaqiu.com

【优惠升级】华秋PCB首单最高立减100元，SMT免费贴片！！！

返回单片机/MCU论坛

12 / 2 页下一页

回复

关闭

站长推荐 /6

快速回复 返回顶部 返回列表

- 技术社区: HarmonyOS技术社区

RISC-V MCU技术社区

FPGA开发者技术社区

- OpenHarmony开源社区: OpenHarmony开源社区

- 嵌入式论坛: ARM技术论坛

STM32/STM8技术论坛

嵌入式技术论坛

单片机/MCU论坛

RISC-V技术论坛

瑞芯微Rockchip开发者社区

FPGA|CPLD|ASIC论坛

DSP论坛

- 电路图及DIY: 电路设计论坛

DIY及创意

电子元器件论坛

专家问答

- 电源技术论坛: 电源技术论坛

无线充电技术

- 综合技术与应用: 机器人论坛

USB论坛

电机控制

模拟技术

音视频技术

综合技术交流

上位机软件（C/Python/Java等）

- 无线通信论坛: WIFI技术

蓝牙技术

天线|RF射频|微波|雷达技术

- EDA设计论坛: PCB设计论坛

DigiPCBA论坛

Protel|AD|DXP论坛

PADS技术论坛

Allegro论坛

multisim论坛

proteus论坛|仿真论坛

KiCad EDA 中文论坛

DFM|可制造性设计论坛

- 测试测量论坛: LabVIEW论坛

Matlab论坛

测试测量技术

传感技术

- 招聘/交友/外包/交易/杂谈: 项目外包

供需及二手交易

工程师杂谈|交友

招聘|求职|工程师职场

- 官方社区: 发烧友官方/活动

华秋商城

华秋电路

time

recommend

hot

post

—
—
—

版
块
导
航