完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
嗨,我使用ApIC32 MZ2048 EFH144运行@ 200 MHz。我需要在我的代码中实现多个浮点乘法。在数据表中提到,这个MCU有硬件单周期乘法器和FPU。我想知道浮点32乘浮法运算32应该花多长时间?我使用的示例代码如下:浮点A,B,C;LeD1=1;C= A*B;ReD1=0;在我的测试中,上面乘法的执行大约140NS(虽然每个周期是5NS),所以代码需要大约70个周期来运行。我做了什么错事?
以上来自于百度翻译 以下为原文 Hi everyone I'm using a PIC32MZ2048EFH144 running @ 200MHz. I need to implement a number of float multiplications in my code. The multiplications take much longer than it should. As mentioned in the datasheet this MCU has hardware single cycle multiplier and also an FPU. I want to know how long should a float_32 by float_32 multiplication take? a sample code that I used is as follows: float a,b,c; LED1 = 1; c = a * b; LED1 = 0; In my test the execution of above multiplication took about 140nS (although each cycle is 5nS) so the code takes about 70 cycles to run. What am I doing wrong? |
|
相关推荐
5个回答
|
|
140/5=28,而不是70。可能是将变量移到FPU的时间。外围接入时间也会被关断。
以上来自于百度翻译 以下为原文 140 / 5 = 28, not 70. Probably time taken to move variables to/from FPU. also peripheral access time to turn LED off. |
|
|
|
你有优化吗?完整的IEEE浮点支持通常会增加一些开销。GCC有一些优化开关(例如FFAST数学,-FUNACT数学优化),允许更好的优化,但会导致意想不到的结果。
以上来自于百度翻译 以下为原文 Do you have optimizations turned on? Full IEEE floating point support usually adds some overhead. GCC has some optimization switches (eg. -ffast-math, -funsafe-math-optimizations) that allow better optimization, but can lead to unexpected results. |
|
|
|
你只看第一次跑步吗?缓存应该第一次填充,随后的运行应该快得多。
以上来自于百度翻译 以下为原文 Are you looking at this on the first run only? The cache should be filling the first time around, and then subsequent runs should be much faster. |
|
|
|
嗯,它需要一个周期。但是,(一个大的但是)在打开LED的指令和关闭LED的指令之间的代码序列不止一个机器指令。步骤1:创建一个LST文件并计算指令。您可以在MPLABX中查看两种方式的指令,但我喜欢有一个真正的文档,我可以检查和复制/粘贴部分,以便发送给其他人解释(或让他们向我解释)正在发生的事情。下面是我如何在MPLABX中创建一个.lST文件:在项目中&属性:IVICE框,点击“执行行后”框。Windows用户粘贴到下一个框(所有一行):${MPyCccdidi}xC32 ObjDIP-S $ {IMADIDER } /${PrimeNeX} .${IMAGEYType }。ELF>${IMADIDER } /${PrimeNeX}。${IMAGEYType }。LSTLinux用户将该反斜杠更改为前斜杠。步骤2:现在构建项目,您将在包含H.EX文件的同一目录中找到.LST文件。我喜欢这样做,这样我就可以比较不同配置的结果。第3步:现在你有一个.LST文件(源代码散布在汇编代码中),寻找你做乘法的地方。在XC32版本1.44和优化级别设置为零的情况下,这里的部分(FX、FY和FZ被声明为浮动浮点,并且DEL2被定义为特定端口的特定锁存比特):从LED打开时起的十一个指令。l关闭LED(第一个SH指令)(第二个SH指令)。将优化级别设置为1,并且可以消除四的机器指令在仪表化序列中。我认为这样的事情很酷。无论如何,到目前为止,所有的事情都是完全合乎逻辑的和确定性的,但这里有一个大问题:因为等待状态和流水线、指令和数据缓存,从指令中直接计算运行时间不是那么容易的。离子计数。(至少,我还没有找到一个黄金规则。)[/开始编辑],使用端口集、CLR和IV寄存器,而不是显式地设置、清除和反转LAT寄存器的位,不仅在指令数量和运行时间方面更有效,而且它们是原子的,允许。LED在中断例程中被设置和清除,而不会中断同一端口上的其他位。我建议你应该习惯用PIC32做“大男孩”的事情。稍后你会感谢我。[//EddieDe]同样,由于流水线和其他所有因素,指令周期的数量可以从一个代码段改变为另一个,这取决于这是否是一个紧密的循环,或者是在与其他用途完全断开的上下文中。一个优化编译器可以改变事物的方式,从源头上看,它甚至更不可预测。底线,它可能(或者,也许,不)帮助你越过这一点,并进入你的应用程序:这是我的RothigWaGOS“经验法则---带着一点盐”。实际运行时间相当于指令周期的两倍和三倍之间的机器周期,我接受它为“正常”并继续我的生活。假设没有中断程序占用一个有意义的周期数,并且在这种情况下,假设外围总线时钟没有从其默认值Fsys显著减慢(2.140个NS(与上面的代码一起得到))对应于140E-9*200 E6=28机。循环=(大约)指令数量的2.54倍。用我的PIC32 MZ2018EF PIM在我的Explorer 16/32板上测试。MPLABX版本4.05,XC32版本1.44注意到,对于这个特殊的简单测试,将等待状态的数量从默认值(7)变为2(最小允许的200 MHz系统时钟)没有改变时序,但这是理所当然的。打开预取操作并没有改变时间,但我通常也这样做。[/开始免责声明]虽然我已经完成了几个PIC32 MX项目(性能不是问题;“32 MX只是闲逛”),我没有一个“MZ设备的实际项目经验”。我只是想让自己熟悉一下。我的“经验法则”可能对所有的应用程序都不是很好,但我已经看到它足够的时间来缓解我的忧虑。如果性能真的,真的,非常关键,那么好的测量总是胜过抽象,特别是,它胜过其他人的意见/猜测。参见脚注[ [结束免责声明] ]问候,DaveFootnote:“做你自己的研究”--- Richard Feynman
以上来自于百度翻译 以下为原文 Well, it should take one cycle. But, (a big but) the code sequence between the instruction that turns on the LED and the instruction that turns off the LED is more than one machine instruction. Step 1: Create a .lst file and count the instructions. You can look at instructions a couple of ways in MPLABX, but I like to have a real document that I can inspect and copy/paste sections to send to others to explain (or to have them explain to me) what's happening. Here's how I create a .lst file in MPLABX:
Step 2: Now build the project, and you will find a .lst file in the same directory that contains the .hex file. I like to do it this way so that I can compare results from different configurations. Step 3: Now that you have a .lst file (with the source code interspersed among the assembly code), look for the place where you do the multiplication. Count the instructions between the place where the LED was turned on and the place where it is turned off. With XC32 version 1.44 and optimization level set to zero, here's that section (fx, fy, and fz were declared volatile floats, and LED2 was defined to be a particular Latch bit of a particular port): LED2 = 1; 9d001684: 3c03bf86 lui v1,0xbf86 9d001688: 94620030 lhu v0,48(v1) 9d00168c: 24040001 li a0,1 9d001690: 7c820844 ins v0,a0,0x1,0x1 9d001694: a4620030 sh v0,48(v1) fz = fx * fy; 9d001698: 8f838050 lw v1,-32688(gp) 9d00169c: 8f828048 lw v0,-32696(gp) 9d0016a0: 44830000 mtc1 v1,$f0 9d0016a4: 44820800 mtc1 v0,$f1 9d0016a8: 46010002 mul.s $f0,$f0,$f1 9d0016ac: 44020000 mfc1 v0,$f0 9d0016b0: af82804c sw v0,-32692(gp) LED2 = 0; 9d0016b4: 3c03bf86 lui v1,0xbf86 9d0016b8: 94620030 lhu v0,48(v1) 9d0016bc: 7c020844 ins v0,zero,0x1,0x1 9d0016c0: a4620030 sh v0,48(v1) Eleven instructions from the time the LED is turned on until the LED(the first sh instruction) is turned off (the second sh instruction). Set optimization level to 1 and you can eliminate four of the machine instructions in the instrumented sequence. I think it's kind of cool to look at stuff like this. Anyhow, everything up until now has been completely logical and deterministic, but here's the biggie: Because of wait states and pipelining and instruction and data caches, it's not so easy to make a direct computation of run time from an instruction count. (At least, I haven't found a gold rule for this.) [/Begin Edit] Also, use of a port's SET, CLR, and INV registers rather than explicitly setting and clearing and inverting bits of a LAT register are not only more efficient in terms of number of instructions and run time, but they are atomic, allowing LEDs to be set and cleared in interrupt routines without disrupting other bits on the same port. I suggest that you should get used to doing things the "Big Boy" way with your PIC32. You will thank me later. [/End Edit] Also, due to pipelining and all of the rest, the number of instruction cycles can change from one section of code to another, depending on, say, whether this is in a tight loop or is in a completely disconnected context from other uses. An optimizing compiler can change things in such a way that it is even less predictable from just looking at the source code. Bottom line, which might (or, maybe, not) help you get past this point and on to your application: Here's my ROTTIWAGOS "Rule Of Thumb --- Take It With A Grain Of Salt" If the actual run time corresponds to a number of machine cycles between two and three times the number of instructions, I accept it as "normal" and get on with my life. That's assuming there are no interrupt routines taking a meaningful number of cycles, and, in this case, assuming that the peripheral bus clock has not been slowed down significantly from its default value of Fsys/2. 140 ns (which I got with the above code) corresponds to 140e-9 * 200e6 = 28 machine cycles = (approximately) 2.54 times the number of instructions. Tested with my PIC32MZ2018EF PIM on my Explorer 16/32 board. MPLABX version 4.05, XC32 version 1.44 Note that for this particular simple test, changing the number of wait states from the default (7) to two (the minimum allowed for a 200 MHz system clock) didn't change the timing, but I do this as a matter of course. Turning on the prefetch operation did not change the timing, but I usually do this also. [/Begin Disclaimer] Although I have completed a couple of PIC32MX projects (performance wasn't an issue; the '32MX was just loafing along), I don't have real project experience with a 'MZ device. I'm just trying to familiarize myself. My "Rule of Thumb" may not be very good for all applications, but I have seen it enough times to ease my worries. If performance is really, really, really critical, well measurement always trumps abstractions, and, in particular, it outvotes other people's opinions/guesses. See Footnote. [/End Disclaimer] Regards, Dave Footnote: "Do your own research." ---Richard Feynman |
|
|
|
正如建议的那样,即使在一个严格的循环中,你也只能在30MHz的频率下切换一个引脚。在开始测量的时候,然后在测量结束时获取值。然后我使用一个调试Prtf来告诉我使用了多少个核心循环。
以上来自于百度翻译 以下为原文 As suggested - even in a tight loop you can only toggle a pin at something like 30ish MHz. When I do timing I use the core timer. Get the value at start and then at end of whatever you are measuring. Then I use a debug printf to tell me how many core cycles were used. HTH |
|
|
|
只有小组成员才能发言,加入小组>>
4825 浏览 9 评论
1831 浏览 8 评论
1748 浏览 10 评论
请问是否能把一个ADC值转换成两个字节用来设置PWM占空比?
2955 浏览 3 评论
请问电源和晶体值之间有什么关系吗?PIC在正常条件下运行4MHz需要多少电压?
2060 浏览 5 评论
461浏览 1评论
1111浏览 1评论
PIC Kit3出现目标设备ID(00000000)与预期的设备ID(02c20000)不匹配。是什么原因
364浏览 0评论
263浏览 0评论
LAN9252使用SQI通信,进行数字复位后读BYTE_TEST异常
1799浏览 0评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-4-25 07:29 , Processed in 1.109378 second(s), Total 77, Slave 61 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号