完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
嗨,我读了关于PIC32速度的帖子,仍然不知道为什么我不能从中得到更多的表现。我需要发送数据/时钟信号到ADAFRUIT RGB LED显示器。我买了2个单元,把它们插在一起,形成一个32×128的显示器。我需要发送每个帧8次,一个每一个颜色位,所以我需要高数据率,我可以得到。我有一个PIC32 MZ2048 EFH144在PIC32 MZ启动器套件,我知道它运行在200兆赫。我用CKO/OSC2引脚验证了它,它在PBCLK1/2上运行。我有预取/等待状态的代码:我有这个ASM函数,它是最快的可能代码,但是它应该运行得更快:在循环中只有5个指令。拆卸与代码相同。我读到,32 MZ不能切换端口I/O任何更快,然后25Ns(感谢微芯片!MX可以在一半的时间内完成它!如果是这样的话,我在循环中写两次,那么在最坏的情况下,每条指令有5NS,它不应该超过(2×25)+(3×5)=6NS,但它是90NS。如果有人知道为什么跑得不快,请帮助。我甚至尝试过这个函数AyyLangGraceFuffixy,但是编译器为代码编写所有的NOP(WHA)???)请提前帮助和感谢!
以上来自于百度翻译 以下为原文 Hi, I've read the posts about PIC32 speed and still don't know why I'm not getting more performance out of it. I need to send data/clock signals to an Adafruit RGB LED display. I bought 2 units and plugged them together to form a 32 x 128 display. I need to send each frame 8 times, one for each color bit, so I need as high data rate as I can get. I have a PIC32MZ2048EFH144 on a PIC32MZ Starter Kit and I know it's running at 200MHz. I've verified it with the CLKO/OSC2 pins, which runs at PBCLK1/2. I have this code for the prefetch/wait states: unsigned int sysclk = 200000000; PRECONbits.PREFEN = 0b11; //enable pre-fetch for cacheable and non-cacheable regions if(sysclk <= 40000000) PRECONbits.PFMWS = 0; // zero wait states for f<40MHz else if(sysclk <= 80000000) PRECONbits.PFMWS = 1; else if(sysclk <= 100000000) PRECONbits.PFMWS = 1; else PRECONbits.PFMWS = 2; I have this asm function that is the fastest possible code, but it should be running faster: int writeData(FRAME_BUFFER_DATA_TYPE *data, int len) { /****************************************************** * * Write pixel data to RGB LED display * * *data will be loaded into register a0 * len will be loaded into a1 * * - write 1 byte of data to LATE * bit 7 6 5 4 3 2 1 0 * color x x B2 B1 G2 G1 R2 R1 * Note: 1=rows 0-15, 2=rows 16-31 * - turn clock on (bit 7) * - turn clock off (bit 7) * - increment pointer for next column of data * - repeat number of times specified by len * *****************************************************/ // This code combines clock/data I/O. Clock is on bit 7 since only 6 data bits are needed _mtc0(25, 0, 18 << 5); asm volatile (".set noreorder"); // stops optimizer messing with the following code (Important, esp. at -O1 and above) asm volatile ("ori $t0, $0, 0x80"); // load 8 into t0 - used for LATESET & LATECLR asm volatile ("addu $a1, $a0, $a1"); // add the pointer address to len to determine the stop point asm volatile ("lui $v1, 0xBF86"); // load v1 with I/O port base address BF86h asm volatile ("mtc0 $0, $25, 1"); // clear stall counter asm volatile (".LOOP1:"); // bne (branch not equal) loops here asm volatile ("lbu $v0, 0($a0)"); // v0 = *a0 load v0 with frame buffer byte pointed to by a0 asm volatile ("*** $v0, 0x430($v1)"); // *v1 = v0 Store frame buffer byte at LATE, BF86h base + 430h offset also turns off clock bit asm volatile ("addiu $a0, $a0, 1"); // a0++ increment a0 asm volatile ("bne $a0, $a1, .LOOP1"); // if (a0!=a1), goto .LOOP1 asm volatile ("sw $t0, 0x438($v1)"); // RE7 on write t0 (value 8) to LATESET, BF86h base + 438h offset (branch delay slot) asm volatile ("sw $t0, 0x434($v1)"); // RE7 off write t0 (value 8) to LATECLR, BF86h base + 438h offset so clock isn't left on asm volatile ("mfc0 $v0, $25, 1"); // read stall counter into v0 for return value asm volatile (".set reorder"); // re-enable optimizer } There are only 5 instructions in the loop. The disassembly is identical to the code. I read that the 32MZ can't toggle a port I/O any faster then 25nS (thanks Microchip! The MX can do it in half the time!). If that's the case and I'm writing twice in the loop, then at worst case, with 5nS per other instruction, it shouldn't be any more than (2 x 25) + (3 x 5) = 65nS, but it's 90nS. If anyone knows why this isn't running any faster, please help. I even tried making this function a __longramfunc__, but then the compiler writes all NOPs for the code (wha?!?). Please help and thanks in advance! |
|
相关推荐
10个回答
|
|
为设备的端口服务的总线的外围总线时钟速度是多少?
以上来自于百度翻译 以下为原文 What is the Peripheral Bus Clock speed for the bus that serves the device's Ports? |
|
|
|
谢谢你的回复。我在程序中不碰它,所以它的默认值除以2。我注意到的另一件事是,如果我把等待状态设置为2或7,我仍然得到相同的定时。但是我很确定它的设置是因为,如果我把它设置为0或1,我的CPU停止工作。我知道这是一个困难的问题,但我需要这个操作稍微快一点。我的期望不现实吗?还有另一个MPU工作得更好/更快吗?
以上来自于百度翻译 以下为原文 Thanks for the response. I don't touch it in my program, so it'll be at its default of divide by 2. Another thing I noticed is that is makes no difference if I set the wait state to 2 or 7, I still get the same timing. But I'm fairly certain it's setting it because, if I set it to 0 or 1, my CPU stops working. I know this is a difficult problem, but I need this to operate a little faster. Are my expectations unrealistic? Is there another MPU that would work better/faster? |
|
|
|
我从不信任“默认”设置。请尝试设置它。
以上来自于百度翻译 以下为原文 I never trust "default" settings. Try actually setting it. |
|
|
|
请记住,MZ有一个多层系统总线,因此对总线的读写可能被其他活动所阻碍。通过设置CFGCON.CPUPRI位,CPU可以在总线上得到优先权。此外,如果循环访问不在缓存中的数据,则缓存填充必须填充整个行,并且当CPU可以获得数据然后将其写回时,可以延迟。是的,MX可以更快地执行端口IO,但权衡是MX不能同时进行多个并发的总线事务。MZ罐头。在所有MZ总线主控器之间,可以有多达7个同时读/写(Flash、RAM(X2)和EBI两个,四个用于DMA访问,一个用于CPU)。当所有7个都在进行时,这意味着可以(让我们假设100MHz接口到外围设备)2.8 Gb/s或更多。实际上,处理这么多的数据流是另一回事,你有没有尝试过DMA?DMA不存在CPU所做的缓存问题,它可以在传输数据时缓冲读取数据。
以上来自于百度翻译 以下为原文 Keep in mind that the MZ has a multi-layer system bus, so the reads and writes to the bus may be held up by other activity. The CPU can be given priority on the bus by setting the CFGCON.CPUPRI bit. Also, if the loop accesses data that isn't in the cache, a cache fill has to fill an entire line, and that can delay when the CPU can get the data to then write it back out. Yes, the MX can do it port IO faster, but the tradeoff is that the MX cannot do multiple simultaneous bus transactions the way the MZ can. Between all of the MZ bus masters, there can be up to 7 simultaneous reads/writes (Four for Flash, RAM (x2), and EBI, Two for DMA accesses, and one for the CPU). When all 7 are going, that means there can be (let's assume 100MHz interface to Peripherals) 2.8 GB/s or more. Actually handling that much data flow is another matter, though. Have you tried DMA? DMA doesn't have the issue of cache that the CPU does, and it can buffer read data while transferring it. |
|
|
|
QHB,今晚我试着把它设定好。我没有屏住呼吸,拉里,我尝试过DMA/PMP,不管我用了什么设置,它都不会超过大约8MHz。这就是为什么我要回去咬它。如果你知道的代码比较快,请告诉我。DMA/PMP似乎也意外地偏移/丢失数据,即0、0处的像素将位于2 0,并在最后剪辑剩余像素。除了计时器和它的中断,对于比特敲击代码来说,没有别的事情发生了。我最终会从SD卡读取图形,所以我希望它尽可能高效。
以上来自于百度翻译 以下为原文 qhb, I'll try setting it tonight. I'm not holding my breath, though. Larry, I tried DMA/PMP and it wouldn't go any faster than about 8MHz no matter what settings I used. That's why I'm going back to bit-banging it. If there's code that you know of that does it faster, please let me know. DMA/PMP also seemed to offset/lose data unexpectedly, i.e. pixel at 0,0 would be at 2,0 and clip the remaining pixels at the end. Other than the timer and its interrupt, for the bit-banging code, there's nothing else going on. I'll eventually be reading graphics from an SD card, so I want it as efficient as possible. |
|
|
|
我想知道你是否遇到了同样的问题,一直在做OpenSwit镜的家伙跑进去了。他用DMA泵输入I/O来运行一个波形发生器,但是发现他不能做一个完整的64K传输,因为DMA进入了一个奇怪的状态,并且在传输之后延迟复位。他不得不退缩(即低于64K),然后就好了。
以上来自于百度翻译 以下为原文 I wonder if you're running into the same problem the guy who's been doing the OpenScope ran into. He has DMA pumping the I/O to run a waveform generator, but found that he couldn't do a full 64K transfer, because the DMA went into a weird state and delayed resetting after the transfer. He had to back off (i.e. do less than 64K), and then it was fine. |
|
|
|
嗨,在MIPS处理器的流水线中,如果你做了一个操作,就会有流水线停滞,在下一个指令中使用这个结果。我认为当像素计数器递增,然后在下一个指令中测试时,可能会发生。当编译器生成代码时,它会尝试重新排序和STA。GGER指令,以避免这样的摊位。它可能没有多大帮助,但你可以尝试重新排序:下一步将是解开循环一点,这样将有更多的时间之间读取像素值从内存,并需要值转移到锁存SFR。在循环开始之前的ST像素。在循环中,读取第二像素到一个不同的寄存器,然后在显示1.ST像素。接下来,在显示第二个像素之前读取第3个像素到第一个寄存器,然后循环控制并跳回循环启动。在循环之后可能会有一些清理工作,或者是AWA。重新加载ByffER当前行外的一个字节已经加载,但将不显示。可以进一步这样做,甚至可以消除Larry Standage提到的缓存填充延迟,但是这将在循环中增加更多的指令。
以上来自于百度翻译 以下为原文 Hi, In the pipeline of the MIPS processor, there will be pipeline stall if you do an operation, and use the result in the next instruction. I think that may happen when the pixel counter is incremented, and then tested in the next instruction. asm volatile ("addiu $a0, $a0, 1"); // a0++ increment a0 asm volatile ("bne $a0, $a1, .LOOP1"); // if (a0!=a1), goto .LOOP1 When the compiler make code, it will try to reorder and stagger instructions to avoid such stalls. It may not help much, but you may try to reorder a little: asm volatile ("addiu $a0, $a0, 1"); // a0++ increment a0 asm volatile ("*** $v0, 0x430($v1)"); // *v1 = v0 Store frame buffer byte at LATE, BF86h base + 430h offset also turns off clock bit asm volatile ("bne $a0, $a1, .LOOP1"); // if (a0!=a1), goto .LOOP1 The next would then be to unroll the loop a little, such that there will be more time between reading pixel value from memory, and needing the value for transfer to Latch SFR. Set it up by reading the first pixel before the loop is started. In the loop, Read 2.nd pixel into a different register, before displaying 1.st pixel. next, read 3.rd pixel into first register, before displaying 2.nd pixel. Then loop control and jump back to loop start. There may be some tidying up to do after the loop, or be aware that a byte outside the current line of the frame byffer have been loaded, but will not be displayed. It is possible to take this further, to also even out the cache fill delay mentioned by Larry Standage, but that would take a lot more instructions in the loop. Regards, Mysil |
|
|
|
如何定义帧缓冲区?你使用了连贯的属性吗?如果是这样,那就说明了速度。当我把要写入的数据定义为相干时,我得到一个11MHz数据时钟(~90NS周期),和你一样。注意,除非你使用DMA,否则你不需要相干属性。对于比特敲击,不使用相干是很好的。如果没有数据上的相干属性,代码在25MHz时将数据时钟锁定。注意,如果您确实启用了对数据的缓存,则CPU需要在系统总线上减少的次数少得多,因为读取一次会将1字节以上的数据缓存到缓存中。你会得到更少的摊位。不幸的是,Mysil的想法,写和读不起作用,因为摊位上读从未恢复记忆是不可掩模在一个有序机。整个流水线只是停顿,当内存读取被执行时,它不能做任何其他事情(不像OOO机器)。编辑:-只是添加,如果你一次读一个32位字,然后一次旋转并写出每个字节,你也会得到一个连贯缓冲区的加速。不过,我还没有测试过。
以上来自于百度翻译 以下为原文 How is your frame buffer defined? Have you used the coherent attribute? If so, that explains the speed. When I define the data to be written as coherent, I am getting an 11MHz data clock (~90ns period), the same as you. Note that you don't need the coherent attribute unless you are using DMA! For bit-banging it is fine not to use coherent. Without the coherent attribute on the data, the code clocks the data out at 25MHz. Note that if you do enable caching on the data, the CPU needs to go out over the system bus far fewer times, as reads will pull more than 1 byte at a time into cache, so you will get less stalls. Unfortunately, Mysil's idea of intereaving reads and writes doesn't work, as the stalls on read from uncached memory are not maskable on an in-order machine. The whole pipeline just stalls, it can't do anything else whilst the memory read is being carried out (unlike OOO machines). edit:- Just to add, you would also get a speedup with a coherent buffer if you read in a 32bit word at a time, then rotating and writing out each byte at a time. I haven't tested that though. |
|
|
|
你是我的英雄!是的!我在这上面挣扎了好几天!当我第一次读你的帖子,但没有把它从主目录中取出。C(它被宣布为Extn),它没有改变任何东西(惊讶它编译),所以我几乎放弃了希望,但后来我从两个地方删除它,它工作!!25MHz!!!!!!!!!!我读了连贯的,现在我明白为什么-它被放置在世界各地的一半在KSG1地区,在那里它不能被缓存。这也把它放到另一个总线上,这样DMA就可以同时运行了,对吧?我可能会试着用32位读取并将它们旋转到端口中。谢谢,摇滚!!!!!!!!!!
以上来自于百度翻译 以下为原文 simong, you're my freakin' hero!! Yes!! I've struggled for days on this!! I took out the coherent from the main.h when I first read your post, but didn't take the one out of the main.c (it's declared extern) and it didn't change anything (surprised it compiled), so I almost gave up hope, but then I removed it from both places and it worked!! 25MHz!!!!!!!! I read up on coherent and now I understand why - it gets placed halfway around the world in the kseg1 region where it can't be cached. That also puts it on another bus so that DMA can run simultaneously, right? I just might try to read in 32 bits and rotate them into the port. Thanks, you rock!!!!!!! |
|
|
|
不,虚拟物理地址转换发生在MMU,CPU内核的一部分。虽然有2个单独的RAM的银行,与系统总线矩阵有单独的连接,但这是基于物理地址决定的,这对于KSGE0和KSEG1都是相同的。如果CPU和DMA都试图访问同一个银行,那么你将得到竞争。为了避免争用,您必须在RAM组之间进行乒乓操作,使用区域或链接器脚本来放置缓冲区。
以上来自于百度翻译 以下为原文 No. The virtual-physical address translation occurs in the MMU, part of the CPU core. Whilst there are 2 separate banks of RAM, with separate connections to the system bus matrix, this is decided based on the physical address, which is the same for both kseg0 and kseg1. If both CPU and DMA are trying to access the same bank, you will get contention. To avoid contention you would have to ping-pong between the ram banks, using either regions or the linker script to place the buffers. |
|
|
|
只有小组成员才能发言,加入小组>>
5160 浏览 9 评论
1998 浏览 8 评论
1927 浏览 10 评论
请问是否能把一个ADC值转换成两个字节用来设置PWM占空比?
3170 浏览 3 评论
请问电源和晶体值之间有什么关系吗?PIC在正常条件下运行4MHz需要多少电压?
2225 浏览 5 评论
729浏览 1评论
613浏览 1评论
有偿咨询,关于MPLAB X IPE烧录PIC32MX所遇到的问题
503浏览 1评论
PIC Kit3出现目标设备ID(00000000)与预期的设备ID(02c20000)不匹配。是什么原因
628浏览 0评论
526浏览 0评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-11-22 10:21 , Processed in 1.374853 second(s), Total 97, Slave 80 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号