完善资料让更多小伙伴认识你,还能领取20积分哦, 立即完善>
你好,
我在Spartan 3E开发板上开发了一个小型微型计算机系统(简单的CPU,视频,I / O控制器等)。 我的系统工作正常,但我只是制定了50MHz的目标时序约束,我想改进它。 每个时序报告的最长路径似乎涉及我用于多路复用存储器单元和寄存器的逻辑。 CPU存储空间包括多个具有不同功能的SRAM块以及存储器映射的基于LUT的寄存器。 在我需要将不同的块“连接”在一起的情况下,我只使用WITH SELECT或IF THEN结构根据相关控制信号复用块。 基于LUT的寄存器尤其如此:我有超过25个独立的基于8位LUT的寄存器,用于控制视频,I / O等各种模块。所有存储器都使用一个存储器映射到CPU地址空间 使用CPU提供的存储器地址的巨型多路复用器作为多路复用器的控制信号。 即使写下来,我也意识到它必须非常低效,我想知道是否存在以更少的逻辑层实现相同目的的永恒方式,从而提高了时序性能。 我会非常感激任何想法! 最好的祝福, 安定 以上来自于谷歌翻译 以下为原文 Hello, I have developed a small micro-computer system on a Spartan 3E development board (simple CPU, video, I/O controllers etc.). My system works fine but I only just made my target timing constraint of 50MHz and I'd like to improve this. The longest paths per the timing report seems to involve the logic that I have for multiplexing memory units and registers. The CPU memory space includes multiple SRAM blocks of different functionality and also memory-mapped LUT-based registers. Where I have needed to "join" different blocks together I simply used WITH SELECT or IF THEN structures to multiplex the blocks according to the relevant control signals. This is particularly the case with my LUT-based registers: I have more than 25 separate 8-bit LUT-based registers that control various modules such as video, I/O, etc.. All are memory mapped into the CPU address space using one giant multiplexer that uses the memory address give by the CPU as the control signal for the multiplexer. Even writing this down I realize it must be enormously inefficient and I wondered if there are alternaltive ways of achieving the same ends with less logic layers and consequently better timing performance. I would be very grateful for any thoughts! Best regards, Anding |
|
相关推荐
2个回答
|
|
对于基于结构的寄存器,常见的优化是具有“影子RAM” - 通常是块RAM -
它包含最近写入的寄存器值的副本。 然后阅读寄存器,你的 多路复用器默认为这个“影子RAM”,除非读回可以在外部改变的数据 只读位或寄存器位,否则可以在处理器范围之外进行更改。 在大多数系统中,这会相当多地减少多路复用器输入的数量。 将寄存器分组 只读或有其他原因不使用影子RAM有帮助。 另一件需要考虑的事情是Block RAM和分布式RAM都有更长的Q时序 比织物人字拖鞋。 如果您能负担得起,在这些RAM之后添加管道阶段会有很大帮助 额外的延迟周期。 在许多情况下,你将一些不真实的东西多路复用 每个周期都会发生变化(如寄存器),因此额外的延迟几乎没有问题。 - Gabor - Gabor 以上来自于谷歌翻译 以下为原文 For fabric-based registers, a common optimization is to have a "shadow RAM" - usually a block RAM - that holds a copy of the most recently written value of the registers. Then to read the registers, your multiplexer defaults to this "shadow RAM' unless reading back data that can change externally like read-only bits, or register bits that can otherwise be changed outside the scope of the processor. In most systems, this reduces the number of mux inputs by quite a bit. Grouping together registers that are read-only or have other reasons not to use shadow RAM helps. Another thing to consider is that block RAM and distributed RAM both have longer clock to Q timing than fabric flip-flops. Adding a pipeline stage after these RAMs helps immensely if you can afford the extra cycle of latency. In many cases you are multiplexing together some things that don't really change on every cycle (like registers) so the extra latency is of little or no concern. -- Gabor -- Gabor |
|
|
|
谢谢Gabor,这给了我很多思考。
我做了一些非常有用的改进 - 使可写硬件寄存器只写。 (换句话说,保持可写寄存器值的记录工作是软件问题而不是硬件问题,从而减小了读取端多路复用器树的大小) - 将所有可写寄存器的更新流水线化到CPU写入之后的循环(在CPU写入后将可写寄存器多路复用器树移动到时钟周期) - 维护所有传入硬件信号的本地寄存器副本(在CPU读取之前将传入路由延迟移至时钟周期) 您是否知道将单独的SRAM块有效组合到单个地址空间的技巧? 我使用CORE向导配置SRAM块。 当向导将多个SRAM块连接到更大的存储器时,它似乎使用专用的路由资源而不是通用多路复用器,因此结果几乎与单个块一样快。 但有时我需要配置单独的块块(可能是因为端口B侧的连接不同),然后将它们连接在一起,形成端口A侧的单个地址空间。 当我在VDHL中执行此操作时,会与多路复用器建立连接,这会引入时序延迟。 无论如何使用端口A侧的专用资源将SRAM块链接在一起,但是能够为端口B侧的不同块指定不同的连接,以及为不同的块使用不同的COE初始化文件? 以上来自于谷歌翻译 以下为原文 Thanks Gabor, this has given me much food for thought. I made some quite helpful improvements by -- making writable hardware registers write only. (In other words the job of record keeping the values of the writable registers is made a software problem rather than a hardware problem, thus reducing the size of the multiplexer tree on the read side) -- pipelining the update of all writable registers to the cycle following the CPU write (moves the writable register multiplexer tree to the clock cycle after the CPU write) -- maintaining local register copies of all incoming hardware signals (moves the incoming routing delays to the clock cycle before the CPU read) Do you know any trick for combining separate SRAM blocks into a single address space efficiently? I configure my SRAM blocks with the CORE wizard. When the wizard connects multiple SRAM blocks into larger piece of memory it seems to use dedicated routing resources rather than general purpose multiplexers so that the result is almost as fast as if it were a single block. However sometimes I need to configure separate chunks of SRAM (perhaps because of different connections on the port B side) and then connect them together into a single address space on the port A side. When I do this in VDHL the connections are made with multiplexers and this introduces timing delays. Is there anyway to link SRAM blocks together using the dedicated resources on the port A side, yet be able to speicify different connections for different block on the port B side as well as use different COE initialization files for different blocks? |
|
|
|
只有小组成员才能发言,加入小组>>
2416 浏览 7 评论
2821 浏览 4 评论
Spartan 3-AN时钟和VHDL让ISE合成时出现错误该怎么办?
2292 浏览 9 评论
3372 浏览 0 评论
如何在RTL或xilinx spartan fpga的约束文件中插入1.56ns延迟缓冲区?
2459 浏览 15 评论
有输入,但是LVDS_25的FPGA内部接收不到数据,为什么?
1149浏览 1评论
请问vc707的电源线是如何连接的,我这边可能出现了缺失元件的情况导致无法供电
582浏览 1评论
求一块XILINX开发板KC705,VC707,KC105和KCU1500
448浏览 1评论
2003浏览 0评论
727浏览 0评论
小黑屋| 手机版| Archiver| 电子发烧友 ( 湘ICP备2023018690号 )
GMT+8, 2024-12-22 20:10 , Processed in 1.381563 second(s), Total 80, Slave 63 queries .
Powered by 电子发烧友网
© 2015 bbs.elecfans.com
关注我们的微信
下载发烧友APP
电子发烧友观察
版权所有 © 湖南华秋数字科技有限公司
电子发烧友 (电路图) 湘公网安备 43011202000918 号 电信与信息服务业务经营许可证:合字B2-20210191 工商网监 湘ICP备2023018690号