CSE291 Matt Eric
As of last night, I had a 8x16 pop counter digit working in greyscale in a fixed location on the screen. I'm pretty sure I can get a second digit working today, and the RGB channels and variable location either today or sometime this week. I also plan to add letters.
Currently, the system works as follows: you have an arbitrary number of sprites (a term for graphics objects used in games), each with its own associated image. Each image is updated on the V-Sync from the BRAM. On the workstation side, you load images once from text files, and can choose any to memory map into the BRAM. If performance becomes a concern (e.g. we want to increase the maximum sprite size beyond 256 pixels), it would probably be more efficient to load all the images separately once, and then allow the sprites to map to any of the loaded images.
Edit: Greyscale double-digit bubble pop counter up and running in Freund Lab!
Converting to three-channel images seems to have caused a timing failure in bitstream generation. However, the same number of bits are being read from the BRAM as before, they're just being utilized more. I believe this is the constraint that cannot be satisfied:
---------------------------------------------------------------------------------------------------------- Constraint | Check | Worst Case | Best Case | Timing | Timing | | Slack | Achievable | Errors | Score ---------------------------------------------------------------------------------------------------------- * NET "plbv46_pcie_0/plbv46_pcie_0/comp_blo | SETUP | -1.170ns| 5.170ns| 140| 56078 ck_plus/core_clk" PERIOD = 4 ns HIGH | HOLD | 0.342ns| | 0| 0 50% | MINPERIOD | 0.000ns| 4.000ns| 0| 0 ----------------------------------------------------------------------------------------------------------
How does this become an issue?
I tried downloading the bitstream anyway, ignoring the timing errors. It works fine until I start trying to communicate with the BRAM, at which everything slows and then the workstation side crashes.
I had a breakthrough today! Digits are being drawn to the screen. The final obstacle was figuring out that the addressing of multi-dimensional arrays in Verilog is reversed. Finally, I can work on the details of the interface: communicating the position and size, the enable bit, etc.
- Fixed bug where second image address was getting pulled from 256*32 instead of 256*4 (bytes).
- Created nextInSprite wire, should fix alignment bug
- RGB framework done
The workstation-side code necessary for a minimal demo is mostly done, so now it's just a matter of getting the Verilog code to work. Struggled with ISE problems today, apparently from corrupted project files.
- EDIT: I've finally discovered that the problem is completely non-deterministic, and happens when I restart the computer in order to reset the Linux-side PCIE driver. Sometimes when I restart it, the state machine halts; other times, it may not halt. I'm not exactly sure what happens to the FPGA when the computer gets restarted; it seems to me that only the VGA signals should be interrupted, but that shouldn't affect the state machine. In any case, this is a huge relief, and I can finally proceed.
- It took some work to set up the behavioral simulation, but it showed some anomalies in the plb_master code (e.g. the output M_ABUS was supposed to be 32 bits long, but it had only been assigned to a one-bit-wide register. Also, there were signs that the crop module wasn't outputting a valid dvi v-sync signal, but that may have been because of the simulated VGA input.) However, working around these issues hasn't corrected the problem.
- At least, my state machine code worked perfectly fine. That's reassuring, but now I have to examine all of the prexisting code I started working with.
- Problems seem much more non-deterministic than I thought. Just re-downloading the already-compiled bitstream gives different results, but it doesn't seem to fix the underlying problem.
- I may have to try ChipScope soon.
- There may be some race conditions happening between the combinational logic and the sequential logic, so I fixed that.
- It turns out that the bubbles code I've been working with this whole time is outdated. It uses the PLB master to communicate with the BRAM, which is the reason for the cumbersome state machine. However, Matt says that it's possible to communicate with the BRAM modules directly. This is not only much simpler, this would also make it possible to do behavioral simulation with a generated BRAM.
- However, this would require porting not only my changes but the old messy bubbles code as well, and I am concerned about this amount of overhead. I'll look into it and try to estimate whether it would be worth doing.
- Before I try ChipScope, I should set up a behavioral simulation of just the state machine in isolation and see if it might be a problem with that.
Still having trouble with the state machine. Here's the problem:
Regardless of whether there is any valid data in the BRAM, the state machine in video_controller should go through and read all the values for the bubble coordinates and such. This had originally been working fine; when I eventually get around to running the workstation-side program, the state machine is still running, so it picks up the values.
However, I've extended the state machine to read additional addresses, and now the state machine stalls when it attempts to read them. When I run the workstation-side program, I can write values to BRAM just fine, but the video controller's state machine has been broken, so nothing happens.
I think it's the condition that the request ID, rReqId, should equal the response ID, wRespId. If the request is out of the legal address range, perhaps the response ID won't update?
I'm not sure, but it seems like the state machine breaks when it tries to read somewhere around PLB address 32'h92e10078. I'm trying to pinpoint it, but it's difficult.
Edit: I think I've fixed the problem, but I am still completely baffled. Originally, I used the following parameters:
parameter C_SHARED_MEM_ADDR = 32'h92e10000; parameter C_SPRITE_ADDR = C_SHARED_MEM_ADDR + 32'h50; parameter C_IMG_ADDR = C_SPRITE_ADDR + 32'h4;
Later in the program, I access the PLB using this line, where tblReadOffset is a 32-bit register that holds an offset between 0 and 32:
rReqAddr <= C_IMG_ADDR + tblReadOffset;
This caused problems. However, rewriting the third parameter as:
parameter C_IMG_ADDR = C_SHARED_MEM_ADDR + 32'h54
Fixes the problem. Why would an additional level of parameter nesting be a problem?
- 256-length data?
- Debug1/debug2: adding different sized registers?
- Change img_addr offset back to 60?
- Does changing verilog files during XST synthesis create problems?
- Finally start reading in image data.
I decided to abandon the dynamic image addressing for now, simply using fixed-size image storage. If I have additional time, I can try to re-implement the image table.
Next time: fix the state machine bug in state 20.
The last image table address is going to have to be an endpoint. So if there are 10 addresses, there can only be 9 images. Changed video_controller.v accordingly. Also lowered number of images to 8 (9 addresses) to keep things simpler at this point.
The test worked perfectly! My next goal is to dynamically access a stored bitmap and display it. Then I will implement the (length, color) encoding instead. After that, I'll get the sprites to take positions.
I realized I was addressing BRAM by bit instead of by byte. That would have been bad. I've updated the workstation app to test writing and reading from BRAM, so I just need to run it with the FPGA next time.
Trying to use 2D registers to avoid code repetition, but running into this strange and apparently yet unresolved problem with XST: http://www.xilinx.com/support/answers/20391.htm
Rather than enumerate all values as suggested in the bug report, I'm going to try changing the "always(*)" into "always(CLK)". I would expect the clock to be fast enough.
On the FPGA side, I extended the state machine to read several values that correspond to where the images will be in BRAM. For testing purposes, it also writes the values back into BRAM in another spot. The next step is to modify the workstation-side app to put the expected values into BRAM.
After setting up simulation tests, I fixed the bug and have a simple, hardcoded bitmap drawing showing up on the screen. However, buffers are only 64 bits long each so far, and hardcoded. Next, I'm going to try putting values in BRAM and using them.
Here is the Sprite module so far. The nature of the buffer will change into holding (length, color) pairs.
I think that encoding our images as (length, color) pairs is definitely the way to go. It's much more suited for the way the images will actually be drawn on the screen, and shouldn't be difficult to preprocess. A fortunate consequence is that scaling images by integer values should be fairly straightforward so long as we restrict the (length, color) pairs to not cross the image bounding box: if we want to scale the width of the image, we can just transform the (length, color) pairs into (length*scale, color) pairs; if we want to scale the height, we can just repeat the set of (length, color) pairs for a given line for the desired scale amount.
- Since we will likely have sprites that use the same images, I'd like to define sprites by only one coordinate (upper left) and an image ID. We'll then need a table that maps ID to addresses. Then, these addresses will point to blocks of BRAM that define images with: (width, height), (# of L-C pairs), and the (L, C) pairs. This way, we can make the size of the images variable.
- Currently, all buffering of BRAM values is being done during V_SYNC. This seems to be the safest way to do things, provided that the V_SYNC is long enough. However, continuing this practice would require buffering all of the images in their entirety. A 50 x 50 image of the number 1 would take around 3*height = 150 LC pairs, which would take 150 x 32 bits; the worst number, 0, will take almost 5*50 = 250*32 bits. What is the maximum number of registers we can create in this FPGA? Hopefully it does not come close.
I just finished laying out the BRAM address space and parametrizing the module. Next time, I need to add to the BRAM IO state machine, hopefully using loops and parametrized 2D registers, since the images are too large to be reading each value explicitly.
Matt figured out the problem. For some reason, installing the driver led it to write a 0 at pcie address 0; this happens to correspond to the skin detection threshold value. By rewriting a sane value, we can fix the problem.
Also ran into some trouble with XPS not recognizing new verilog files I had added to the video processor's ISE project. Turns out you have to update some files in /core/data.
Trying to simply draw some black squares on screen. Not working as hoped. I might have to add some simulation test files, even for this simple job.
Still haven't figured out the problem with the driver. Matt was surprised by the behavior as well. Possibly there's a newer version somewhere, but he doesn't know where it is. Unfortunately, it seems that the system in the foyer has been reset, so no chance of pulling it from there, either.
I'll ignore this problem for now, since I don't know where everything is in the Seed repository and don't have access anyway. My first incremental goal is to get an arbitrary, hardcoded bitmap to show up on-screen.
Matt showed me how to download a bitstream into the freundlabpc3 FPGA after changing code with ISE. I tweaked the skin detection color to make sure it was working properly.
Matt also directed me to the SVN repository containing the workstation-side code. I tried installing the workstation-pcie driver (sudo insmod xc5vsx50t.ko), and it does show up in /dev/, but it does something funny to the fpga if I install it while it's running. The entire screen fills up with the skin detection color, though the bubble outlines are still on top. So I tried installing the driver first, then starting up the fpga. This avoids the FPGA disruption, except for the fact that if I want to run a workstation-side program that uses the pcie, I need to restart the computer after an FPGA restart, which forces me to reinstall the driver.
--Yoavfreund 17:38, 6 December 2009 (UTC) Sounds good. However, I would really like you and Matt to move from SVN to mercurial.
--Eric Doi 02:26, 8 December 2009 (UTC) Will do. I just needed to get the data; when I push new changes up, I'll start a mercurial project in the class tree.
Modify the system to enable a bubble pop counter. The workstation will send the FPGA a value to display on the screen; this will be stored in BRAM for asynchronous retrieval by the FPGA, which will use stored bitmaps (or encoded representations) of the digits 0-9 to display the count on the screen.
Check with Matt if this understanding is correct.
I started looking through the XPS projects on Seed and found one (seed.ucsd.edu/data/hgroot/fpgaBaseSystem/ml506/bs0_bubbles) which has the video processor pcore that implements the bubbles app and integrates the skin detection processing.
Currently, there is a module, region_counter, which represents the bubbles.
- VGA information (current (x,y) coords, clock, sync variables, etc)
- a flag containing the skin detection output for the current pixel
- a bounding box ((x1,y1) to (x2,y2)) - comes from workstation, read from BRAM
- a flag for whether the current (x,y) lies on the outline of the bounding box (in which case we should draw the outline color)
- a counter of how many skin pixels have been seen inside of the bounding box
However, we would like to be able to draw arbitrary bitmaps, and also be able to play the bubbles game with these, so we need some modifications.
I'm working on first creating a module, "sprite" that represents a drawable object, to satisfy a subset of these requirements:
- VGA information
- a bounding box
- a circular buffer of some number of (length, color) pairs
The bounding box and the (length, color) pairs would ultimately come from BRAM, but this should be done outside of this module.
--Eric Doi 20:52, 24 November 2009 (UTC) On second thought, I think it would be more flexible to encode the mask separately; all it would require is an extra bit in addition to the (length, color) pairs, e.g. (length, color, mask).
- OUT_INSIDE: A flag for whether the current pixel is "inside" of the object represented by this bitmap. I think the easiest way to do this would be to reserve a color value to be a special "mask" color. This color would be drawn as transparent, but would make it straightforward to determine if a pixel is "inside."
- OUT_ENABLE: A flag for whether this pixel is a reserved color value of "transparent" (if it is, don't override the VGA input)
- OUT_COLOR: A color value
The basic operation of this module would be roughly as follows (state machine):
- State 1: we are outside of the sprite's bounding box. Set OUT_ENABLE = 0, OUT_INSIDE = 0.
- State 2: we just entered the sprite's bounding box. Read a (length, color) pair from the buffer; store the length into a counter. Set OUT_ENABLE = 1, OUT_INSIDE = 1.
- State 3: we already have a counter value; IF we are inside of the bounding box, decrement it and output the color. Otherwise, wait until we get to the next horizontal line and reach our bounding box again. If we have run out of counter values, read another pair from the buffer.
Then, the updated region_counter would simply instantiate a sprite and route all the VGA information to it. The only additional input it would need is the flag containing skin detection output; then simply increment the skin pixel counter if this is a skin pixel and if the sprite reports OUT_INSIDE == 1 for this pixel.
After this is finished, I would like to try to hard-code an encoding of a round bubble bitmap and get round bubbles working.
I put the code up on the repository. Matt suggested that I look at the base system, particularly where the skin detection output gets integrated into the VGA output. That's where I want to work.
11-9-09 Update: Verilog Code for decoder
I've finally got some Verilog code working as a decoder.
- The real question now is how the data is going to be put in, and how it's going to be read out. Most likely we'll want to do the two way BRAM transfer as mentioned in class and demonstrated by Evan?
- After working that out, I can try making a C program that will translate bitmaps into these Elias coded streams, so we can see some actual data.
Verilog code for decoder
Great to hear that it's working. Yes, doing this is C will be better because most apps will probably be written in C (several reasons including speed and more predictable runtimes). Additionally, I think it will be easier in C given that one can cast arbitrary sequences of 8 bits into a char/byte.
I think it's a fair to say we support 8 bits of color only and that the color table is known to the decoder.
We'll want to start implementing the decoder in verilog, but I'd like to see the code and run some experiments to see how reasonable the encoded size is for a 40x40 (say) tile. If encoding a 40x40 pixel tile (1600 pixels) takes close to 1.6KB, it's not that much better than uncompressed right? --MattJacobsen 15:27, 20 October 2009 (UTC)
--Yoavfreund 17:53, 20 October 2009 (UTC) For 40 by 40 there will not be much difference if it is an eloaborate map such as a character glyph, but for a 100x100 circle there will be a large difference. A reasonable solution is to have an API that supports more than one format and start by implementing the one with the fixed bitmap first.
10-16-09 Elias Coding
- Implement in software an elias gamma coding encoder and decoder. (Actually, just copy java implementation from Wikipedia page and test that it is correct - encode and decode all numbers from 1 to 500,000.)
- Implement the decoder in hardware
Read about various Elias codings, and prefix-free codes in general. According to , and this great graph, Elias Gamma is what we want to use if our values are frequently less than 32. Omega coding (recursive) is also very interesting, and is probably somewhere in-between Gamma and Delta as far as the most efficient bit lengths.
--Yoavfreund 04:02, 18 October 2009 (UTC) There is no single best code. Different codes are different for different distributions. The elias gamma coding is a good first method to use. After you create a large set of codes for describing different shapes and characters, we can analyze the statistics of the run lengths and come up with a better code. For now, the gamma code is good enough. Next task to tackle is writing verilog code to decode the gamma encoded integers. The following step would be to create images on the video display as defined by a sequence of integerst stored in the bram (in a gamma encoded fashion). One of the steps towards creating these fonts is to understand how to get a bitmap (glyph) of some nice looking fixed width font. I don't know much about this stuff, but here is a promising starting point: http://www.levien.com/type/myfonts/inconsolata.html
I am a little unclear about the difference between comma-free coding and prefix free, but my intuition is that prefix-free codings need more backtracking in order to be decoded than comma-free codes. In any case, I can revisit these things later, when I get some experimental results to consider.
Worked on implementing the Elias Gamma coding in Java. The concept is so simple, but it's taking me a while to work out the details of a bit implementation among the quirks of Java; Java doesn't supply an actual BitStream, but I did find an implementation on Google. For now, I am using Strings of 0s and 1s to represent streams of bits, just as a proof of concept.
- I'm wondering if it might be better to use C for this, since it's more lax about typing and might be faster about working with bit information. Actually, that's probably not where the bottleneck will be.
- Since the number of colors should be known beforehand, they won't need to be Elias encoded.
It turns out that the Wikipedia code had a few bugs in it. I finally found and fixed them, and the encoding/decoding now works nicely with console input. I'll set up file IO and then I can do some more thorough testing.
--Eric Doi 08:40, 17 October 2009 (UTC)
Successfully encoded and decoded integers from 1 to 500,000, but separately; attempting to encode and decode them as a single stream leads to extremely large Strings and Integer linked lists, and it didn't look like it would be possible.
--Eric Doi 10:49, 17 October 2009 (UTC)