Skip to content
Snippets Groups Projects
Select Git revision
  • master
1 result

coclocking

  • Clone with SSH
  • Clone with HTTPS
  • Jake's avatar
    Jake Read authored
    3ac932ce
    History

    Collaborative Clocking

    What?

    I'm going to attempt to implement a 'Collaboratively Clocked' Serial PHY1 - that's a CCSP2, thanks very much.

    Motivation

    Serial Communication

    Serial communication is the bread and butter of simple physical communication layers - we take a group of bits (say, a Byte) and transfer it over a single line - so we go one-bit-at-a-time. Here is Sparkfun's excellent introduction to serial communication. UART, SPI and I2C are all forms of serial communication.

    This is a parallel line - we push 8 bits at once, requiring 8 physical media (wires) and one clock media.

    parallel

    This is a serial line - we push bits one at a time, requiring one data line (media, wire), and one clock line.

    serial

    I want to address two pitfalls of serial communication:

    • Clock Configuration
    • In an Asynchronous (Clockless) serial line (like UART), we have to configure both endpoints to run at the same speed - so both processors know how often to sample the line in order to determine what the bit is at that time-step.
    • Master / Slave
    • In Clocked (Synchronous) Serial Communication (SPI), we generally have a Master and a Slave, where the Master sets the clock speed, pushes a clock line that the Slave uses as a reference for when to sample a bit. Here is Sparkfun's SPI documentation, including a more thorough rundown of this issue.

    See how the clock defines sampling positions:

    spi

    Co-Clocked Serial Communication

    In this solution, which Neil presents in his work on ATP (asynchronous token protocol) - endpoints 'pass the clock' back and forth. Endpoints have two transmit and two receive lines each, one for data and one for the 'clock' or the 'token'. The token is flipped to signal that new data is present on the line, and that flip indicates to the other endpoint that it should read the new data. When the other side has done so, it flips its token to indicate that it's ready for a new bit.

    So, for example, we have these four lines (T: token, D: data)

    wb-phy

    In the first timestep, we send a token 'I have data on the line now' and the data. The other side responds (once it has handled the data - i.e. read the pin state and put that in memory) by setting its token line in the second timestep. This causes the first side to repeat the process. We do this until there is no more data.

    wb-timesteps

    This allows us to have a 'clock-like' line - i.e. there is some certainty about when bits should be sampled, without the constraint of configuring a clock. Instead, the slower side sets the rate - it OK's every bit by 'bouncing' the clock (token) back.

    This is important for two reasons

    • We want to interface slower chips (or overbuffered chips) with faster chips. Co-Clocking auto-rates to the fastest possible speed.
    • In a sense, this auto-configures speed for the length of the medium - propogation time is accounted for with the clock - or maybe another way to say it is that we ensure that the 'size of a bit' - i.e. the length of the wire it takes up - is no shorter than the wire. In this way, we eliminate issues that arise from reflections in the medium, and avoid dealing with fancy RF-like link processing as seen in Ethernet.

    Why FPGAs?

    I'm going to try to do this with an FPGA4. I want to do this because I want to walk around the speed limit inherent in most of the Microcontrollers we've seen - that the datarate in this model is limited by the rate we can push/pull data into and out of the Computer part of the microcontroller. On each clock cycle, we have to pull sampled data in, set the new data, and set the 'ok' token. This is one complete cycle into-and-out-of the Microcontroller. The CBA has been keeping track of these speeds on our Ring Test Page - and we can see that even the fastest chips have a Ring speed of ~ 6MHz (I just tested a new one, the ATSAMS70 running at 300MHz, it measures in at 5.8MHz). This translates to a maximum bitrate of 6MBPS - where modern Ethernet is running towards 100GBPS (!). In addition, this 6MHz ring is occupying all of the processor's time - if we tried to do other stuff, this would slow down.

    So the goal is to use an FPGA kind of like a fancy multiplexer to read parallel lines (where wiring many pins -> many pins on the PCB is no problem5) and output a serial line. This works because with a single instruction on the microcontroller we can set all of the lines on a port, ports generally being the same width (in bits) as the MCU's primitive data type - 32 bits on 32 bit MCUs, 8 bits on smaller MCUs like the XMEGA.

    fpga-multiplexing

    So we effectively multiply the bitrate by the bus size - turning the 6MHz line into a 48 (for a byte) or a 192 (for a full 32-bit word) MPBS line, using the higher speed of the FPGA to drive those 'wide' blocks along a serial line. We retain the Token lines on the MCU side - as it needs to co-clock with the FPGA.

    Sounds pretty neat to me!

    Some drawbacks:

    • System Complexity (?) w/r/t simple UART implementation, but this is simpl-er than, say, an Ethernet implementation. And robust!
    • LOTS of PIO is taken up on the processor side. For a full duplex 32-bit line we would be driving 66 pins! As it turns out, when one of these ATSAM's is configured to read / write from external RAM, it does essentially this - pushing a bus of 16 bits in parallel. Same limits. QSPI6 works similarly, running a bus of four lines in parallel to drive bitrates past permissible clock rates. So, some evidence that this is an appropriate solution. Also, it seems to make basic sense.. OK!

    Implementation

    I first had this idea when I saw the TinyFPGA project7 - so I purchased two, and the programmer they conveniently developed.

    tinyfpgas

    I started by watching this video from the EEVBlog guy - a very enthusiastic expert-seeming youtuber whose explanations of most things EE I would highly recommend. I also found this tutorial: this person is even using the same chip as I am!

    I followed the TinyFPGA guide here to setup the Lattice Diamond Software that I needed to program the chip. The size of the download for the software (1.6GB) indicates that I am probably in for a steep learning curve. The datasheet is also 8MB. Yikes!

    I spent some time listening to more people talk about FPGA's and Verilog8 - here as well as this and then this.

    As I'm learning I'm trying to consider how I'm going to design this thing. It's basically two shift registers + the co-clock. These are simple logic circuits.

    Verilog can describe things behaviorally and structurally. Behavioural descriptions talk about what a circuit does, structural descriptions talk about what a circuit is. For example, in verilog we can 'write' AND, OR, NOT, NAND, NOR, XOR and XNOR gates with code.

    For example, here's the NOT gate - y is input, x is output

    not(y,x); 

    Really, we are describing these gates and their interconnects, and then verilog 'compiles' this code into a description of the appropriate FPGA interconnects9. So, my circuit being pretty simple - and my interest in learning about barebones logic design being large - I'm going to probably go about writing code like this.

    Here are the primitives (yes I had to google this)

    gates

    So a Ring Oscillator is just a NOT

    gates

    I'm going to loop through setting up and running this before I throw myself down the hole of trying to understand and implement a shift register. I'm in foreign lands now. Here be dragons, etc.

    Yada yada yada, I followed the TinyFPGA example (a counter) and it worked, great success. Here's the code - direct from the TinyFPGA Example:

    module TinyFPGA_A2 (
      inout pin1,
      inout pin2,
      inout pin3_sn,
      inout pin4_mosi,
      inout pin5,
      inout pin6,
      inout pin7_done,
      inout pin8_pgmn,
      inout pin9_jtgnb,
      inout pin10_sda,
      inout pin11_scl,
      //inout pin12_tdo,
      //inout pin13_tdi,
      //inout pin14_tck,
      //inout pin15_tms,
      inout pin16,
      inout pin17,
      inout pin18_cs,
      inout pin19_sclk,
      inout pin20_miso,
      inout pin21,
      inout pin22
    );
    
      // left side of board
      assign pin1 = 1'bz;
      assign pin2 = 1'bz;
      assign pin3_sn = 1'bz;
      assign pin4_mosi = 1'bz;
      assign pin5 = 1'bz;
      assign pin6 = 1'bz;
      assign pin7_done = 1'bz;
      assign pin8_pgmn = 1'bz;
      //assign pin9_jtgnb = 1'bz;
      //assign pin10_sda = 1'bz;
      //assign pin11_scl = 1'bz;
      
      // right side of board
      //assign pin12_tdo = 1'bz;
      //assign pin13_tdi = 1'bz;
      //assign pin14_tck = 1'bz;
      //assign pin15_tms = 1'bz;
      assign pin16 = 1'bz;
      assign pin17 = 1'bz;
      assign pin18_cs = 1'bz;
      assign pin19_sclk = 1'bz;
      assign pin20_miso = 1'bz;
      assign pin21 = 1'bz;
      assign pin22 = 1'bz;
      
      wire clk;
    
      OSCH #(
    	  .NOM_FREQ("2.08")
      ) internal_oscillator_inst (
    	  .STDBY(1'b0),
    	  .OSC(clk)
      );
    
      reg[23:0] led_timer;
    
      always @(posedge clk) begin
    	  led_timer <= led_timer + 1;
      end
    
      assign pin9_jtgnb = led_timer[23];
      assign pin10_sda = led_timer[22];
      assign pin11_scl = led_timer[21];
    
    endmodule

    Warning! Graphic Breadboard Content:

    counter

    OK, now I'm trying the ring. I think I can just NOT some pins, let's see.

    module TinyFPGA_A2 (
      input pin21,
      output pin22
    );
    
     assign pin22 =! pin21;
    
    endmodule

    Nice. This blows all other ring tests out of the water with a 120MHz wave. Actually, this is close to the oscilloscope's maximum frequency - 200MHz, so Sam and I are going to test it out on the LeCroy later on. Whoop.

    rt-tek

    OK, now I have two boards, one side is in =! out, the other is in = out so we have this antagonistic ring. The 'co-clock' is right at 56MHz, half of the previous measurement, makes sense! Strange difference in the waves...

    co-rt-tek

    Parallel -> Serial

    OK, I want to wrap this up tonight with a semi-complete experiment. Next thing to look at is how I'm actually going to implement the shifting logic. I'll finish watching some verilog videos from prior work while I scratch my head at this.

    Turns out there's a wiki for this.

    Ok, that seems like I've got the right components - I'll use two shift registers (one Parallel In Serial Out [PISO] the other SIPO) and I'll clock them on the 'co-clocking' line. That handles the FPGA side, but I also have to do something similar on the Microcontroller side... It'll set a port of bits (a byte) and then set an additional bit to tell the FPGA it has a byte ready. The FPGA will then do this out-shifting (with the possibility of in-shifting simultaneously, for a duplex connection - or I add a 2nd set of in/out tokens to async duplex). When the byte is transferred, the FPGA ticks it out line to the MCU, triggering the next byte to be parallel'd to the FPGA, and another 'go' token sent from MCU -> FPGA.

    Here are my next building blocks:

    1. Actual Co-Clock:
    • each side out=in the clock line after an op, in this example, after a clock tick. Critically, this requires time to exist. I have to kick this cycle into happening by flipping the output side to 'hi' from 'lo' (or vice versa) on the '1st' cycle. I can say this '1st' cycle will be triggered from the microcontroller's output.
    • or does the receiving side have to out=in while the transmitting side out=!in and initially set hi ? this requires a state bit
    1. Shift Register: on clock cycle, outshift bit [7:0]
    2. Shift on Co-Clock: on co-clock cycle, outshift bit [7:0]

    I want to say I figured this out with pure, raw brilliance, but I also googled around. this contains a few examples of shift registers. I also watched this which helped disambiguate when / how different inputs and outputs are set, when statements are evaluated... etc. It's a messy, massively parallel world!

    OK, mad head-scratching later, I figured out how to do the clock with a reg of a few bits ... and how to do parallel in. Parallel in involves setting up an input as ... something that looks like an array. Here's the modified Ring, operating on a clock...

    reg is used for any variable assigned in an always@ statment wire (aka net) is used for any variable assigned in a continuous assign statment

    we can 'always @(edge clock)' or 'always@(var, var, var)' wherever these vars change, function is evaluated or 'assign var =' fo continuous

    Heirarchachly, when we write a module and instantiate it in a higher level module, the lower level module exists within the higher level module, as hardware. It is not a function we evaluate, it is a piece of hardware that is one of the components that makes up our self.

    Because modules are considered hardware, they cannot be instantiated or used inside of a procedural block (always@). They can only be used with continuous assignments.

    OK, cool, I think learning just takes raw time, occasionally. I started thinking about Verilog three days ago now and I'm finally starting to be able to do things-that-I-want-to-do. Here's my working bit-shifter

    module TinyFPGA_A2 (
    	input [3:0] DIN,
    	output SO,
    	output CLK
    );
    
    	wire clk;
    	
    	reg [3:0] din; // 4-bit wide input port
    	reg [2:0] counter; // counts to 8
    	reg out; // out data
    	reg outclk; // shows counter edge
    	
    	OSCH #( // setup the oscillator
    		.NOM_FREQ("2.08")
    	) internal_oscillator_inst (
    		.STDBY(1'b0),
    		.OSC(clk) // oscillator bangs the clk wire back-and-forth
    	);
    	
    	always @(posedge clk) begin // every time the clk wire has a positive edge, do:
    		counter <= counter + 1; // increment the counter
    		if(counter == 0) begin // on a new cycle,
    			din = DIN; // read data in from port
    			outclk = 1'b1; // expose counter clock
    			end
    		else begin
    			outclk = 1'b0; // counter clock low
    			end
    		
    		out <= din[counter]; // set the out wire to be equal to the i-th element in the data in port
    	end
    	
    	assign SO = out; // serial output is the out value
    	assign CLK = outclk; // CLK output is the outclk value
    
    endmodule

    And the hardware side:

    BLOCK RESETPATHS ;
    BLOCK ASYNCPATHS ;
    
    //LOCATE COMP "pin1" SITE "13" ;
    //LOCATE COMP "pin2" SITE "14" ;
    
    LOCATE COMP "CLK" SITE "13" ; // pin1
    LOCATE COMP "SO" SITE "14" ; // pin2
    
    LOCATE COMP "pin3_sn" SITE "16" ;
    LOCATE COMP "pin4_mosi" SITE "17" ;
    LOCATE COMP "pin5" SITE "20" ;
    LOCATE COMP "pin6" SITE "21" ;
    LOCATE COMP "pin7_done" SITE "23" ;
    LOCATE COMP "pin8_pgmn" SITE "25" ;
    LOCATE COMP "pin9_jtgnb" SITE "26" ;
    LOCATE COMP "pin10_sda" SITE "27" ;
    LOCATE COMP "pin11_scl" SITE "28" ;
    
    //LOCATE COMP "pin16" SITE "4" ;
    //LOCATE COMP "pin17" SITE "5" ;
    //LOCATE COMP "pin18_cs" SITE "8" ;
    //LOCATE COMP "pin19_sclk" SITE "9" ;
    //LOCATE COMP "pin20_miso" SITE "10" ;
    
    LOCATE COMP "DIN[0]" SITE "4" ; // pin16
    LOCATE COMP "DIN[1]" SITE "5" ; // pin17
    LOCATE COMP "DIN[2]" SITE "8" ; // pin18
    LOCATE COMP "DIN[3]" SITE "9" ; // pin19
    //LOCATE COMP "DIN[4]" SITE "10" ; // pin 20
    
    LOCATE COMP "pin21" SITE "11" ;
    LOCATE COMP "pin22" SITE "12" ;
    

    And here's the setup - I use resistors to hi / low on the breadboard to 'simulate' input bits - also, this port should be 8-bits wide, not 4, but here we are. I'm tracing the clock output line (pin1) and the data output line (pin2).

    first-shift-hardware

    And the scope traces, the yellow line is the clock indicator, the blue line is the data. The data being read out is the 'hi, hi, lo, hi' port you see in the breadboard, repeated twice because my counter is counting 8 bits.

    first-shift-scope

    Now that I see I can do this, I'm going to go forwards with a 'complete' implementation - I'm going to make a little XMEGA dev-board to hang on this breadboard w/ the TinyFPGA - I'll make two of these - and then I'll try to get them to chat with eachother. This will cause me to have to close out this project with ready states, and 'banging' on the XMEGA side as well. Neat.

    XMEGA Breadboard and USB

    Here's my plan: XMEGA w/ USB & some pins exposed to a breadboard. Maybe this will be generally handy for other stuff also.

    OK, ready:

    I put this in a repo, here. I have the USB CDC working, I can send characters through the serial port, and return them. In my demo, I'm going to put a byte on the port, send it via fpga to the other breadboarded device, and then read it out on that port.

    Here's one port, counting to 255:

    int main (void)
    {
      sysclk_init();
      irq_initialize_vectors();
      cpu_irq_enable();
      board_init();
      
      usb_init();
        
      PORTD.DIRSET = PIN3_bm | PIN4_bm; // set output (leds)
      PORTD.DIRCLR = PIN5_bm; // set input (button)
      
      PORTA.DIRSET = PIN0_bm | PIN1_bm | PIN2_bm | PIN3_bm | PIN4_bm | PIN5_bm | PIN6_bm | PIN7_bm;
      
      uint8_t counter = 0;
      while(1){
        if(counter > 255){
          counter = 0;
        }
        PORTA.OUTSET = counter;
        PORTA.OUTCLR = ~counter;
        counter ++;
        delay_ms(15);
      }
    }

    counting

    With some verilog, this is now counting on the serial line out...

    counting-out

    This is a different ball game at 133MHz:

    counting-133

    The Message Pass

    OK, It's late, this needs more time. I'm going to wrap this up with a simple implementation and call it (for now). I'm going to implement a (clocked, boo) shift-in register on the other side. Here's the Verilog on TX

    module TinyFPGA_A2 (
      input [7:0] DIN, // parallel data in port 
      output DOUT, // serial data out
      output CLKOUT, // literal clock
      output TRGOUT // starts frame
    );
    
      wire clk;
      
      reg [7:0] din; // 8-bit wide input port
      reg [2:0] counter; // counts to 8
      reg dout; // out data
      reg clkout; // shows counter edge
      
      OSCH #( // setup the oscillator
        .NOM_FREQ("2.08") // 2.08, 10.23, 19.00, 44.33, 66.50, 88.67, 133
      ) oscillator_instance (
        .STDBY(1'b0),
        .OSC(clk) // oscillator bangs the clk wire back-and-forth
      );
        
      always @(posedge clk) begin // every time the clk wire has a positive edge, do:
        dout <= din[counter-1]; // set the out wire to be equal to the i-th element in the data in port
        if(counter == 7) begin // on a new cycle,
          din <= DIN; // read data in from port
          clkout <= 1'b1; // expose counter clock
          end
        else begin
          clkout <= 1'b0; // counter clock low
          end
        counter <= counter + 1; // increment the counter
        end
      
      assign DOUT = dout; // serial output is the out value
      assign CLKOUT = clk; // CLK output is the outclk value
      assign TRGOUT = clkout;
    
    endmodule

    and

    BLOCK RESETPATHS ;
    BLOCK ASYNCPATHS ;
    
    // pins 12 -> 15 are JTAG
    
    //LOCATE COMP "pin1" SITE "13" ;
    //LOCATE COMP "pin2" SITE "14" ;
    
    LOCATE COMP "DIN[0]" SITE "13" ; // pin1 (lsb)
    LOCATE COMP "DIN[1]" SITE "14" ; // pin2
    LOCATE COMP "DIN[2]" SITE "16" ; // pin3
    LOCATE COMP "DIN[3]" SITE "17" ; // pin4
    LOCATE COMP "DIN[4]" SITE "20" ; // pin5
    LOCATE COMP "DIN[5]" SITE "21" ; // pin6
    LOCATE COMP "DIN[6]" SITE "23" ; // pin7
    LOCATE COMP "DIN[7]" SITE "25" ; // pin8 (msb)
    
    LOCATE COMP "NDIN" SITE "27" ; // pin10 (new data in)
    LOCATE COMP "NDREADY" SITE "28" ; // pin11 (ready for new data)
    
    LOCATE COMP "DOUT" SITE "12" ; // pin22
    LOCATE COMP "TRGOUT" SITE "11" ; // pin21
    LOCATE COMP "CLKOUT" SITE "10" ; // pin20

    I got this to decode on the other end.

    co-counting

    This is the RX Side:

    module TinyFPGA_A2 (
      output [7:0] DOUT, // parallel data in port 
      input DIN, // serial data out
      input CLKIN, // literal clock
      input TRGIN // starts frame
    );
      
      reg dhold;
      reg [7:0] dout; // 8-bit wide input port
      reg [3:0] counter; // counts to 8
      
      always @(posedge CLKIN) begin // every time the clk wire has a positive edge, do:
        if(TRGIN) begin // signals new block
          counter <= 0;
          end
        dout[counter] <= DIN; // shift in new data
        counter <= counter + 1;
        end
      
      assign DOUT = dout; // serial output is the out value
    
    endmodule

    and

    BLOCK RESETPATHS ;
    BLOCK ASYNCPATHS ;
    
    // pins 12 -> 15 are JTAG
    
    //LOCATE COMP "pin1" SITE "13" ;
    //LOCATE COMP "pin2" SITE "14" ;
    
    LOCATE COMP "DOUT[0]" SITE "13" ; // pin1 (lsb)
    LOCATE COMP "DOUT[1]" SITE "14" ; // pin2
    LOCATE COMP "DOUT[2]" SITE "16" ; // pin3
    LOCATE COMP "DOUT[3]" SITE "17" ; // pin4
    LOCATE COMP "DOUT[4]" SITE "20" ; // pin5
    LOCATE COMP "DOUT[5]" SITE "21" ; // pin6
    LOCATE COMP "DOUT[6]" SITE "23" ; // pin7
    LOCATE COMP "DOUT[7]" SITE "25" ; // pin8 (msb)
    
    LOCATE COMP "NDIN" SITE "27" ; // pin10 (new data in)
    LOCATE COMP "NDREADY" SITE "28" ; // pin11 (ready for new data)
    
    LOCATE COMP "DIN" SITE "12" ; // pin22
    LOCATE COMP "TRGIN" SITE "11" ; // pin21
    LOCATE COMP "CLKIN" SITE "10" ; // pin20

    Now we're doing the business! Here's what the signal looks like on the scope - blue line is the data (in series), the yellow channel is the 'start' bit, and the purple line is the clock. Note: no co-clocking here (boo), but I feel like I'm just a few steps away from bringing that in.

    co-counting-scope

    OK - I'm going to put this away for now, I feel like I've had as much success as I will until I can come back to this project with neurons that are more plastic / less time-constrained.

    In Summary

    • I learned (the basics of) Verilog
    • I discovered that even really basic FPGAs have VERY fast IO (133MHz seen here)
    • I showed a 'co-clocking' scheme between two FPGAs, but ran out of time for implementing it along with message passing.

    Next Steps

    I really want to keep going down this path in the future... This presents the possibility of passing messages at really big datarates without using any kind of preamble / processor overhead / RF fanciness / expensive hardware. In my brief overview of existing network technology for robotics, the industry is a bit stuck in the mud with switched ethernet10 and it seems like no-one has a good solution for super-simple low-level networking.

    Also, I felt like I really hit a mental wall here - like, I was not able to carefully enough design this experiment. So, just that is enough motivation to take another shot at it in the future. I would love to implement this hardware layer in a complete revision of my larger networks project here.

    Footnotes

    1. Physical Layer
    2. Because networking is a real ABD3, you know?
    3. Acronym Based Discipline
    4. Field Programmable Gate Array
    5. But we certaintly don't want to run a cable with 18 lines.
    6. Quad SPI
    7. Bless you, open source hardware-ists.
    8. Verilog is the
    9. OMG Neato
    10. Which has all of those things: RF Fanciness, Expensive & Bulky Hardware, and Processor Overhead. Also unnecessarily large message sizes: an Ethernet Header is ~42 bytes, and messages often need only be 1-8 bytes in length. Yikes!