The YGREC8's Integration Manual

Created sam. 30 sept. 2023 21:52:48 CEST by whygee@f-cpu.org
Version sam. 16 déc. 2023 05:22:02 CET


PRELIMINARY / WORK IN PROGRESS
It's not yet a reliable and definitive reference!

Check the latest news and updates at ygrec8.com
and download the latest version from src.ygrec8.com
or if you prefer git : gitlab.com/fhwd/ygrec8

 ©2023 Yann Guidon
all documentation published under CC BY-NC-SA 4.0 license.

Introduction

Along with this document, you should also read:

The YGREC8's Main manual contains the general overview of the project, the ISA and the programming model.
The YGREC8's Assembly Language manual for details and specifications for assemblers and disassemblers: programs that transform text into binary instructions or vice versa.
Look at the YGREC8's Programming manual for exemple code sequences, program samples and coding idioms.

This manual describes one intended implementation for integration in a microchip. Other structures and organisations might be used for other technologies.

System Overview

A complete and functional YGREC8 system requires more than the main datapath. The following diagram shows the other necessary blocks :

The debug system is essential during development: this is the circuit that lets the developer write and read data inside the system. It can access:

the current and next instruction pointers (read/write)
the result, SRI and SND (read only)
the flags (read only)
the current status of the core (read only)

The clocks are very different so the commands are resynchronised by talking to the FSM, which also receive a small set of commands such as RESET, START, STEP, STOP, WR_INSTR, WR_IMEM etc. The FSM can also inhibit the update of the registers or the PC for example, so the currently injected instruction is not considered as coming from the program memory's instructions stream.

From there, the developer can inject arbitrary instructions to access the other parts of the system (memory blocks, I/O etc.), dump or modify the state of the system... The debug circuits are totally asynchronous from the main clock, and usually significantly slower. The developer must stop the core when inspecting or writing data. It is a bit more intrusive than other debug systems but it is designed to use as few transistors as possible.

A version of the debug circuit called DTAP has already been designed as a standalone circuit but it could be replaced by anything that does the job, such as the Caravel harness used by eFabless (though it is much more complex and uses more real estate, meaning it is more prone to failure). Or a more classic JTAG scan chain.

The core should be able to work in "standalone" mode, when no host controls the debug circuits. This is usually controlled by the state of some pins during /RESET : the System FSM can wait for a debugger to take control, or the FSM can load external instructions, either passively (as a SPI slave) or actively (by reading from an external SPI Flash chip). If the Y8 system has no internal code storage, you can bootstrap the core by sending 512 bytes through SPI (with an Arduino, RPi or ESP). The System FSM also handles the RESET/LOAD/START/STEP/SKIP/STOP commands received from the debugger.

The Instruction FSM manages the lower-level details of scheduling the instructions and their details (such as OVL, HLT, LDCx, writes to PC...). This split of the FSM greatly simplifies the design and the modularity. More states are added as more features and corner cases are developed. Meanwhile, the System FSM remains core-agnostic.

The Y8 core does most of the remaining work : decodes instructions, fetches operands, performs operations, writes the results to the registers, checks conditions and move things around... It reads instructions from the Instruction Register, which usually transparently interfaces the Instruction SRAM. The instruction can also be read or written by the debug circuit, for inspection, injection as well as for writing a new program or overlay into the Instruction SRAM.

The Data SRAM is just a plain memory array. 2 areas of 256 bytes each do the trick but some banking could take place if needed and if the necessary resources are available.

The IO & Config Registers is a user-defined area of 512 bytes that does everything that the rest can't. This is usually where customisation takes place, signals enter or exit the system, default behaviours are configured...

Still missing : the interrupt controller and exception handling (everything jumps back to address 0 for now). Stay tuned.

The debug system (DTAP)

The default implementation of the debug system is the DTAP, which is now spun off into a parallel project. A simulator or emulator can easily perform its function during development.

The interface between the debug circuit is asynchronous, must be correctly sampled (beware during signal transitions) and the rest of the Y8 system must have at least these signals :

Current PC (read) [8] => write by injection: SET 123 PC
Next PC (read) [8] (can not be written, depends on current instruction)
SRI (read) [8] => also used to dump the registers and/or memory
SND (read) [8] => also used to dump the registers and/or memory
Result (read) [8] => also used to read I/O registers and others
Current instruction (read & write) [16]
Status/Condition bits (read) [8] (C, S, Z, b0, b1, b2, b3, writeback)
System FSM state (read) [4] (affected by the command register)
Instruction FSM state (read) [4] (affected by the instructions)
Command register (read & write) [4] (the famous NOP, RESET, START, STEP, SKIP, STOP, WR_INSTR, WR_IMEM etc. sent to the system FSM)

and probably more but not much because most basic functions are already possible with some combination of commands.

The System FSM

The System FSM is part of the Y8 system, it is clocked like the core and helps synchronise the debug commands. It manages initialisation, loads the initial code overlay, controls the high-level core behaviour and interfaces with the debug system. See this discussion.

The host controls the debug session and is responsible for enforcing correct sequences and timing to prevent overlaps or undefined states. Going from one state to another must use a buffer NOP command. Some states are duplicated to allow the command register to contain single-cycle orders during multiple cycles (STEP, SKIP, WR_IMEM...). Four states are reached when the command goes back to NOP. Here is a (preliminary) list of the commands that the System FSM can receive:

RESET : reinitialise the core, clear PC, reload the boot overlay. Has precedence over all other commands.
NOP : do nothing. No command.
START : start or resume program execution.
STEP : execute the current instruction then STOP.
SKIP : like STEP but no writeback of the result. Prevents a CALL from jumping, for example.
STOP : if RUNning, stops program execution (inhibits PC write and writeback)
WR_IMEM : initiate a write cycle of the program memory, with the 16 bits written to the instruction register.

The internal states can be read by the debug system.

Some of the required inputs for this FSM:

external /RESET pin
configuration pins
command register
PC overflow bit (from the incrementer)
Opcode=OVL
Opcode=INV
Trap/watchpoint
...

Some of the required outputs for this FSM:

Inhibit PC update
Clear PC
Write Instruction Memory enable/strobe
...
and the FSM state

The Instruction FSM

This is the ISA-specific part of the scheduling system, which needs more direct information from the core, in particular from the instruction decoder.

The PRFX state is almost identical to the INST state, with some transition conditions to IDLE inhibited (RUN, STEP...).

BTW it would be extra cringe to have an instruction sequence like PF PC , LDCH R1 R1 or something like that, since it would scan several different states in sequence, for no useful purpose. But the FSM must not let such a corner case fall into a crack...

Some of the required inputs for this FSM:

System FSM state
Y8 core instruction decoder
...

Some of the required outputs for this FSM:

PC update control {increment, clear, set to Result, inhibit}
Writeback enable/inhibit
...

This FSM will stop if the instruction is not a valid one, if more than one prefix is used or if an overlay is requested. The System FSM will manage what to do with these conditions: reboot ? load or switch to a different overlay ? This is discussed later in this manual but so far, the core reboots or freezes.

The Instruction Register

At the start, it is a simple latch to buffer the Instruction Memory's output. The memory's read port is also multiplexed with another register written by the debug system to inject arbitrary instructions.

The program memory is directly addressed by the PC register (to prevent circuit duplication). During an overlay load sequence (controlled by the System FSM), the PC is reset then incremented, all core WriteBacks are inhibited, the Instruction Register is loaded with new values then it is written to the instruction memory. The cycle stops when the PC has overflown, reaching again a cleared state, and execution resumes.

More features can be added to this simple unit, to allow better debugging by triggerring traps and/or incrementing event counters:

one or more registers to compare the PC
one or more registers to compare the instruction and/or its subfields
one or more registers to compare the result bus
...

The instruction register could also later inject one or more prerecorded instruction sequences to manage transitions (such as state backup/restoration) when an exception or IRQ occurs, so a small SRAM or ROM block might emulate microcode (as a last resort because the FSMs must also be adapted).

...

Datapath and scheduling

The core's speed should be decent, with a pretty short critical datapath, and most of the instructions execute in one clock cycle. Exceptions include the LDCx instructions, computed writes to PC and prefixes.

Legend:

The green wires represent the working data
The red wires represent control signals coming from the instruction register
The blue wires represent the condition logic (for inhibiting writeback of the result) and the PC (write-to-PC, call, etc.)

The exceptions and scheduling are handled by the System FSM and the Instruction FSM.

Memory blocks expect a stable address to be latched at the start of every clock cycle. Both Program Memory and Data Memory operate simultaneously, independently, they are not multiplexed.

The Data Memory could be a single SRAM block because only one register can be updated during a cycle. A1, D1, A2 and D2 can act as "cache" for the memory blocks. The single-memory-block case adds some complexity because A1 and A2 must be multiplexed for each access. Using two independent blocks looks easier at first because their respective addresses are taken directly from A1 and A2.

The program memory is a bit more delicate because the incremented address from PC must first be latched. Depending on the latency ratio between the memory and the core, there are two possible configurations:

If the memory is fast compared to the core's logic (such as FPGA) then the memory's read port can be directly latched into the instruction register. The new PC computed during one cycle will give the instruction for the next cycle. It's the simplest case.
If the memory is slow and the clock cycle must be shortened, the instruction fetch must take a whole cycle, which overlaps the execution of the preceding instruction. This pipeline might create scheduling issues when writing to PC, CALL and with LDCx. The bubbles and the inhibition of some operations are then controlled by the FSM.

The datapath is conceptually split into 2 stages:

The first stage decodes the instruction and fetches the operands from the register set, or multiplexes the immediate values. At the "Nexus", both SRI and SND operands are available.
The second stage does all the rest: perform the operation, multiplexes the results in the "Suxen", evaluate the flags, check for the condition to inhibit the writeback and commits the value(s) to the register and memory.

The first stage is typically as fast as the increment of the PC so the value of SRI can be multiplexed directly to the Instruction Memory address bus. This allows some operations like LDCx, CALL or SET to execute with no stall. However if the value must pass through the other units, such as during a relative jump (or any other "write to PC" instruction), the instruction fetch is stalled while waiting for the result to be available on the Suxen hub. This is why the Instruction FSM has an extra "write to PC" stall state. Of course, this is only a suggested organisation and the scheduling can be widly different in other situations.

The I/O and Configuration Registers space

The 512 registers are only accessed with an immediate address to shorten the decoding time and give enough time for the many registers to react. This bus has a very large fan-in and fan-out, on top of the complex decoding, so it is expected to be "slow". Eventually, wait states could be added in the FSM if the IO bus is too crowded.

The IO system is modular, user-defined, so it is not part of the core. Indirect addresses might be possible (and even self-incrementing) through specific registers but this is independent from the core as well. There is a simple synchronous interface, without indication that the address is implemented (or not):

Address (9 bits)
Data IN (8 bits)
Data OUT (8 bits)
Read enable
Write enable
Clock

There is no standard map of the IO space, but here are some of the intended uses :

Four Scratch registers (used during exception or interrupt)
A "reset cause" register (MCLR, jump to 0, brownout, exception, interrupt, ...)
Shadow status register (to save or set the carry, sign and zero flags)
Bank registers (for extending A1 and A2)
Overlay register (for the program memory)
Timer : 8 byte-wide and chainable counters, and their configuration
Condition source selection registers (for b0 b1 b2 and b3 conditions)
Interrupt controller (masks, active signals, trigger modes...)
GPIO (bytes with copy/set/clear and direction addresses)
Clock source & frequency
UART
Trap filters and conditions (trap on opcode, register, address...)
Watchdog timer
Some framebuffer ?
SPI master and/or slave
Your own custom devices
LEDs and buttons, of course
HD44780 4-bit interface ?
...

As already mentioned, fan-in and fan-out is the main challenge. The addresse maps change wildly across implementations because decoding can use boolean shortcuts. This is also why 9 bits are used, and not 8 : some address bits could be directly used as "enable" signals. This is discussed here but too many things have changed since. The interface names should be standardised to allow porting the source code, and the address map could be "compiled" with some smart software (one day).

Overlays

The YGREC8 has a lousy addressing space, like we're in 1972 or so. There is no split of the addressing space into several remapped banks like in other 8-bit CPUs (Z80 or 68xx). The approach is to simply swap or switch the whole program memory to different contents. In the YGREC8 system, each block of 256 instructions is called an overlay.

If the physically available SRAM space is too tight, the System FSM can reload the whole program memory from external sources (usually a SPI Flash chip), possibly even starting at a different offset, thus loading new code. It's slow, but it reuses the available mechanisms with minimal changes or updates.
Even if more SRAM is available, only 256 instructions still remain reachable. Jumping from one overlay to another will simply change the SRAM's Most Significant Bits of the address, controlled by a hidden register (and restarting at address 0). It's faster but the whole SRAM must be preloaded, probably using the previous mechanism.

A program can switch to a different overlay by hijacking the CALL instruction: it makes no sense to CALL and write the PC+1 to PC so this is interpreted as the OVL opcode. The operand can be a register, a 4-bit or 8-bit immediate, and the first operands allow conditions. Since a register can be an operand, it is possible for an overlay to "call" another, which can then "return" to the first one when a function is completed. Saving the PC+1 adds some overhead though.

Overlays can later be used to handle traps, interrupts and other system management tasks, and the YGREC8 could cycle through several "user" overlays, which share common functions provided by yet other "library" overlays.

How many overlays can there be ? Maximum 256 since the OVL opcode can only specify one byte. The value 255 means HALT and is currently hardwired to stop the core (it may be implemented as a real overlay that provides some debug functions). Overlay#0 is the bootstrap overlay. Overall, the 64Ki instructions (128KiB) could look like this:

BOOT: default entry point to start the system
setup1: initialise some I/O registers
setup2: initialise some memory from OVL4 or OVL5
setup3: more initialisation
DataTable1: some constant tables
DataTable2: more constant tables
Program_main: the start and main body of the program
Program submodule 1
Program submodule 2
Program submodule 3
...
...
Generic library 1
Generic library 3
Generic library 2
...
...
...
...

...
...
Trap handler
IRQ handler
HALT: this overlay could be actually implemented to catch errors and handle a graceful restart, or even provide debug and breakpoints, but for now this situation will only go to the STOP condition in the system FSM.

Traps, exceptions, interrupts

So far, exceptional conditions are not handled specifically and it causes the System FSM to reboot the core. A "reason" I/O register would help distinguish the sources of reboot but with a space of only 256 instructions, there is not much to both do and catch.

A larger program memory space will help with easier handling: the traps and interrupts can then be configured to switch to a user-given overlay.

To be continued...