The YGREC8's Assembly Language Manual

Created lun. 16 oct. 2023 05:50:46 CEST by whygee@f-cpu.org
Version sam. 16 déc. 2023 03:43:00 CET


PRELIMINARY / WORK IN PROGRESS
It's not yet a definitive reference, check again soon.

Check the latest news and updates at ygrec8.com
and download the latest version from src.ygrec8.com
or if you prefer git : gitlab.com/fhwd/ygrec8

 ©2023 Yann Guidon
all documentation published under CC BY-NC-SA 4.0 license.

Introduction

This manual is limited to details and specifications for assemblers and disassemblers: programs that transform text into binary instructions or vice versa.

Look at the YGREC8's Main manual for the general overview of the project, the ISA and the programming model.
Look at the YGREC8's Programming manual for exemple code sequences, program samples and coding idioms.

General syntax, conventions and other rules

This specification tries to keep the syntax of YGREC8 programs source code as simple and easy to understand and parse as possible. The assembler can be easily coded with sophisticated software tools (such as JavaScript or bison and flex) but low-level languages (even assembly languages) should work as well. Eventually (and even though this is very far fetched) the YGREC8 should be able to assemble its own programs, meaning that frivolous features are avoided (due to obvious space and resource constraints).

The assembler may have one, two or more passes depending on the implementation.
Invocation of the assembly program is not specified. It can be a command-line interface, a graphical interface or a module in another program.
The recommended file name extension for assembly language input is .y8 and the syntax is described by this document.
No file inclusion is provided. Just concatenate all the files before assembly.
No aliasing or macro is provided. If needed, preprocess using an existing program such as m4 or cpp.
The file name extension for text-encoded binary code is .hyx (see https://gitlab.com/fhwd/HYX/). Raw binary code (.byn) is also possible but less flexible and lacks metadata such as addresses or purpose (data or instructions? overlay number?).
Instructions are output Lower Byte first (Little Endian). All addresses must be in increasing order.
By default, code is assembled starting at address 0. Don't forget to .ORG if you need otherwise.
All uninitialised and unused instruction locations must be set to all-ones 0xFFFF (like a blank Flash or EEPROM cell). Thus the program correctly traps with the INV opcode when an unexpected condition occurs, which is clearly distinct from NOP.
Each instruction has its own line. Even the PF prefix and the labels (which are equivalent to .EQU).
Lines can not be merged with a trailing "\" (unlike C and derived)
Comments start at the semicolon ";" and extend to the end of the current line.
During parsing, the line is transformed to upcase to renormalise everything (even symbols). Commas "," are considered as spaces (in case you have this old habit). The following lines give the same result:

OR D1, D2 IFC
or d1  d2,ifc
Or D1  d2 IfC
oR d1, D2, iFc

Numbers are recognised by their prefix: a minus sign or a decimal digit. They are by default written in decimal but can have an eventual suffix to specify the base :

d (decimal)
h (hexadecimal)
o (octal optional)
b (binary)

The equivalent regular expression would be something like:

(([-][0-9a-fA-F])|([0-9]))[0-9a-fA-F]*[bBdDhHoO]?

Each number must fit in the desired field. All numbers are considered signed but can also be written using unsigned numbers for convenience (as long as the binary reprensentation fits in the field).

Imm9 range: -256 to 511
Imm8 range: -128 to 255
Imm4 range: -8 to 7 (8 to 15 are not recommended because this field is sign-extended, so it may output a warning)

User's symbols and intermediary results in expressions are stored as signed numbers coded in 32 bits (int32_t in C). The assembler must check the range for every instruction field. All arithmetic overflows result in throwing errors, or at least a warning if the corresponding option is enabled.
When expecting a number, the assembler can accept a direct number (as in the previous point), a character ('a'), a symbol (previously defined and valued), or an expression (always between parenthesis) that accepts all of the 4 previous data types.
Some basic arithmetic may be provided to compute immediate values (+, -, *, >>, >>>, <<, smin, smax, etc.) inside a binary "expression" (see above). There is no precedence because every operation uses its own pair of parenthesis.
UTF-8/Unicode is not supported, only plain ASCII. Symbol names can only contain letters (A-Z), digits (0-9) and underscores (_) but may not start with a digit (otherwise this is considered as a number).
The default maximum length of a symbol (or label) is 24 ASCII characters. Depending on the implementation, symbol names that exceed this length may either be supported, or throw an error, or issue a warning or use only the 24 leading characters. Check and configure your source code and tools carefully, in case you need longer names.
The symbols and reserved words can not be redefined (by .EQU, .OVL, label or others). This will throw an error.
The order of the operands immediately follows the structure of the instruction, so you usually have the following order : OPCODE SRI SND (SRI being an immediate number or register name). However the condition is always at the end. A typical instruction would be:

ADD -6 PC IFNZ ; subtract 6 from PC if Z flag is 0

The above instruction loops back 6 instructions until the result of the preceding computation is zero.

Pseudo-instructions

Before passing a line to the actual instruction assembler, the line can be pre-parsed and examined, looking for pseudo-instructions: these are commands that do not turn into actual instructions but do help manage the overall program's structures and definitions. The first word is parsed and tested:

• If the word starts with a dot "." then a pseudo-instruction is expected and looked-up. It could be one of those:

.END (essential) Ends parsing of the file. The source file can contain any garbage (or the dump of the symbol table) below this line.
.ORG (essential) sets the address where the next instruction is stored. The value may be a number, symbol or expression, but can not be post-defined. Depending on the assembler, it can also be an error to have a new address smaller than the current ldcxaddress ($). For example the line

.ORG 42

means that the next instruction will be stored at address 42.

.EQU (essential) creates a new symbol with a known value.

.EQU plop 42

defines the symbol "PLOP" and assigns the value 42. The symbol "PLOP" can be used later in the source file. The accepted range is a signed 32-bit integer (uint32_t). This value muse be fully defined.

.DW (essential) assembles a sequence of 16-bit immediate value.

.DW 42

will output the value 42 to the instruction memory space at the current address ($), as if it was an instruction. Depending on the assembler, the first value may be "fixed" by a second pass if one symbol is undefined. .DW accepts more than one argument:

.DW 1,1,2,3,5,8,13,21,33

Depending on the assembler, the first value may be "fixed" by a second pass if one symbol is undefined. It would be too complex to fix the following ones.

.DW forward,1,2,3
.EQU forward 21

.BL (premiminary, optional) is like .DW but "byte low" to be accessed by LDCL. Accepts a sequence of bytes and/or ASCII characters as strings (such as "abc"). The upper byte is not affected.

.DB 11, "hello world"

.BH (premiminary, optional) is like .BH but "byte high" to be accessed by LDCH. Does not affect the lower byte so you can use .BL and .BH sequentially on the same addresses.

.DH 26, "Bonjour le monde du haut !\n"

.OVL (premiminary, optional) declares a new overlay, this .ORGs back to 0, expects a number (0 to 255) and can declare an optional symbol name (a sort of enhanced .EQU with more side effects).

.OVL 23 MyCustomLibrary

.{ and .} are reserved for eventual/optional nested contexts (local symbol tables) later. Throw an error if not supported.

• Otherwise, if the word ends in ':' (and is not an number, a symbol or a keyword) then it is a label and its definition is added to the table of symbols. There is no separator before ':' and the label must be alone on the line. The line

plop:

is a shorthand to the line

.EQU plop $

The following line must fail:

plop :  ; MUST_FAIL: don't allow spaces before ":"

• Lastly, if none of the above applies, the line is passed to the actual assembler.

Basic Syntax

End of line is given by CR (ASCII 13d) or LF (ASCII 10d)
Separators are

space ' ' (ASCII 32d)
comma ',' (ASCII 44d)
horizontal tab (ASCII 9d)
non-breakable space (ASCII 160d),   in HTML).

The dollar sign '$' (ASCII 36d) represents/returns the value of the current address where the result is stored (often equivalent to the PC's value).
Character bytes are written between apostrophes. This means only raw ASCII, no UNICODE code point.
"Expressions" allow binary computations. When implemented, each operation is always between parenthesis, to avoid precedence. The assembler could support the following arithmetic and boolean operations:

+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo/remainder)
^ (boolean XOR)
| (boolean OR)
& (boolean AND)
>> (logic shift righ)
>>> (sign-preserving shift right)
<< (shift left)
<@ (rotate left)
@< (rotate right)
?> (return the lowest unsigned value)
<? (return the highest unsigned value)
$> (return the lowest signed value)
<$ (return the highest signed value)

Let's also consider comparisons:

< ()
> ()
= ()
!= ()
<= ()
>= ()
()
()

All those symbols are reserved.

Keywords

In addition to the previous keywords, the assembler pre-defines the following keywords as elements of an instruction. The list may expand in the future but these are essential definitions:

Opcodes

The opcodes take 2 to 4 characters.

"ADD" "AND" "ANDN" "CALL" "CMPS" "CMPU" "IN" "INV" "LDCH" "LDCL" "OR" "OUT" "PF" "RC" "RO" "SA" "SET" "SH" "SUB" "XOR"

Aliases

Some instructions are encoded as a special case of other instructions (see aliases).

"NOP" (see "OR") "HLT" (see "CALL") "OVL" (see "CALL")

Registers

The register names take only 2 characters only.

"D1" (000) "A1" (001) "D2" (010) "A2" (011) "R1" (100) "R2" (101) "R3" (110) "PC" (111)

Conditions

The conditions fit in 3 or 4 characters only, often shortened.

0000 NEVR Never (instruction is not executed or committed, like a NOP)
0001 IFN0 B0=0
0010 IFNC Carry=0
0011 IFN1 B1=0
0100 IFNS Sign=0 (positive signed number, aliased as IFP?)
0101 IFN2 B2=0
0110 IFNZ Zero=0 (last result had at least on set bit)
0111 IFN3 B3=0
1000 ALWS Always (the default condition so writing it is not required)
1001 IF0 B0=1
1010 IFC Carry=1
1011 IF1 B1=1
1100 IFS Sign=1 (negative signed number)
1101 IF2 B2=1
1110 IFZ Zero=1 (last result had all bits cleared)
1111 IF3 B3=1

Forms

The instruction's textual format follows the binary format (except the condition which is a suffix). "Forms" are valid sequences of no, one or two arguments:

No operand (often instruction aliases):

"HLT" "INV" "NOP"

Examples:

NOP ; do nothing

HLT ; properly terminate a program

INV ; trigger a trap/reboot/panic

One immediate number (0 to 255: Imm8):

"OVL"

Example:

OVL 42 ; Load and execute overlay #42 (special case for CALL 42 PC)

One immediate number (0 to 511: Imm9) and a (source or destination) register:

"IN" "OUT"

Examples:

IN 67, D3  ; get value from IOspace at address 67 and write register D3

OUT 45 A1 ; Put the value of A1 into IOspace at address 45

One short immediate (-8 to 7) or a source register, and a source/destination register:

"LDCH" "LDCL" "RC" "RO" "SA" "SH"

Examples:

RC  1  D1   ; Rotate register D1 through carry by 1 position (left)

RO -3  D2   ; Rotate register D2 by 3 positions (right)

SA  R1 A1   ; Arithmetic Shift of A1 by R1 positions

SH  R2 A2   ; Logic shift of A2 by R2 positions

One source/destination register and one optional condition:

"PF"

PF R3 ; put the next result in R3
; default condition is ALWS so Carry-In is set to 1

PF R1 IFN2 ; put the next result in R1
; and copy the negated condition bit #2 in the Carry-in flag

One immediate byte (Imm8) and a source/destination register

(same list as below)

One short immediate (Imm4) or a source register, then a source/destination register and one condition:

"ADD" "AND" "ANDN" "CALL" "CMPS" "CMPU" "OR" "SET" "SUB" "XOR"

If no condition is specified, it is considered as ALWS (1000)

Examples:

AND 123 R1 ; bit-mask R1 with byte 123

ADD -76 R2 ; subtract byte 76 from register R2

SUB  95 R3 ; Subtract R3 from byte 95 and put the result in R3

ADD 1 A1 IFC ; increment A1 by short 1 if the carry bit is set

ADD D1 A1 IFS ; Add D1 to A1 (result in A1) if the Sign/Negative flag is set.

There is no clear requirement or syntax for interpreting a number as short or long (Imm4 or Imm8). The assembler first tries to fit values in a short immediate, allowing the eventual condition to be accepted. If an immediate exceeds the Imm4 range, then any explicit condition suffix is an error.

To be continued...

Now have a look at the YGREC8's Integration manual, the Programming manual or the Main YGREC8 Manual.