From: <Saved by Windows Internet Explorer 7>
Subject: Emergent Technologies Inc. -- LMI K-Machine
Date: Wed, 3 Jan 2007 18:29:43 -0800
MIME-Version: 1.0
Content-Type: text/html;
	charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
Content-Location: http://fare.tunes.org/tmp/emergent/kmachine.htm
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- Copyright 1996 Emergent Technologies Inc. --><!-- All rights =
reserved. --><HTML><HEAD><TITLE>Emergent Technologies Inc. -- LMI =
K-Machine</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dwindows-1252">
<META content=3D"Lisp Machine Inc. K-machine architecture" =
name=3Ddescription>
<META content=3D"LISP, computer, Lisp Machine, hardware" =
name=3Dkeywords>
<STYLE type=3Dtext/css>BODY {
	FONT-FAMILY: Trebuchet MS
}
P {
	TEXT-INDENT: 0.25in
}
CODE {
	FONT-FAMILY: courier new
}
PRE {
	FONT-FAMILY: courier new
}
</STYLE>

<META content=3D"MSHTML 6.00.5730.11" name=3DGENERATOR></HEAD>
<BODY background=3D"">
<CENTER>
<H1>Lisp Machine Inc. K-machine:</H1><BR>
<H3>The Deffenbaugh, Marshall, Powell, Willison architecture<BR>as =
remembered by=20
Joe Marshall</H3></CENTER>
<BLOCKQUOTE><I>Abstract:</I>=20
  <P>The LMI K-machine was the last processor designed and built by Lisp =

  Machine, Inc. Unlike other Lisp Machines, the K-machine is not =
descended from=20
  Tom Knight's original CONS architecture; the K-machine is an original =
design.=20
  This paper provides an overview of the more interesting elements of =
the=20
  K-machine architecture.</P></BLOCKQUOTE>
<H2>Introduction</H2>
<P>The LMI K-machine was the last processor designed and built by Lisp =
Machine,=20
Inc. The K-machine was designed in late 1985 - early 1986 and the first=20
instructions were executed December 31, 1986. The hardware was =
significantly=20
debugged by April 1986 and an effort to port the legacy Lisp Machine =
code to the=20
K-machine was underway. Financial difficulties kept the K-machine from =
reaching=20
the marketplace. Lisp Machine's successor, Gigamos, was beset by legal=20
difficulties which again prevented marketing of the K-machine processor. =

Nonetheless, several K-machine board sets were produced and debugged =
before the=20
demise of Lisp Machine, Inc.</P>
<P>The K-machine processor is notable for several reasons: Unlike other =
Lisp=20
Machines, the K-machine is not descended from Tom Knight's original CONS =

architecture; the K-machine is an original design. The K-machine is =
designed to=20
run Lisp, and Lisp only. As a result, the performance of the K-machine =
on the=20
Gabriel benchmarks is significantly higher than that of any of the =
several=20
contemporary competing products of the mid-to-late 1980's, among them =
the TI=20
Explorer chip, the Symbolics Ivory, and Lucid Common Lisp. In fact, the=20
K-machine was competitive with Lisp implementations on the Cray 1 and =
the IBM=20
3670. Since the project was kept relatively secret during its =
development, very=20
few people know the details of its architecture.</P>
<P>The authors of this paper were the primary architects of the original =

K-machine as designed and built by Lisp Machine, Inc. The hardware=20
implementation of the architecture was primarily designed by Robert =
Powell,=20
Bruce Deffenbaugh, and Kent Hoult. Subsequent minor hardware changes =
were made=20
by the staff of Gigamos. Many people were involved in the project.</P>
<P>This paper provides an overview of the more interesting elements of =
the=20
K-machine architecture. The K-machine contained its share of mundane =
elements=20
such as instruction caches, virtual memory hardware, etc. Details about =
the=20
mundane hardware is beyond the scope of this paper.</P><BR>
<H2>Architectural Principles</H2>
<P>In any processor design there are fundamental "principles" from which =
the=20
rest of the architecture derives. For the LMI K-machine, these were as =
follows:=20
<UL>
  <LI><B>Optimistic instruction execution</B><BR>The processor was =
designed with=20
  the assumption that certain operations, are common enough that no time =
should=20
  be wasted processing them. For example, when two numbers are added it =
is=20
  assumed that they are both fixnums. If this turns out not to be the =
case, the=20
  addition is aborted and the computation retried with the generic =
arithmetic=20
  code. This is in contrast to the LMI LAMBDA which did a type dispatch =
on=20
  arithmetic operands <EM>before</EM> attempting to operate upon =
them.<BR><BR>
  <LI><B>Fully type-safe operation</B><BR>All user-operations are fully =
type=20
  safe. Speed and safety flags are ignored by the compiler and full =
safety is=20
  provided by the hardware.<BR><BR>
  <LI><B>Non-expert user assumption</B><BR>A typical user of an LMI =
LAMBDA does=20
  not have the time necessary to learn how to squeeze high performance =
from that=20
  processor. We therefore designed the machine to optimize the patters =
of usage=20
  found in code written by non-expert users. This is not to say that=20
  inexperienced programmers can expect high performance from poorly =
written=20
  code, but a pedagogic and abstract style of code need not be abandoned =
when=20
  coding a high-performance application.<BR><BR>
  <LI><B>32-bit objects</B><BR>As in the CONS, CADR, and LMI LAMBDA, the =

  K-machine supports 32-bit objects. In contrast, the Symbolics 3600, =
had 36-bit=20
  objects. In both the 3600 and the LMI LAMBDA, the type tag is kept in =
the=20
  pointer, so the address space, as defined by the length of a pointer, =
is=20
  significantly smaller in the LMI machines.<BR><BR>A 32-bit object =
admits a=20
  larger range of stock hardware upon which to base an implementation, =
and=20
  allows Lisp objects to be stored and manipulated on non-specialized =
hardware,=20
  for instance, the standard 32-bit bus found in most modern computers.=20
  Furthermore, the smaller address space simplifies the implementation =
of the=20
  virtual memory system and ephemeral GC by allowing a single level of =
page=20
  tables.<BR><BR>The drawbacks of using 32-bit objects are an inability =
to use=20
  full IEEE floating point, integers are required to have less than =
32-bits=20
  precision, and the address space is quite a bit smaller. Additionally =
it is=20
  much more difficult to ensure that all objects are tagged at all =
times;=20
  generally, the tag bits must be stripped off objects before they are =
passed=20
  through the ALU, and the objects must be re-tagged before they are =
stored back=20
  in memory. This increases the burden upon the system programmer to =
ensure that=20
  the garbage collector only sees valid pointers.<BR><BR>
  <LI><B>RISC-like architecture</B><BR>Certain RISC principles were =
adhered to:=20
  Instructions are atomic; there are no "instruction prefixes" or =
"string=20
  operations". There are no complicated addressing modes (although that =
the low=20
  order bit can be forced to 1 during a read cycle to be able to take =
CDR's=20
  without having to perform an add). All instructions complete in a =
single=20
  fixed-time clock cycle. <BR><BR>Other RISC principles, such as a =
uniform=20
  instruction set, were discarded. The K-machine has a way to fit an ALU =

  operation into virtually any instruction. There are even "short" =
conditional=20
  branches for tight loops that have enough bits in them to execute an=20
  arithmetic (but not a shift/rotate) instruction while =
branching.<BR><BR>Unlike=20
  some RISC machines, while the instruction set was "reduced" as above,=20
  complicated instructions that performed several operations in parallel =
were=20
  introduced so long as they could be completed in one clock =
cycle.<BR><BR>In=20
  retrospect, we inherited many of the advantages and disadvantages of =
RISC=20
  processors. We had simpler and cheaper hardware, interrupt processing =
was=20
  easier, and the compiler was easy to target. However, we also had very =
wide=20
  instructions (64-bits), and often the compiler could not find useful =
work to=20
  be done while waiting for condition codes to become available. This =
caused=20
  several <CODE>NOP</CODE>s to be inserted. We also found that the =
performance=20
  of the machine can be limited by the rate at which the instruction =
cache is=20
  loaded.<BR><BR>
  <LI><B>50ns clock cycle</B><BR>Our target was a 50ns clock cycle. Each =

  instruction was to complete within this time. The actual machine came =
in at=20
  80ns. It turns out that the AMD 29332 computes a stable output some =
20ns or so=20
  after the inputs become stable except for one crucial line: the =
integer=20
  overflow line can take up to 47ns to stabilize. The critical timing =
loop is=20
  between the ALU receiving correct inputs to the ALU output =
stabilizing, a=20
  couple of gate delays to recognize an error condition, and the time =
necessary=20
  to abort the current instruction fetch and set the program counter to=20
  0.<BR><BR>
  <LI><B>3-address machine</B><BR>The CONS, CADR, LAMBDA, and Symbolics =
3600=20
  emulated a stack machine with the microcode. While a stack machine is =
an easy=20
  architecture to compile to, there is a performance penalty to be paid. =
For=20
  instance, in the evaluation of a simple addition, 3 instructions must =
be used:=20
  <CODE><BR>push arg1 <BR>push arg2 <BR>add <BR></CODE>Where a register =
machine=20
  may accomplish the same operation in one instruction. A 3 address =
machine,=20
  however, requires a much wider instruction than a stack machine =
because the=20
  arguments to a stack machine operation are implicit. The K-machine has =
a=20
  64-bit instruction word. An instruction word of this size creates some =

  instruction bandwidth difficulties, but there are compensatory =
advantages in=20
  that the instruction can specify a number of operations to be done in=20
  parallel.<BR><BR>
  <LI><B>No CDR codes</B><BR>Analysis of the storage on existing Lisp =
Machines=20
  indicated that only about 20 percent of the storage was actually =
stored in=20
  lists (the rest being vectors, structures, or classes), and that about =
20=20
  percent of the lists benefited from CDR codes. CDR codes take their =
toll in=20
  increasing memory references (the CAR must be examined before the CDR =
can be=20
  located), and reduced address space (2-bits per pointer reserved for =
every=20
  object in the system). One CDR-code bit was added to the address =
space, and=20
  the other to the tag.<BR><BR>
  <LI><B>"Level" implementation technology</B><BR>The entire K-machine =
is=20
  implemented in standard CMOS and TTL. No ECL or other high-performance =
parts=20
  were used. The point of doing this is to be able to easily incorporate =
faster=20
  technology by simply replacing all the parts with their faster=20
  equivalent.<BR><BR></LI></UL>
<P></P>
<P>The final data layout is a 6-bit tag at the most-significant end of =
the=20
(4-byte) word, and a 26-bit <EM>word</EM> pointer in the least =
significant end.=20
This makes a 2<SUP>26</SUP> word (64 MWord) address space, or a 1/4 GB =
address=20
space. Again, this is smaller than Symbolics, but nothing to sneeze at. =
The=20
machine is correct-endian, <I>i.e.</I> little.</P><BR>
<H2>The Pipeline</H2>
<P>The K machine has a 4 stage pipeline. While this is transparent to =
the end=20
user, and mostly transparent to the system programmer, there are =
occasions where=20
the pipeline is exposed. Notes about this will be interspersed =
below.</P>
<P>The clock cycle to the machine registers run at twice the speed of =
the main=20
processor clock and alternate between read and write cycles. This means =
that=20
each instruction can both read from the registers and write back to =
them.</P>
<P>The four pipelines stages are as follows:=20
<OL>
  <LI><B>Fetch</B><BR>The instruction is fetched from the instruction=20
  cache.<BR><BR>
  <LI><B>Read</B><BR>The instruction source data are fetched from the =
register=20
  file, functional input bus, or instruction register (for immediate=20
  values).<BR><BR>
  <LI><B>ALU</B><BR>The data are passed through the ALU and latched at =
the=20
  output register.<BR><BR>
  <LI><B>Write</B><BR>The value in the output register is written back =
to the=20
  register file or to a functional destination.</LI></OL>
<P>An instruction may be aborted at any time before the output register =
value is=20
written.</P>
<P>A four-stage pipeline generally means that branch instructions are =
not taken=20
immediately. Usually, there are "branch delay instructions" that are=20
unconditionally executed after the branch is specified. The K-machine =
branch=20
delay is unusual in that the condition codes are set in the instruction=20
<EM>before</EM> the conditional branch instruction and the branch target =
is=20
specified in the instruction <EM>after</EM> the conditional branch. =
Thus, when=20
the condition codes are generated by the ALU, the conditional branch is =
in the=20
instruction register and the machine can then decide whether to follow =
the=20
branch at this point. This ameliorates the effect of branch delay by =
spreading=20
the branch across several instructions. Of course, the compiler hides =
the=20
details of conditional branches.</P><BR>
<H2>The Primary Data Paths</H2>
<P>Within the processor itself, the data paths are 33 bits wide. The =
extra=20
"boxed" bit allows us to manipulate 32-bit non-pointer data (such as =
floating=20
point numbers) directly. The 33<SUP>rd</SUP> bit is constructed on the =
fly from=20
context when a pointer is read into the machine, and appropriate action =
is taken=20
to "box" non-pointer data when it is written to memory.</P>
<H3>The register file</H3>
<P>Unlike the CONS derivatives, but much like a RISC machine, the =
K-machine has=20
a register file consisting of 4096 33-bit words. This register file is=20
duplicated to allow 2 simultaneous reads from different registers. =
Writes to the=20
register file always update both copies. The output of the register file =
is=20
latched and presented only to the ALU. In order to use the contents of a =

register as a virtual address or to write it to memory, it must be =
passed=20
through the ALU. This incurs a one-cycle delay.</P>
<P>The register file is both read and written during each machine =
instruction,=20
and thus requires fast RAM. Static RAM is used both for speed, and to =
allow=20
single stepping of the K-machine hardware for debugging purposes.</P>
<P>The register file is addressed by the function calling hardware. This =
will be=20
described in detail below.</P>
<H3>The ALU</H3>
<H4>Integer unit</H4>
<P>The K-machine was designed to use the AMD 29332 ALU. This is a 32-bit =
ALU=20
that has quite a bit of functionality including a barrel shifter (for =
tag=20
extraction), and 8, 16, 24, and 32-bit modes of operation. The decision =
was made=20
to make fixnums 24-bits wide to take advantage of the natural ability of =
the ALU=20
to handle this. This was a step backward, as the LMI-lambda had 25-bit =
fixnums,=20
but the gain in performance was considered more important.</P>
<H4>Floating point unit</H4>
<P>In parallel with the integer ALU (the AMD 29332), is a floating point =
adder=20
and a floating point multiplier. The output register, which holds the =
value to=20
written back to the registers, may receive its value from any of these=20
sources.</P>
<H4>Type checking unit</H4>
<P>Neither the CADR or the LMI LAMBDA had hardware for checking the =
types of=20
operands in parallel with the actual computation done by the ALU. It is =
my=20
understanding that the Symbolics 3600 has a type checking unit, but we =
are=20
unfamiliar with it. We invented a simple, but effective, type checking =
hardware=20
module that operates in parallel with the arithmetic units. This =
hardware=20
consists of a 256Kbit x 1bit RAM. This RAM is addressed by 7bits each =
from the=20
left and right operands of the ALU (the boxed and tag bits), two bits =
from the=20
instruction to specify the type check operation (for instance, operands =
must be=20
boxed fixnums, or left operand is an aggregate data structure and the =
right=20
operand is a fixnum), and two bits from a mode register (this allowed =
the=20
program to change type checking strategies when garbage collecting, for=20
instance). These 7+7+2+2 are concatenated to form a 2<SUP>18</SUP> =
address for=20
the type check RAM. This single bit output indicates whether the =
operation=20
should succeed or cause a type check error. This turns out to be an =
elegant,=20
simple, and inexpensive mechanism for parallel type checking.</P>
<P>The type check RAM is loaded at boot time by presenting all possible=20
combinations to the RAM (by sending synthesized data through the ALU). =
Traps are=20
turned off during programming, and a special control register toggles =
the write=20
line at the appropriate time as the data flies by. It is tricky, but it =
avoids=20
the necessity of adding special data paths in order to program the type =
checking=20
hardware. By implementing the checking in a RAM, we can experiment with =
which=20
types are represented, and even add new hardware tags on the fly.</P>
<H4>Boxed unit</H4>
<P>A small piece of logic computes the boxed bit for the ALU result. The =

instruction can specify that the result is forced to be boxed, forced to =
be=20
unboxed, is to have the same boxedness as the left-hand operand, or is =
to have=20
the opposite boxedness as the left-hand operand (this last one has no=20
discernible use, but it is what the hardware does).</P>
<H3>Interconnection</H3>
<P>As mentioned previously, the output of the register file is latched =
and=20
presented to the inputs of the ALU. The output of the ALU is similarly =
latched=20
(by the output register) and written back to the register file in the =
next=20
instruction. Thus the primary data paths in the machine are composed of =
a=20
two-stage pipeline.</P>
<P>A set of address comparators enable a pass-around path when the =
result of an=20
ALU operation is needed immediately for the next computation. This =
relieves the=20
compiler from part of the instruction scheduling burden. In general, the =

compiler can ignore the effect of the pipeline except when =
<EM>functional=20
sources and destinations</EM> are involved.</P>
<P>There are additional points to inject and extract data from the main =
data=20
paths. <EM>Functional destinations</EM> can be written with the value of =
the=20
output register. These consist of various control registers, the memory =
maps,=20
and the virtual memory address and data registers. <EM>Functional =
sources</EM>=20
may be injected into the right hand side of the ALU input registers.=20
Additionally, the virtual memory data may be injected directly into the =
output=20
register (see below for why).</P>
<P>Like the CADR, but unlike the LAMBDA, the functional busses (called =
the MFI=20
and MFO bus in the CADR), are not connected. The MFI bus passes data =
destined=20
for the processor, the MFO bus passes data sourced from the processor. =
This=20
admits the possibility that some inconsistency may arise in a read-write =

register (for instance, the memory data register), but the difficulty in =

"turning around" the bus, quite evident in the LMI LAMBDA, is easily =
avoided in=20
this manner.
<P>
<P>Because the functional destinations are located beyond the output =
register,=20
the effect of the pipeline is exposed to the compiler and care is taken =
to=20
ensure that enough time is allowed for the data to arrive at its =
destination=20
before an attempt is made to use it; in particular, it is important to =
not=20
attempt to use the virtual memory data register until two instructions =
after=20
initiating the read. Like the CADR and LAMBDA, the K-machine has =
interlocks to=20
stall the pipeline while the processor awaits data from the memory =
system; the=20
two instruction rule is to make sure that the request to read from the =
memory=20
system precedes the request for the value read. </P>
<P>As mentioned before, the virtual memory data can be injected into the =

pipeline at the output register rather than at the right hand input to =
the ALU.=20
The unusual placement of the injection point of the memory data saves =
time;=20
memory data is usually written to a register and this will bypass a =
pointless=20
trip through the ALU.</P>
<P>Since the functional destinations are delayed by a clock tick, =
certain=20
operations, such as loading the type checking RAM, can take advantage of =
this=20
delay by specifying that the data is to be written, And providing said =
data in=20
the next instruction in the stream. The data will therefore be at the =
correct=20
point in the pipeline when the write pulse is generated by the delayed=20
functional destination. This is a bit of a hack, but it allows us to use =
a=20
baroque set of instructions to program the hardware without requiring =
any extra=20
data paths. (Of course this can only be done with interrupts off).</P>
<P>The instruction register can also be injected into the right hand =
side of the=20
ALU. This allows us to embed immediate constants into the instruction =
stream.=20
However, since the immediate constants had to be part of an instruction, =
there=20
is no room for an ALU operation to be specified when this mechanism is =
used. The=20
hardware forces the ALU to perform a <CODE>pass right</CODE> =
instruction, so all=20
immediate instructions must be loaded into a register before any =
operation can=20
take place (note that if the immediate was loaded to a functional =
destination it=20
would simply pass through the ALU on its way to the functional output =
bus).</P>
<P>A special case is made for immediate fixnums. These are represented =
in the=20
instruction stream as untagged immediate values. The hardware generates =
the=20
fixnum tag on the fly as it is injected, and the 8 bits saved (remember =
that=20
fixnums are 24 bits), allow for an ALU operation to be specified.</P>
<P>The PC can be loaded from the output of the ALU. This allows computed =

branches to be executed.</P>
<H2>Function calling hardware</H2>
<P>While the LMI Lambda was advertised as having special hardware to =
speed=20
function calls, there is no particular piece of hardware devoted to =
rapid=20
function calling. No machine that I am aware of had function calling =
hardware=20
like the K-machine.</P>
<H3>Register addressing</H3>
<P>The function calling mechanism is intimately tied to the register =
addressing=20
scheme. A register address consists of 6 bits. Two bits select a "frame" =
and 4=20
bits provide an offset within that frame. Each frame therefore has 16 =
general=20
purpose registers.</P>
<P>There are 4 frames available to each instruction (more below). =
Analysis of=20
the LMI LAMBDA showed that most functions have one or two arguments and =
one or=20
two local parameters. The unusual functions that overflow the register =
frame are=20
handled in software. The compiler generates the calls to the register =
spilling=20
code.</P>
<P>The register frames are not overlapped as in other RISC machines. =
This allows=20
us to allocate and free register frames in a non-stack (LIFO) manner. =
The=20
advantages of this are discussed below.</P>
<P>Unlike the CADR, LAMBDA, and 3600, <EM>registers do not have a =
virtual=20
address</EM>. In the LAMBDA, virtual memory accesses to the stack cause =
a load=20
from the stack cache. In the K-machine, it is not possible to address =
register=20
values in this way. Thus it is not possible to create a pointer into the =

stack.</P>
<P>The 4 frames available to each instruction are as follows:=20
<UL>
  <LI><B>Global</B><BR>A set of 16 global frames is allocated for use by =
all=20
  functions. Each instruction includes a 4 bit global frame selector, so =
only 16=20
  of the 256 global registers are available within a single instruction. =
Global=20
  registers contain data such as commonly used constants (like 1, 0, =
NIL, -1,=20
  etc.) to avoid using the costly <CODE>load immediate</CODE> =
instruction,=20
  pointers to wired memory (like the page tables), etc.=20
  <LI><B>Active</B><BR>This frame contains the arguments, locals, and=20
  temporaries of the currently executing function.=20
  <LI><B>Open</B><BR>This frame is the one currently being prepared for =
the next=20
  function call.=20
  <LI><B>Return</B><BR>This frame is the one that was used by the last =
function=20
  call. Multiple values can be returned in this frame, but single values =
are=20
  returned via a different mechanism. In retrospect, this frame causes =
more=20
  problems than expected as it needs to be saved and restored on each=20
  interrupt.</LI></UL>
<H3>Call sequence</H3>
<P>The K-machine uses a "3-point" calling sequence. A new open frame is=20
allocated, then it is filled with arguments, then control is transferred =
to the=20
callee. When the callee returns, the open frame that was allocated is =
discarded=20
(actually moved to the RETURN frame). While it may not sound like it, =
this=20
implements a callee saves convention.</P>
<P>When a function is called as a subproblem, a continuation must be =
created=20
(and pushed) in order for the function to return a value. Note, however, =
that=20
<EM>the continuation does not need to be in a register</EM>. In fact, it =
is rare=20
for a continuation to be manipulated as a value, so a separate "hidden" =
stack is=20
used to hold it. This allows us to manipulate continuations in parallel =
with=20
operations that use the register set.</P>
<P>The continuation has 4 parts: the caller's active frame, the caller's =

previous open frame, the return PC, and the return destination. The =
actual=20
control sequence is complicated, but I'll make a stab at explaining it:=20
<UL>
  <LI><B>Open a frame</B><BR>The <CODE>open</CODE> instruction saves the =
current=20
  values of the OPEN and ACTIVE register frame pointers on the hidden =
stack and=20
  loads the OPEN register with a new frame pointer.<BR><BR><EM>Frame =
pointers=20
  are kept in a hardware freelist.</EM> This freelist is popped when a =
frame is=20
  opened, and it is pushed when a function returns. When the freelist is =
close=20
  to empty, an interrupt is generated to dump the stack to memory. The =
oldest=20
  frame on the stack contains a trampoline that reloads the stack from=20
  memory.<BR><BR>Multiple <CODE>open</CODE> statements are allowed. Each =
one=20
  will push the previous value of the OPEN register on the hidden stack. =
This=20
  means that temporaries necessary for building a stack frame need not =
be=20
  allocated from the currently active frame, or copied from temporary =
locations=20
  to the argument locations at call time.<BR><BR>
  <LI><B>Call a function</B><BR>The <CODE>call</CODE> instruction saves =
the=20
  return PC on the hidden stack, loads the ACTIVE frame pointer from the =
OPEN=20
  frame pointer, and transfers control to the new function.<BR><BR>
  <LI><B>Return from a function</B><BR>The <CODE>return</CODE> =
instruction moves=20
  the current RETURN frame pointer to the frame freelist, moves the =
ACTIVE frame=20
  pointer to the RETURN register, restores the callers OPEN and ACTIVE=20
  registers, restores the PC and transfers control back to the callee. =
Note:=20
  when a function is called, the callee is responsible for destroying =
its own=20
  frame, thus the value of the OPEN register in the caller is the value =
it had=20
  before the OPEN statement that created the callee's =
frame.<BR><BR></LI></UL>
<H4>Return destination</H4>
<P>In addition to all the above activity, the <CODE>call</CODE> =
instruction=20
specifies a "return destination" for the computed value. Rather than =
return a=20
value to a canonical location and then move the value to the desired =
register, a=20
register address is specified in the <CODE>call</CODE> instruction, and =
the=20
return value is loaded into that register when the function returns.</P>
<P>The <CODE>open</CODE>, <CODE>call</CODE>, and <CODE>return</CODE>=20
instructions each can be executed <EM>in one clock cycle</EM>.</P>
<P>For the sake of illustration, here is how the compiler might generate =
some=20
code: <PRE>LISP:  (defun xxx (f0 f1 z0 b1 b2 f3)
               (foo f0 f1 (bar (baz z0) b1 b2) f3))

OPEN                ; open a frame for call to foo,=20

O0 &lt;- A0            ; load open register 0 with contents of active
                    ; register 0 (f0)

O1 &lt;- A1            ; load open register 1 with contents of active
                    ; register 1 (f1)

OPEN                ; open a frame for call to bar,=20

OPEN                ; open a frame for call to baz,

O0 &lt;- A2            ; load open register 1 with contents of active
                    ; register 2 (z0)

O0 &lt;- CALL BAZ      ; call function BAZ, result to appear in slot 0
                    ; in the frame for BAR.

O1 &lt;- A3            ; BAZ's open frame is gone, result of call is in
                    ; open register 0, load open register 1
                    ; with contents of active register 3 (b1)

02 &lt;- A4            ; load open register 2 with contents of active
                    ; register 4 (b2)

O2 &lt;- CALL BAR      ; call function BAR, result to appear in slot 2
                    ; of frame for FOO

O3 &lt;- A5            ; BAR's open frame is gone, result of call is in
                    ; open register 2, load open register 3
                    ; with contents of active register 5 (f3)

CALL FOO            ; and away we go.
</PRE><BR>
<P>12 instructions. Now, you might think that this is pretty good code, =
but we=20
can do better. The K-machine supports the following optimizations:=20
<UL>
  <LI>Call hardware operations in parallel with register operations. The =

  register fetches occur before the call hardware operation completes, =
and the=20
  register writeback occurs in the instruction <EM>after</EM> the call =
hardware=20
  operation completes. After pondering this for a while, we convinced =
ourselves=20
  that this is the desired behavior.<BR><BR>
  <LI><CODE>open-call</CODE> instructions can open a frame in parallel =
with=20
  calling the function. This, in combination with the parallel moves =
allow one=20
  argument functions to be called in one instruction.<BR><BR>
  <LI>A special <CODE>OPEN on RETURN</CODE> destination opens a new =
frame when=20
  returning from a call. This, in combination with the parallel moves, =
allows a=20
  lazy opening strategy; the frame opening happens when the first =
argument to be=20
  placed in the frame is computed.<BR><BR>
  <LI><B>Tail recursion</B> is implemented in hardware. The special=20
  <CODE>TOPEN</CODE> instruction opens a new frame without pushing the =
current=20
  OPEN and ACTIVE frame pointers on the hidden stack (in essence, not =
creating a=20
  new continuation), and the special <CODE>tail-call</CODE> instruction=20
  transfers control to the callee (like a jump), but because no=20
  <CODE>RETURN</CODE> is executed for the current frame, the current =
ACTIVE=20
  frame is immediately returned to the hardware freelist. This is =
necessary in=20
  order not to leak frames.<BR><BR>Note that a tail recursive call done =
like=20
  this violates the LIFO order of the stack, thus the hardware freelist =
is truly=20
  a freelist and the frames may become arbitrarily shuffled.<BR><BR>Also =
note=20
  that it is not necessary to smash the arguments in the current frame =
to=20
  implement tail recursion. In the worst case, this can add several move =

  instructions to a tail-recursive procedure. By using the hardware, it =
is often=20
  the case that <EM>tail recursive calls are faster than the equivalent =
loop=20
  that mutates the arguments</EM>.<BR><BR>Finally note that the return=20
  destination is stored in the hidden stack. This means that not only =
does the=20
  callee <EM>not know</EM> where the returned data is going, but also =
that the=20
  caller <EM>does not know</EM> what function returned the value! In =
other=20
  words, arbitrary tail recursion is supported by the hardware. =
</LI></UL><BR>
<P></P>
<P>Here is what the code looks like with the above optimizations: =
<PRE>LISP:  (defun xxx (f0 f1 z0 b1 b2 f3)
               (foo f0 f1 (bar (baz z0) b1 b2) f3))

TOPEN, 00 &lt;- A0     ; open a frame for tail recursive call to foo,=20
                    ; in parallel, load open register 0 with contents of =
active
                    ; register 0 (f0)

O1 &lt;- A1            ; load open register 1 with contents of active
                    ; register 1 (f1)

NEW-OPEN0 &lt;- OPEN CALL BAZ, O0 &lt;- A2
                    ; open a frame for a call to baz,
                    ; load open register 0 with the contents
                    ; of active register 2 (z0), result to appear
                    ; in slot 0 of open frame to be allocated upon
                    ; return from BAZ.

O1 &lt;- A3            ; load open register 1 with contents of
                    ; active register 3 (b1)


O2 &lt;- CALL BAR, O2 &lt;- A4
                    ; call bar, returning value to slot 2 in FOO's
                    ; frame.
                    ; in parallel, load open register 2 with contents
                    ; of active register 4 (b2)

TCALL FOO, O3 &lt;- A5 ; tail-recurse to foo, in parallel move=20
                    ; contents of active register 5 (f3) into
                    ; open register 3
</PRE><BR>As you can see, we have removed half the instructions! This =
kind of=20
optimization turns out to be fairly common. Remember, that each of these =

instruction execute in one clock cycle.
<P></P>
<P>The K-machine function calling hardware is implemented as a maze of=20
registered 2-to-1 multiplexers and RAM. The only restriction that the=20
implementation placed upon programmers is that two return operations =
cannot be=20
performed in sequence. In other words, when a function returns, it's =
caller is=20
not allowed to return in the next instruction. This is not a problem =
because a=20
return immediately followed by a return is a tail-recursive call and is =
handled=20
differently.</P>
<P>The K-machine function calling hardware is easy to describe, but many =
hours=20
were spent proving that it operated correctly even in the presence of=20
interrupts. The hardware implementation evolved through several =
versions. Each=20
had equivalent functionality, but quite different approaches to getting =
the data=20
to the right place at the right time.</P>
<P>To compare the K-machine call hardware with the LMI LAMBDA we counted =
the=20
number of microinstructions executed by a simple function call. The LMI =
LAMBDA=20
executes a minimum of 100 microinstructions to implement a function =
call. These=20
instructions execute at about 250 ns each. The K machine executes a =
function=20
call in a single 80ns instruction. This results in a factor of 300 in=20
performance. The LMI LAMBDA takes 7 seconds to run the tak benchmark. =
The=20
K-machine executes the same code in .03 seconds. This is faster than PCL =
on a=20
Cray-1.</P>
<H2>Multiple values</H2>
<P>There are two mechanisms for returning arguments. When a value is =
returned,=20
the special return functional destination makes sure the return value =
ends up in=20
the appropriate slot in the appropriate frame. This implements the =
standard case=20
of the caller expecting one return value and the callee providing one =
return=20
value.
<P>
<P>When a callee wishes to return multiple values, the first value is =
returned=20
via a mechanism that differs from the single value return only in that =
it sets a=20
flag indicating that additional values are present. Callers expecting a =
single=20
value never check this flag.</P>
<P>When a caller is prepared to accept multiple values, it checks the =
multiple=20
value return flag. If this flag is clear, the caller sets the remaining =
multiple=20
values to NIL. If this flag is set, the multiple values are specified by =
special=20
info left in the stack frame of the callee. This is available to the =
caller in=20
the RETURN frame.</P>
<P>Some functions, such as unwind protect, handle multiple values in a=20
pass-through mode. There is a return path that leaves the =
multiple-value-flag=20
unchanged.</P>
<H2>Linking</H2>
<P>As you may have noticed in the code above, functions are statically =
linked.=20
Functions are re-linked on the fly to simulate value cells. Special =
hardware=20
support is provided to enable "lazy linking".</P>
<P>Each instruction had a link bit that was normally set to 0. When a =
function=20
is re-defined, the old definition is "killed" by setting the link bit to =
1. No=20
other action is taken at this point.</P>
<P>The link bit is noticed by the hardware when an instruction is =
fetched.=20
Because of the pipeline, the processor will still be executing code =
within the=20
caller. If a function is linked to a dead function, the interrupt =
handler calls=20
the linker. It is important to interrupt at instruction fetch time =
because a=20
call to a tail-recursive function leaves no trace of where it came =
from.</P>
<P>If a function is linked to dead function, that does not mean it =
should be=20
unconditionally re-linked to the new definition. Imagine this scenario: =
<PRE>(defun foo () 'foo)
(setf (symbol-function bar) foo)

(defun test1 () (foo))
(defun test2 () (bar))
</PRE>In this case, the symbol functions for foo and bar point to the =
same=20
function object. Thus the code for <CODE>test1</CODE> and =
<CODE>test2</CODE>=20
will be statically linked to the same function.
<P></P>
<P>Now suppose foo is re-defined as follows: <PRE>(defun foo () =
'new-foo)
</PRE><CODE>Test1</CODE> and <CODE>test2</CODE> are not immediately =
re-linked to=20
the new definition. If <CODE>test1</CODE> is executed, when the code for =
foo is=20
fetched in preparation for the call, an interrupt occurs that invokes =
the=20
linker. The linker determines (from extra data associated with=20
<CODE>test1</CODE> and <CODE>foo</CODE>) that <CODE>test1</CODE> should =
be=20
re-linked to the new definition of foo. The linker performs this =
re-linking and=20
restarts the call instruction.
<P></P>
<P>The situation is different for <CODE>test2</CODE> however. In this =
case, the=20
linker determines that the code for bar needs to be unshared from that=20
associated with foo. The linker "resurrects" a copy of the dead code =
associated=20
with foo and re-links bar to this new code. Essentially, if a function =
is called=20
via a function cell, the linker relinks to the new definition, however =
if the=20
function is called via a function pointer, the linker relinks to a copy =
of the=20
old definition.</P>
<P>I have been told that this is similar to <CODE>UUO-LINKS</CODE> used =
in ITS,=20
but we developed this mechanism independently.</P>
<H2>Interrupt handling</H2>
<P>The CONS, CADR, and LAMBDA machines polled for interrupts at the =
microcode=20
level. The K-machine, not having microcode, does not poll for =
interrupts.=20
Interrupts are either synchronous, for instance integer overflow or page =
fault,=20
or asynchronous, such as the 16 millisecond clock.</P>
<P>Interrupts are indicated by a bit vector stored in a functional =
source. If=20
multiple interrupts occur at the same time, the interrupt code can check =
this=20
source to decide what order in which to process the interrupts.</P>
<H3>The Interrupt Handler</H3>
<P>When interrupts are enabled, an interrupt causes several things to =
happen:=20
<UL>
  <LI>Interrupts are disabled. This prevents recursive interrupts and is =
"the=20
  right thing". It is under programmer control to re-enable interrupts =
should=20
  recursive interrupts be desired, but the race condition implicit in =
leaving=20
  the interrupts enabled are avoided.<BR><BR>
  <LI>The register writeback is aborted.<BR><BR>
  <LI>The condition codes are copied to a special location.<BR><BR>
  <LI>The PC is forced to 0. This effects an unconditional transfer to =
location=20
  0. The address of the instruction that would have been executed is =
saved in a=20
  special location.<BR><BR>
  <LI>The pipeline registers are frozen; the clock for the pipeline is=20
  temporarily disabled.</LI></UL>
<P></P>
<P>As the next few instructions proceed, various parts of pipeline are=20
re-enabled. If the right code is in place at location 0, the pipeline =
state will=20
be correctly saved in a global frame reserved for this purpose. The =
interrupt=20
sequence goes something like the following (remember that the pipeline =
registers=20
are not being clocked until further notice):=20
<OL>
  <LI>The output of the ALU is saved, then the output register clock is=20
  enabled.<BR><BR>
  <LI>The condition codes are read from a functional source. This allows =
us to=20
  restart conditional branches should the processor be interrupted =
during one of=20
  the "branch delay" instructions. The saved condition code clock is=20
  re-enabled.<BR><BR>
  <LI>The right hand side of the ALU is passed through the ALU to be =
saved in=20
  the global frame, then the right hand side clock is enabled.<BR><BR>
  <LI>The left hand side of the ALU is passed through the ALU to be =
saved in the=20
  global frame, then the left hand side clock is enabled.<BR><BR>
  <LI>The PC of the interrupted instruction is read and stored in the =
global=20
  frame.<BR></LI></OL>
<P></P>
<P>At this point, the entire state of the pipeline has been saved and =
execution=20
proceeds as normal. The interrupt functional source contains the value =
of the=20
interrupt, so the correct interrupt handler can be located.</P>
<P>Note that there is <EM>no other state</EM> to indicate that the =
processor has=20
been interrupted. Thus it is not necessary to balance every interrupt =
with an=20
interrupt restore.</P>
<P>When the interrupt has been processed, there are three paths back =
into the=20
interrupted code. All of them reload the pipeline in a similar way, =
differing=20
only in the last instruction. As the pipeline is reloaded, the pipeline =
clocks=20
are incrementally disabled in the same order that they were enabled for=20
unloading the pipeline. Pipeline reload happens something like this:=20
<OL>
  <LI>The saved ALU output is passed through and captured in the output=20
  register. The output register is then disabled.<BR><BR>
  <LI>The saved left hand and right hand operands are read and latched =
into the=20
  ALU input registers.<BR><BR>
  <LI>At this point, the pipeline is reloaded. All the pipeline clocks =
are=20
  enabled and control is transferred to the saved PC. In addition, the =
saved=20
  condition codes are used to specify a conditional branch. Should a =
pending=20
  branch be present when restarting, it will be taken as if the =
condition codes=20
  had been generated normally.</LI></OL>
<P></P>
<P>As mentioned before, there were different interrupt crawlout =
routines. The=20
standard one, described above, was used for asynchronous interrupts such =
as the=20
clock. In this case, the pipeline state after the interrupt crawlout is=20
identical to that when the interrupt occurred, and the interrupted =
instruction=20
is re-executed.</P>
<P>Another interrupt crawlout is used for synchronous interrupts. These=20
interrupts are caused by the program and generally indicate an =
error-like=20
condition that can be fixed (something like integer overflow). For these =

interrupts, the ALU output register is <EM>not</EM> enabled until one=20
instruction <EM>after</EM> the interrupted one. To see why this is =
interesting,=20
consider the case of integer overflow.</P>
<P>When integer overflow occurs, an interrupt is taken. The interrupt =
handler=20
conses a bignum result and proceeds to restart the instruction. If the=20
instruction were simply re-executed, the integer overflow would occur =
again.=20
Instead, the bignum value is loaded into the ALU output register during =
the=20
crawlout. When the instruction is re-executed, the overflow condition =
and the=20
output of the ALU are ignored, and the bignum is used instead. This =
presents the=20
illusion that <EM>the addition of two fixnums resulted in a bignum =
pointer. Thus=20
we are able to abstract away the limitations of the ALU.</EM></P>
<P>One more crawlout is provided. This causes an immediate interrupt =
before the=20
execution of a single instruction. This seemingly useless ability allows =
the=20
computer to read the contents of the instruction cache without actually=20
executing instructions.</P>
<P>The critical positioning of the instructions in the interrupt crawlin =
and=20
crawlout routines forced us to write them by hand. Other than a few =
pieces of=20
code that clearly benefited from hand tuning (for instance, a=20
<CODE>mapcar</CODE> would speculatively fetch the CDR while testing the =
CAR),=20
there was no code in the system that was not written in Lisp.</P>
<H3>Single Stepping</H3>
<P>The interrupt logic could be programmed to re-interrupt after a =
single=20
instruction. This allows a process to single step another for debugging=20
purposes.</P>
<H3>Interrupt disabling</H3>
<P>It is important to be able to disable interrupts, and the instruction =
to do=20
so must be atomic. A functional source is used to disable interrupts. =
Reads from=20
this source clear the interrupt enable and return the value that the =
interrupt=20
enable flag had before the source was read. This allows reliable control =
of=20
interrupts, even if an asynchronous interrupt interrupts the disable =
interrupts=20
instruction.</P>
<P>Synchronous interrupts are discarded when interrupts are disabled, =
but=20
asynchronous interrupts are simply deferred. When interrupts are =
re-enabled, any=20
pending asynchronous interrupts are immediately taken. Synchronous =
interrupts=20
may be disabled independently.</P>
<H2>Bibliography</H2>
<P>Many sources were consulted for the design of the K-machine. =
Unfortunately, a=20
complete list of sources has been lost in the mists of time. For those=20
forgotten, be assured that no intellectual dishonesty is intended. Among =
the=20
sources consulted were:=20
<UL>
  <LI>The Lisp Machine Manual=20
  <LI>Abelson, Sussman, and Sussman's "Structure and Interpretation of =
Computer=20
  Programs".=20
  <LI>the MIT AI Lab technical reports relating to the Lisp Machine =
Project=20
  <LI>Papers relating to the MIPS processor=20
  <LI>Almost all of Guy Steele's papers.=20
  <LI>Tom Knight's class at MIT on how to build high-performance =
hardware.=20
  <LI>The work of Jonathan Rees, David Kranz, and others of the Yale T =
Project,=20
  and specially the Orbit compiler</LI></UL>
<P></P>
<H2>Appendix</H2>
<H3>What ever happened to the hardware?</H3>
<P>During the investigation of Guy Montpetit, owner of Gigamos, for =
fraud=20
surrounding his use of Japanese investment capital, the Canadian =
government=20
requested that the United States assets be seized pending resolution of =
the=20
case. Coopers and Lybrand held the assets. Gigamos eventually filed for=20
bankruptcy in the U.S., and Coopers and Lybrand sought out a buyer. =
After a=20
period of time, when no investors were found, the material assets of =
Gigamos,=20
including the K-machine board sets, specifications, schematics, and =
printed=20
circuit artwork were sold for salvage to Eli Hefron and Sons in =
Cambridge, MA. I=20
purchased these materials from Eli Hefron and Sons and they are =
currently in my=20
possession.</P>
<SCRIPT language=3DJavascript>=0A=
<!--=0A=
=0A=
// FILE ARCHIVED ON 20041013090637 AND RETRIEVED FROM THE=0A=
// INTERNET ARCHIVE ON 20050920050219.=0A=
// JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.=0A=
// ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.=0A=
// SECTION 108(a)(3)).=0A=
=0A=
   var sWayBackCGI =3D "http://web.archive.org/web/20041013090637/";=0A=
=0A=
   function xLateUrl(aCollection, sProp) {=0A=
      var i =3D 0;=0A=
      for(i =3D 0; i < aCollection.length; i++)=0A=
         if (aCollection[i][sProp].indexOf("mailto:") =3D=3D -1 &&=0A=
             aCollection[i][sProp].indexOf("javascript:") =3D=3D -1)=0A=
            aCollection[i][sProp] =3D sWayBackCGI + =
aCollection[i][sProp];=0A=
   }=0A=
=0A=
   if (document.links)  xLateUrl(document.links, "href");=0A=
   if (document.images) xLateUrl(document.images, "src");=0A=
   if (document.embeds) xLateUrl(document.embeds, "src");=0A=
=0A=
   if (document.body && document.body.background)=0A=
      document.body.background =3D sWayBackCGI + =
document.body.background;=0A=
=0A=
//-->=0A=
=0A=
</SCRIPT>
<!-- Edwin variables: --><!-- Mode: text --><!-- End: --></BODY></HTML>
