papers/20070526_cpu08_divu16_by_u16/cpu08div16by16a.tex

%$Header: /home/dashley/cvsrep/e3ft_gpl01/e3ft_gpl01/dtaipubs/cron/2007/cpu08div16by16a/cpu08div16by16a.tex,v 1.12 2007/05/26 18:25:46 dashley Exp $
%
\documentclass[letterpaper,10pt,titlepage]{article}
%
%\pagestyle{headings}
%
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage[ansinew]{inputenc}
\usepackage[OT1]{fontenc}
\usepackage{graphicx}
%\usepackage{makeidx}
%
%Contributed by Robin Fairbairns of the comp.text.tex newsgroup.  I have no 
%idea how it works ... but it gives a working \bdiv.
%
\makeatletter
\def\bdiv{%
  \nonscript\mskip-\medmuskip\mkern5mu%
  \mathbin{\operator@font div}\penalty900\mkern5mu%
  \nonscript\mskip-\medmuskip}
\makeatother
%
%
\begin{document}
\title{Notes on Economical 16/16 = 16 Unsigned Integer Division on the Freescale CPU08}
\author{\vspace{1cm}\\David T. Ashley\\\texttt{dta@e3ft.com}\\\vspace{1cm}}
\date{\vspace*{8mm}\small{Version Control $ $Revision: 1.12 $ $ \\
      Version Control $ $Date: 2007/05/26 18:25:46 $ $ (UTC) \\
      $ $RCSfile: cpu08div16by16a.tex,v $ $ \\
      \LaTeX{} Compilation Date: \today{}}}
\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
\begin{abstract}
This document contains my notes on attempting to optimize
the 16 = 16/16 unsigned integer division subroutine distributed with
a Freescale CPU08 compiler.  I attempted to replace the standard
shift-compare-subtract algorithm with Knuth's algorithm.

In the end, the attempt was mostly a failure.  There seemed to be no way
to gain the execution time advantages of Knuth's algorithm without increasing
FLASH size.  (The goal would have been to obtain more favorable 
execution time \emph{and} FLASH size, rather than a tradeoff.)
\end{abstract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction and Overview}
\label{siov0}

The fundamental goal of practical computer integer arithmetic is to obtain
an exact result using machine instructions that are as fast and compact as 
possible.

For three of the four fundamental operations (addition, subtraction, and
multiplication), it is intuitively obvious to most programmers how to 
use existing machine instructions to operate on operands that are larger
than the machine instructions can accommodate.

The fourth fundamental operation (division), however, is less well understood
by a typical programmer than the other three.  It is not obvious to most
programmers how to use machine division instructions to divide integers larger
than the native machine division instructions will accommodate.

In 2007, I noticed that the integer division library functions associated with
a particular \emph{C} compiler were of the shift-compare-subtract variety.
As the CPU08 has a 16/16=8 native division instruction, I suspected but was not
sure that the 16/16=16 division function could be improved.

Note that for a library function used by a \emph{C} compiler, it may not
be necessary (although it may be desirable) to calculate the quotient and the
remainder at the same time.  An algorithm that calculates the quotient but
does not calculate the remainder, or vice-versa, may be acceptable for
use by compiled code. 

Although I haven't examined the \emph{C} standards, I doubt that there is
any special requirement for the quotient or remainder calculated when
the denominator is zero.  The only requirement is probably that the calculation
not continue indefinitely (i.e. not an infinite loop).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Nomenclature and Notation}
\label{snom0}

I use the nomenclature ``16/16=16 division'' to denote a division function that
accepts a 16-bit unsigned numerator and a 16-bit unsigned denominator; and
produces either/both a 16-bit quotient and a 16-bit remainder.

I use $n$ to denote the numerator, $d$ to denote the denominator, $q$ to denote
the integer quotient, and $r$ to denote the remainder.

Any of the 16-bit quantities can be subscripted with ``H'' or ``L'' to denote
the most-significant or least-significant byte.  For example,

\begin{equation}
\label{eq:snom0:01}
n = 2^8 n_H + n_L .
\end{equation}

Within any of the 16-bit quantities, bits are subscripted in the traditional fashion.
For example,

\begin{equation}
\label{eq:snom0:02}
n = \sum_{i=0}^{15} 2^i n_i .
\end{equation}

Within any of the 8-bit quantities, bits are also subscripted in the traditional fashion.
For example,

\begin{equation}
\label{eq:snom0:03}
n = 2^8 \sum_{i=0}^{7} 2^i n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} 
  =     \sum_{i=0}^{7} 2^{i+8} n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} .
\end{equation}

Because only non-negative integers are involved, \emph{floor()} and 
\emph{div} are used interchangeably, i.e.

\begin{equation}
\label{eq:snom0:04}
\left\lfloor \frac{a}{b} \right\rfloor = a \bdiv b, \;\; a,b \geq 0 .
\end{equation}

An identity that is frequently used in this document is

\begin{eqnarray}
\label{eq:snom0:05}
\frac{a}{b} & = & a \bdiv b + \frac{a \bmod b}{b} \\
\label{eq:snom0:06}
            & = & \left\lfloor \frac{a}{b} \right\rfloor + \frac{a \bmod b}{b} .
\end{eqnarray}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Analysis of Existing Code}
\label{saec0}
 
This section presents an analysis of the behavior and characteristics of
the existing code.

 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Listing}
\label{saec0:slec0}

The existing code is reproduced below, with line numbers.
 
\begin{verbatim}
001: c_umod:
002:         bsr        c_udiv         ; divide
003: pshh                              ; transfer MSB
004:         pula                      ; in place
005:         sta        c_reg
006:         txa                       ; LSB in place
007:         rts                       ; and return
008: ;
009: c_udiv:
010:         psha                      ; save NL
011:         pshh                      ; DH on the stack
012:         tst        1,sp           ; test zero
013:         bne        full           ; full division
014:         pulh                      ; clean stack
015:         cpx        c_reg          ; compare DL with NH
016:         bls        half           ; half division
017:         lda        c_reg          ; NH
018:         psha                      ; in
019:         pulh                      ; H
020:         pula                      ; NL in A
021:         div                       ; DL in X, divide
022:         clr        c_reg          ; QH is zero
023: fini:
024:         pshh                      ; move RL
025:         pulx                      ; in X
026:         clrh                      ; RH is zero
027:         rts                       ; and return
028: half:
029:         lda        c_reg          ; NH in A
030:         clrh                      ; 1st divide 8 X 8
031:         div                       ; divide
032:         sta        c_reg          ; QH in place
033:         pula                      ; complete dividend with NL
034:         div                       ; divide
035:         bra        fini           ; complete remainder
036: full:
037:         pshx                      ; save DL
038:         ldx        #8             ; counter
039:         clra                      ; Extention
040:         pshx                      ; save on
041:         psha                      ; the stack
042:         tsx                       ; stack addressed by H:X
043: bcl:
044:         lsl        4,x            ; shift E:NH:NL
045:         rol        c_reg          ; left
046:         rola
047:         sta        0,x            ; store E
048:         cmp        3,x            ; compare DH
049:         blo        next           ; too low, continue
050:         bne        ok             ; ok to subtract
051:         lda        c_reg          ; compare NH
052:         sub        2,x            ; with DL
053:         bhs        ok2            ; ok, complete result
054:         lda        0,x            ; restore value
055:         bra        next           ; and continue
056: ok:
057:         lda        c_reg          ; substract D
058:         sub        2,x            ; from E:NH
059: ok2:
060:         sta        c_reg          ; in place
061:         lda        0,x
062:         sbc        3,x
063:         inc        4,x            ; set result bit
064: next:
065:         dbnz       1,x,bcl        ; count down and loop
066:         sta        0,x            ; store E
067:         pulh                      ; RH in place
068:         ais        #3             ; clean up stack
069:         ldx        c_reg          ; RL in place
070:         pula                      ; QL in place
071:         clr        c_reg          ; QH is zero
072:         rts                       ; and return
\end{verbatim}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Algorithmic Analysis}
\label{saec0:saan0}

The algorithm is divided into two cases:

\begin{itemize}
\item Case I:  $d_H = 0$ (lines 14-35).
\item Case II: $d_H > 0$ (lines 36-72).
\end{itemize} 

 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsubsection{Case I}
\label{saec0:saan0:scas0}

In the case of $d_H = 0$, the division is 16/8:

\begin{eqnarray}
\label{eq:saec0:saan0:scas0:01}
\frac{2^8 n_H + n_L}{d_L} & = & \frac{2^8 n_H}{d_L} + \frac{n_L}{d_L} \\
\label{eq:saec0:saan0:scas0:02}
& = & 2^8 \left( n_H \bdiv d_L + \frac{n_H \bmod d_L}{d_L} \right) + \frac{n_L}{d_L} \\
\label{eq:saec0:saan0:scas0:03}
& = & 2^8 (n_H \bdiv d_L) + \frac{2^8 (n_H \bmod d_L) + n_L}{d_L}
\end{eqnarray}

(\ref{eq:saec0:saan0:scas0:03}) is an exact expression involving rational numbers.
However, we don't want to calculate the left side of (\ref{eq:saec0:saan0:scas0:03});
rather, we wish to calculate its \emph{floor()}\@.  Applying the
\emph{floor()} function to both sides of (\ref{eq:saec0:saan0:scas0:03}) yields:

\begin{equation}
\label{eq:saec0:saan0:scas0:04}
\left\lfloor \frac{2^8 n_H + n_L}{d_L} \right\rfloor
=
2^8 (n_H \bdiv d_L) + \left\lfloor \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \right\rfloor .
\end{equation}

Note that (\ref{eq:saec0:saan0:scas0:04}) is in a form
that can be readily evaluated using a processor with 16/8 division 
capability; so long as 

\begin{equation}
\label{eq:saec0:saan0:scas0:05}
\frac{2^8 (n_H \bmod d_L) + n_L}{d_L} < 2^8 ,
\end{equation}

\noindent{}a fact that can be easily verified by the reader.

(\ref{eq:saec0:saan0:scas0:04}) can be readily evaluated by a processor with 
16/8 division capability because such a division instruction always provides
both quotient and remainder.  It is easy to see that (\ref{eq:saec0:saan0:scas0:04})
can be evaluated with a division, a re-staging of bytes, and a second division.

If (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested, it needs to be verified
whether the remainder of the second division is the same as the remainder
of the larger division, i.e.

\begin{equation}
\label{eq:saec0:saan0:scas0:06}
(2^8 n_H + n_L) \bmod d_L =? ((2^8 \bmod d_L) + n_L) \bmod d_L.
\end{equation}

The question of whether (\ref{eq:saec0:saan0:scas0:06}) is an
equality is the question of whether

\begin{equation}
\label{eq:saec0:saan0:scas0:07}
ka \bmod b = (k (a \bmod b)) \bmod b .
\end{equation}

In order to prove or disprove (\ref{eq:saec0:saan0:scas0:07}),
decompose $a$ into $ib + j$.  Then, since $kib \bmod b = 0$,

\begin{eqnarray}
\label{eq:saec0:saan0:scas0:08}
k (ib + j) \bmod b & = & k j \bmod b \\
\label{eq:saec0:saan0:scas0:09}
k j \bmod b & = & k j \bmod b .
\end{eqnarray}

Thus, if (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested (with
two divisions), the final remainder will be the same as the remainder for the
original division.  (\ref{eq:saec0:saan0:scas0:04}) will, in fact, deliver both
the quotient and remainder economically.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsubsection{Case II}
\label{saec0:saan0:scas1}

The case of $d_H > 0$ (\S{}\ref{saec0:slec0}, lines 36-72) is conventional
shift-compare-subtract division.  Only eight iterations of the loop are required
because with $d_H > 0$, $d \geq 2^8$, and $n/d < 2^8$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Timing Analysis}
\label{saec0:stan0}

The code of \S{}\ref{saec0:slec0} is reproduced below, with instruction
timing (number of clocks) and FLASH requirements (number of bytes) added
as (clocks/bytes).  It was determined that \emph{c\_reg} resides
in zero-page (i.e. \emph{direct}) memory.

\begin{verbatim}
001: c_umod:
002:         bsr        c_udiv     4/2 ; divide
003: pshh                              ; transfer MSB
004:         pula                  2/1 ; in place
005:         sta        c_reg      3/2
006:         txa                   1/1 ; LSB in place
007:         rts                   4/1 ; and return
008: ;
009: c_udiv:
010:         psha                  2/1 ; save NL
011:         pshh                  2/1 ; DH on the stack
012:         tst        1,sp       4/3 ; test zero
013:         bne        full       3/2 ; full division
014:         pulh                  2/1 ; clean stack
015:         cpx        c_reg      3/2 ; compare DL with NH
016:         bls        half       3/2 ; half division
017:         lda        c_reg      3/2 ; NH
018:         psha                  2/1 ; in
019:         pulh                  2/1 ; H
020:         pula                  2/1 ; NL in A
021:         div                   7/1 ; DL in X, divide
022:         clr        c_reg      3/2 ; QH is zero
023: fini:
024:         pshh                  2/1 ; move RL
025:         pulx                  2/1 ; in X
026:         clrh                  1/1 ; RH is zero
027:         rts                   4/1 ; and return
028: half:
029:         lda        c_reg      3/2 ; NH in A
030:         clrh                  1/1 ; 1st divide 8 X 8
031:         div                   7/1 ; divide
032:         sta        c_reg      3/2 ; QH in place
033:         pula                  2/1 ; complete dividend with NL
034:         div                   7/1 ; divide
035:         bra        fini       3/2 ; complete remainder
036: full:
037:         pshx                  2/1 ; save DL
038:         ldx        #8         2/2 ; counter
039:         clra                  1/1 ; Extention
040:         pshx                  2/1 ; save on
041:         psha                  2/1 ; the stack
042:         tsx                   2/1 ; stack addressed by H:X
043: bcl:
044:         lsl        4,x        4/2 ; shift E:NH:NL
045:         rol        c_reg      4/2 ; left
046:         rola                  1/1
047:         sta        0,x        2/1 ; store E
048:         cmp        3,x        3/2 ; compare DH
049:         blo        next       3/2 ; too low, continue
050:         bne        ok         3/2 ; ok to subtract
051:         lda        c_reg      3/2 ; compare NH
052:         sub        2,x        3/2 ; with DL
053:         bhs        ok2        3/2 ; ok, complete result
054:         lda        0,x        2/1 ; restore value
055:         bra        next       3/2 ; and continue
056: ok:
057:         lda        c_reg      3/2 ; substract D
058:         sub        2,x        3/2 ; from E:NH
059: ok2:
060:         sta        c_reg      3/2 ; in place
061:         lda        0,x        2/1
062:         sbc        3,x        3/2
063:         inc        4,x        3/2 ; set result bit
064: next:
065:         dbnz       1,x,bcl    5/3 ; count down and loop
066:         sta        0,x        2/1 ; store E
067:         pulh                  2/1 ; RH in place
068:         ais        #3         2/2 ; clean up stack
069:         ldx        c_reg      3/2 ; RL in place
070:         pula                  2/1 ; QL in place
071:         clr        c_reg      3/2 ; QH is zero
072:         rts                   4/1 ; and return
\end{verbatim}

There are three distinct timing cases for the \emph{c\_udiv} function:

\begin{enumerate}
\item $d_H = 0$ and $n_H < d_L$:
      47 clocks are required, representing straight flow of the 
      instructions from line 10 through line 27.
\item $d_H = 0$ and $n_H \geq d_L$:  54 clocks are required.
\item $d_H > 0$ and every bit of the quotient is 1, in which case
      400 clocks are required.  This represents 22 clocks up through
      line 42, 45 clocks $\times$ 8 in the lines from 43 through 65, and
      18 clocks in the lines from 66 through 72.
\end{enumerate}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{FLASH/RAM Consumption Analysis}
\label{saec0:sfra0}

From \S{}\ref{saec0:stan0}, 93 bytes of FLASH are used.  Only one byte
of RAM is used (\emph{c\_reg}, probably shared with other functions
as well).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Potential Optimizations}
\label{spop0}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Potential Case I Optimizations}
\label{spop0:scas0}

This section corresponds to \emph{Case I} of \S{}\ref{saec0:saan0:scas0}.

The most obvious observation about the code (\S{}\ref{saec0:slec0}) is that
division instructions are very inexpensive on the CPU08---7 clock cycles,
or about 2 typical instructions.  Branching based on
$n_H \geq d_L$ 
(\S{}\ref{saec0:slec0}, lines 15-16)
may cost more in the test, the branch, and in other
data transfer overhead than is saved.  It may make sense to apply the
full formula in (\ref{eq:saec0:saan0:scas0:04}) in all cases where
$d_H = 0$.

When $d_H = 0$, and if one assumes normal distribution of data, the expected
value of execution time is about $(47+54)/2 = 50.5$ clocks.\footnote{If the
data is assumed normally distributed, $n_H$ has about a 0.5 probability of being at least as large
as $d_L$.}

The code below combines two of the three timing cases into one by ignoring the
relationship between $n_H$ and $d_L$.

\begin{verbatim}
;Condition at function entry:
;N_H in 1,SP
;N_L in A
;D_H in H
;D_L in X
;
;Condition at function exit:
;Q_H in c_reg
;Q_L in A
;R_H in H
;R_L in X
;
c_udiv:
   psha            2/1 ; save NL
   pshh            2/1 ; DH on the stack
   tst   1,sp      4/3 ; test zero
   bne   full      3/2 ; full division
;
;From here on we're committed to the division with
;arbitary numerator, and denominator <= 255.
;  N_H at 3,sp
;  N_L at 2,sp
;  D_H at 1,sp
;
   clrh            1/1
   lda   3,sp      4/3 ; H:A now contains N_H
   div             7/1 ; divide
   sta   c_reg     3/2 ; QH in place
   lda   2,sp      4/3 ; complete dividend with NL
   div             7/1 ; divide.  Q_L in A, R_L in H
   pshh            2/1 ; move RL
   pulx            2/1 ; in X
   clrh            1/1 ; RH is zero
   ais   #3        2/2 ; clean stack
   rts             4/1 ; and return
\end{verbatim}

Although the code does raise the minimum execution time from 47 to
48 clocks:

\begin{itemize}
\item It lowers the expected value of the $d_H=0$ execution time from
      50.5 to 48 clocks.
\item It saves approximately 15 bytes of FLASH.
\end{itemize}

This optimization is recommended.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Potential Case II Optimizations}
\label{spop0:scas1}

This section corresponds to \emph{Case II} of \S{}\ref{saec0:saan0:scas1}.

I sent an e-mail to an engineer at the compiler manufacturer indicating
that:

\begin{itemize}
\item I believed \emph{Case I} could be optimized as indicated earlier
      in the document.
\item I believed \emph{Case II} could be optimized by applying Knuth's
      algorithm.      
\end{itemize}

The reply I received from the compiler manufacturer was that:

\begin{itemize}
\item There was agreeement about \emph{Case I}.
\item \emph{Case II} may be a little faster using Knuth's algorithm, but would
      definitely be larger (code used to evaluate the application of Knuth's
      algorithm was also provided).
\end{itemize}

In the test code provided by the compiler manufacturer, the approach used was 
to obtain a trial quotient and then to subtract the divisor from the remainder
up to twice to adjust the quotient up to 2 counts downward (Knuth's algorithm).

I did try a different approach (to iterate on the quotient and to reconstruct
$q \times d$ with decreasing $q$).  This approach promised to be slightly more
compact because $q \times d$ reconstruction was reutilized.  However, it worked out
to occupy about 153 bytes rather than 103 for the shift-compare-subtract
algorithm (timing was not examined).  (This test code is version-controlled
in the same directory as this \LaTeX{} document.)

At this point I agree with the
compiler manufacturer 
that there is a tradeoff between size and speed (it seems
nearly impossible to get both).

In any reimplementation of this algorithm, will probably need to choose between
size and speed.  I believe there is some possibility to reduce the reimplementation
from 153 bytes, but not down to 103 bytes.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% $Log: cpu08div16by16a.tex,v $
% Revision 1.12  2007/05/26 18:25:46  dashley
% Edits to incorporate final thoughts (for now) and test code results.
%
% Revision 1.11  2007/05/21 16:49:40  dashley
% \bdiv functionality as suggested by Robin Fairbairns added.
%
% Revision 1.10  2007/05/18 15:53:59  dashley
% Edits.
%
% Revision 1.9  2007/05/18 02:02:56  dashley
% Edits.
%
% Revision 1.8  2007/05/18 02:02:10  dashley
% Edits.
%
% Revision 1.7  2007/05/18 01:40:20  dashley
% Edits.
%
% Revision 1.6  2007/05/18 00:15:56  dashley
% Edits.
%
% Revision 1.5  2007/05/17 15:53:57  dashley
% Edits.
%
% Revision 1.4  2007/05/17 14:46:34  dashley
% Edits.
%
% Revision 1.3  2007/05/17 04:15:22  dashley
% Edits.
%
% Revision 1.2  2007/05/17 03:50:48  dashley
% Edits.
%
% Revision 1.1  2007/05/17 03:09:06  dashley
% Initial checkin.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1	%$Header: /home/dashley/cvsrep/e3ft_gpl01/e3ft_gpl01/dtaipubs/cron/2007/cpu08div16by16a/cpu08div16by16a.tex,v 1.12 2007/05/26 18:25:46 dashley Exp $
2	%
3	\documentclass[letterpaper,10pt,titlepage]{article}
4	%
5	%\pagestyle{headings}
6	%
7	\usepackage{amsmath}
8	\usepackage{amsfonts}
9	\usepackage{amssymb}
10	\usepackage[ansinew]{inputenc}
11	\usepackage[OT1]{fontenc}
12	\usepackage{graphicx}
13	%\usepackage{makeidx}
14	%
15	%Contributed by Robin Fairbairns of the comp.text.tex newsgroup. I have no
16	%idea how it works ... but it gives a working \bdiv.
17	%
18	\makeatletter
19	\def\bdiv{%
20	\nonscript\mskip-\medmuskip\mkern5mu%
21	\mathbin{\operator@font div}\penalty900\mkern5mu%
22	\nonscript\mskip-\medmuskip}
23	\makeatother
24	%
25	%
26	\begin{document}
27	\title{Notes on Economical 16/16 = 16 Unsigned Integer Division on the Freescale CPU08}
28	\author{\vspace{1cm}\\David T. Ashley\\\texttt{dta@e3ft.com}\\\vspace{1cm}}
29	\date{\vspace*{8mm}\small{Version Control $ $Revision: 1.12 $ $ \\
30	Version Control $ $Date: 2007/05/26 18:25:46 $ $ (UTC) \\
31	$ $RCSfile: cpu08div16by16a.tex,v $ $ \\
32	\LaTeX{} Compilation Date: \today{}}}
33	\maketitle
34
35	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
36	%
37	\begin{abstract}
38	This document contains my notes on attempting to optimize
39	the 16 = 16/16 unsigned integer division subroutine distributed with
40	a Freescale CPU08 compiler. I attempted to replace the standard
41	shift-compare-subtract algorithm with Knuth's algorithm.
42
43	In the end, the attempt was mostly a failure. There seemed to be no way
44	to gain the execution time advantages of Knuth's algorithm without increasing
45	FLASH size. (The goal would have been to obtain more favorable
46	execution time \emph{and} FLASH size, rather than a tradeoff.)
47	\end{abstract}
48
49	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
50
51	\section{Introduction and Overview}
52	\label{siov0}
53
54	The fundamental goal of practical computer integer arithmetic is to obtain
55	an exact result using machine instructions that are as fast and compact as
56	possible.
57
58	For three of the four fundamental operations (addition, subtraction, and
59	multiplication), it is intuitively obvious to most programmers how to
60	use existing machine instructions to operate on operands that are larger
61	than the machine instructions can accommodate.
62
63	The fourth fundamental operation (division), however, is less well understood
64	by a typical programmer than the other three. It is not obvious to most
65	programmers how to use machine division instructions to divide integers larger
66	than the native machine division instructions will accommodate.
67
68	In 2007, I noticed that the integer division library functions associated with
69	a particular \emph{C} compiler were of the shift-compare-subtract variety.
70	As the CPU08 has a 16/16=8 native division instruction, I suspected but was not
71	sure that the 16/16=16 division function could be improved.
72
73	Note that for a library function used by a \emph{C} compiler, it may not
74	be necessary (although it may be desirable) to calculate the quotient and the
75	remainder at the same time. An algorithm that calculates the quotient but
76	does not calculate the remainder, or vice-versa, may be acceptable for
77	use by compiled code.
78
79	Although I haven't examined the \emph{C} standards, I doubt that there is
80	any special requirement for the quotient or remainder calculated when
81	the denominator is zero. The only requirement is probably that the calculation
82	not continue indefinitely (i.e. not an infinite loop).
83
84	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
85
86	\section{Nomenclature and Notation}
87	\label{snom0}
88
89	I use the nomenclature ``16/16=16 division'' to denote a division function that
90	accepts a 16-bit unsigned numerator and a 16-bit unsigned denominator; and
91	produces either/both a 16-bit quotient and a 16-bit remainder.
92
93	I use $n$ to denote the numerator, $d$ to denote the denominator, $q$ to denote
94	the integer quotient, and $r$ to denote the remainder.
95
96	Any of the 16-bit quantities can be subscripted with ``H'' or ``L'' to denote
97	the most-significant or least-significant byte. For example,
98
99	\begin{equation}
100	\label{eq:snom0:01}
101	n = 2^8 n_H + n_L .
102	\end{equation}
103
104	Within any of the 16-bit quantities, bits are subscripted in the traditional fashion.
105	For example,
106
107	\begin{equation}
108	\label{eq:snom0:02}
109	n = \sum_{i=0}^{15} 2^i n_i .
110	\end{equation}
111
112	Within any of the 8-bit quantities, bits are also subscripted in the traditional fashion.
113	For example,
114
115	\begin{equation}
116	\label{eq:snom0:03}
117	n = 2^8 \sum_{i=0}^{7} 2^i n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i}
118	= \sum_{i=0}^{7} 2^{i+8} n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} .
119	\end{equation}
120
121	Because only non-negative integers are involved, \emph{floor()} and
122	\emph{div} are used interchangeably, i.e.
123
124	\begin{equation}
125	\label{eq:snom0:04}
126	\left\lfloor \frac{a}{b} \right\rfloor = a \bdiv b, \;\; a,b \geq 0 .
127	\end{equation}
128
129	An identity that is frequently used in this document is
130
131	\begin{eqnarray}
132	\label{eq:snom0:05}
133	\frac{a}{b} & = & a \bdiv b + \frac{a \bmod b}{b} \\
134	\label{eq:snom0:06}
135	& = & \left\lfloor \frac{a}{b} \right\rfloor + \frac{a \bmod b}{b} .
136	\end{eqnarray}
137
138
139	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
140
141	\section{Analysis of Existing Code}
142	\label{saec0}
143
144	This section presents an analysis of the behavior and characteristics of
145	the existing code.
146
147
148	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
149
150	\subsection{Listing}
151	\label{saec0:slec0}
152
153	The existing code is reproduced below, with line numbers.
154
155	\begin{verbatim}
156	001: c_umod:
157	002: bsr c_udiv ; divide
158	003: pshh ; transfer MSB
159	004: pula ; in place
160	005: sta c_reg
161	006: txa ; LSB in place
162	007: rts ; and return
163	008: ;
164	009: c_udiv:
165	010: psha ; save NL
166	011: pshh ; DH on the stack
167	012: tst 1,sp ; test zero
168	013: bne full ; full division
169	014: pulh ; clean stack
170	015: cpx c_reg ; compare DL with NH
171	016: bls half ; half division
172	017: lda c_reg ; NH
173	018: psha ; in
174	019: pulh ; H
175	020: pula ; NL in A
176	021: div ; DL in X, divide
177	022: clr c_reg ; QH is zero
178	023: fini:
179	024: pshh ; move RL
180	025: pulx ; in X
181	026: clrh ; RH is zero
182	027: rts ; and return
183	028: half:
184	029: lda c_reg ; NH in A
185	030: clrh ; 1st divide 8 X 8
186	031: div ; divide
187	032: sta c_reg ; QH in place
188	033: pula ; complete dividend with NL
189	034: div ; divide
190	035: bra fini ; complete remainder
191	036: full:
192	037: pshx ; save DL
193	038: ldx #8 ; counter
194	039: clra ; Extention
195	040: pshx ; save on
196	041: psha ; the stack
197	042: tsx ; stack addressed by H:X
198	043: bcl:
199	044: lsl 4,x ; shift E:NH:NL
200	045: rol c_reg ; left
201	046: rola
202	047: sta 0,x ; store E
203	048: cmp 3,x ; compare DH
204	049: blo next ; too low, continue
205	050: bne ok ; ok to subtract
206	051: lda c_reg ; compare NH
207	052: sub 2,x ; with DL
208	053: bhs ok2 ; ok, complete result
209	054: lda 0,x ; restore value
210	055: bra next ; and continue
211	056: ok:
212	057: lda c_reg ; substract D
213	058: sub 2,x ; from E:NH
214	059: ok2:
215	060: sta c_reg ; in place
216	061: lda 0,x
217	062: sbc 3,x
218	063: inc 4,x ; set result bit
219	064: next:
220	065: dbnz 1,x,bcl ; count down and loop
221	066: sta 0,x ; store E
222	067: pulh ; RH in place
223	068: ais #3 ; clean up stack
224	069: ldx c_reg ; RL in place
225	070: pula ; QL in place
226	071: clr c_reg ; QH is zero
227	072: rts ; and return
228	\end{verbatim}
229
230
231	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
232
233	\subsection{Algorithmic Analysis}
234	\label{saec0:saan0}
235
236	The algorithm is divided into two cases:
237
238	\begin{itemize}
239	\item Case I: $d_H = 0$ (lines 14-35).
240	\item Case II: $d_H > 0$ (lines 36-72).
241	\end{itemize}
242
243
244	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
245
246	\subsubsection{Case I}
247	\label{saec0:saan0:scas0}
248
249	In the case of $d_H = 0$, the division is 16/8:
250
251	\begin{eqnarray}
252	\label{eq:saec0:saan0:scas0:01}
253	\frac{2^8 n_H + n_L}{d_L} & = & \frac{2^8 n_H}{d_L} + \frac{n_L}{d_L} \\
254	\label{eq:saec0:saan0:scas0:02}
255	& = & 2^8 \left( n_H \bdiv d_L + \frac{n_H \bmod d_L}{d_L} \right) + \frac{n_L}{d_L} \\
256	\label{eq:saec0:saan0:scas0:03}
257	& = & 2^8 (n_H \bdiv d_L) + \frac{2^8 (n_H \bmod d_L) + n_L}{d_L}
258	\end{eqnarray}
259
260	(\ref{eq:saec0:saan0:scas0:03}) is an exact expression involving rational numbers.
261	However, we don't want to calculate the left side of (\ref{eq:saec0:saan0:scas0:03});
262	rather, we wish to calculate its \emph{floor()}\@. Applying the
263	\emph{floor()} function to both sides of (\ref{eq:saec0:saan0:scas0:03}) yields:
264
265	\begin{equation}
266	\label{eq:saec0:saan0:scas0:04}
267	\left\lfloor \frac{2^8 n_H + n_L}{d_L} \right\rfloor
268	=
269	2^8 (n_H \bdiv d_L) + \left\lfloor \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \right\rfloor .
270	\end{equation}
271
272	Note that (\ref{eq:saec0:saan0:scas0:04}) is in a form
273	that can be readily evaluated using a processor with 16/8 division
274	capability; so long as
275
276	\begin{equation}
277	\label{eq:saec0:saan0:scas0:05}
278	\frac{2^8 (n_H \bmod d_L) + n_L}{d_L} < 2^8 ,
279	\end{equation}
280
281	\noindent{}a fact that can be easily verified by the reader.
282
283	(\ref{eq:saec0:saan0:scas0:04}) can be readily evaluated by a processor with
284	16/8 division capability because such a division instruction always provides
285	both quotient and remainder. It is easy to see that (\ref{eq:saec0:saan0:scas0:04})
286	can be evaluated with a division, a re-staging of bytes, and a second division.
287
288	If (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested, it needs to be verified
289	whether the remainder of the second division is the same as the remainder
290	of the larger division, i.e.
291
292	\begin{equation}
293	\label{eq:saec0:saan0:scas0:06}
294	(2^8 n_H + n_L) \bmod d_L =? ((2^8 \bmod d_L) + n_L) \bmod d_L.
295	\end{equation}
296
297	The question of whether (\ref{eq:saec0:saan0:scas0:06}) is an
298	equality is the question of whether
299
300	\begin{equation}
301	\label{eq:saec0:saan0:scas0:07}
302	ka \bmod b = (k (a \bmod b)) \bmod b .
303	\end{equation}
304
305	In order to prove or disprove (\ref{eq:saec0:saan0:scas0:07}),
306	decompose $a$ into $ib + j$. Then, since $kib \bmod b = 0$,
307
308	\begin{eqnarray}
309	\label{eq:saec0:saan0:scas0:08}
310	k (ib + j) \bmod b & = & k j \bmod b \\
311	\label{eq:saec0:saan0:scas0:09}
312	k j \bmod b & = & k j \bmod b .
313	\end{eqnarray}
314
315	Thus, if (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested (with
316	two divisions), the final remainder will be the same as the remainder for the
317	original division. (\ref{eq:saec0:saan0:scas0:04}) will, in fact, deliver both
318	the quotient and remainder economically.
319
320	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
321
322	\subsubsection{Case II}
323	\label{saec0:saan0:scas1}
324
325	The case of $d_H > 0$ (\S{}\ref{saec0:slec0}, lines 36-72) is conventional
326	shift-compare-subtract division. Only eight iterations of the loop are required
327	because with $d_H > 0$, $d \geq 2^8$, and $n/d < 2^8$.
328
329
330	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
331
332	\subsection{Timing Analysis}
333	\label{saec0:stan0}
334
335	The code of \S{}\ref{saec0:slec0} is reproduced below, with instruction
336	timing (number of clocks) and FLASH requirements (number of bytes) added
337	as (clocks/bytes). It was determined that \emph{c\_reg} resides
338	in zero-page (i.e. \emph{direct}) memory.
339
340	\begin{verbatim}
341	001: c_umod:
342	002: bsr c_udiv 4/2 ; divide
343	003: pshh ; transfer MSB
344	004: pula 2/1 ; in place
345	005: sta c_reg 3/2
346	006: txa 1/1 ; LSB in place
347	007: rts 4/1 ; and return
348	008: ;
349	009: c_udiv:
350	010: psha 2/1 ; save NL
351	011: pshh 2/1 ; DH on the stack
352	012: tst 1,sp 4/3 ; test zero
353	013: bne full 3/2 ; full division
354	014: pulh 2/1 ; clean stack
355	015: cpx c_reg 3/2 ; compare DL with NH
356	016: bls half 3/2 ; half division
357	017: lda c_reg 3/2 ; NH
358	018: psha 2/1 ; in
359	019: pulh 2/1 ; H
360	020: pula 2/1 ; NL in A
361	021: div 7/1 ; DL in X, divide
362	022: clr c_reg 3/2 ; QH is zero
363	023: fini:
364	024: pshh 2/1 ; move RL
365	025: pulx 2/1 ; in X
366	026: clrh 1/1 ; RH is zero
367	027: rts 4/1 ; and return
368	028: half:
369	029: lda c_reg 3/2 ; NH in A
370	030: clrh 1/1 ; 1st divide 8 X 8
371	031: div 7/1 ; divide
372	032: sta c_reg 3/2 ; QH in place
373	033: pula 2/1 ; complete dividend with NL
374	034: div 7/1 ; divide
375	035: bra fini 3/2 ; complete remainder
376	036: full:
377	037: pshx 2/1 ; save DL
378	038: ldx #8 2/2 ; counter
379	039: clra 1/1 ; Extention
380	040: pshx 2/1 ; save on
381	041: psha 2/1 ; the stack
382	042: tsx 2/1 ; stack addressed by H:X
383	043: bcl:
384	044: lsl 4,x 4/2 ; shift E:NH:NL
385	045: rol c_reg 4/2 ; left
386	046: rola 1/1
387	047: sta 0,x 2/1 ; store E
388	048: cmp 3,x 3/2 ; compare DH
389	049: blo next 3/2 ; too low, continue
390	050: bne ok 3/2 ; ok to subtract
391	051: lda c_reg 3/2 ; compare NH
392	052: sub 2,x 3/2 ; with DL
393	053: bhs ok2 3/2 ; ok, complete result
394	054: lda 0,x 2/1 ; restore value
395	055: bra next 3/2 ; and continue
396	056: ok:
397	057: lda c_reg 3/2 ; substract D
398	058: sub 2,x 3/2 ; from E:NH
399	059: ok2:
400	060: sta c_reg 3/2 ; in place
401	061: lda 0,x 2/1
402	062: sbc 3,x 3/2
403	063: inc 4,x 3/2 ; set result bit
404	064: next:
405	065: dbnz 1,x,bcl 5/3 ; count down and loop
406	066: sta 0,x 2/1 ; store E
407	067: pulh 2/1 ; RH in place
408	068: ais #3 2/2 ; clean up stack
409	069: ldx c_reg 3/2 ; RL in place
410	070: pula 2/1 ; QL in place
411	071: clr c_reg 3/2 ; QH is zero
412	072: rts 4/1 ; and return
413	\end{verbatim}
414
415	There are three distinct timing cases for the \emph{c\_udiv} function:
416
417	\begin{enumerate}
418	\item $d_H = 0$ and $n_H < d_L$:
419	47 clocks are required, representing straight flow of the
420	instructions from line 10 through line 27.
421	\item $d_H = 0$ and $n_H \geq d_L$: 54 clocks are required.
422	\item $d_H > 0$ and every bit of the quotient is 1, in which case
423	400 clocks are required. This represents 22 clocks up through
424	line 42, 45 clocks $\times$ 8 in the lines from 43 through 65, and
425	18 clocks in the lines from 66 through 72.
426	\end{enumerate}
427
428
429	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
430
431	\subsection{FLASH/RAM Consumption Analysis}
432	\label{saec0:sfra0}
433
434	From \S{}\ref{saec0:stan0}, 93 bytes of FLASH are used. Only one byte
435	of RAM is used (\emph{c\_reg}, probably shared with other functions
436	as well).
437
438
439	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
440
441	\section{Potential Optimizations}
442	\label{spop0}
443
444
445	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
446
447	\subsection{Potential Case I Optimizations}
448	\label{spop0:scas0}
449
450	This section corresponds to \emph{Case I} of \S{}\ref{saec0:saan0:scas0}.
451
452	The most obvious observation about the code (\S{}\ref{saec0:slec0}) is that
453	division instructions are very inexpensive on the CPU08---7 clock cycles,
454	or about 2 typical instructions. Branching based on
455	$n_H \geq d_L$
456	(\S{}\ref{saec0:slec0}, lines 15-16)
457	may cost more in the test, the branch, and in other
458	data transfer overhead than is saved. It may make sense to apply the
459	full formula in (\ref{eq:saec0:saan0:scas0:04}) in all cases where
460	$d_H = 0$.
461
462	When $d_H = 0$, and if one assumes normal distribution of data, the expected
463	value of execution time is about $(47+54)/2 = 50.5$ clocks.\footnote{If the
464	data is assumed normally distributed, $n_H$ has about a 0.5 probability of being at least as large
465	as $d_L$.}
466
467	The code below combines two of the three timing cases into one by ignoring the
468	relationship between $n_H$ and $d_L$.
469
470	\begin{verbatim}
471	;Condition at function entry:
472	;N_H in 1,SP
473	;N_L in A
474	;D_H in H
475	;D_L in X
476	;
477	;Condition at function exit:
478	;Q_H in c_reg
479	;Q_L in A
480	;R_H in H
481	;R_L in X
482	;
483	c_udiv:
484	psha 2/1 ; save NL
485	pshh 2/1 ; DH on the stack
486	tst 1,sp 4/3 ; test zero
487	bne full 3/2 ; full division
488	;
489	;From here on we're committed to the division with
490	;arbitary numerator, and denominator <= 255.
491	; N_H at 3,sp
492	; N_L at 2,sp
493	; D_H at 1,sp
494	;
495	clrh 1/1
496	lda 3,sp 4/3 ; H:A now contains N_H
497	div 7/1 ; divide
498	sta c_reg 3/2 ; QH in place
499	lda 2,sp 4/3 ; complete dividend with NL
500	div 7/1 ; divide. Q_L in A, R_L in H
501	pshh 2/1 ; move RL
502	pulx 2/1 ; in X
503	clrh 1/1 ; RH is zero
504	ais #3 2/2 ; clean stack
505	rts 4/1 ; and return
506	\end{verbatim}
507
508	Although the code does raise the minimum execution time from 47 to
509	48 clocks:
510
511	\begin{itemize}
512	\item It lowers the expected value of the $d_H=0$ execution time from
513	50.5 to 48 clocks.
514	\item It saves approximately 15 bytes of FLASH.
515	\end{itemize}
516
517	This optimization is recommended.
518
519
520	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
521
522	\subsection{Potential Case II Optimizations}
523	\label{spop0:scas1}
524
525	This section corresponds to \emph{Case II} of \S{}\ref{saec0:saan0:scas1}.
526
527	I sent an e-mail to an engineer at the compiler manufacturer indicating
528	that:
529
530	\begin{itemize}
531	\item I believed \emph{Case I} could be optimized as indicated earlier
532	in the document.
533	\item I believed \emph{Case II} could be optimized by applying Knuth's
534	algorithm.
535	\end{itemize}
536
537	The reply I received from the compiler manufacturer was that:
538
539	\begin{itemize}
540	\item There was agreeement about \emph{Case I}.
541	\item \emph{Case II} may be a little faster using Knuth's algorithm, but would
542	definitely be larger (code used to evaluate the application of Knuth's
543	algorithm was also provided).
544	\end{itemize}
545
546	In the test code provided by the compiler manufacturer, the approach used was
547	to obtain a trial quotient and then to subtract the divisor from the remainder
548	up to twice to adjust the quotient up to 2 counts downward (Knuth's algorithm).
549
550	I did try a different approach (to iterate on the quotient and to reconstruct
551	$q \times d$ with decreasing $q$). This approach promised to be slightly more
552	compact because $q \times d$ reconstruction was reutilized. However, it worked out
553	to occupy about 153 bytes rather than 103 for the shift-compare-subtract
554	algorithm (timing was not examined). (This test code is version-controlled
555	in the same directory as this \LaTeX{} document.)
556
557	At this point I agree with the
558	compiler manufacturer
559	that there is a tradeoff between size and speed (it seems
560	nearly impossible to get both).
561
562	In any reimplementation of this algorithm, will probably need to choose between
563	size and speed. I believe there is some possibility to reduce the reimplementation
564	from 153 bytes, but not down to 103 bytes.
565
566	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
567	\end{document}
568	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
569	% $Log: cpu08div16by16a.tex,v $
570	% Revision 1.12 2007/05/26 18:25:46 dashley
571	% Edits to incorporate final thoughts (for now) and test code results.
572	%
573	% Revision 1.11 2007/05/21 16:49:40 dashley
574	% \bdiv functionality as suggested by Robin Fairbairns added.
575	%
576	% Revision 1.10 2007/05/18 15:53:59 dashley
577	% Edits.
578	%
579	% Revision 1.9 2007/05/18 02:02:56 dashley
580	% Edits.
581	%
582	% Revision 1.8 2007/05/18 02:02:10 dashley
583	% Edits.
584	%
585	% Revision 1.7 2007/05/18 01:40:20 dashley
586	% Edits.
587	%
588	% Revision 1.6 2007/05/18 00:15:56 dashley
589	% Edits.
590	%
591	% Revision 1.5 2007/05/17 15:53:57 dashley
592	% Edits.
593	%
594	% Revision 1.4 2007/05/17 14:46:34 dashley
595	% Edits.
596	%
597	% Revision 1.3 2007/05/17 04:15:22 dashley
598	% Edits.
599	%
600	% Revision 1.2 2007/05/17 03:50:48 dashley
601	% Edits.
602	%
603	% Revision 1.1 2007/05/17 03:09:06 dashley
604	% Initial checkin.
605	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%