%$Header: /home/dashley/cvsrep/e3ft_gpl01/e3ft_gpl01/dtaipubs/cron/2007/cpu08div16by16a/cpu08div16by16a.tex,v 1.12 2007/05/26 18:25:46 dashley Exp $ % \documentclass[letterpaper,10pt,titlepage]{article} % %\pagestyle{headings} % \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage[ansinew]{inputenc} \usepackage[OT1]{fontenc} \usepackage{graphicx} %\usepackage{makeidx} % %Contributed by Robin Fairbairns of the comp.text.tex newsgroup. I have no %idea how it works ... but it gives a working \bdiv. % \makeatletter \def\bdiv{% \nonscript\mskip-\medmuskip\mkern5mu% \mathbin{\operator@font div}\penalty900\mkern5mu% \nonscript\mskip-\medmuskip} \makeatother % % \begin{document} \title{Notes on Economical 16/16 = 16 Unsigned Integer Division on the Freescale CPU08} \author{\vspace{1cm}\\David T. Ashley\\\texttt{dta@e3ft.com}\\\vspace{1cm}} \date{\vspace*{8mm}\small{Version Control $ $Revision: 1.12 $ $ \\ Version Control $ $Date: 2007/05/26 18:25:46 $ $ (UTC) \\ $ $RCSfile: cpu08div16by16a.tex,v $ $ \\ \LaTeX{} Compilation Date: \today{}}} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \begin{abstract} This document contains my notes on attempting to optimize the 16 = 16/16 unsigned integer division subroutine distributed with a Freescale CPU08 compiler. I attempted to replace the standard shift-compare-subtract algorithm with Knuth's algorithm. In the end, the attempt was mostly a failure. There seemed to be no way to gain the execution time advantages of Knuth's algorithm without increasing FLASH size. (The goal would have been to obtain more favorable execution time \emph{and} FLASH size, rather than a tradeoff.) \end{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction and Overview} \label{siov0} The fundamental goal of practical computer integer arithmetic is to obtain an exact result using machine instructions that are as fast and compact as possible. For three of the four fundamental operations (addition, subtraction, and multiplication), it is intuitively obvious to most programmers how to use existing machine instructions to operate on operands that are larger than the machine instructions can accommodate. The fourth fundamental operation (division), however, is less well understood by a typical programmer than the other three. It is not obvious to most programmers how to use machine division instructions to divide integers larger than the native machine division instructions will accommodate. In 2007, I noticed that the integer division library functions associated with a particular \emph{C} compiler were of the shift-compare-subtract variety. As the CPU08 has a 16/16=8 native division instruction, I suspected but was not sure that the 16/16=16 division function could be improved. Note that for a library function used by a \emph{C} compiler, it may not be necessary (although it may be desirable) to calculate the quotient and the remainder at the same time. An algorithm that calculates the quotient but does not calculate the remainder, or vice-versa, may be acceptable for use by compiled code. Although I haven't examined the \emph{C} standards, I doubt that there is any special requirement for the quotient or remainder calculated when the denominator is zero. The only requirement is probably that the calculation not continue indefinitely (i.e. not an infinite loop). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Nomenclature and Notation} \label{snom0} I use the nomenclature ``16/16=16 division'' to denote a division function that accepts a 16-bit unsigned numerator and a 16-bit unsigned denominator; and produces either/both a 16-bit quotient and a 16-bit remainder. I use $n$ to denote the numerator, $d$ to denote the denominator, $q$ to denote the integer quotient, and $r$ to denote the remainder. Any of the 16-bit quantities can be subscripted with ``H'' or ``L'' to denote the most-significant or least-significant byte. For example, \begin{equation} \label{eq:snom0:01} n = 2^8 n_H + n_L . \end{equation} Within any of the 16-bit quantities, bits are subscripted in the traditional fashion. For example, \begin{equation} \label{eq:snom0:02} n = \sum_{i=0}^{15} 2^i n_i . \end{equation} Within any of the 8-bit quantities, bits are also subscripted in the traditional fashion. For example, \begin{equation} \label{eq:snom0:03} n = 2^8 \sum_{i=0}^{7} 2^i n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} = \sum_{i=0}^{7} 2^{i+8} n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} . \end{equation} Because only non-negative integers are involved, \emph{floor()} and \emph{div} are used interchangeably, i.e. \begin{equation} \label{eq:snom0:04} \left\lfloor \frac{a}{b} \right\rfloor = a \bdiv b, \;\; a,b \geq 0 . \end{equation} An identity that is frequently used in this document is \begin{eqnarray} \label{eq:snom0:05} \frac{a}{b} & = & a \bdiv b + \frac{a \bmod b}{b} \\ \label{eq:snom0:06} & = & \left\lfloor \frac{a}{b} \right\rfloor + \frac{a \bmod b}{b} . \end{eqnarray} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Analysis of Existing Code} \label{saec0} This section presents an analysis of the behavior and characteristics of the existing code. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Listing} \label{saec0:slec0} The existing code is reproduced below, with line numbers. \begin{verbatim} 001: c_umod: 002: bsr c_udiv ; divide 003: pshh ; transfer MSB 004: pula ; in place 005: sta c_reg 006: txa ; LSB in place 007: rts ; and return 008: ; 009: c_udiv: 010: psha ; save NL 011: pshh ; DH on the stack 012: tst 1,sp ; test zero 013: bne full ; full division 014: pulh ; clean stack 015: cpx c_reg ; compare DL with NH 016: bls half ; half division 017: lda c_reg ; NH 018: psha ; in 019: pulh ; H 020: pula ; NL in A 021: div ; DL in X, divide 022: clr c_reg ; QH is zero 023: fini: 024: pshh ; move RL 025: pulx ; in X 026: clrh ; RH is zero 027: rts ; and return 028: half: 029: lda c_reg ; NH in A 030: clrh ; 1st divide 8 X 8 031: div ; divide 032: sta c_reg ; QH in place 033: pula ; complete dividend with NL 034: div ; divide 035: bra fini ; complete remainder 036: full: 037: pshx ; save DL 038: ldx #8 ; counter 039: clra ; Extention 040: pshx ; save on 041: psha ; the stack 042: tsx ; stack addressed by H:X 043: bcl: 044: lsl 4,x ; shift E:NH:NL 045: rol c_reg ; left 046: rola 047: sta 0,x ; store E 048: cmp 3,x ; compare DH 049: blo next ; too low, continue 050: bne ok ; ok to subtract 051: lda c_reg ; compare NH 052: sub 2,x ; with DL 053: bhs ok2 ; ok, complete result 054: lda 0,x ; restore value 055: bra next ; and continue 056: ok: 057: lda c_reg ; substract D 058: sub 2,x ; from E:NH 059: ok2: 060: sta c_reg ; in place 061: lda 0,x 062: sbc 3,x 063: inc 4,x ; set result bit 064: next: 065: dbnz 1,x,bcl ; count down and loop 066: sta 0,x ; store E 067: pulh ; RH in place 068: ais #3 ; clean up stack 069: ldx c_reg ; RL in place 070: pula ; QL in place 071: clr c_reg ; QH is zero 072: rts ; and return \end{verbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Algorithmic Analysis} \label{saec0:saan0} The algorithm is divided into two cases: \begin{itemize} \item Case I: $d_H = 0$ (lines 14-35). \item Case II: $d_H > 0$ (lines 36-72). \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Case I} \label{saec0:saan0:scas0} In the case of $d_H = 0$, the division is 16/8: \begin{eqnarray} \label{eq:saec0:saan0:scas0:01} \frac{2^8 n_H + n_L}{d_L} & = & \frac{2^8 n_H}{d_L} + \frac{n_L}{d_L} \\ \label{eq:saec0:saan0:scas0:02} & = & 2^8 \left( n_H \bdiv d_L + \frac{n_H \bmod d_L}{d_L} \right) + \frac{n_L}{d_L} \\ \label{eq:saec0:saan0:scas0:03} & = & 2^8 (n_H \bdiv d_L) + \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \end{eqnarray} (\ref{eq:saec0:saan0:scas0:03}) is an exact expression involving rational numbers. However, we don't want to calculate the left side of (\ref{eq:saec0:saan0:scas0:03}); rather, we wish to calculate its \emph{floor()}\@. Applying the \emph{floor()} function to both sides of (\ref{eq:saec0:saan0:scas0:03}) yields: \begin{equation} \label{eq:saec0:saan0:scas0:04} \left\lfloor \frac{2^8 n_H + n_L}{d_L} \right\rfloor = 2^8 (n_H \bdiv d_L) + \left\lfloor \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \right\rfloor . \end{equation} Note that (\ref{eq:saec0:saan0:scas0:04}) is in a form that can be readily evaluated using a processor with 16/8 division capability; so long as \begin{equation} \label{eq:saec0:saan0:scas0:05} \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} < 2^8 , \end{equation} \noindent{}a fact that can be easily verified by the reader. (\ref{eq:saec0:saan0:scas0:04}) can be readily evaluated by a processor with 16/8 division capability because such a division instruction always provides both quotient and remainder. It is easy to see that (\ref{eq:saec0:saan0:scas0:04}) can be evaluated with a division, a re-staging of bytes, and a second division. If (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested, it needs to be verified whether the remainder of the second division is the same as the remainder of the larger division, i.e. \begin{equation} \label{eq:saec0:saan0:scas0:06} (2^8 n_H + n_L) \bmod d_L =? ((2^8 \bmod d_L) + n_L) \bmod d_L. \end{equation} The question of whether (\ref{eq:saec0:saan0:scas0:06}) is an equality is the question of whether \begin{equation} \label{eq:saec0:saan0:scas0:07} ka \bmod b = (k (a \bmod b)) \bmod b . \end{equation} In order to prove or disprove (\ref{eq:saec0:saan0:scas0:07}), decompose $a$ into $ib + j$. Then, since $kib \bmod b = 0$, \begin{eqnarray} \label{eq:saec0:saan0:scas0:08} k (ib + j) \bmod b & = & k j \bmod b \\ \label{eq:saec0:saan0:scas0:09} k j \bmod b & = & k j \bmod b . \end{eqnarray} Thus, if (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested (with two divisions), the final remainder will be the same as the remainder for the original division. (\ref{eq:saec0:saan0:scas0:04}) will, in fact, deliver both the quotient and remainder economically. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Case II} \label{saec0:saan0:scas1} The case of $d_H > 0$ (\S{}\ref{saec0:slec0}, lines 36-72) is conventional shift-compare-subtract division. Only eight iterations of the loop are required because with $d_H > 0$, $d \geq 2^8$, and $n/d < 2^8$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Timing Analysis} \label{saec0:stan0} The code of \S{}\ref{saec0:slec0} is reproduced below, with instruction timing (number of clocks) and FLASH requirements (number of bytes) added as (clocks/bytes). It was determined that \emph{c\_reg} resides in zero-page (i.e. \emph{direct}) memory. \begin{verbatim} 001: c_umod: 002: bsr c_udiv 4/2 ; divide 003: pshh ; transfer MSB 004: pula 2/1 ; in place 005: sta c_reg 3/2 006: txa 1/1 ; LSB in place 007: rts 4/1 ; and return 008: ; 009: c_udiv: 010: psha 2/1 ; save NL 011: pshh 2/1 ; DH on the stack 012: tst 1,sp 4/3 ; test zero 013: bne full 3/2 ; full division 014: pulh 2/1 ; clean stack 015: cpx c_reg 3/2 ; compare DL with NH 016: bls half 3/2 ; half division 017: lda c_reg 3/2 ; NH 018: psha 2/1 ; in 019: pulh 2/1 ; H 020: pula 2/1 ; NL in A 021: div 7/1 ; DL in X, divide 022: clr c_reg 3/2 ; QH is zero 023: fini: 024: pshh 2/1 ; move RL 025: pulx 2/1 ; in X 026: clrh 1/1 ; RH is zero 027: rts 4/1 ; and return 028: half: 029: lda c_reg 3/2 ; NH in A 030: clrh 1/1 ; 1st divide 8 X 8 031: div 7/1 ; divide 032: sta c_reg 3/2 ; QH in place 033: pula 2/1 ; complete dividend with NL 034: div 7/1 ; divide 035: bra fini 3/2 ; complete remainder 036: full: 037: pshx 2/1 ; save DL 038: ldx #8 2/2 ; counter 039: clra 1/1 ; Extention 040: pshx 2/1 ; save on 041: psha 2/1 ; the stack 042: tsx 2/1 ; stack addressed by H:X 043: bcl: 044: lsl 4,x 4/2 ; shift E:NH:NL 045: rol c_reg 4/2 ; left 046: rola 1/1 047: sta 0,x 2/1 ; store E 048: cmp 3,x 3/2 ; compare DH 049: blo next 3/2 ; too low, continue 050: bne ok 3/2 ; ok to subtract 051: lda c_reg 3/2 ; compare NH 052: sub 2,x 3/2 ; with DL 053: bhs ok2 3/2 ; ok, complete result 054: lda 0,x 2/1 ; restore value 055: bra next 3/2 ; and continue 056: ok: 057: lda c_reg 3/2 ; substract D 058: sub 2,x 3/2 ; from E:NH 059: ok2: 060: sta c_reg 3/2 ; in place 061: lda 0,x 2/1 062: sbc 3,x 3/2 063: inc 4,x 3/2 ; set result bit 064: next: 065: dbnz 1,x,bcl 5/3 ; count down and loop 066: sta 0,x 2/1 ; store E 067: pulh 2/1 ; RH in place 068: ais #3 2/2 ; clean up stack 069: ldx c_reg 3/2 ; RL in place 070: pula 2/1 ; QL in place 071: clr c_reg 3/2 ; QH is zero 072: rts 4/1 ; and return \end{verbatim} There are three distinct timing cases for the \emph{c\_udiv} function: \begin{enumerate} \item $d_H = 0$ and $n_H < d_L$: 47 clocks are required, representing straight flow of the instructions from line 10 through line 27. \item $d_H = 0$ and $n_H \geq d_L$: 54 clocks are required. \item $d_H > 0$ and every bit of the quotient is 1, in which case 400 clocks are required. This represents 22 clocks up through line 42, 45 clocks $\times$ 8 in the lines from 43 through 65, and 18 clocks in the lines from 66 through 72. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{FLASH/RAM Consumption Analysis} \label{saec0:sfra0} From \S{}\ref{saec0:stan0}, 93 bytes of FLASH are used. Only one byte of RAM is used (\emph{c\_reg}, probably shared with other functions as well). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Potential Optimizations} \label{spop0} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Potential Case I Optimizations} \label{spop0:scas0} This section corresponds to \emph{Case I} of \S{}\ref{saec0:saan0:scas0}. The most obvious observation about the code (\S{}\ref{saec0:slec0}) is that division instructions are very inexpensive on the CPU08---7 clock cycles, or about 2 typical instructions. Branching based on $n_H \geq d_L$ (\S{}\ref{saec0:slec0}, lines 15-16) may cost more in the test, the branch, and in other data transfer overhead than is saved. It may make sense to apply the full formula in (\ref{eq:saec0:saan0:scas0:04}) in all cases where $d_H = 0$. When $d_H = 0$, and if one assumes normal distribution of data, the expected value of execution time is about $(47+54)/2 = 50.5$ clocks.\footnote{If the data is assumed normally distributed, $n_H$ has about a 0.5 probability of being at least as large as $d_L$.} The code below combines two of the three timing cases into one by ignoring the relationship between $n_H$ and $d_L$. \begin{verbatim} ;Condition at function entry: ;N_H in 1,SP ;N_L in A ;D_H in H ;D_L in X ; ;Condition at function exit: ;Q_H in c_reg ;Q_L in A ;R_H in H ;R_L in X ; c_udiv: psha 2/1 ; save NL pshh 2/1 ; DH on the stack tst 1,sp 4/3 ; test zero bne full 3/2 ; full division ; ;From here on we're committed to the division with ;arbitary numerator, and denominator <= 255. ; N_H at 3,sp ; N_L at 2,sp ; D_H at 1,sp ; clrh 1/1 lda 3,sp 4/3 ; H:A now contains N_H div 7/1 ; divide sta c_reg 3/2 ; QH in place lda 2,sp 4/3 ; complete dividend with NL div 7/1 ; divide. Q_L in A, R_L in H pshh 2/1 ; move RL pulx 2/1 ; in X clrh 1/1 ; RH is zero ais #3 2/2 ; clean stack rts 4/1 ; and return \end{verbatim} Although the code does raise the minimum execution time from 47 to 48 clocks: \begin{itemize} \item It lowers the expected value of the $d_H=0$ execution time from 50.5 to 48 clocks. \item It saves approximately 15 bytes of FLASH. \end{itemize} This optimization is recommended. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Potential Case II Optimizations} \label{spop0:scas1} This section corresponds to \emph{Case II} of \S{}\ref{saec0:saan0:scas1}. I sent an e-mail to an engineer at the compiler manufacturer indicating that: \begin{itemize} \item I believed \emph{Case I} could be optimized as indicated earlier in the document. \item I believed \emph{Case II} could be optimized by applying Knuth's algorithm. \end{itemize} The reply I received from the compiler manufacturer was that: \begin{itemize} \item There was agreeement about \emph{Case I}. \item \emph{Case II} may be a little faster using Knuth's algorithm, but would definitely be larger (code used to evaluate the application of Knuth's algorithm was also provided). \end{itemize} In the test code provided by the compiler manufacturer, the approach used was to obtain a trial quotient and then to subtract the divisor from the remainder up to twice to adjust the quotient up to 2 counts downward (Knuth's algorithm). I did try a different approach (to iterate on the quotient and to reconstruct $q \times d$ with decreasing $q$). This approach promised to be slightly more compact because $q \times d$ reconstruction was reutilized. However, it worked out to occupy about 153 bytes rather than 103 for the shift-compare-subtract algorithm (timing was not examined). (This test code is version-controlled in the same directory as this \LaTeX{} document.) At this point I agree with the compiler manufacturer that there is a tradeoff between size and speed (it seems nearly impossible to get both). In any reimplementation of this algorithm, will probably need to choose between size and speed. I believe there is some possibility to reduce the reimplementation from 153 bytes, but not down to 103 bytes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % $Log: cpu08div16by16a.tex,v $ % Revision 1.12 2007/05/26 18:25:46 dashley % Edits to incorporate final thoughts (for now) and test code results. % % Revision 1.11 2007/05/21 16:49:40 dashley % \bdiv functionality as suggested by Robin Fairbairns added. % % Revision 1.10 2007/05/18 15:53:59 dashley % Edits. % % Revision 1.9 2007/05/18 02:02:56 dashley % Edits. % % Revision 1.8 2007/05/18 02:02:10 dashley % Edits. % % Revision 1.7 2007/05/18 01:40:20 dashley % Edits. % % Revision 1.6 2007/05/18 00:15:56 dashley % Edits. % % Revision 1.5 2007/05/17 15:53:57 dashley % Edits. % % Revision 1.4 2007/05/17 14:46:34 dashley % Edits. % % Revision 1.3 2007/05/17 04:15:22 dashley % Edits. % % Revision 1.2 2007/05/17 03:50:48 dashley % Edits. % % Revision 1.1 2007/05/17 03:09:06 dashley % Initial checkin. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%