1 |
%$Header: /home/dashley/cvsrep/e3ft_gpl01/e3ft_gpl01/dtaipubs/cron/2007/cpu08div16by16a/cpu08div16by16a.tex,v 1.12 2007/05/26 18:25:46 dashley Exp $
|
2 |
%
|
3 |
\documentclass[letterpaper,10pt,titlepage]{article}
|
4 |
%
|
5 |
%\pagestyle{headings}
|
6 |
%
|
7 |
\usepackage{amsmath}
|
8 |
\usepackage{amsfonts}
|
9 |
\usepackage{amssymb}
|
10 |
\usepackage[ansinew]{inputenc}
|
11 |
\usepackage[OT1]{fontenc}
|
12 |
\usepackage{graphicx}
|
13 |
%\usepackage{makeidx}
|
14 |
%
|
15 |
%Contributed by Robin Fairbairns of the comp.text.tex newsgroup. I have no
|
16 |
%idea how it works ... but it gives a working \bdiv.
|
17 |
%
|
18 |
\makeatletter
|
19 |
\def\bdiv{%
|
20 |
\nonscript\mskip-\medmuskip\mkern5mu%
|
21 |
\mathbin{\operator@font div}\penalty900\mkern5mu%
|
22 |
\nonscript\mskip-\medmuskip}
|
23 |
\makeatother
|
24 |
%
|
25 |
%
|
26 |
\begin{document}
|
27 |
\title{Notes on Economical 16/16 = 16 Unsigned Integer Division on the Freescale CPU08}
|
28 |
\author{\vspace{1cm}\\David T. Ashley\\\texttt{dta@e3ft.com}\\\vspace{1cm}}
|
29 |
\date{\vspace*{8mm}\small{Version Control $ $Revision: 1.12 $ $ \\
|
30 |
Version Control $ $Date: 2007/05/26 18:25:46 $ $ (UTC) \\
|
31 |
$ $RCSfile: cpu08div16by16a.tex,v $ $ \\
|
32 |
\LaTeX{} Compilation Date: \today{}}}
|
33 |
\maketitle
|
34 |
|
35 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
36 |
%
|
37 |
\begin{abstract}
|
38 |
This document contains my notes on attempting to optimize
|
39 |
the 16 = 16/16 unsigned integer division subroutine distributed with
|
40 |
a Freescale CPU08 compiler. I attempted to replace the standard
|
41 |
shift-compare-subtract algorithm with Knuth's algorithm.
|
42 |
|
43 |
In the end, the attempt was mostly a failure. There seemed to be no way
|
44 |
to gain the execution time advantages of Knuth's algorithm without increasing
|
45 |
FLASH size. (The goal would have been to obtain more favorable
|
46 |
execution time \emph{and} FLASH size, rather than a tradeoff.)
|
47 |
\end{abstract}
|
48 |
|
49 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
50 |
|
51 |
\section{Introduction and Overview}
|
52 |
\label{siov0}
|
53 |
|
54 |
The fundamental goal of practical computer integer arithmetic is to obtain
|
55 |
an exact result using machine instructions that are as fast and compact as
|
56 |
possible.
|
57 |
|
58 |
For three of the four fundamental operations (addition, subtraction, and
|
59 |
multiplication), it is intuitively obvious to most programmers how to
|
60 |
use existing machine instructions to operate on operands that are larger
|
61 |
than the machine instructions can accommodate.
|
62 |
|
63 |
The fourth fundamental operation (division), however, is less well understood
|
64 |
by a typical programmer than the other three. It is not obvious to most
|
65 |
programmers how to use machine division instructions to divide integers larger
|
66 |
than the native machine division instructions will accommodate.
|
67 |
|
68 |
In 2007, I noticed that the integer division library functions associated with
|
69 |
a particular \emph{C} compiler were of the shift-compare-subtract variety.
|
70 |
As the CPU08 has a 16/16=8 native division instruction, I suspected but was not
|
71 |
sure that the 16/16=16 division function could be improved.
|
72 |
|
73 |
Note that for a library function used by a \emph{C} compiler, it may not
|
74 |
be necessary (although it may be desirable) to calculate the quotient and the
|
75 |
remainder at the same time. An algorithm that calculates the quotient but
|
76 |
does not calculate the remainder, or vice-versa, may be acceptable for
|
77 |
use by compiled code.
|
78 |
|
79 |
Although I haven't examined the \emph{C} standards, I doubt that there is
|
80 |
any special requirement for the quotient or remainder calculated when
|
81 |
the denominator is zero. The only requirement is probably that the calculation
|
82 |
not continue indefinitely (i.e. not an infinite loop).
|
83 |
|
84 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
85 |
|
86 |
\section{Nomenclature and Notation}
|
87 |
\label{snom0}
|
88 |
|
89 |
I use the nomenclature ``16/16=16 division'' to denote a division function that
|
90 |
accepts a 16-bit unsigned numerator and a 16-bit unsigned denominator; and
|
91 |
produces either/both a 16-bit quotient and a 16-bit remainder.
|
92 |
|
93 |
I use $n$ to denote the numerator, $d$ to denote the denominator, $q$ to denote
|
94 |
the integer quotient, and $r$ to denote the remainder.
|
95 |
|
96 |
Any of the 16-bit quantities can be subscripted with ``H'' or ``L'' to denote
|
97 |
the most-significant or least-significant byte. For example,
|
98 |
|
99 |
\begin{equation}
|
100 |
\label{eq:snom0:01}
|
101 |
n = 2^8 n_H + n_L .
|
102 |
\end{equation}
|
103 |
|
104 |
Within any of the 16-bit quantities, bits are subscripted in the traditional fashion.
|
105 |
For example,
|
106 |
|
107 |
\begin{equation}
|
108 |
\label{eq:snom0:02}
|
109 |
n = \sum_{i=0}^{15} 2^i n_i .
|
110 |
\end{equation}
|
111 |
|
112 |
Within any of the 8-bit quantities, bits are also subscripted in the traditional fashion.
|
113 |
For example,
|
114 |
|
115 |
\begin{equation}
|
116 |
\label{eq:snom0:03}
|
117 |
n = 2^8 \sum_{i=0}^{7} 2^i n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i}
|
118 |
= \sum_{i=0}^{7} 2^{i+8} n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} .
|
119 |
\end{equation}
|
120 |
|
121 |
Because only non-negative integers are involved, \emph{floor()} and
|
122 |
\emph{div} are used interchangeably, i.e.
|
123 |
|
124 |
\begin{equation}
|
125 |
\label{eq:snom0:04}
|
126 |
\left\lfloor \frac{a}{b} \right\rfloor = a \bdiv b, \;\; a,b \geq 0 .
|
127 |
\end{equation}
|
128 |
|
129 |
An identity that is frequently used in this document is
|
130 |
|
131 |
\begin{eqnarray}
|
132 |
\label{eq:snom0:05}
|
133 |
\frac{a}{b} & = & a \bdiv b + \frac{a \bmod b}{b} \\
|
134 |
\label{eq:snom0:06}
|
135 |
& = & \left\lfloor \frac{a}{b} \right\rfloor + \frac{a \bmod b}{b} .
|
136 |
\end{eqnarray}
|
137 |
|
138 |
|
139 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
140 |
|
141 |
\section{Analysis of Existing Code}
|
142 |
\label{saec0}
|
143 |
|
144 |
This section presents an analysis of the behavior and characteristics of
|
145 |
the existing code.
|
146 |
|
147 |
|
148 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
149 |
|
150 |
\subsection{Listing}
|
151 |
\label{saec0:slec0}
|
152 |
|
153 |
The existing code is reproduced below, with line numbers.
|
154 |
|
155 |
\begin{verbatim}
|
156 |
001: c_umod:
|
157 |
002: bsr c_udiv ; divide
|
158 |
003: pshh ; transfer MSB
|
159 |
004: pula ; in place
|
160 |
005: sta c_reg
|
161 |
006: txa ; LSB in place
|
162 |
007: rts ; and return
|
163 |
008: ;
|
164 |
009: c_udiv:
|
165 |
010: psha ; save NL
|
166 |
011: pshh ; DH on the stack
|
167 |
012: tst 1,sp ; test zero
|
168 |
013: bne full ; full division
|
169 |
014: pulh ; clean stack
|
170 |
015: cpx c_reg ; compare DL with NH
|
171 |
016: bls half ; half division
|
172 |
017: lda c_reg ; NH
|
173 |
018: psha ; in
|
174 |
019: pulh ; H
|
175 |
020: pula ; NL in A
|
176 |
021: div ; DL in X, divide
|
177 |
022: clr c_reg ; QH is zero
|
178 |
023: fini:
|
179 |
024: pshh ; move RL
|
180 |
025: pulx ; in X
|
181 |
026: clrh ; RH is zero
|
182 |
027: rts ; and return
|
183 |
028: half:
|
184 |
029: lda c_reg ; NH in A
|
185 |
030: clrh ; 1st divide 8 X 8
|
186 |
031: div ; divide
|
187 |
032: sta c_reg ; QH in place
|
188 |
033: pula ; complete dividend with NL
|
189 |
034: div ; divide
|
190 |
035: bra fini ; complete remainder
|
191 |
036: full:
|
192 |
037: pshx ; save DL
|
193 |
038: ldx #8 ; counter
|
194 |
039: clra ; Extention
|
195 |
040: pshx ; save on
|
196 |
041: psha ; the stack
|
197 |
042: tsx ; stack addressed by H:X
|
198 |
043: bcl:
|
199 |
044: lsl 4,x ; shift E:NH:NL
|
200 |
045: rol c_reg ; left
|
201 |
046: rola
|
202 |
047: sta 0,x ; store E
|
203 |
048: cmp 3,x ; compare DH
|
204 |
049: blo next ; too low, continue
|
205 |
050: bne ok ; ok to subtract
|
206 |
051: lda c_reg ; compare NH
|
207 |
052: sub 2,x ; with DL
|
208 |
053: bhs ok2 ; ok, complete result
|
209 |
054: lda 0,x ; restore value
|
210 |
055: bra next ; and continue
|
211 |
056: ok:
|
212 |
057: lda c_reg ; substract D
|
213 |
058: sub 2,x ; from E:NH
|
214 |
059: ok2:
|
215 |
060: sta c_reg ; in place
|
216 |
061: lda 0,x
|
217 |
062: sbc 3,x
|
218 |
063: inc 4,x ; set result bit
|
219 |
064: next:
|
220 |
065: dbnz 1,x,bcl ; count down and loop
|
221 |
066: sta 0,x ; store E
|
222 |
067: pulh ; RH in place
|
223 |
068: ais #3 ; clean up stack
|
224 |
069: ldx c_reg ; RL in place
|
225 |
070: pula ; QL in place
|
226 |
071: clr c_reg ; QH is zero
|
227 |
072: rts ; and return
|
228 |
\end{verbatim}
|
229 |
|
230 |
|
231 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
232 |
|
233 |
\subsection{Algorithmic Analysis}
|
234 |
\label{saec0:saan0}
|
235 |
|
236 |
The algorithm is divided into two cases:
|
237 |
|
238 |
\begin{itemize}
|
239 |
\item Case I: $d_H = 0$ (lines 14-35).
|
240 |
\item Case II: $d_H > 0$ (lines 36-72).
|
241 |
\end{itemize}
|
242 |
|
243 |
|
244 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
245 |
|
246 |
\subsubsection{Case I}
|
247 |
\label{saec0:saan0:scas0}
|
248 |
|
249 |
In the case of $d_H = 0$, the division is 16/8:
|
250 |
|
251 |
\begin{eqnarray}
|
252 |
\label{eq:saec0:saan0:scas0:01}
|
253 |
\frac{2^8 n_H + n_L}{d_L} & = & \frac{2^8 n_H}{d_L} + \frac{n_L}{d_L} \\
|
254 |
\label{eq:saec0:saan0:scas0:02}
|
255 |
& = & 2^8 \left( n_H \bdiv d_L + \frac{n_H \bmod d_L}{d_L} \right) + \frac{n_L}{d_L} \\
|
256 |
\label{eq:saec0:saan0:scas0:03}
|
257 |
& = & 2^8 (n_H \bdiv d_L) + \frac{2^8 (n_H \bmod d_L) + n_L}{d_L}
|
258 |
\end{eqnarray}
|
259 |
|
260 |
(\ref{eq:saec0:saan0:scas0:03}) is an exact expression involving rational numbers.
|
261 |
However, we don't want to calculate the left side of (\ref{eq:saec0:saan0:scas0:03});
|
262 |
rather, we wish to calculate its \emph{floor()}\@. Applying the
|
263 |
\emph{floor()} function to both sides of (\ref{eq:saec0:saan0:scas0:03}) yields:
|
264 |
|
265 |
\begin{equation}
|
266 |
\label{eq:saec0:saan0:scas0:04}
|
267 |
\left\lfloor \frac{2^8 n_H + n_L}{d_L} \right\rfloor
|
268 |
=
|
269 |
2^8 (n_H \bdiv d_L) + \left\lfloor \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \right\rfloor .
|
270 |
\end{equation}
|
271 |
|
272 |
Note that (\ref{eq:saec0:saan0:scas0:04}) is in a form
|
273 |
that can be readily evaluated using a processor with 16/8 division
|
274 |
capability; so long as
|
275 |
|
276 |
\begin{equation}
|
277 |
\label{eq:saec0:saan0:scas0:05}
|
278 |
\frac{2^8 (n_H \bmod d_L) + n_L}{d_L} < 2^8 ,
|
279 |
\end{equation}
|
280 |
|
281 |
\noindent{}a fact that can be easily verified by the reader.
|
282 |
|
283 |
(\ref{eq:saec0:saan0:scas0:04}) can be readily evaluated by a processor with
|
284 |
16/8 division capability because such a division instruction always provides
|
285 |
both quotient and remainder. It is easy to see that (\ref{eq:saec0:saan0:scas0:04})
|
286 |
can be evaluated with a division, a re-staging of bytes, and a second division.
|
287 |
|
288 |
If (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested, it needs to be verified
|
289 |
whether the remainder of the second division is the same as the remainder
|
290 |
of the larger division, i.e.
|
291 |
|
292 |
\begin{equation}
|
293 |
\label{eq:saec0:saan0:scas0:06}
|
294 |
(2^8 n_H + n_L) \bmod d_L =? ((2^8 \bmod d_L) + n_L) \bmod d_L.
|
295 |
\end{equation}
|
296 |
|
297 |
The question of whether (\ref{eq:saec0:saan0:scas0:06}) is an
|
298 |
equality is the question of whether
|
299 |
|
300 |
\begin{equation}
|
301 |
\label{eq:saec0:saan0:scas0:07}
|
302 |
ka \bmod b = (k (a \bmod b)) \bmod b .
|
303 |
\end{equation}
|
304 |
|
305 |
In order to prove or disprove (\ref{eq:saec0:saan0:scas0:07}),
|
306 |
decompose $a$ into $ib + j$. Then, since $kib \bmod b = 0$,
|
307 |
|
308 |
\begin{eqnarray}
|
309 |
\label{eq:saec0:saan0:scas0:08}
|
310 |
k (ib + j) \bmod b & = & k j \bmod b \\
|
311 |
\label{eq:saec0:saan0:scas0:09}
|
312 |
k j \bmod b & = & k j \bmod b .
|
313 |
\end{eqnarray}
|
314 |
|
315 |
Thus, if (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested (with
|
316 |
two divisions), the final remainder will be the same as the remainder for the
|
317 |
original division. (\ref{eq:saec0:saan0:scas0:04}) will, in fact, deliver both
|
318 |
the quotient and remainder economically.
|
319 |
|
320 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
321 |
|
322 |
\subsubsection{Case II}
|
323 |
\label{saec0:saan0:scas1}
|
324 |
|
325 |
The case of $d_H > 0$ (\S{}\ref{saec0:slec0}, lines 36-72) is conventional
|
326 |
shift-compare-subtract division. Only eight iterations of the loop are required
|
327 |
because with $d_H > 0$, $d \geq 2^8$, and $n/d < 2^8$.
|
328 |
|
329 |
|
330 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
331 |
|
332 |
\subsection{Timing Analysis}
|
333 |
\label{saec0:stan0}
|
334 |
|
335 |
The code of \S{}\ref{saec0:slec0} is reproduced below, with instruction
|
336 |
timing (number of clocks) and FLASH requirements (number of bytes) added
|
337 |
as (clocks/bytes). It was determined that \emph{c\_reg} resides
|
338 |
in zero-page (i.e. \emph{direct}) memory.
|
339 |
|
340 |
\begin{verbatim}
|
341 |
001: c_umod:
|
342 |
002: bsr c_udiv 4/2 ; divide
|
343 |
003: pshh ; transfer MSB
|
344 |
004: pula 2/1 ; in place
|
345 |
005: sta c_reg 3/2
|
346 |
006: txa 1/1 ; LSB in place
|
347 |
007: rts 4/1 ; and return
|
348 |
008: ;
|
349 |
009: c_udiv:
|
350 |
010: psha 2/1 ; save NL
|
351 |
011: pshh 2/1 ; DH on the stack
|
352 |
012: tst 1,sp 4/3 ; test zero
|
353 |
013: bne full 3/2 ; full division
|
354 |
014: pulh 2/1 ; clean stack
|
355 |
015: cpx c_reg 3/2 ; compare DL with NH
|
356 |
016: bls half 3/2 ; half division
|
357 |
017: lda c_reg 3/2 ; NH
|
358 |
018: psha 2/1 ; in
|
359 |
019: pulh 2/1 ; H
|
360 |
020: pula 2/1 ; NL in A
|
361 |
021: div 7/1 ; DL in X, divide
|
362 |
022: clr c_reg 3/2 ; QH is zero
|
363 |
023: fini:
|
364 |
024: pshh 2/1 ; move RL
|
365 |
025: pulx 2/1 ; in X
|
366 |
026: clrh 1/1 ; RH is zero
|
367 |
027: rts 4/1 ; and return
|
368 |
028: half:
|
369 |
029: lda c_reg 3/2 ; NH in A
|
370 |
030: clrh 1/1 ; 1st divide 8 X 8
|
371 |
031: div 7/1 ; divide
|
372 |
032: sta c_reg 3/2 ; QH in place
|
373 |
033: pula 2/1 ; complete dividend with NL
|
374 |
034: div 7/1 ; divide
|
375 |
035: bra fini 3/2 ; complete remainder
|
376 |
036: full:
|
377 |
037: pshx 2/1 ; save DL
|
378 |
038: ldx #8 2/2 ; counter
|
379 |
039: clra 1/1 ; Extention
|
380 |
040: pshx 2/1 ; save on
|
381 |
041: psha 2/1 ; the stack
|
382 |
042: tsx 2/1 ; stack addressed by H:X
|
383 |
043: bcl:
|
384 |
044: lsl 4,x 4/2 ; shift E:NH:NL
|
385 |
045: rol c_reg 4/2 ; left
|
386 |
046: rola 1/1
|
387 |
047: sta 0,x 2/1 ; store E
|
388 |
048: cmp 3,x 3/2 ; compare DH
|
389 |
049: blo next 3/2 ; too low, continue
|
390 |
050: bne ok 3/2 ; ok to subtract
|
391 |
051: lda c_reg 3/2 ; compare NH
|
392 |
052: sub 2,x 3/2 ; with DL
|
393 |
053: bhs ok2 3/2 ; ok, complete result
|
394 |
054: lda 0,x 2/1 ; restore value
|
395 |
055: bra next 3/2 ; and continue
|
396 |
056: ok:
|
397 |
057: lda c_reg 3/2 ; substract D
|
398 |
058: sub 2,x 3/2 ; from E:NH
|
399 |
059: ok2:
|
400 |
060: sta c_reg 3/2 ; in place
|
401 |
061: lda 0,x 2/1
|
402 |
062: sbc 3,x 3/2
|
403 |
063: inc 4,x 3/2 ; set result bit
|
404 |
064: next:
|
405 |
065: dbnz 1,x,bcl 5/3 ; count down and loop
|
406 |
066: sta 0,x 2/1 ; store E
|
407 |
067: pulh 2/1 ; RH in place
|
408 |
068: ais #3 2/2 ; clean up stack
|
409 |
069: ldx c_reg 3/2 ; RL in place
|
410 |
070: pula 2/1 ; QL in place
|
411 |
071: clr c_reg 3/2 ; QH is zero
|
412 |
072: rts 4/1 ; and return
|
413 |
\end{verbatim}
|
414 |
|
415 |
There are three distinct timing cases for the \emph{c\_udiv} function:
|
416 |
|
417 |
\begin{enumerate}
|
418 |
\item $d_H = 0$ and $n_H < d_L$:
|
419 |
47 clocks are required, representing straight flow of the
|
420 |
instructions from line 10 through line 27.
|
421 |
\item $d_H = 0$ and $n_H \geq d_L$: 54 clocks are required.
|
422 |
\item $d_H > 0$ and every bit of the quotient is 1, in which case
|
423 |
400 clocks are required. This represents 22 clocks up through
|
424 |
line 42, 45 clocks $\times$ 8 in the lines from 43 through 65, and
|
425 |
18 clocks in the lines from 66 through 72.
|
426 |
\end{enumerate}
|
427 |
|
428 |
|
429 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
430 |
|
431 |
\subsection{FLASH/RAM Consumption Analysis}
|
432 |
\label{saec0:sfra0}
|
433 |
|
434 |
From \S{}\ref{saec0:stan0}, 93 bytes of FLASH are used. Only one byte
|
435 |
of RAM is used (\emph{c\_reg}, probably shared with other functions
|
436 |
as well).
|
437 |
|
438 |
|
439 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
440 |
|
441 |
\section{Potential Optimizations}
|
442 |
\label{spop0}
|
443 |
|
444 |
|
445 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
446 |
|
447 |
\subsection{Potential Case I Optimizations}
|
448 |
\label{spop0:scas0}
|
449 |
|
450 |
This section corresponds to \emph{Case I} of \S{}\ref{saec0:saan0:scas0}.
|
451 |
|
452 |
The most obvious observation about the code (\S{}\ref{saec0:slec0}) is that
|
453 |
division instructions are very inexpensive on the CPU08---7 clock cycles,
|
454 |
or about 2 typical instructions. Branching based on
|
455 |
$n_H \geq d_L$
|
456 |
(\S{}\ref{saec0:slec0}, lines 15-16)
|
457 |
may cost more in the test, the branch, and in other
|
458 |
data transfer overhead than is saved. It may make sense to apply the
|
459 |
full formula in (\ref{eq:saec0:saan0:scas0:04}) in all cases where
|
460 |
$d_H = 0$.
|
461 |
|
462 |
When $d_H = 0$, and if one assumes normal distribution of data, the expected
|
463 |
value of execution time is about $(47+54)/2 = 50.5$ clocks.\footnote{If the
|
464 |
data is assumed normally distributed, $n_H$ has about a 0.5 probability of being at least as large
|
465 |
as $d_L$.}
|
466 |
|
467 |
The code below combines two of the three timing cases into one by ignoring the
|
468 |
relationship between $n_H$ and $d_L$.
|
469 |
|
470 |
\begin{verbatim}
|
471 |
;Condition at function entry:
|
472 |
;N_H in 1,SP
|
473 |
;N_L in A
|
474 |
;D_H in H
|
475 |
;D_L in X
|
476 |
;
|
477 |
;Condition at function exit:
|
478 |
;Q_H in c_reg
|
479 |
;Q_L in A
|
480 |
;R_H in H
|
481 |
;R_L in X
|
482 |
;
|
483 |
c_udiv:
|
484 |
psha 2/1 ; save NL
|
485 |
pshh 2/1 ; DH on the stack
|
486 |
tst 1,sp 4/3 ; test zero
|
487 |
bne full 3/2 ; full division
|
488 |
;
|
489 |
;From here on we're committed to the division with
|
490 |
;arbitary numerator, and denominator <= 255.
|
491 |
; N_H at 3,sp
|
492 |
; N_L at 2,sp
|
493 |
; D_H at 1,sp
|
494 |
;
|
495 |
clrh 1/1
|
496 |
lda 3,sp 4/3 ; H:A now contains N_H
|
497 |
div 7/1 ; divide
|
498 |
sta c_reg 3/2 ; QH in place
|
499 |
lda 2,sp 4/3 ; complete dividend with NL
|
500 |
div 7/1 ; divide. Q_L in A, R_L in H
|
501 |
pshh 2/1 ; move RL
|
502 |
pulx 2/1 ; in X
|
503 |
clrh 1/1 ; RH is zero
|
504 |
ais #3 2/2 ; clean stack
|
505 |
rts 4/1 ; and return
|
506 |
\end{verbatim}
|
507 |
|
508 |
Although the code does raise the minimum execution time from 47 to
|
509 |
48 clocks:
|
510 |
|
511 |
\begin{itemize}
|
512 |
\item It lowers the expected value of the $d_H=0$ execution time from
|
513 |
50.5 to 48 clocks.
|
514 |
\item It saves approximately 15 bytes of FLASH.
|
515 |
\end{itemize}
|
516 |
|
517 |
This optimization is recommended.
|
518 |
|
519 |
|
520 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
521 |
|
522 |
\subsection{Potential Case II Optimizations}
|
523 |
\label{spop0:scas1}
|
524 |
|
525 |
This section corresponds to \emph{Case II} of \S{}\ref{saec0:saan0:scas1}.
|
526 |
|
527 |
I sent an e-mail to an engineer at the compiler manufacturer indicating
|
528 |
that:
|
529 |
|
530 |
\begin{itemize}
|
531 |
\item I believed \emph{Case I} could be optimized as indicated earlier
|
532 |
in the document.
|
533 |
\item I believed \emph{Case II} could be optimized by applying Knuth's
|
534 |
algorithm.
|
535 |
\end{itemize}
|
536 |
|
537 |
The reply I received from the compiler manufacturer was that:
|
538 |
|
539 |
\begin{itemize}
|
540 |
\item There was agreeement about \emph{Case I}.
|
541 |
\item \emph{Case II} may be a little faster using Knuth's algorithm, but would
|
542 |
definitely be larger (code used to evaluate the application of Knuth's
|
543 |
algorithm was also provided).
|
544 |
\end{itemize}
|
545 |
|
546 |
In the test code provided by the compiler manufacturer, the approach used was
|
547 |
to obtain a trial quotient and then to subtract the divisor from the remainder
|
548 |
up to twice to adjust the quotient up to 2 counts downward (Knuth's algorithm).
|
549 |
|
550 |
I did try a different approach (to iterate on the quotient and to reconstruct
|
551 |
$q \times d$ with decreasing $q$). This approach promised to be slightly more
|
552 |
compact because $q \times d$ reconstruction was reutilized. However, it worked out
|
553 |
to occupy about 153 bytes rather than 103 for the shift-compare-subtract
|
554 |
algorithm (timing was not examined). (This test code is version-controlled
|
555 |
in the same directory as this \LaTeX{} document.)
|
556 |
|
557 |
At this point I agree with the
|
558 |
compiler manufacturer
|
559 |
that there is a tradeoff between size and speed (it seems
|
560 |
nearly impossible to get both).
|
561 |
|
562 |
In any reimplementation of this algorithm, will probably need to choose between
|
563 |
size and speed. I believe there is some possibility to reduce the reimplementation
|
564 |
from 153 bytes, but not down to 103 bytes.
|
565 |
|
566 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
567 |
\end{document}
|
568 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
569 |
% $Log: cpu08div16by16a.tex,v $
|
570 |
% Revision 1.12 2007/05/26 18:25:46 dashley
|
571 |
% Edits to incorporate final thoughts (for now) and test code results.
|
572 |
%
|
573 |
% Revision 1.11 2007/05/21 16:49:40 dashley
|
574 |
% \bdiv functionality as suggested by Robin Fairbairns added.
|
575 |
%
|
576 |
% Revision 1.10 2007/05/18 15:53:59 dashley
|
577 |
% Edits.
|
578 |
%
|
579 |
% Revision 1.9 2007/05/18 02:02:56 dashley
|
580 |
% Edits.
|
581 |
%
|
582 |
% Revision 1.8 2007/05/18 02:02:10 dashley
|
583 |
% Edits.
|
584 |
%
|
585 |
% Revision 1.7 2007/05/18 01:40:20 dashley
|
586 |
% Edits.
|
587 |
%
|
588 |
% Revision 1.6 2007/05/18 00:15:56 dashley
|
589 |
% Edits.
|
590 |
%
|
591 |
% Revision 1.5 2007/05/17 15:53:57 dashley
|
592 |
% Edits.
|
593 |
%
|
594 |
% Revision 1.4 2007/05/17 14:46:34 dashley
|
595 |
% Edits.
|
596 |
%
|
597 |
% Revision 1.3 2007/05/17 04:15:22 dashley
|
598 |
% Edits.
|
599 |
%
|
600 |
% Revision 1.2 2007/05/17 03:50:48 dashley
|
601 |
% Edits.
|
602 |
%
|
603 |
% Revision 1.1 2007/05/17 03:09:06 dashley
|
604 |
% Initial checkin.
|
605 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|