Parent Directory | Revision Log

Revision **10** -
(**show annotations**)
(**download**)
(**as text**)

*Fri Oct 7 03:23:35 2016 UTC*
(8 years ago)
by *dashley*

File MIME type: application/x-tex

File size: 22866 byte(s)

File MIME type: application/x-tex

File size: 22866 byte(s)

Commit of u16 = u16 / u16 analysis for the CPU08.

1 | %$Header: /home/dashley/cvsrep/e3ft_gpl01/e3ft_gpl01/dtaipubs/cron/2007/cpu08div16by16a/cpu08div16by16a.tex,v 1.12 2007/05/26 18:25:46 dashley Exp $ |

2 | % |

3 | \documentclass[letterpaper,10pt,titlepage]{article} |

4 | % |

5 | %\pagestyle{headings} |

6 | % |

7 | \usepackage{amsmath} |

8 | \usepackage{amsfonts} |

9 | \usepackage{amssymb} |

10 | \usepackage[ansinew]{inputenc} |

11 | \usepackage[OT1]{fontenc} |

12 | \usepackage{graphicx} |

13 | %\usepackage{makeidx} |

14 | % |

15 | %Contributed by Robin Fairbairns of the comp.text.tex newsgroup. I have no |

16 | %idea how it works ... but it gives a working \bdiv. |

17 | % |

18 | \makeatletter |

19 | \def\bdiv{% |

20 | \nonscript\mskip-\medmuskip\mkern5mu% |

21 | \mathbin{\operator@font div}\penalty900\mkern5mu% |

22 | \nonscript\mskip-\medmuskip} |

23 | \makeatother |

24 | % |

25 | % |

26 | \begin{document} |

27 | \title{Notes on Economical 16/16 = 16 Unsigned Integer Division on the Freescale CPU08} |

28 | \author{\vspace{1cm}\\David T. Ashley\\\texttt{dta@e3ft.com}\\\vspace{1cm}} |

29 | \date{\vspace*{8mm}\small{Version Control $ $Revision: 1.12 $ $ \\ |

30 | Version Control $ $Date: 2007/05/26 18:25:46 $ $ (UTC) \\ |

31 | $ $RCSfile: cpu08div16by16a.tex,v $ $ \\ |

32 | \LaTeX{} Compilation Date: \today{}}} |

33 | \maketitle |

34 | |

35 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

36 | % |

37 | \begin{abstract} |

38 | This document contains my notes on attempting to optimize |

39 | the 16 = 16/16 unsigned integer division subroutine distributed with |

40 | a Freescale CPU08 compiler. I attempted to replace the standard |

41 | shift-compare-subtract algorithm with Knuth's algorithm. |

42 | |

43 | In the end, the attempt was mostly a failure. There seemed to be no way |

44 | to gain the execution time advantages of Knuth's algorithm without increasing |

45 | FLASH size. (The goal would have been to obtain more favorable |

46 | execution time \emph{and} FLASH size, rather than a tradeoff.) |

47 | \end{abstract} |

48 | |

49 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

50 | |

51 | \section{Introduction and Overview} |

52 | \label{siov0} |

53 | |

54 | The fundamental goal of practical computer integer arithmetic is to obtain |

55 | an exact result using machine instructions that are as fast and compact as |

56 | possible. |

57 | |

58 | For three of the four fundamental operations (addition, subtraction, and |

59 | multiplication), it is intuitively obvious to most programmers how to |

60 | use existing machine instructions to operate on operands that are larger |

61 | than the machine instructions can accommodate. |

62 | |

63 | The fourth fundamental operation (division), however, is less well understood |

64 | by a typical programmer than the other three. It is not obvious to most |

65 | programmers how to use machine division instructions to divide integers larger |

66 | than the native machine division instructions will accommodate. |

67 | |

68 | In 2007, I noticed that the integer division library functions associated with |

69 | a particular \emph{C} compiler were of the shift-compare-subtract variety. |

70 | As the CPU08 has a 16/16=8 native division instruction, I suspected but was not |

71 | sure that the 16/16=16 division function could be improved. |

72 | |

73 | Note that for a library function used by a \emph{C} compiler, it may not |

74 | be necessary (although it may be desirable) to calculate the quotient and the |

75 | remainder at the same time. An algorithm that calculates the quotient but |

76 | does not calculate the remainder, or vice-versa, may be acceptable for |

77 | use by compiled code. |

78 | |

79 | Although I haven't examined the \emph{C} standards, I doubt that there is |

80 | any special requirement for the quotient or remainder calculated when |

81 | the denominator is zero. The only requirement is probably that the calculation |

82 | not continue indefinitely (i.e. not an infinite loop). |

83 | |

84 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

85 | |

86 | \section{Nomenclature and Notation} |

87 | \label{snom0} |

88 | |

89 | I use the nomenclature ``16/16=16 division'' to denote a division function that |

90 | accepts a 16-bit unsigned numerator and a 16-bit unsigned denominator; and |

91 | produces either/both a 16-bit quotient and a 16-bit remainder. |

92 | |

93 | I use $n$ to denote the numerator, $d$ to denote the denominator, $q$ to denote |

94 | the integer quotient, and $r$ to denote the remainder. |

95 | |

96 | Any of the 16-bit quantities can be subscripted with ``H'' or ``L'' to denote |

97 | the most-significant or least-significant byte. For example, |

98 | |

99 | \begin{equation} |

100 | \label{eq:snom0:01} |

101 | n = 2^8 n_H + n_L . |

102 | \end{equation} |

103 | |

104 | Within any of the 16-bit quantities, bits are subscripted in the traditional fashion. |

105 | For example, |

106 | |

107 | \begin{equation} |

108 | \label{eq:snom0:02} |

109 | n = \sum_{i=0}^{15} 2^i n_i . |

110 | \end{equation} |

111 | |

112 | Within any of the 8-bit quantities, bits are also subscripted in the traditional fashion. |

113 | For example, |

114 | |

115 | \begin{equation} |

116 | \label{eq:snom0:03} |

117 | n = 2^8 \sum_{i=0}^{7} 2^i n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} |

118 | = \sum_{i=0}^{7} 2^{i+8} n_{H_i} + \sum_{i=0}^{7} 2^i n_{L_i} . |

119 | \end{equation} |

120 | |

121 | Because only non-negative integers are involved, \emph{floor()} and |

122 | \emph{div} are used interchangeably, i.e. |

123 | |

124 | \begin{equation} |

125 | \label{eq:snom0:04} |

126 | \left\lfloor \frac{a}{b} \right\rfloor = a \bdiv b, \;\; a,b \geq 0 . |

127 | \end{equation} |

128 | |

129 | An identity that is frequently used in this document is |

130 | |

131 | \begin{eqnarray} |

132 | \label{eq:snom0:05} |

133 | \frac{a}{b} & = & a \bdiv b + \frac{a \bmod b}{b} \\ |

134 | \label{eq:snom0:06} |

135 | & = & \left\lfloor \frac{a}{b} \right\rfloor + \frac{a \bmod b}{b} . |

136 | \end{eqnarray} |

137 | |

138 | |

139 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

140 | |

141 | \section{Analysis of Existing Code} |

142 | \label{saec0} |

143 | |

144 | This section presents an analysis of the behavior and characteristics of |

145 | the existing code. |

146 | |

147 | |

148 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

149 | |

150 | \subsection{Listing} |

151 | \label{saec0:slec0} |

152 | |

153 | The existing code is reproduced below, with line numbers. |

154 | |

155 | \begin{verbatim} |

156 | 001: c_umod: |

157 | 002: bsr c_udiv ; divide |

158 | 003: pshh ; transfer MSB |

159 | 004: pula ; in place |

160 | 005: sta c_reg |

161 | 006: txa ; LSB in place |

162 | 007: rts ; and return |

163 | 008: ; |

164 | 009: c_udiv: |

165 | 010: psha ; save NL |

166 | 011: pshh ; DH on the stack |

167 | 012: tst 1,sp ; test zero |

168 | 013: bne full ; full division |

169 | 014: pulh ; clean stack |

170 | 015: cpx c_reg ; compare DL with NH |

171 | 016: bls half ; half division |

172 | 017: lda c_reg ; NH |

173 | 018: psha ; in |

174 | 019: pulh ; H |

175 | 020: pula ; NL in A |

176 | 021: div ; DL in X, divide |

177 | 022: clr c_reg ; QH is zero |

178 | 023: fini: |

179 | 024: pshh ; move RL |

180 | 025: pulx ; in X |

181 | 026: clrh ; RH is zero |

182 | 027: rts ; and return |

183 | 028: half: |

184 | 029: lda c_reg ; NH in A |

185 | 030: clrh ; 1st divide 8 X 8 |

186 | 031: div ; divide |

187 | 032: sta c_reg ; QH in place |

188 | 033: pula ; complete dividend with NL |

189 | 034: div ; divide |

190 | 035: bra fini ; complete remainder |

191 | 036: full: |

192 | 037: pshx ; save DL |

193 | 038: ldx #8 ; counter |

194 | 039: clra ; Extention |

195 | 040: pshx ; save on |

196 | 041: psha ; the stack |

197 | 042: tsx ; stack addressed by H:X |

198 | 043: bcl: |

199 | 044: lsl 4,x ; shift E:NH:NL |

200 | 045: rol c_reg ; left |

201 | 046: rola |

202 | 047: sta 0,x ; store E |

203 | 048: cmp 3,x ; compare DH |

204 | 049: blo next ; too low, continue |

205 | 050: bne ok ; ok to subtract |

206 | 051: lda c_reg ; compare NH |

207 | 052: sub 2,x ; with DL |

208 | 053: bhs ok2 ; ok, complete result |

209 | 054: lda 0,x ; restore value |

210 | 055: bra next ; and continue |

211 | 056: ok: |

212 | 057: lda c_reg ; substract D |

213 | 058: sub 2,x ; from E:NH |

214 | 059: ok2: |

215 | 060: sta c_reg ; in place |

216 | 061: lda 0,x |

217 | 062: sbc 3,x |

218 | 063: inc 4,x ; set result bit |

219 | 064: next: |

220 | 065: dbnz 1,x,bcl ; count down and loop |

221 | 066: sta 0,x ; store E |

222 | 067: pulh ; RH in place |

223 | 068: ais #3 ; clean up stack |

224 | 069: ldx c_reg ; RL in place |

225 | 070: pula ; QL in place |

226 | 071: clr c_reg ; QH is zero |

227 | 072: rts ; and return |

228 | \end{verbatim} |

229 | |

230 | |

231 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

232 | |

233 | \subsection{Algorithmic Analysis} |

234 | \label{saec0:saan0} |

235 | |

236 | The algorithm is divided into two cases: |

237 | |

238 | \begin{itemize} |

239 | \item Case I: $d_H = 0$ (lines 14-35). |

240 | \item Case II: $d_H > 0$ (lines 36-72). |

241 | \end{itemize} |

242 | |

243 | |

244 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

245 | |

246 | \subsubsection{Case I} |

247 | \label{saec0:saan0:scas0} |

248 | |

249 | In the case of $d_H = 0$, the division is 16/8: |

250 | |

251 | \begin{eqnarray} |

252 | \label{eq:saec0:saan0:scas0:01} |

253 | \frac{2^8 n_H + n_L}{d_L} & = & \frac{2^8 n_H}{d_L} + \frac{n_L}{d_L} \\ |

254 | \label{eq:saec0:saan0:scas0:02} |

255 | & = & 2^8 \left( n_H \bdiv d_L + \frac{n_H \bmod d_L}{d_L} \right) + \frac{n_L}{d_L} \\ |

256 | \label{eq:saec0:saan0:scas0:03} |

257 | & = & 2^8 (n_H \bdiv d_L) + \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} |

258 | \end{eqnarray} |

259 | |

260 | (\ref{eq:saec0:saan0:scas0:03}) is an exact expression involving rational numbers. |

261 | However, we don't want to calculate the left side of (\ref{eq:saec0:saan0:scas0:03}); |

262 | rather, we wish to calculate its \emph{floor()}\@. Applying the |

263 | \emph{floor()} function to both sides of (\ref{eq:saec0:saan0:scas0:03}) yields: |

264 | |

265 | \begin{equation} |

266 | \label{eq:saec0:saan0:scas0:04} |

267 | \left\lfloor \frac{2^8 n_H + n_L}{d_L} \right\rfloor |

268 | = |

269 | 2^8 (n_H \bdiv d_L) + \left\lfloor \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} \right\rfloor . |

270 | \end{equation} |

271 | |

272 | Note that (\ref{eq:saec0:saan0:scas0:04}) is in a form |

273 | that can be readily evaluated using a processor with 16/8 division |

274 | capability; so long as |

275 | |

276 | \begin{equation} |

277 | \label{eq:saec0:saan0:scas0:05} |

278 | \frac{2^8 (n_H \bmod d_L) + n_L}{d_L} < 2^8 , |

279 | \end{equation} |

280 | |

281 | \noindent{}a fact that can be easily verified by the reader. |

282 | |

283 | (\ref{eq:saec0:saan0:scas0:04}) can be readily evaluated by a processor with |

284 | 16/8 division capability because such a division instruction always provides |

285 | both quotient and remainder. It is easy to see that (\ref{eq:saec0:saan0:scas0:04}) |

286 | can be evaluated with a division, a re-staging of bytes, and a second division. |

287 | |

288 | If (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested, it needs to be verified |

289 | whether the remainder of the second division is the same as the remainder |

290 | of the larger division, i.e. |

291 | |

292 | \begin{equation} |

293 | \label{eq:saec0:saan0:scas0:06} |

294 | (2^8 n_H + n_L) \bmod d_L =? ((2^8 \bmod d_L) + n_L) \bmod d_L. |

295 | \end{equation} |

296 | |

297 | The question of whether (\ref{eq:saec0:saan0:scas0:06}) is an |

298 | equality is the question of whether |

299 | |

300 | \begin{equation} |

301 | \label{eq:saec0:saan0:scas0:07} |

302 | ka \bmod b = (k (a \bmod b)) \bmod b . |

303 | \end{equation} |

304 | |

305 | In order to prove or disprove (\ref{eq:saec0:saan0:scas0:07}), |

306 | decompose $a$ into $ib + j$. Then, since $kib \bmod b = 0$, |

307 | |

308 | \begin{eqnarray} |

309 | \label{eq:saec0:saan0:scas0:08} |

310 | k (ib + j) \bmod b & = & k j \bmod b \\ |

311 | \label{eq:saec0:saan0:scas0:09} |

312 | k j \bmod b & = & k j \bmod b . |

313 | \end{eqnarray} |

314 | |

315 | Thus, if (\ref{eq:saec0:saan0:scas0:04}) is evaluated as suggested (with |

316 | two divisions), the final remainder will be the same as the remainder for the |

317 | original division. (\ref{eq:saec0:saan0:scas0:04}) will, in fact, deliver both |

318 | the quotient and remainder economically. |

319 | |

320 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

321 | |

322 | \subsubsection{Case II} |

323 | \label{saec0:saan0:scas1} |

324 | |

325 | The case of $d_H > 0$ (\S{}\ref{saec0:slec0}, lines 36-72) is conventional |

326 | shift-compare-subtract division. Only eight iterations of the loop are required |

327 | because with $d_H > 0$, $d \geq 2^8$, and $n/d < 2^8$. |

328 | |

329 | |

330 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

331 | |

332 | \subsection{Timing Analysis} |

333 | \label{saec0:stan0} |

334 | |

335 | The code of \S{}\ref{saec0:slec0} is reproduced below, with instruction |

336 | timing (number of clocks) and FLASH requirements (number of bytes) added |

337 | as (clocks/bytes). It was determined that \emph{c\_reg} resides |

338 | in zero-page (i.e. \emph{direct}) memory. |

339 | |

340 | \begin{verbatim} |

341 | 001: c_umod: |

342 | 002: bsr c_udiv 4/2 ; divide |

343 | 003: pshh ; transfer MSB |

344 | 004: pula 2/1 ; in place |

345 | 005: sta c_reg 3/2 |

346 | 006: txa 1/1 ; LSB in place |

347 | 007: rts 4/1 ; and return |

348 | 008: ; |

349 | 009: c_udiv: |

350 | 010: psha 2/1 ; save NL |

351 | 011: pshh 2/1 ; DH on the stack |

352 | 012: tst 1,sp 4/3 ; test zero |

353 | 013: bne full 3/2 ; full division |

354 | 014: pulh 2/1 ; clean stack |

355 | 015: cpx c_reg 3/2 ; compare DL with NH |

356 | 016: bls half 3/2 ; half division |

357 | 017: lda c_reg 3/2 ; NH |

358 | 018: psha 2/1 ; in |

359 | 019: pulh 2/1 ; H |

360 | 020: pula 2/1 ; NL in A |

361 | 021: div 7/1 ; DL in X, divide |

362 | 022: clr c_reg 3/2 ; QH is zero |

363 | 023: fini: |

364 | 024: pshh 2/1 ; move RL |

365 | 025: pulx 2/1 ; in X |

366 | 026: clrh 1/1 ; RH is zero |

367 | 027: rts 4/1 ; and return |

368 | 028: half: |

369 | 029: lda c_reg 3/2 ; NH in A |

370 | 030: clrh 1/1 ; 1st divide 8 X 8 |

371 | 031: div 7/1 ; divide |

372 | 032: sta c_reg 3/2 ; QH in place |

373 | 033: pula 2/1 ; complete dividend with NL |

374 | 034: div 7/1 ; divide |

375 | 035: bra fini 3/2 ; complete remainder |

376 | 036: full: |

377 | 037: pshx 2/1 ; save DL |

378 | 038: ldx #8 2/2 ; counter |

379 | 039: clra 1/1 ; Extention |

380 | 040: pshx 2/1 ; save on |

381 | 041: psha 2/1 ; the stack |

382 | 042: tsx 2/1 ; stack addressed by H:X |

383 | 043: bcl: |

384 | 044: lsl 4,x 4/2 ; shift E:NH:NL |

385 | 045: rol c_reg 4/2 ; left |

386 | 046: rola 1/1 |

387 | 047: sta 0,x 2/1 ; store E |

388 | 048: cmp 3,x 3/2 ; compare DH |

389 | 049: blo next 3/2 ; too low, continue |

390 | 050: bne ok 3/2 ; ok to subtract |

391 | 051: lda c_reg 3/2 ; compare NH |

392 | 052: sub 2,x 3/2 ; with DL |

393 | 053: bhs ok2 3/2 ; ok, complete result |

394 | 054: lda 0,x 2/1 ; restore value |

395 | 055: bra next 3/2 ; and continue |

396 | 056: ok: |

397 | 057: lda c_reg 3/2 ; substract D |

398 | 058: sub 2,x 3/2 ; from E:NH |

399 | 059: ok2: |

400 | 060: sta c_reg 3/2 ; in place |

401 | 061: lda 0,x 2/1 |

402 | 062: sbc 3,x 3/2 |

403 | 063: inc 4,x 3/2 ; set result bit |

404 | 064: next: |

405 | 065: dbnz 1,x,bcl 5/3 ; count down and loop |

406 | 066: sta 0,x 2/1 ; store E |

407 | 067: pulh 2/1 ; RH in place |

408 | 068: ais #3 2/2 ; clean up stack |

409 | 069: ldx c_reg 3/2 ; RL in place |

410 | 070: pula 2/1 ; QL in place |

411 | 071: clr c_reg 3/2 ; QH is zero |

412 | 072: rts 4/1 ; and return |

413 | \end{verbatim} |

414 | |

415 | There are three distinct timing cases for the \emph{c\_udiv} function: |

416 | |

417 | \begin{enumerate} |

418 | \item $d_H = 0$ and $n_H < d_L$: |

419 | 47 clocks are required, representing straight flow of the |

420 | instructions from line 10 through line 27. |

421 | \item $d_H = 0$ and $n_H \geq d_L$: 54 clocks are required. |

422 | \item $d_H > 0$ and every bit of the quotient is 1, in which case |

423 | 400 clocks are required. This represents 22 clocks up through |

424 | line 42, 45 clocks $\times$ 8 in the lines from 43 through 65, and |

425 | 18 clocks in the lines from 66 through 72. |

426 | \end{enumerate} |

427 | |

428 | |

429 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

430 | |

431 | \subsection{FLASH/RAM Consumption Analysis} |

432 | \label{saec0:sfra0} |

433 | |

434 | From \S{}\ref{saec0:stan0}, 93 bytes of FLASH are used. Only one byte |

435 | of RAM is used (\emph{c\_reg}, probably shared with other functions |

436 | as well). |

437 | |

438 | |

439 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

440 | |

441 | \section{Potential Optimizations} |

442 | \label{spop0} |

443 | |

444 | |

445 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

446 | |

447 | \subsection{Potential Case I Optimizations} |

448 | \label{spop0:scas0} |

449 | |

450 | This section corresponds to \emph{Case I} of \S{}\ref{saec0:saan0:scas0}. |

451 | |

452 | The most obvious observation about the code (\S{}\ref{saec0:slec0}) is that |

453 | division instructions are very inexpensive on the CPU08---7 clock cycles, |

454 | or about 2 typical instructions. Branching based on |

455 | $n_H \geq d_L$ |

456 | (\S{}\ref{saec0:slec0}, lines 15-16) |

457 | may cost more in the test, the branch, and in other |

458 | data transfer overhead than is saved. It may make sense to apply the |

459 | full formula in (\ref{eq:saec0:saan0:scas0:04}) in all cases where |

460 | $d_H = 0$. |

461 | |

462 | When $d_H = 0$, and if one assumes normal distribution of data, the expected |

463 | value of execution time is about $(47+54)/2 = 50.5$ clocks.\footnote{If the |

464 | data is assumed normally distributed, $n_H$ has about a 0.5 probability of being at least as large |

465 | as $d_L$.} |

466 | |

467 | The code below combines two of the three timing cases into one by ignoring the |

468 | relationship between $n_H$ and $d_L$. |

469 | |

470 | \begin{verbatim} |

471 | ;Condition at function entry: |

472 | ;N_H in 1,SP |

473 | ;N_L in A |

474 | ;D_H in H |

475 | ;D_L in X |

476 | ; |

477 | ;Condition at function exit: |

478 | ;Q_H in c_reg |

479 | ;Q_L in A |

480 | ;R_H in H |

481 | ;R_L in X |

482 | ; |

483 | c_udiv: |

484 | psha 2/1 ; save NL |

485 | pshh 2/1 ; DH on the stack |

486 | tst 1,sp 4/3 ; test zero |

487 | bne full 3/2 ; full division |

488 | ; |

489 | ;From here on we're committed to the division with |

490 | ;arbitary numerator, and denominator <= 255. |

491 | ; N_H at 3,sp |

492 | ; N_L at 2,sp |

493 | ; D_H at 1,sp |

494 | ; |

495 | clrh 1/1 |

496 | lda 3,sp 4/3 ; H:A now contains N_H |

497 | div 7/1 ; divide |

498 | sta c_reg 3/2 ; QH in place |

499 | lda 2,sp 4/3 ; complete dividend with NL |

500 | div 7/1 ; divide. Q_L in A, R_L in H |

501 | pshh 2/1 ; move RL |

502 | pulx 2/1 ; in X |

503 | clrh 1/1 ; RH is zero |

504 | ais #3 2/2 ; clean stack |

505 | rts 4/1 ; and return |

506 | \end{verbatim} |

507 | |

508 | Although the code does raise the minimum execution time from 47 to |

509 | 48 clocks: |

510 | |

511 | \begin{itemize} |

512 | \item It lowers the expected value of the $d_H=0$ execution time from |

513 | 50.5 to 48 clocks. |

514 | \item It saves approximately 15 bytes of FLASH. |

515 | \end{itemize} |

516 | |

517 | This optimization is recommended. |

518 | |

519 | |

520 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

521 | |

522 | \subsection{Potential Case II Optimizations} |

523 | \label{spop0:scas1} |

524 | |

525 | This section corresponds to \emph{Case II} of \S{}\ref{saec0:saan0:scas1}. |

526 | |

527 | I sent an e-mail to an engineer at the compiler manufacturer indicating |

528 | that: |

529 | |

530 | \begin{itemize} |

531 | \item I believed \emph{Case I} could be optimized as indicated earlier |

532 | in the document. |

533 | \item I believed \emph{Case II} could be optimized by applying Knuth's |

534 | algorithm. |

535 | \end{itemize} |

536 | |

537 | The reply I received from the compiler manufacturer was that: |

538 | |

539 | \begin{itemize} |

540 | \item There was agreeement about \emph{Case I}. |

541 | \item \emph{Case II} may be a little faster using Knuth's algorithm, but would |

542 | definitely be larger (code used to evaluate the application of Knuth's |

543 | algorithm was also provided). |

544 | \end{itemize} |

545 | |

546 | In the test code provided by the compiler manufacturer, the approach used was |

547 | to obtain a trial quotient and then to subtract the divisor from the remainder |

548 | up to twice to adjust the quotient up to 2 counts downward (Knuth's algorithm). |

549 | |

550 | I did try a different approach (to iterate on the quotient and to reconstruct |

551 | $q \times d$ with decreasing $q$). This approach promised to be slightly more |

552 | compact because $q \times d$ reconstruction was reutilized. However, it worked out |

553 | to occupy about 153 bytes rather than 103 for the shift-compare-subtract |

554 | algorithm (timing was not examined). (This test code is version-controlled |

555 | in the same directory as this \LaTeX{} document.) |

556 | |

557 | At this point I agree with the |

558 | compiler manufacturer |

559 | that there is a tradeoff between size and speed (it seems |

560 | nearly impossible to get both). |

561 | |

562 | In any reimplementation of this algorithm, will probably need to choose between |

563 | size and speed. I believe there is some possibility to reduce the reimplementation |

564 | from 153 bytes, but not down to 103 bytes. |

565 | |

566 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

567 | \end{document} |

568 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

569 | % $Log: cpu08div16by16a.tex,v $ |

570 | % Revision 1.12 2007/05/26 18:25:46 dashley |

571 | % Edits to incorporate final thoughts (for now) and test code results. |

572 | % |

573 | % Revision 1.11 2007/05/21 16:49:40 dashley |

574 | % \bdiv functionality as suggested by Robin Fairbairns added. |

575 | % |

576 | % Revision 1.10 2007/05/18 15:53:59 dashley |

577 | % Edits. |

578 | % |

579 | % Revision 1.9 2007/05/18 02:02:56 dashley |

580 | % Edits. |

581 | % |

582 | % Revision 1.8 2007/05/18 02:02:10 dashley |

583 | % Edits. |

584 | % |

585 | % Revision 1.7 2007/05/18 01:40:20 dashley |

586 | % Edits. |

587 | % |

588 | % Revision 1.6 2007/05/18 00:15:56 dashley |

589 | % Edits. |

590 | % |

591 | % Revision 1.5 2007/05/17 15:53:57 dashley |

592 | % Edits. |

593 | % |

594 | % Revision 1.4 2007/05/17 14:46:34 dashley |

595 | % Edits. |

596 | % |

597 | % Revision 1.3 2007/05/17 04:15:22 dashley |

598 | % Edits. |

599 | % |

600 | % Revision 1.2 2007/05/17 03:50:48 dashley |

601 | % Edits. |

602 | % |

603 | % Revision 1.1 2007/05/17 03:09:06 dashley |

604 | % Initial checkin. |

605 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

dashley@gmail.com | ViewVC Help |

Powered by ViewVC 1.1.25 |