Parent Directory | Revision Log

Revision **30** -
(**show annotations**)
(**download**)
(**as text**)

*Sat Oct 8 07:22:17 2016 UTC*
(7 years, 7 months ago)
by *dashley*

File MIME type: application/x-tex

File size: 32767 byte(s)

File MIME type: application/x-tex

File size: 32767 byte(s)

Initial commit.

1 | %$Header: /home/dashley/cvsrep/uculib01/uculib01/doc/manual/c_tbg0/c_tbg0.tex,v 1.12 2010/04/05 15:13:50 dashley Exp $ |

2 | |

3 | \chapter{Technical Background} |

4 | \label{ctbg0} |

5 | |

6 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

7 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

8 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

9 | \section{Introduction} |

10 | %Section tag: int0 |

11 | \label{ctbg0:sint0} |

12 | |

13 | |

14 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

15 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

16 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

17 | \section{Interrupt Compatibility Issues} |

18 | %Section tag: ici0 |

19 | \label{ctbg0:sici0} |

20 | |

21 | |

22 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

23 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

24 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

25 | \section{Linear Software Timers} |

26 | %Section tag: lst0 |

27 | \label{ctbg0:slst0} |

28 | |

29 | Several of the block memory functions described in |

30 | Chapter \ref{cbmf0} are designed primarily to |

31 | implement what we call here a |

32 | \index{software timer}\index{linear software timer}\emph{software timer} (although these functions |

33 | may have other uses). |

34 | |

35 | We define a \index{software timer}\index{linear software timer}software timer as an unsigned integer |

36 | of any length convenient to the machine that: |

37 | |

38 | \begin{itemize} |

39 | \item Is decremented or incremented at a constant rate,\footnote{It is also |

40 | possible to decrement (or increment) software timers at a non-constant rate to |

41 | approximate equations of the form $x(k+1) = \overline{\alpha} x(k)$, |

42 | $\overline{\alpha} < 1$ (or related |

43 | equations), but we don't discuss schemes of this type here. The advantage |

44 | of such a scheme would be to extend the maximum time period of the timer |

45 | while still keeping good precision at low values of time. We would call |

46 | such software timers \index{non-linear software timer}\index{software timer}% |

47 | \emph{non-linear} software timers to differentiate them |

48 | from the timers described here.} |

49 | but not |

50 | below or above the minimum or maximum value of an unsigned |

51 | integer. |

52 | \item Is set and tested by one process (the application) but |

53 | is decremented or incremented by another process (typically |

54 | system software). |

55 | \end{itemize} |

56 | |

57 | Typically, linear software timers are grouped together as arrays of |

58 | unsigned integers that are decremented or incremented at the same rate (hence the |

59 | efficiency of the block manipulation described in Chapter \ref{cbmf0}). |

60 | Most commonly, the rates at which groups of timers may be decremented or |

61 | incremented are provided in binary decades (but 1-2-5 decades are also |

62 | not uncommon). |

63 | |

64 | Linear software timers are the most efficient known way to manipulate time |

65 | in stateful logic where the logic is implemented as code (rather than data). |

66 | The efficiency comes about because: |

67 | |

68 | \begin{itemize} |

69 | \item The timers are decremented and incremented centrally using |

70 | block memory operations (saves FLASH bulk |

71 | and CPU cycles). |

72 | \item The arrangement of timers in binary decades or 1-2-5 decades |

73 | allows the usage of smaller data types (at the expense of precision) |

74 | than would be possible if time were manipulated only in terms of one |

75 | base quantum. |

76 | \item The test for an expired timer ($t=0$) often compiles well because |

77 | tests against zero are less expensive than tests against other constants |

78 | in CPU instruction sets. |

79 | \end{itemize} |

80 | |

81 | In most of the discussion which follows, we confine the discussion to |

82 | the case of software timers that are decremented, but not below zero. |

83 | (Software timers that are incremented are less common and have different |

84 | applications.) |

85 | |

86 | The arrangement of software timers in most applications is in terms of |

87 | a base quantum, which we denote $\sigma$\@. $\sigma$=1 millisecond is |

88 | a common choice. |

89 | |

90 | In a binary decade arrangement, the quantum used to decrement the |

91 | different arrays is $\sigma 2^q$, where $q$ is the ``tier'' of the array of |

92 | software timers. If $\sigma$=1 millisecond, ``Tier 0'' software timers |

93 | are decremented every millisecond, ``Tier 1'' software timers are decremented |

94 | every two milliseconds, ``Tier 2'' software timers are decremented |

95 | every four milliseconds, and so on. |

96 | |

97 | Note that if a software timer (at tier $q$) is set to a value $n$ |

98 | and then tested periodically to determine if it has reached zero, the |

99 | actual amount of time $t$ that may have elapsed is given by\footnote{In |

100 | order to obtain (\ref{eq:ctbg0:slst0:01}), we make the assumption that |

101 | the process that tests the software timer runs at least as often as the |

102 | software timer is decremented. There are scheduling scenarios that add |

103 | yet more to the error, and these are not developed here.} |

104 | |

105 | \begin{equation} |

106 | \label{eq:ctbg0:slst0:01} |

107 | (n - 1) \sigma 2^q |

108 | < |

109 | t |

110 | \leq |

111 | n \sigma 2^q . |

112 | \end{equation} |

113 | |

114 | \noindent{}(\ref{eq:ctbg0:slst0:01}) comes about because it is possible |

115 | that the timer was set to $n$ just before it was decremented. |

116 | |

117 | We propose to choose |

118 | |

119 | \begin{equation} |

120 | \label{eq:ctbg0:slst0:02} |

121 | n = \left\lfloor {\frac{\alpha}{\sigma 2^q}} \right\rfloor . |

122 | \end{equation} |

123 | |

124 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

125 | \noindent\rule{\textwidth}{2pt} |

126 | |

127 | TODO: |

128 | |

129 | \begin{enumerate} |

130 | \item Break into subsections. |

131 | \item Develop scheduling assumptions that lead to (\ref{eq:ctbg0:slst0:01}) and friends. |

132 | \item Show relationship with timed automata. |

133 | \item Show state-space and complexity reduction advantages. |

134 | \end{enumerate} |

135 | |

136 | |

137 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

138 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

139 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

140 | \section{Square Root Extraction (Integer)} |

141 | %Section tag: sre0 |

142 | \label{ctbg0:ssre0} |

143 | |

144 | This section presents information about the extraction of |

145 | square roots over an integer domain and integer range. Usually, the function |

146 | of interest is |

147 | |

148 | \begin{equation} |

149 | \label{eq:ctbg0:ssre0:01} |

150 | f(x) = \left\lfloor \sqrt{x} \right\rfloor . |

151 | \end{equation} |

152 | |

153 | Other functions of interest can generally be implemented in terms |

154 | of (\ref{eq:ctbg0:ssre0:01}). |

155 | |

156 | If a square root with information after the radix point is desired, the argument $x$ can be premultiplied |

157 | by the square of the reciprocal of the precision after the radix point. For example, |

158 | a good approximation to $10 \sqrt{x}$ is |

159 | |

160 | \begin{equation} |

161 | \label{eq:ctbg0:ssre0:02} |

162 | 10 \sqrt{x} \approx g(x) = \left\lfloor 10 \sqrt{x} \right\rfloor |

163 | = \left\lfloor \sqrt{100 x} \right\rfloor , |

164 | \end{equation} |

165 | |

166 | \noindent{}which can be implemented in terms of (\ref{eq:ctbg0:ssre0:01})\@. |

167 | (\ref{eq:ctbg0:ssre0:02}) provides a result that is human-friendly (i.e. 453 is |

168 | interpreted as 45.3), but the radix point of the result is not positioned between |

169 | any two bits. A more typical approach might be to premultiply by an even |

170 | power of two so as to place the radix point of the result between two bits. |

171 | For example, |

172 | |

173 | \begin{equation} |

174 | \label{eq:ctbg0:ssre0:02b} |

175 | 2^8 \sqrt{x} \approx \left\lfloor 2^8 \sqrt{x} \right\rfloor |

176 | = \left\lfloor \sqrt{2^{16} x} \right\rfloor |

177 | \end{equation} |

178 | |

179 | \noindent{}will calculate the square root of an integer |

180 | with eight bits after the radix point. |

181 | |

182 | Similarly, if rounding to the nearest integer is desired, note that\footnote{To |

183 | prove (\ref{eq:ctbg0:ssre0:03}), consider four cases. \emph{Case 1:} If the fractional part of |

184 | $\sqrt{x}$ is 0, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |

185 | will be an integer, the result will be exact, and (\ref{eq:ctbg0:ssre0:03}) holds. |

186 | \emph{Case 2:} If the fractional part of |

187 | $\sqrt{x}$ is less than 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |

188 | will be an even integer, the square root will be rounded down, and (\ref{eq:ctbg0:ssre0:03}) holds. |

189 | \emph{Case 3:} If the fractional part of |

190 | $\sqrt{x}$ is equal to 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |

191 | will be an odd integer, the square root will be rounded up, and (\ref{eq:ctbg0:ssre0:03}) holds. |

192 | \emph{Case 4:} If the fractional part of |

193 | $\sqrt{x}$ is greater than 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |

194 | will be an odd integer, the square root will be rounded up, and (\ref{eq:ctbg0:ssre0:03}) holds.} |

195 | |

196 | \begin{equation} |

197 | \label{eq:ctbg0:ssre0:03} |

198 | h(x) = \left\lfloor \sqrt{x} + 0.5 \right\rfloor |

199 | = \left\lfloor \frac{\left\lfloor \sqrt{4 x} \right\rfloor + 1}{2} \right\rfloor , |

200 | \end{equation} |

201 | |

202 | \noindent{}which can also be easily implemented in terms of (\ref{eq:ctbg0:ssre0:01}). |

203 | |

204 | |

205 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

206 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

207 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

208 | \subsection{Summation of Odd Integers} |

209 | \label{ctbg0:ssre0:ssoi0} |

210 | |

211 | Consider the square of a non-negative integer $n$, $n^2$. The distance from |

212 | $n^2$ to the square of the next larger integer is |

213 | |

214 | \begin{equation} |

215 | \label{eq:ctbg0:ssre0:ssoi0:01} |

216 | (n+1)^2 - n^2 = 2n + 1 . |

217 | \end{equation} |

218 | |

219 | $2n+1, n \in \mathbb{Z}^+$ is the set of odd non-negative integers. |

220 | It follows directly that the set of squared integers can be formed by |

221 | successively adding odd integers, i.e. |

222 | |

223 | \begin{eqnarray} |

224 | \nonumber 1^2 = 1 & = & 1 \\ |

225 | 2^2 = 4 & = & 1 + 3 \\ |

226 | \nonumber 3^2 = 9 & = & 1 + 3 + 5 \\ |

227 | \nonumber 4^2 = 16 & = & 1 + 3 + 5 + 7 \\ |

228 | \nonumber & \ldots{} & |

229 | \end{eqnarray} |

230 | |

231 | One possible integer square root algorithm would involve summing consecutive |

232 | odd integers until the argument is exceeded. |

233 | This algorithm is not developed further because it would be impractical to |

234 | calculate the square root of any integers except small integers using this |

235 | method. |

236 | |

237 | |

238 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

239 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

240 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

241 | \subsection{Bit-By-Bit Bisection} |

242 | \label{ctbg0:ssre0:sbis0} |

243 | |

244 | Square roots of an integer can be found using bit-by-bit bisection. |

245 | In this method, starting with the most significant bit of the result, each bit |

246 | is assumed `1' and a trial square is calculated. If the trial square is no |

247 | larger than the argument, that bit of the result is `1'. However, if the trial |

248 | square is larger than the argument, that bit of the result is `0'. |

249 | |

250 | \begin{tiny} |

251 | \begin{verbatim} |

252 | Essentially, it is bit-by-bit trial squaring. "Bisection" because |

253 | each bit of the square root cuts the interval (in which the square root |

254 | might lie) in half. |

255 | |

256 | Two of the posters in comp.arch.embedded indicated that "bisection" can be |

257 | economized further by using the identity: |

258 | |

259 | (n + 2^k)^2 = n^2 + n*2^(k+1) + 2^(2*k) |

260 | |

261 | The "n^2" term is significant because it means that if you have the |

262 | previous trial square and you are considering setting bit "k", you can |

263 | compute the next trial square from the previous square using only shifting |

264 | and adding. |

265 | |

266 | If you carry this to the extreme, because you initially assume no bits |

267 | are set, you start with n = n**2 = 0, and the first trial square does |

268 | not involve multiplication. So you can implement the entire square |

269 | algorithm without multiplying, even once. |

270 | |

271 | Anyway, applying all possible optimizations, one gets this: |

272 | |

273 | http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/ |

274 | cosmic/modx0/atu24sqrtfrxn/devel/alg_eval01.c?revision=1.7&view=markup |

275 | |

276 | or this in the 32-bit case: |

277 | |

278 | unsigned int sqrt32(unsigned long arg) |

279 | { |

280 | unsigned long two_2k_mask; |

281 | unsigned long trial_square; |

282 | unsigned long trial_square_prev; |

283 | unsigned long square_root_ls_ip1; |

284 | |

285 | two_2k_mask = 0x40000000UL; |

286 | trial_square = 0; |

287 | trial_square_prev = 0; |

288 | square_root_ls_ip1 = 0; |

289 | |

290 | while(1) |

291 | { |

292 | trial_square = (trial_square_prev) |

293 | + (square_root_ls_ip1) |

294 | + (two_2k_mask); |

295 | |

296 | square_root_ls_ip1 >>= 1; |

297 | |

298 | if (arg >= trial_square) |

299 | { |

300 | square_root_ls_ip1 |= two_2k_mask; |

301 | trial_square_prev = trial_square; |

302 | } |

303 | |

304 | if (two_2k_mask == 1) |

305 | break; |

306 | |

307 | two_2k_mask >>= 2; |

308 | } |

309 | |

310 | return(square_root_ls_ip1); |

311 | } |

312 | |

313 | If you look at the loop, there really isn't much there. There will be |

314 | 16 iterations. The test against "1" to exit the loop can be implemented |

315 | as a single byte test (because a single "1" is being right-shifted, "1" |

316 | as the value of the least significant byte means that the other three |

317 | bytes are zero). |

318 | |

319 | Temporary storage requirements are about 4 times the size of the input |

320 | argument. So, for a 32-bit argument, that would be 16 bytes. |

321 | \end{verbatim} |

322 | \end{tiny} |

323 | |

324 | |

325 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

326 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

327 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

328 | \subsection{The Babylonian Method} |

329 | \label{ctbg0:ssre0:sbab0} |

330 | |

331 | TBD. |

332 | |

333 | |

334 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

335 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

336 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

337 | \subsection{Unprocessed Newsgroup Posts} |

338 | \label{ctbg0:ssre0:sunp0} |

339 | |

340 | Have to process these newsgroup posts: |

341 | |

342 | \begin{tiny} |

343 | \begin{verbatim} |

344 | I have the need in a microcontroller application to calculate floor(sqrt(x)) |

345 | where x is approximately in the range of 2^32 - 2^64. |

346 | |

347 | What are the algorithms I should look at? |

348 | |

349 | These small microcontrollers are characterized by having a 8-bit times 8-bit |

350 | to give 16-bit result integer unsigned multiply instruction, and a similar |

351 | 16/8 division instruction to give a quotient and a remainder. |

352 | Mulitplications of larger integers have to be done by multiplying the bytes |

353 | and adding. |

354 | |

355 | I'm aware of these algorithms: |

356 | |

357 | a)Adding consecutive odd integers and counting how many you have to add, |

358 | i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |

359 | |

360 | b)Trial squaring, testing setting each bit to "1", i.e. |

361 | |

362 | http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu8sqrtfrxx/src/atu8sqrtfrxx.c?revision=1.3&view=markup |

363 | |

364 | c)Another method that seems to work (don't know the name of it), here is |

365 | source code: |

366 | |

367 | http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu16sqrtx10frxx/src/atu16sqrtx10frxx.c?revision=1.3&view=markup |

368 | |

369 | What other options should I investigate? |

370 | |

371 | ================================================================================ |

372 | Here is some C code for integer square roots using many different |

373 | algorithms: |

374 | http://cap.connx.com/chess-engines/new-approach/jsqrt0.ZIP |

375 | |

376 | Worth a look: |

377 | http://www.azillionmonkeys.com/qed/sqroot.html |

378 | |

379 | Consider also: |

380 | http://www.google.com/#hl=en&source=hp&q=fast+integer+square+root&rlz= |

381 | 1W1GGLD_en&aq=0&aqi=g2&oq=fast+integer+sq&fp=a048890d3c90c6fc |

382 | ================================================================================ |

383 | > I have the need in a microcontroller application to calculate |

384 | > floor(sqrt(x)) where x is approximately in the range of 2^32 - 2^64. |

385 | |

386 | x = y^2 - (2^16)^2 |

387 | |

388 | > What are the algorithms I should look at? |

389 | |

390 | Google is your friend :> There are many sqrt algorithms with |

391 | different tradeoffs (speed/size). |

392 | |

393 | I would, instead, ask that you might want to examine what *else* |

394 | you know about "x". I.e., you've already constrained the range |

395 | (why? does that give you any other information that can be |

396 | exploited?); are there any other characteristics about how |

397 | "x" is synthesized that you can exploit to divide and conquer? |

398 | |

399 | E.g. if x = y * z, then knowing sqrt(y) and/or sqrt(z) simplifies |

400 | your problem. |

401 | |

402 | > These small microcontrollers are characterized by having a 8-bit times |

403 | > 8-bit to give 16-bit result integer unsigned multiply instruction, and a |

404 | > similar 16/8 division instruction to give a quotient and a remainder. |

405 | > Mulitplications of larger integers have to be done by multiplying the |

406 | > bytes and adding. |

407 | > |

408 | > I'm aware of these algorithms: |

409 | > |

410 | > a)Adding consecutive odd integers and counting how many you have to add, |

411 | > i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |

412 | > |

413 | > b)Trial squaring, testing setting each bit to "1", i.e. |

414 | > |

415 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu8sqrtfrxx/src/atu8sqrtfrxx.c?revision=1.3&view=markup |

416 | > |

417 | > c)Another method that seems to work (don't know the name of it), here is |

418 | > source code: |

419 | > |

420 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu16sqrtx10frxx/src/atu16sqrtx10frxx.c?revision=1.3&view=markup |

421 | > |

422 | > What other options should I investigate? |

423 | ================================================================================ |

424 | > Hi, |

425 | > |

426 | > I have the need in a microcontroller application to calculate floor(sqrt(x)) |

427 | > where x is approximately in the range of 2^32 - 2^64. |

428 | > |

429 | > What are the algorithms I should look at? |

430 | > |

431 | > These small microcontrollers are characterized by having a 8-bit times 8-bit |

432 | > to give 16-bit result integer unsigned multiply instruction, and a similar |

433 | > 16/8 division instruction to give a quotient and a remainder. |

434 | > Mulitplications of larger integers have to be done by multiplying the bytes |

435 | > and adding. |

436 | > |

437 | > I'm aware of these algorithms: |

438 | > |

439 | > a)Adding consecutive odd integers and counting how many you have to add, |

440 | > i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |

441 | > |

442 | > b)Trial squaring, testing setting each bit to "1", i.e. |

443 | > |

444 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |

445 | > |

446 | > c)Another method that seems to work (don't know the name of it), here is |

447 | > source code: |

448 | > |

449 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |

450 | > |

451 | > What other options should I investigate? |

452 | > |

453 | > Thanks, Datesfat |

454 | |

455 | The one I have been using last 15-20 years is probably your B, at |

456 | least sounds |

457 | like what I made up back then - successive approximation for each bit |

458 | of the result. |

459 | If multiplication is fast on the core you will be using it it is hard |

460 | to beat. |

461 | |

462 | Dimiter |

463 | ================================================================================ |

464 | > Hi, |

465 | > |

466 | > I have the need in a microcontroller application to calculate |

467 | > floor(sqrt(x)) |

468 | > where x is approximately in the range of 2^32 - 2^64. |

469 | > |

470 | > What are the algorithms I should look at? |

471 | > |

472 | > These small microcontrollers are characterized by having a 8-bit times |

473 | > 8-bit |

474 | > to give 16-bit result integer unsigned multiply instruction, and a similar |

475 | > 16/8 division instruction to give a quotient and a remainder. |

476 | > Mulitplications of larger integers have to be done by multiplying the |

477 | > bytes |

478 | > and adding. |

479 | > |

480 | > I'm aware of these algorithms: |

481 | > |

482 | > a)Adding consecutive odd integers and counting how many you have to add, |

483 | > i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |

484 | > |

485 | > b)Trial squaring, testing setting each bit to "1", i.e. |

486 | > |

487 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |

488 | > |

489 | > c)Another method that seems to work (don't know the name of it), here is |

490 | > source code: |

491 | > |

492 | > http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |

493 | > |

494 | > What other options should I investigate? |

495 | > |

496 | > Thanks, Datesfat |

497 | > |

498 | >The one I have been using last 15-20 years is probably your B, at |

499 | >least sounds |

500 | >like what I made up back then - successive approximation for each bit |

501 | >of the result. |

502 | >If multiplication is fast on the core you will be using it it is hard |

503 | >to beat. |

504 | |

505 | Thanks for the insight. |

506 | |

507 | For small operands, I agree with your analysis. |

508 | |

509 | However, for larger operands, the Babylonian method: |

510 | |

511 | http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method |

512 | |

513 | looks very promising. |

514 | |

515 | The issue is that for the cost of one division (the /2 and addition is |

516 | virtually free) you get far more than one bit of additional information |

517 | about the square root. |

518 | |

519 | For larger operands, I don't believe that (b) will be the right answer any |

520 | longer. |

521 | |

522 | But I'm not sure yet ... |

523 | ================================================================================ |

524 | > a)Adding consecutive odd integers and counting how many you have to add, |

525 | > i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |

526 | |

527 | That quite certainly can't work for the input range you're targeting. |

528 | You would be looking at up to four billion iterations if x was 2^64 |

529 | |

530 | > b)Trial squaring, testing setting each bit to "1", i.e. |

531 | |

532 | A whole lot better already. It becomes even better if you update the |

533 | square of your current result differentially, applying the binomial |

534 | formula. As you update the result-to-be ('res' in the following), by |

535 | adding a 1-bit at position 'k': |

536 | |

537 | x --> x + 2^k |

538 | |

539 | it's square, which eventually should be equal to the input number, |

540 | updates too: |

541 | |

542 | x^2 --> x^2 + 2*(2^k)*x + (2^k)^2 |

543 | = x^2 + x<<(k+1) + 1<<(2*k) |

544 | |

545 | That's two additions and a bit of shifting. Even an 8-bit CPU should be |

546 | able to manage that in reasonable time. Basically you go looking for a |

547 | bit position 'k' such that the current difference between x^2 and the |

548 | target value is still larger than or equal to (x<<(k+1)+1<<(2*k)). |

549 | |

550 | This is actually a special case of an algorithm people used to be taught |

551 | in higher schools. Just like we all (hopefully) learned to do long |

552 | multiplication and long division in elementary school, there used to be |

553 | an algorithm for long square-roots. I've seen it in a textbook used to |

554 | teach metal workers' apprentices from the 1940s. It's a little daunting |

555 | in decimal numbers on paper, but boils down to the above when performed |

556 | in binary: you figure out one digit per iteration by looking at the |

557 | difference between the target figure and square of the result-to-be. |

558 | It's also somewhat similar to long division in that way. |

559 | |

560 | > What other options should I investigate? |

561 | |

562 | One traditional trick for sqrt(), and some other inverse functions, too, |

563 | is to treat it as a root-finding problem for the function |

564 | |

565 | f(x) := x^2 - y |

566 | |

567 | The x that solves f(x)=0 is sqrt(y). You can solve that using the |

568 | Newton-Raphson iteration algorithm, a.k.a. the tangent method: |

569 | |

570 | x' = x - f(x)/f'(x) |

571 | = x - (x^2 - y) / (2 * x) |

572 | = x - x/2 - y/(2*x) |

573 | = x/2 - (y/2)/x |

574 | |

575 | This requires long division, but pays off by converging on the solution |

576 | real fast, once you've come anywhere near it. |

577 | |

578 | This algorithm and the bit-by-bit one above are somewhat related. The |

579 | search for the right bit position, 'k', is effectively a simplified |

580 | version of the division (y/2)/x. |

581 | ================================================================================ |

582 | > I have the need in a microcontroller application to calculate floor(sqrt(x)) |

583 | > where x is approximately in the range of 2^32 - 2^64. |

584 | > |

585 | > What are the algorithms I should look at? |

586 | |

587 | Newton-Raphson or bisection, depending upon how slow division is. N-R is |

588 | asymptotically faster (the number of correct digits doubles at each step), |

589 | but each step requires a division. |

590 | |

591 | Bisection is essentially your option b), but it can be computed much more |

592 | efficiently than the algorithm given. You don't need arbitrary multiplies, |

593 | only shifts, as: |

594 | |

595 | (a+2^k)^2 = a^2 + 2.a.2^k + (2^k)^2 |

596 | = a^2 + a.2^(k+1) + 2^2k |

597 | |

598 | IOW: |

599 | |

600 | uint32_t isqrt(uint64_t x) |

601 | { |

602 | uint64_t a = 0; |

603 | uint64_t a2 = 0; |

604 | int k; |

605 | |

606 | for (k = 31; k >= 0; k--) { |

607 | |

608 | uint64_t t = a2 + (a << (k+1)) + ((uint64_t)1 << (k + k)); |

609 | if (x > t) { |

610 | a += 1 << k; |

611 | a2 = t; |

612 | } |

613 | } |

614 | |

615 | return a; |

616 | } |

617 | |

618 | Depending upon the CPU, it may be faster to calculate the a.2^(k+1) and |

619 | 2^2k terms incrementally. |

620 | ================================================================================ |

621 | Wiki |

622 | |

623 | http://en.wikipedia.org/wiki/Methods_of_computing_square_roots\\end{verbatim} |

624 | ================================================================================ |

625 | Noticed your post on sci.math. I am surprised no one mentioned this one |

626 | from Joe Leherbauer that only uses shifts and adds. I cannot find the |

627 | original usenet message but a copy is here: |

628 | |

629 | http://groups.google.com/group/comp.lang.c/browse\_thread/thread/14ed11f99b6ac8ab/d4297a8de3ead321?hl=en\&ie=UTF-8\&q=\%22joe+Leherbauer\%22 |

630 | |

631 | I clipped the message: |

632 | |

633 | <----------------------clipped from message----------------------------> |

634 | The function below is probably the fastest you can get with integer |

635 | operations only. It executes in constant time of 16 loop iterations |

636 | assuming a long is 32 bits, i.e. it's O(log(n)). |

637 | As a by-product it also produces the square root remainder. |

638 | More formally, it computes s and r, such that: |

639 | |

640 | s = floor(sqrt(n)) |

641 | r = n - s * s |

642 | |

643 | I have to admit that I did not invent this. |

644 | It is from an anonymous Usenet source. |

645 | |

646 | unsigned long |

647 | i\_sqrt (unsigned long n, unsigned long * rem) |

648 | { |

649 | unsigned long r, s, t; |

650 | |

651 | r = 0; |

652 | t = (~0UL >> 2) + 1; /* largest power of 4 supported by data type */ |

653 | do |

654 | { |

655 | s = r + t; |

656 | if ( n >= s ) |

657 | { |

658 | n -= s; |

659 | r = s + t; |

660 | } |

661 | r >>= 1; |

662 | } |

663 | while ( (t >>= 2) != 0 ); |

664 | *rem = n; |

665 | |

666 | return r; |

667 | |

668 | } |

669 | |

670 | --- |

671 | Joe Leherbauer Leherbauer at telering dot at |

672 | Registered Linux User \# 215803 |

673 | ================================================================================ |

674 | \end{tiny} |

675 | |

676 | |

677 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

678 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

679 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

680 | \section[\emph{UcuAtS32S16v2CpDiva2FRxx(\protect\mbox{\protect$\cdot$})} Function Calculation Method] |

681 | {\emph{UcuAtS32S16v2CpDiva2FRxx(\protect\mbox{\protect\boldmath $\cdot$})} Function Calculation Method} |

682 | \label{ctbg0:svec2} |

683 | |

684 | \index{UcuAtS32S16v2CpDiva2FRxx()@\emph{UcuAtS32S16v2CpDiva2FRxx($\cdot$)}}% |

685 | The nominal calculation performed by the |

686 | \emph{UcuAtS32S16v2CpDiva2FRxx($\cdot$)} function is |

687 | |

688 | \begin{equation} |

689 | \label{eq:ctbg0:svec2:01} |

690 | g(a_x, a_y, b_x, b_y) |

691 | = |

692 | \left\lfloor \frac{\vec{a} \times \vec{b}}{| \vec{b} | \hat{k}} \right\rfloor |

693 | = |

694 | \left\lfloor \frac{a_x b_y - a_y b_x}{\sqrt{b_x^2 + b_y^2}} \right\rfloor , |

695 | \end{equation} |

696 | |

697 | \noindent{}where the floor function of a negative argument rounds towards zero. |

698 | |

699 | Clearly, (\ref{eq:ctbg0:svec2:01}) isn't a convenient method of calcuation for an |

700 | inexpensive microcontroller, as the result |

701 | of the square root calculation might need to include a fractional part. |

702 | |

703 | (\ref{eq:ctbg0:svec2:01}) can be modified by squaring then taking the square root, |

704 | making adjustments for the loss of sign information caused by squaring: |

705 | |

706 | \begin{equation} |

707 | \label{eq:ctbg0:svec2:02} |

708 | \left\lfloor \frac{a_x b_y - a_y b_x}{\sqrt{b_x^2 + b_y^2}} \right\rfloor |

709 | = |

710 | \left\lfloor \sqrt{\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}} \right\rfloor sgn (a_x b_y - a_y b_x). |

711 | \end{equation} |

712 | |

713 | Note in (\ref{eq:ctbg0:svec2:02}) |

714 | the the remainder of the division can be discarded, i.e. |

715 | |

716 | \begin{equation} |

717 | \label{eq:ctbg0:svec2:03} |

718 | \left\lfloor \sqrt{\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}} \right\rfloor |

719 | = |

720 | \left\lfloor \sqrt{\left\lfloor\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}\right\rfloor} \right\rfloor . |

721 | \end{equation} |

722 | |

723 | \noindent{}(\ref{eq:ctbg0:svec2:03}) is valid because only the square root of an integer can be |

724 | an integer, so flooring the input to a square root calculation whose output is floored |

725 | will not have an effect. |

726 | |

727 | The size of the integer to be squared in (\ref{eq:ctbg0:svec2:03}) must be established. |

728 | $a_x b_y - a_y b_x$ may have a magnitude as large as $2^{31}-2^{15}$, so 31 bits are required (for an unsigned representation). |

729 | In practice, 32 bits (4 bytes) will be used for storage. |

730 | |

731 | The term $(a_x b_y - a_y b_x)^2$ may be as large as $(2^{31}-2^{15})^2 = 2^{62}-2^{47}+2^{30}$, so 62 bits are required. |

732 | In practice, 64 bits (8 bytes) will be used for storage. |

733 | |

734 | The most difficult question to answer is the upper bound on the result of the |

735 | division. One would suspect from the form of (\ref{eq:ctbg0:svec2:03}) |

736 | that there is an upper bound less than $2^{62}-2^{47}+2^{30}$, as the |

737 | numerator and denominator both increase with increasing $b_x$ or $b_y$. |

738 | If an upper bound smaller than $2^{62}-2^{47}+2^{30}$ exists, it would make |

739 | both the division and the square root extraction more efficient. |

740 | |

741 | An upper bound on the result of the division in (\ref{eq:ctbg0:svec2:03}) |

742 | need not be overly tight. The upper bound is only to assist in devising |

743 | efficient calculation. |

744 | |

745 | The maximum magnitude of $a_x b_y - a_y b_x$ can only occur when a negative |

746 | number is subtracted from a positive, or a positive number is subtracted from |

747 | a negative. The most positive number that can be formed as the product |

748 | of two UCU\_SINT16's is $2^{15}2^{15} = 2^{30}$. The most negative number that can be formed |

749 | is $-2^{15}(2^{15}-1) = -2^{30}+2^{15}$. The maximum magnitude of |

750 | $a_x b_y - a_y b_x$ is thus $2^{30} - (-2^{30}+2^{15}) = 2^{31} - 2^{15}$. |

751 | |

752 | To simplify the algebra (in exchange for a looser bound), |

753 | one can assume that both positive and negative |

754 | UCU\_SINT16's may reach a magnitude of $2^{15}$. Assuming |

755 | that $a_x = a_y = 2^{15}$, the maximum value of |

756 | (\ref{eq:ctbg0:svec2:03}) is |

757 | |

758 | \begin{eqnarray} |

759 | \nonumber |

760 | & |

761 | \displaystyle{ |

762 | \sqrt{\frac{(2^{15} b_x + 2^{15} b_y)^2}{b_x^2 + b_y^2}} |

763 | = |

764 | \sqrt{\frac{2^{30}(b_x^2 + 2 b_x b_y + b_y^2)}{b_x^2 + b_y^2}} |

765 | } |

766 | & \\ |

767 | \label{eq:ctbg0:svec2:04} |

768 | & |

769 | \displaystyle{ |

770 | = \sqrt{2^{30} \frac{b_x^2 + b_y^2}{b_x^2 + b_y^2} + 2^{31} \frac{b_x b_y}{b_x^2 + b_y^2}} |

771 | } |

772 | & \\ |

773 | \nonumber |

774 | & |

775 | \displaystyle{ |

776 | = \sqrt{2^{30} + 2^{31} \frac{b_x b_y}{b_x^2 + b_y^2}} |

777 | } |

778 | & |

779 | \end{eqnarray} |

780 | |

781 | (\ref{eq:ctbg0:svec2:04}) implies that the maximum value of (\ref{eq:ctbg0:svec2:03}) |

782 | depends on the maximum of the function |

783 | |

784 | \begin{equation} |

785 | \label{eq:ctbg0:svec2:05} |

786 | f(b_x, b_y) = \frac{b_x b_y}{b_x^2 + b_y^2} , \;\; b_x, b_y \in \mathbb{Z}^+ . |

787 | \end{equation} |

788 | |

789 | Several helpful posters\footnote{Thanks to Ray Vickson, Gerry Myerson, Rob Johnson, and ``Scattered''.} |

790 | on \texttt{sci.math} provided proofs that |

791 | (\ref{eq:ctbg0:svec2:05}) could not exceed 1/2. The proof that I found easiest to understand |

792 | was provided by Gerry Myerson. Consider the function |

793 | |

794 | \begin{equation} |

795 | \label{eq:ctbg0:svec2:06} |

796 | h(b_x, b_y) |

797 | = |

798 | \frac{1}{2} - f(b_x, b_y) |

799 | = |

800 | \frac{1}{2} \left( \frac{b_x^2 - 2 b_x b_y + b_y^2}{b_x^2 + b_y^2} \right) |

801 | = |

802 | \frac{1}{2} \left( \frac{(b_x - b_y)^2}{b_x^2 + b_y^2} \right). |

803 | \end{equation} |

804 | |

805 | $h(b_x, b_y)$ is non-negative over the domain of interest, proving that $f(b_x, b_y)$ cannot exceed $1/2$. |

806 | |

807 | Substituting the maximum value of 1/2 into (\ref{eq:ctbg0:svec2:04}) |

808 | leads to |

809 | |

810 | \begin{equation} |

811 | \label{eq:ctbg0:svec2:07} |

812 | \frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2} |

813 | \leq |

814 | 2^{30} + 2^{31} \left( \frac{1}{2} \right) |

815 | = |

816 | 2^{30} + 2^{30} |

817 | = |

818 | 2^{31} . |

819 | \end{equation} |

820 | |

821 | \noindent{}(\ref{eq:ctbg0:svec2:07}) |

822 | establishes that the result of the division may not exceed 32 bits. This is |

823 | an important result, as the dividend of the division is 64 bits, requiring |

824 | 64 rounds in the general case to obtain a quotient. Knowledge that the |

825 | quotient may not exceed 32 bits will approximately halve the division time, as |

826 | only 32 iterations rather than 64 need to be performed. |

827 | |

828 | The square root extraction will involve a 32-bit input (the quotient from the |

829 | division) and a 16-bit result. |

830 | |

831 | |

832 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

833 | \noindent\begin{figure}[!b] |

834 | \noindent\rule[-0.25in]{\textwidth}{1pt} |

835 | \begin{tiny} |

836 | \begin{verbatim} |

837 | $RCSfile: c_tbg0.tex,v $ |

838 | $Source: /home/dashley/cvsrep/uculib01/uculib01/doc/manual/c_tbg0/c_tbg0.tex,v $ |

839 | $Revision: 1.12 $ |

840 | $Author: dashley $ |

841 | $Date: 2010/04/05 15:13:50 $ |

842 | \end{verbatim} |

843 | \end{tiny} |

844 | \noindent\rule[0.25in]{\textwidth}{1pt} |

845 | \end{figure} |

846 | |

847 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

848 | %$Log: c_tbg0.tex,v $ |

849 | %Revision 1.12 2010/04/05 15:13:50 dashley |

850 | %Edits. |

851 | % |

852 | %Revision 1.11 2010/04/05 12:17:12 dashley |

853 | %Edits. |

854 | % |

855 | %Revision 1.10 2010/04/01 21:07:34 dashley |

856 | %Edits. |

857 | % |

858 | %Revision 1.9 2010/04/01 20:21:50 dashley |

859 | %Edits. |

860 | % |

861 | %Revision 1.8 2010/03/03 16:30:35 dashley |

862 | %Edits. |

863 | % |

864 | %Revision 1.7 2010/03/03 00:19:21 dashley |

865 | %Edits. |

866 | % |

867 | %Revision 1.6 2010/02/05 03:04:01 dashley |

868 | %Additional square root information added. |

869 | % |

870 | %Revision 1.5 2010/02/05 02:58:12 dashley |

871 | %Newsgroup post text added. |

872 | % |

873 | %Revision 1.4 2010/01/28 21:18:33 dashley |

874 | %a)Chapter start quotes removed. |

875 | %b)Aesthetic comment line added at the bottom of most files. |

876 | % |

877 | %Revision 1.3 2007/10/08 18:16:34 dtashley |

878 | %Edits. |

879 | % |

880 | %Revision 1.2 2007/10/07 18:11:22 dtashley |

881 | %Edits. |

882 | % |

883 | %Revision 1.1 2007/10/07 01:22:01 dtashley |

884 | %Initial checkin. |

885 | % |

886 | %End of $RCSfile: c_tbg0.tex,v $. |

887 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |

888 |

dashley@gmail.com | ViewVC Help |

Powered by ViewVC 1.1.25 |