1 |
%$Header: /home/dashley/cvsrep/uculib01/uculib01/doc/manual/c_tbg0/c_tbg0.tex,v 1.12 2010/04/05 15:13:50 dashley Exp $ |
2 |
|
3 |
\chapter{Technical Background} |
4 |
\label{ctbg0} |
5 |
|
6 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
7 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
8 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
9 |
\section{Introduction} |
10 |
%Section tag: int0 |
11 |
\label{ctbg0:sint0} |
12 |
|
13 |
|
14 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
15 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
16 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
17 |
\section{Interrupt Compatibility Issues} |
18 |
%Section tag: ici0 |
19 |
\label{ctbg0:sici0} |
20 |
|
21 |
|
22 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
23 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
24 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
25 |
\section{Linear Software Timers} |
26 |
%Section tag: lst0 |
27 |
\label{ctbg0:slst0} |
28 |
|
29 |
Several of the block memory functions described in |
30 |
Chapter \ref{cbmf0} are designed primarily to |
31 |
implement what we call here a |
32 |
\index{software timer}\index{linear software timer}\emph{software timer} (although these functions |
33 |
may have other uses). |
34 |
|
35 |
We define a \index{software timer}\index{linear software timer}software timer as an unsigned integer |
36 |
of any length convenient to the machine that: |
37 |
|
38 |
\begin{itemize} |
39 |
\item Is decremented or incremented at a constant rate,\footnote{It is also |
40 |
possible to decrement (or increment) software timers at a non-constant rate to |
41 |
approximate equations of the form $x(k+1) = \overline{\alpha} x(k)$, |
42 |
$\overline{\alpha} < 1$ (or related |
43 |
equations), but we don't discuss schemes of this type here. The advantage |
44 |
of such a scheme would be to extend the maximum time period of the timer |
45 |
while still keeping good precision at low values of time. We would call |
46 |
such software timers \index{non-linear software timer}\index{software timer}% |
47 |
\emph{non-linear} software timers to differentiate them |
48 |
from the timers described here.} |
49 |
but not |
50 |
below or above the minimum or maximum value of an unsigned |
51 |
integer. |
52 |
\item Is set and tested by one process (the application) but |
53 |
is decremented or incremented by another process (typically |
54 |
system software). |
55 |
\end{itemize} |
56 |
|
57 |
Typically, linear software timers are grouped together as arrays of |
58 |
unsigned integers that are decremented or incremented at the same rate (hence the |
59 |
efficiency of the block manipulation described in Chapter \ref{cbmf0}). |
60 |
Most commonly, the rates at which groups of timers may be decremented or |
61 |
incremented are provided in binary decades (but 1-2-5 decades are also |
62 |
not uncommon). |
63 |
|
64 |
Linear software timers are the most efficient known way to manipulate time |
65 |
in stateful logic where the logic is implemented as code (rather than data). |
66 |
The efficiency comes about because: |
67 |
|
68 |
\begin{itemize} |
69 |
\item The timers are decremented and incremented centrally using |
70 |
block memory operations (saves FLASH bulk |
71 |
and CPU cycles). |
72 |
\item The arrangement of timers in binary decades or 1-2-5 decades |
73 |
allows the usage of smaller data types (at the expense of precision) |
74 |
than would be possible if time were manipulated only in terms of one |
75 |
base quantum. |
76 |
\item The test for an expired timer ($t=0$) often compiles well because |
77 |
tests against zero are less expensive than tests against other constants |
78 |
in CPU instruction sets. |
79 |
\end{itemize} |
80 |
|
81 |
In most of the discussion which follows, we confine the discussion to |
82 |
the case of software timers that are decremented, but not below zero. |
83 |
(Software timers that are incremented are less common and have different |
84 |
applications.) |
85 |
|
86 |
The arrangement of software timers in most applications is in terms of |
87 |
a base quantum, which we denote $\sigma$\@. $\sigma$=1 millisecond is |
88 |
a common choice. |
89 |
|
90 |
In a binary decade arrangement, the quantum used to decrement the |
91 |
different arrays is $\sigma 2^q$, where $q$ is the ``tier'' of the array of |
92 |
software timers. If $\sigma$=1 millisecond, ``Tier 0'' software timers |
93 |
are decremented every millisecond, ``Tier 1'' software timers are decremented |
94 |
every two milliseconds, ``Tier 2'' software timers are decremented |
95 |
every four milliseconds, and so on. |
96 |
|
97 |
Note that if a software timer (at tier $q$) is set to a value $n$ |
98 |
and then tested periodically to determine if it has reached zero, the |
99 |
actual amount of time $t$ that may have elapsed is given by\footnote{In |
100 |
order to obtain (\ref{eq:ctbg0:slst0:01}), we make the assumption that |
101 |
the process that tests the software timer runs at least as often as the |
102 |
software timer is decremented. There are scheduling scenarios that add |
103 |
yet more to the error, and these are not developed here.} |
104 |
|
105 |
\begin{equation} |
106 |
\label{eq:ctbg0:slst0:01} |
107 |
(n - 1) \sigma 2^q |
108 |
< |
109 |
t |
110 |
\leq |
111 |
n \sigma 2^q . |
112 |
\end{equation} |
113 |
|
114 |
\noindent{}(\ref{eq:ctbg0:slst0:01}) comes about because it is possible |
115 |
that the timer was set to $n$ just before it was decremented. |
116 |
|
117 |
We propose to choose |
118 |
|
119 |
\begin{equation} |
120 |
\label{eq:ctbg0:slst0:02} |
121 |
n = \left\lfloor {\frac{\alpha}{\sigma 2^q}} \right\rfloor . |
122 |
\end{equation} |
123 |
|
124 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
125 |
\noindent\rule{\textwidth}{2pt} |
126 |
|
127 |
TODO: |
128 |
|
129 |
\begin{enumerate} |
130 |
\item Break into subsections. |
131 |
\item Develop scheduling assumptions that lead to (\ref{eq:ctbg0:slst0:01}) and friends. |
132 |
\item Show relationship with timed automata. |
133 |
\item Show state-space and complexity reduction advantages. |
134 |
\end{enumerate} |
135 |
|
136 |
|
137 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
138 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
139 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
140 |
\section{Square Root Extraction (Integer)} |
141 |
%Section tag: sre0 |
142 |
\label{ctbg0:ssre0} |
143 |
|
144 |
This section presents information about the extraction of |
145 |
square roots over an integer domain and integer range. Usually, the function |
146 |
of interest is |
147 |
|
148 |
\begin{equation} |
149 |
\label{eq:ctbg0:ssre0:01} |
150 |
f(x) = \left\lfloor \sqrt{x} \right\rfloor . |
151 |
\end{equation} |
152 |
|
153 |
Other functions of interest can generally be implemented in terms |
154 |
of (\ref{eq:ctbg0:ssre0:01}). |
155 |
|
156 |
If a square root with information after the radix point is desired, the argument $x$ can be premultiplied |
157 |
by the square of the reciprocal of the precision after the radix point. For example, |
158 |
a good approximation to $10 \sqrt{x}$ is |
159 |
|
160 |
\begin{equation} |
161 |
\label{eq:ctbg0:ssre0:02} |
162 |
10 \sqrt{x} \approx g(x) = \left\lfloor 10 \sqrt{x} \right\rfloor |
163 |
= \left\lfloor \sqrt{100 x} \right\rfloor , |
164 |
\end{equation} |
165 |
|
166 |
\noindent{}which can be implemented in terms of (\ref{eq:ctbg0:ssre0:01})\@. |
167 |
(\ref{eq:ctbg0:ssre0:02}) provides a result that is human-friendly (i.e. 453 is |
168 |
interpreted as 45.3), but the radix point of the result is not positioned between |
169 |
any two bits. A more typical approach might be to premultiply by an even |
170 |
power of two so as to place the radix point of the result between two bits. |
171 |
For example, |
172 |
|
173 |
\begin{equation} |
174 |
\label{eq:ctbg0:ssre0:02b} |
175 |
2^8 \sqrt{x} \approx \left\lfloor 2^8 \sqrt{x} \right\rfloor |
176 |
= \left\lfloor \sqrt{2^{16} x} \right\rfloor |
177 |
\end{equation} |
178 |
|
179 |
\noindent{}will calculate the square root of an integer |
180 |
with eight bits after the radix point. |
181 |
|
182 |
Similarly, if rounding to the nearest integer is desired, note that\footnote{To |
183 |
prove (\ref{eq:ctbg0:ssre0:03}), consider four cases. \emph{Case 1:} If the fractional part of |
184 |
$\sqrt{x}$ is 0, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |
185 |
will be an integer, the result will be exact, and (\ref{eq:ctbg0:ssre0:03}) holds. |
186 |
\emph{Case 2:} If the fractional part of |
187 |
$\sqrt{x}$ is less than 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |
188 |
will be an even integer, the square root will be rounded down, and (\ref{eq:ctbg0:ssre0:03}) holds. |
189 |
\emph{Case 3:} If the fractional part of |
190 |
$\sqrt{x}$ is equal to 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |
191 |
will be an odd integer, the square root will be rounded up, and (\ref{eq:ctbg0:ssre0:03}) holds. |
192 |
\emph{Case 4:} If the fractional part of |
193 |
$\sqrt{x}$ is greater than 0.5, then $\lfloor \sqrt{4x} \rfloor = \lfloor 2 \sqrt{x} \rfloor$ |
194 |
will be an odd integer, the square root will be rounded up, and (\ref{eq:ctbg0:ssre0:03}) holds.} |
195 |
|
196 |
\begin{equation} |
197 |
\label{eq:ctbg0:ssre0:03} |
198 |
h(x) = \left\lfloor \sqrt{x} + 0.5 \right\rfloor |
199 |
= \left\lfloor \frac{\left\lfloor \sqrt{4 x} \right\rfloor + 1}{2} \right\rfloor , |
200 |
\end{equation} |
201 |
|
202 |
\noindent{}which can also be easily implemented in terms of (\ref{eq:ctbg0:ssre0:01}). |
203 |
|
204 |
|
205 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
206 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
207 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
208 |
\subsection{Summation of Odd Integers} |
209 |
\label{ctbg0:ssre0:ssoi0} |
210 |
|
211 |
Consider the square of a non-negative integer $n$, $n^2$. The distance from |
212 |
$n^2$ to the square of the next larger integer is |
213 |
|
214 |
\begin{equation} |
215 |
\label{eq:ctbg0:ssre0:ssoi0:01} |
216 |
(n+1)^2 - n^2 = 2n + 1 . |
217 |
\end{equation} |
218 |
|
219 |
$2n+1, n \in \mathbb{Z}^+$ is the set of odd non-negative integers. |
220 |
It follows directly that the set of squared integers can be formed by |
221 |
successively adding odd integers, i.e. |
222 |
|
223 |
\begin{eqnarray} |
224 |
\nonumber 1^2 = 1 & = & 1 \\ |
225 |
2^2 = 4 & = & 1 + 3 \\ |
226 |
\nonumber 3^2 = 9 & = & 1 + 3 + 5 \\ |
227 |
\nonumber 4^2 = 16 & = & 1 + 3 + 5 + 7 \\ |
228 |
\nonumber & \ldots{} & |
229 |
\end{eqnarray} |
230 |
|
231 |
One possible integer square root algorithm would involve summing consecutive |
232 |
odd integers until the argument is exceeded. |
233 |
This algorithm is not developed further because it would be impractical to |
234 |
calculate the square root of any integers except small integers using this |
235 |
method. |
236 |
|
237 |
|
238 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
239 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
240 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
241 |
\subsection{Bit-By-Bit Bisection} |
242 |
\label{ctbg0:ssre0:sbis0} |
243 |
|
244 |
Square roots of an integer can be found using bit-by-bit bisection. |
245 |
In this method, starting with the most significant bit of the result, each bit |
246 |
is assumed `1' and a trial square is calculated. If the trial square is no |
247 |
larger than the argument, that bit of the result is `1'. However, if the trial |
248 |
square is larger than the argument, that bit of the result is `0'. |
249 |
|
250 |
\begin{tiny} |
251 |
\begin{verbatim} |
252 |
Essentially, it is bit-by-bit trial squaring. "Bisection" because |
253 |
each bit of the square root cuts the interval (in which the square root |
254 |
might lie) in half. |
255 |
|
256 |
Two of the posters in comp.arch.embedded indicated that "bisection" can be |
257 |
economized further by using the identity: |
258 |
|
259 |
(n + 2^k)^2 = n^2 + n*2^(k+1) + 2^(2*k) |
260 |
|
261 |
The "n^2" term is significant because it means that if you have the |
262 |
previous trial square and you are considering setting bit "k", you can |
263 |
compute the next trial square from the previous square using only shifting |
264 |
and adding. |
265 |
|
266 |
If you carry this to the extreme, because you initially assume no bits |
267 |
are set, you start with n = n**2 = 0, and the first trial square does |
268 |
not involve multiplication. So you can implement the entire square |
269 |
algorithm without multiplying, even once. |
270 |
|
271 |
Anyway, applying all possible optimizations, one gets this: |
272 |
|
273 |
http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/ |
274 |
cosmic/modx0/atu24sqrtfrxn/devel/alg_eval01.c?revision=1.7&view=markup |
275 |
|
276 |
or this in the 32-bit case: |
277 |
|
278 |
unsigned int sqrt32(unsigned long arg) |
279 |
{ |
280 |
unsigned long two_2k_mask; |
281 |
unsigned long trial_square; |
282 |
unsigned long trial_square_prev; |
283 |
unsigned long square_root_ls_ip1; |
284 |
|
285 |
two_2k_mask = 0x40000000UL; |
286 |
trial_square = 0; |
287 |
trial_square_prev = 0; |
288 |
square_root_ls_ip1 = 0; |
289 |
|
290 |
while(1) |
291 |
{ |
292 |
trial_square = (trial_square_prev) |
293 |
+ (square_root_ls_ip1) |
294 |
+ (two_2k_mask); |
295 |
|
296 |
square_root_ls_ip1 >>= 1; |
297 |
|
298 |
if (arg >= trial_square) |
299 |
{ |
300 |
square_root_ls_ip1 |= two_2k_mask; |
301 |
trial_square_prev = trial_square; |
302 |
} |
303 |
|
304 |
if (two_2k_mask == 1) |
305 |
break; |
306 |
|
307 |
two_2k_mask >>= 2; |
308 |
} |
309 |
|
310 |
return(square_root_ls_ip1); |
311 |
} |
312 |
|
313 |
If you look at the loop, there really isn't much there. There will be |
314 |
16 iterations. The test against "1" to exit the loop can be implemented |
315 |
as a single byte test (because a single "1" is being right-shifted, "1" |
316 |
as the value of the least significant byte means that the other three |
317 |
bytes are zero). |
318 |
|
319 |
Temporary storage requirements are about 4 times the size of the input |
320 |
argument. So, for a 32-bit argument, that would be 16 bytes. |
321 |
\end{verbatim} |
322 |
\end{tiny} |
323 |
|
324 |
|
325 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
326 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
327 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
328 |
\subsection{The Babylonian Method} |
329 |
\label{ctbg0:ssre0:sbab0} |
330 |
|
331 |
TBD. |
332 |
|
333 |
|
334 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
335 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
336 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
337 |
\subsection{Unprocessed Newsgroup Posts} |
338 |
\label{ctbg0:ssre0:sunp0} |
339 |
|
340 |
Have to process these newsgroup posts: |
341 |
|
342 |
\begin{tiny} |
343 |
\begin{verbatim} |
344 |
I have the need in a microcontroller application to calculate floor(sqrt(x)) |
345 |
where x is approximately in the range of 2^32 - 2^64. |
346 |
|
347 |
What are the algorithms I should look at? |
348 |
|
349 |
These small microcontrollers are characterized by having a 8-bit times 8-bit |
350 |
to give 16-bit result integer unsigned multiply instruction, and a similar |
351 |
16/8 division instruction to give a quotient and a remainder. |
352 |
Mulitplications of larger integers have to be done by multiplying the bytes |
353 |
and adding. |
354 |
|
355 |
I'm aware of these algorithms: |
356 |
|
357 |
a)Adding consecutive odd integers and counting how many you have to add, |
358 |
i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |
359 |
|
360 |
b)Trial squaring, testing setting each bit to "1", i.e. |
361 |
|
362 |
http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu8sqrtfrxx/src/atu8sqrtfrxx.c?revision=1.3&view=markup |
363 |
|
364 |
c)Another method that seems to work (don't know the name of it), here is |
365 |
source code: |
366 |
|
367 |
http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu16sqrtx10frxx/src/atu16sqrtx10frxx.c?revision=1.3&view=markup |
368 |
|
369 |
What other options should I investigate? |
370 |
|
371 |
================================================================================ |
372 |
Here is some C code for integer square roots using many different |
373 |
algorithms: |
374 |
http://cap.connx.com/chess-engines/new-approach/jsqrt0.ZIP |
375 |
|
376 |
Worth a look: |
377 |
http://www.azillionmonkeys.com/qed/sqroot.html |
378 |
|
379 |
Consider also: |
380 |
http://www.google.com/#hl=en&source=hp&q=fast+integer+square+root&rlz= |
381 |
1W1GGLD_en&aq=0&aqi=g2&oq=fast+integer+sq&fp=a048890d3c90c6fc |
382 |
================================================================================ |
383 |
> I have the need in a microcontroller application to calculate |
384 |
> floor(sqrt(x)) where x is approximately in the range of 2^32 - 2^64. |
385 |
|
386 |
x = y^2 - (2^16)^2 |
387 |
|
388 |
> What are the algorithms I should look at? |
389 |
|
390 |
Google is your friend :> There are many sqrt algorithms with |
391 |
different tradeoffs (speed/size). |
392 |
|
393 |
I would, instead, ask that you might want to examine what *else* |
394 |
you know about "x". I.e., you've already constrained the range |
395 |
(why? does that give you any other information that can be |
396 |
exploited?); are there any other characteristics about how |
397 |
"x" is synthesized that you can exploit to divide and conquer? |
398 |
|
399 |
E.g. if x = y * z, then knowing sqrt(y) and/or sqrt(z) simplifies |
400 |
your problem. |
401 |
|
402 |
> These small microcontrollers are characterized by having a 8-bit times |
403 |
> 8-bit to give 16-bit result integer unsigned multiply instruction, and a |
404 |
> similar 16/8 division instruction to give a quotient and a remainder. |
405 |
> Mulitplications of larger integers have to be done by multiplying the |
406 |
> bytes and adding. |
407 |
> |
408 |
> I'm aware of these algorithms: |
409 |
> |
410 |
> a)Adding consecutive odd integers and counting how many you have to add, |
411 |
> i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |
412 |
> |
413 |
> b)Trial squaring, testing setting each bit to "1", i.e. |
414 |
> |
415 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu8sqrtfrxx/src/atu8sqrtfrxx.c?revision=1.3&view=markup |
416 |
> |
417 |
> c)Another method that seems to work (don't know the name of it), here is |
418 |
> source code: |
419 |
> |
420 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic/modxx/atu16sqrtx10frxx/src/atu16sqrtx10frxx.c?revision=1.3&view=markup |
421 |
> |
422 |
> What other options should I investigate? |
423 |
================================================================================ |
424 |
> Hi, |
425 |
> |
426 |
> I have the need in a microcontroller application to calculate floor(sqrt(x)) |
427 |
> where x is approximately in the range of 2^32 - 2^64. |
428 |
> |
429 |
> What are the algorithms I should look at? |
430 |
> |
431 |
> These small microcontrollers are characterized by having a 8-bit times 8-bit |
432 |
> to give 16-bit result integer unsigned multiply instruction, and a similar |
433 |
> 16/8 division instruction to give a quotient and a remainder. |
434 |
> Mulitplications of larger integers have to be done by multiplying the bytes |
435 |
> and adding. |
436 |
> |
437 |
> I'm aware of these algorithms: |
438 |
> |
439 |
> a)Adding consecutive odd integers and counting how many you have to add, |
440 |
> i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |
441 |
> |
442 |
> b)Trial squaring, testing setting each bit to "1", i.e. |
443 |
> |
444 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |
445 |
> |
446 |
> c)Another method that seems to work (don't know the name of it), here is |
447 |
> source code: |
448 |
> |
449 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |
450 |
> |
451 |
> What other options should I investigate? |
452 |
> |
453 |
> Thanks, Datesfat |
454 |
|
455 |
The one I have been using last 15-20 years is probably your B, at |
456 |
least sounds |
457 |
like what I made up back then - successive approximation for each bit |
458 |
of the result. |
459 |
If multiplication is fast on the core you will be using it it is hard |
460 |
to beat. |
461 |
|
462 |
Dimiter |
463 |
================================================================================ |
464 |
> Hi, |
465 |
> |
466 |
> I have the need in a microcontroller application to calculate |
467 |
> floor(sqrt(x)) |
468 |
> where x is approximately in the range of 2^32 - 2^64. |
469 |
> |
470 |
> What are the algorithms I should look at? |
471 |
> |
472 |
> These small microcontrollers are characterized by having a 8-bit times |
473 |
> 8-bit |
474 |
> to give 16-bit result integer unsigned multiply instruction, and a similar |
475 |
> 16/8 division instruction to give a quotient and a remainder. |
476 |
> Mulitplications of larger integers have to be done by multiplying the |
477 |
> bytes |
478 |
> and adding. |
479 |
> |
480 |
> I'm aware of these algorithms: |
481 |
> |
482 |
> a)Adding consecutive odd integers and counting how many you have to add, |
483 |
> i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |
484 |
> |
485 |
> b)Trial squaring, testing setting each bit to "1", i.e. |
486 |
> |
487 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |
488 |
> |
489 |
> c)Another method that seems to work (don't know the name of it), here is |
490 |
> source code: |
491 |
> |
492 |
> http://www.uculib.com/vvcuculib01/viewvc.cgi/uculib01/src/stm8/cosmic... |
493 |
> |
494 |
> What other options should I investigate? |
495 |
> |
496 |
> Thanks, Datesfat |
497 |
> |
498 |
>The one I have been using last 15-20 years is probably your B, at |
499 |
>least sounds |
500 |
>like what I made up back then - successive approximation for each bit |
501 |
>of the result. |
502 |
>If multiplication is fast on the core you will be using it it is hard |
503 |
>to beat. |
504 |
|
505 |
Thanks for the insight. |
506 |
|
507 |
For small operands, I agree with your analysis. |
508 |
|
509 |
However, for larger operands, the Babylonian method: |
510 |
|
511 |
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method |
512 |
|
513 |
looks very promising. |
514 |
|
515 |
The issue is that for the cost of one division (the /2 and addition is |
516 |
virtually free) you get far more than one bit of additional information |
517 |
about the square root. |
518 |
|
519 |
For larger operands, I don't believe that (b) will be the right answer any |
520 |
longer. |
521 |
|
522 |
But I'm not sure yet ... |
523 |
================================================================================ |
524 |
> a)Adding consecutive odd integers and counting how many you have to add, |
525 |
> i.e. 1 + 3 = 4, 4 + 5 = 9, 9 + 7 = 16, 16 + 9 = 25, etc. |
526 |
|
527 |
That quite certainly can't work for the input range you're targeting. |
528 |
You would be looking at up to four billion iterations if x was 2^64 |
529 |
|
530 |
> b)Trial squaring, testing setting each bit to "1", i.e. |
531 |
|
532 |
A whole lot better already. It becomes even better if you update the |
533 |
square of your current result differentially, applying the binomial |
534 |
formula. As you update the result-to-be ('res' in the following), by |
535 |
adding a 1-bit at position 'k': |
536 |
|
537 |
x --> x + 2^k |
538 |
|
539 |
it's square, which eventually should be equal to the input number, |
540 |
updates too: |
541 |
|
542 |
x^2 --> x^2 + 2*(2^k)*x + (2^k)^2 |
543 |
= x^2 + x<<(k+1) + 1<<(2*k) |
544 |
|
545 |
That's two additions and a bit of shifting. Even an 8-bit CPU should be |
546 |
able to manage that in reasonable time. Basically you go looking for a |
547 |
bit position 'k' such that the current difference between x^2 and the |
548 |
target value is still larger than or equal to (x<<(k+1)+1<<(2*k)). |
549 |
|
550 |
This is actually a special case of an algorithm people used to be taught |
551 |
in higher schools. Just like we all (hopefully) learned to do long |
552 |
multiplication and long division in elementary school, there used to be |
553 |
an algorithm for long square-roots. I've seen it in a textbook used to |
554 |
teach metal workers' apprentices from the 1940s. It's a little daunting |
555 |
in decimal numbers on paper, but boils down to the above when performed |
556 |
in binary: you figure out one digit per iteration by looking at the |
557 |
difference between the target figure and square of the result-to-be. |
558 |
It's also somewhat similar to long division in that way. |
559 |
|
560 |
> What other options should I investigate? |
561 |
|
562 |
One traditional trick for sqrt(), and some other inverse functions, too, |
563 |
is to treat it as a root-finding problem for the function |
564 |
|
565 |
f(x) := x^2 - y |
566 |
|
567 |
The x that solves f(x)=0 is sqrt(y). You can solve that using the |
568 |
Newton-Raphson iteration algorithm, a.k.a. the tangent method: |
569 |
|
570 |
x' = x - f(x)/f'(x) |
571 |
= x - (x^2 - y) / (2 * x) |
572 |
= x - x/2 - y/(2*x) |
573 |
= x/2 - (y/2)/x |
574 |
|
575 |
This requires long division, but pays off by converging on the solution |
576 |
real fast, once you've come anywhere near it. |
577 |
|
578 |
This algorithm and the bit-by-bit one above are somewhat related. The |
579 |
search for the right bit position, 'k', is effectively a simplified |
580 |
version of the division (y/2)/x. |
581 |
================================================================================ |
582 |
> I have the need in a microcontroller application to calculate floor(sqrt(x)) |
583 |
> where x is approximately in the range of 2^32 - 2^64. |
584 |
> |
585 |
> What are the algorithms I should look at? |
586 |
|
587 |
Newton-Raphson or bisection, depending upon how slow division is. N-R is |
588 |
asymptotically faster (the number of correct digits doubles at each step), |
589 |
but each step requires a division. |
590 |
|
591 |
Bisection is essentially your option b), but it can be computed much more |
592 |
efficiently than the algorithm given. You don't need arbitrary multiplies, |
593 |
only shifts, as: |
594 |
|
595 |
(a+2^k)^2 = a^2 + 2.a.2^k + (2^k)^2 |
596 |
= a^2 + a.2^(k+1) + 2^2k |
597 |
|
598 |
IOW: |
599 |
|
600 |
uint32_t isqrt(uint64_t x) |
601 |
{ |
602 |
uint64_t a = 0; |
603 |
uint64_t a2 = 0; |
604 |
int k; |
605 |
|
606 |
for (k = 31; k >= 0; k--) { |
607 |
|
608 |
uint64_t t = a2 + (a << (k+1)) + ((uint64_t)1 << (k + k)); |
609 |
if (x > t) { |
610 |
a += 1 << k; |
611 |
a2 = t; |
612 |
} |
613 |
} |
614 |
|
615 |
return a; |
616 |
} |
617 |
|
618 |
Depending upon the CPU, it may be faster to calculate the a.2^(k+1) and |
619 |
2^2k terms incrementally. |
620 |
================================================================================ |
621 |
Wiki |
622 |
|
623 |
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots\\end{verbatim} |
624 |
================================================================================ |
625 |
Noticed your post on sci.math. I am surprised no one mentioned this one |
626 |
from Joe Leherbauer that only uses shifts and adds. I cannot find the |
627 |
original usenet message but a copy is here: |
628 |
|
629 |
http://groups.google.com/group/comp.lang.c/browse\_thread/thread/14ed11f99b6ac8ab/d4297a8de3ead321?hl=en\&ie=UTF-8\&q=\%22joe+Leherbauer\%22 |
630 |
|
631 |
I clipped the message: |
632 |
|
633 |
<----------------------clipped from message----------------------------> |
634 |
The function below is probably the fastest you can get with integer |
635 |
operations only. It executes in constant time of 16 loop iterations |
636 |
assuming a long is 32 bits, i.e. it's O(log(n)). |
637 |
As a by-product it also produces the square root remainder. |
638 |
More formally, it computes s and r, such that: |
639 |
|
640 |
s = floor(sqrt(n)) |
641 |
r = n - s * s |
642 |
|
643 |
I have to admit that I did not invent this. |
644 |
It is from an anonymous Usenet source. |
645 |
|
646 |
unsigned long |
647 |
i\_sqrt (unsigned long n, unsigned long * rem) |
648 |
{ |
649 |
unsigned long r, s, t; |
650 |
|
651 |
r = 0; |
652 |
t = (~0UL >> 2) + 1; /* largest power of 4 supported by data type */ |
653 |
do |
654 |
{ |
655 |
s = r + t; |
656 |
if ( n >= s ) |
657 |
{ |
658 |
n -= s; |
659 |
r = s + t; |
660 |
} |
661 |
r >>= 1; |
662 |
} |
663 |
while ( (t >>= 2) != 0 ); |
664 |
*rem = n; |
665 |
|
666 |
return r; |
667 |
|
668 |
} |
669 |
|
670 |
--- |
671 |
Joe Leherbauer Leherbauer at telering dot at |
672 |
Registered Linux User \# 215803 |
673 |
================================================================================ |
674 |
\end{tiny} |
675 |
|
676 |
|
677 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
678 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
679 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
680 |
\section[\emph{UcuAtS32S16v2CpDiva2FRxx(\protect\mbox{\protect$\cdot$})} Function Calculation Method] |
681 |
{\emph{UcuAtS32S16v2CpDiva2FRxx(\protect\mbox{\protect\boldmath $\cdot$})} Function Calculation Method} |
682 |
\label{ctbg0:svec2} |
683 |
|
684 |
\index{UcuAtS32S16v2CpDiva2FRxx()@\emph{UcuAtS32S16v2CpDiva2FRxx($\cdot$)}}% |
685 |
The nominal calculation performed by the |
686 |
\emph{UcuAtS32S16v2CpDiva2FRxx($\cdot$)} function is |
687 |
|
688 |
\begin{equation} |
689 |
\label{eq:ctbg0:svec2:01} |
690 |
g(a_x, a_y, b_x, b_y) |
691 |
= |
692 |
\left\lfloor \frac{\vec{a} \times \vec{b}}{| \vec{b} | \hat{k}} \right\rfloor |
693 |
= |
694 |
\left\lfloor \frac{a_x b_y - a_y b_x}{\sqrt{b_x^2 + b_y^2}} \right\rfloor , |
695 |
\end{equation} |
696 |
|
697 |
\noindent{}where the floor function of a negative argument rounds towards zero. |
698 |
|
699 |
Clearly, (\ref{eq:ctbg0:svec2:01}) isn't a convenient method of calcuation for an |
700 |
inexpensive microcontroller, as the result |
701 |
of the square root calculation might need to include a fractional part. |
702 |
|
703 |
(\ref{eq:ctbg0:svec2:01}) can be modified by squaring then taking the square root, |
704 |
making adjustments for the loss of sign information caused by squaring: |
705 |
|
706 |
\begin{equation} |
707 |
\label{eq:ctbg0:svec2:02} |
708 |
\left\lfloor \frac{a_x b_y - a_y b_x}{\sqrt{b_x^2 + b_y^2}} \right\rfloor |
709 |
= |
710 |
\left\lfloor \sqrt{\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}} \right\rfloor sgn (a_x b_y - a_y b_x). |
711 |
\end{equation} |
712 |
|
713 |
Note in (\ref{eq:ctbg0:svec2:02}) |
714 |
the the remainder of the division can be discarded, i.e. |
715 |
|
716 |
\begin{equation} |
717 |
\label{eq:ctbg0:svec2:03} |
718 |
\left\lfloor \sqrt{\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}} \right\rfloor |
719 |
= |
720 |
\left\lfloor \sqrt{\left\lfloor\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2}\right\rfloor} \right\rfloor . |
721 |
\end{equation} |
722 |
|
723 |
\noindent{}(\ref{eq:ctbg0:svec2:03}) is valid because only the square root of an integer can be |
724 |
an integer, so flooring the input to a square root calculation whose output is floored |
725 |
will not have an effect. |
726 |
|
727 |
The size of the integer to be squared in (\ref{eq:ctbg0:svec2:03}) must be established. |
728 |
$a_x b_y - a_y b_x$ may have a magnitude as large as $2^{31}-2^{15}$, so 31 bits are required (for an unsigned representation). |
729 |
In practice, 32 bits (4 bytes) will be used for storage. |
730 |
|
731 |
The term $(a_x b_y - a_y b_x)^2$ may be as large as $(2^{31}-2^{15})^2 = 2^{62}-2^{47}+2^{30}$, so 62 bits are required. |
732 |
In practice, 64 bits (8 bytes) will be used for storage. |
733 |
|
734 |
The most difficult question to answer is the upper bound on the result of the |
735 |
division. One would suspect from the form of (\ref{eq:ctbg0:svec2:03}) |
736 |
that there is an upper bound less than $2^{62}-2^{47}+2^{30}$, as the |
737 |
numerator and denominator both increase with increasing $b_x$ or $b_y$. |
738 |
If an upper bound smaller than $2^{62}-2^{47}+2^{30}$ exists, it would make |
739 |
both the division and the square root extraction more efficient. |
740 |
|
741 |
An upper bound on the result of the division in (\ref{eq:ctbg0:svec2:03}) |
742 |
need not be overly tight. The upper bound is only to assist in devising |
743 |
efficient calculation. |
744 |
|
745 |
The maximum magnitude of $a_x b_y - a_y b_x$ can only occur when a negative |
746 |
number is subtracted from a positive, or a positive number is subtracted from |
747 |
a negative. The most positive number that can be formed as the product |
748 |
of two UCU\_SINT16's is $2^{15}2^{15} = 2^{30}$. The most negative number that can be formed |
749 |
is $-2^{15}(2^{15}-1) = -2^{30}+2^{15}$. The maximum magnitude of |
750 |
$a_x b_y - a_y b_x$ is thus $2^{30} - (-2^{30}+2^{15}) = 2^{31} - 2^{15}$. |
751 |
|
752 |
To simplify the algebra (in exchange for a looser bound), |
753 |
one can assume that both positive and negative |
754 |
UCU\_SINT16's may reach a magnitude of $2^{15}$. Assuming |
755 |
that $a_x = a_y = 2^{15}$, the maximum value of |
756 |
(\ref{eq:ctbg0:svec2:03}) is |
757 |
|
758 |
\begin{eqnarray} |
759 |
\nonumber |
760 |
& |
761 |
\displaystyle{ |
762 |
\sqrt{\frac{(2^{15} b_x + 2^{15} b_y)^2}{b_x^2 + b_y^2}} |
763 |
= |
764 |
\sqrt{\frac{2^{30}(b_x^2 + 2 b_x b_y + b_y^2)}{b_x^2 + b_y^2}} |
765 |
} |
766 |
& \\ |
767 |
\label{eq:ctbg0:svec2:04} |
768 |
& |
769 |
\displaystyle{ |
770 |
= \sqrt{2^{30} \frac{b_x^2 + b_y^2}{b_x^2 + b_y^2} + 2^{31} \frac{b_x b_y}{b_x^2 + b_y^2}} |
771 |
} |
772 |
& \\ |
773 |
\nonumber |
774 |
& |
775 |
\displaystyle{ |
776 |
= \sqrt{2^{30} + 2^{31} \frac{b_x b_y}{b_x^2 + b_y^2}} |
777 |
} |
778 |
& |
779 |
\end{eqnarray} |
780 |
|
781 |
(\ref{eq:ctbg0:svec2:04}) implies that the maximum value of (\ref{eq:ctbg0:svec2:03}) |
782 |
depends on the maximum of the function |
783 |
|
784 |
\begin{equation} |
785 |
\label{eq:ctbg0:svec2:05} |
786 |
f(b_x, b_y) = \frac{b_x b_y}{b_x^2 + b_y^2} , \;\; b_x, b_y \in \mathbb{Z}^+ . |
787 |
\end{equation} |
788 |
|
789 |
Several helpful posters\footnote{Thanks to Ray Vickson, Gerry Myerson, Rob Johnson, and ``Scattered''.} |
790 |
on \texttt{sci.math} provided proofs that |
791 |
(\ref{eq:ctbg0:svec2:05}) could not exceed 1/2. The proof that I found easiest to understand |
792 |
was provided by Gerry Myerson. Consider the function |
793 |
|
794 |
\begin{equation} |
795 |
\label{eq:ctbg0:svec2:06} |
796 |
h(b_x, b_y) |
797 |
= |
798 |
\frac{1}{2} - f(b_x, b_y) |
799 |
= |
800 |
\frac{1}{2} \left( \frac{b_x^2 - 2 b_x b_y + b_y^2}{b_x^2 + b_y^2} \right) |
801 |
= |
802 |
\frac{1}{2} \left( \frac{(b_x - b_y)^2}{b_x^2 + b_y^2} \right). |
803 |
\end{equation} |
804 |
|
805 |
$h(b_x, b_y)$ is non-negative over the domain of interest, proving that $f(b_x, b_y)$ cannot exceed $1/2$. |
806 |
|
807 |
Substituting the maximum value of 1/2 into (\ref{eq:ctbg0:svec2:04}) |
808 |
leads to |
809 |
|
810 |
\begin{equation} |
811 |
\label{eq:ctbg0:svec2:07} |
812 |
\frac{(a_x b_y - a_y b_x)^2}{b_x^2 + b_y^2} |
813 |
\leq |
814 |
2^{30} + 2^{31} \left( \frac{1}{2} \right) |
815 |
= |
816 |
2^{30} + 2^{30} |
817 |
= |
818 |
2^{31} . |
819 |
\end{equation} |
820 |
|
821 |
\noindent{}(\ref{eq:ctbg0:svec2:07}) |
822 |
establishes that the result of the division may not exceed 32 bits. This is |
823 |
an important result, as the dividend of the division is 64 bits, requiring |
824 |
64 rounds in the general case to obtain a quotient. Knowledge that the |
825 |
quotient may not exceed 32 bits will approximately halve the division time, as |
826 |
only 32 iterations rather than 64 need to be performed. |
827 |
|
828 |
The square root extraction will involve a 32-bit input (the quotient from the |
829 |
division) and a 16-bit result. |
830 |
|
831 |
|
832 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
833 |
\noindent\begin{figure}[!b] |
834 |
\noindent\rule[-0.25in]{\textwidth}{1pt} |
835 |
\begin{tiny} |
836 |
\begin{verbatim} |
837 |
$RCSfile: c_tbg0.tex,v $ |
838 |
$Source: /home/dashley/cvsrep/uculib01/uculib01/doc/manual/c_tbg0/c_tbg0.tex,v $ |
839 |
$Revision: 1.12 $ |
840 |
$Author: dashley $ |
841 |
$Date: 2010/04/05 15:13:50 $ |
842 |
\end{verbatim} |
843 |
\end{tiny} |
844 |
\noindent\rule[0.25in]{\textwidth}{1pt} |
845 |
\end{figure} |
846 |
|
847 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
848 |
%$Log: c_tbg0.tex,v $ |
849 |
%Revision 1.12 2010/04/05 15:13:50 dashley |
850 |
%Edits. |
851 |
% |
852 |
%Revision 1.11 2010/04/05 12:17:12 dashley |
853 |
%Edits. |
854 |
% |
855 |
%Revision 1.10 2010/04/01 21:07:34 dashley |
856 |
%Edits. |
857 |
% |
858 |
%Revision 1.9 2010/04/01 20:21:50 dashley |
859 |
%Edits. |
860 |
% |
861 |
%Revision 1.8 2010/03/03 16:30:35 dashley |
862 |
%Edits. |
863 |
% |
864 |
%Revision 1.7 2010/03/03 00:19:21 dashley |
865 |
%Edits. |
866 |
% |
867 |
%Revision 1.6 2010/02/05 03:04:01 dashley |
868 |
%Additional square root information added. |
869 |
% |
870 |
%Revision 1.5 2010/02/05 02:58:12 dashley |
871 |
%Newsgroup post text added. |
872 |
% |
873 |
%Revision 1.4 2010/01/28 21:18:33 dashley |
874 |
%a)Chapter start quotes removed. |
875 |
%b)Aesthetic comment line added at the bottom of most files. |
876 |
% |
877 |
%Revision 1.3 2007/10/08 18:16:34 dtashley |
878 |
%Edits. |
879 |
% |
880 |
%Revision 1.2 2007/10/07 18:11:22 dtashley |
881 |
%Edits. |
882 |
% |
883 |
%Revision 1.1 2007/10/07 01:22:01 dtashley |
884 |
%Initial checkin. |
885 |
% |
886 |
%End of $RCSfile: c_tbg0.tex,v $. |
887 |
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
888 |
|