**emFloat**

Developed and honed for more than two decades, emFloat is a highly optimized component of emRun, SEGGER’s C runtime library, and also a part of SEGGER Embedded Studio.

Designed for plug-and-play, emFloat can replace a default floating-point library, delivering better performance with less code. Very fast and very small, it delivers FPU-like performance in pure software. Where available, it even boosts the performance of an FPU for complex mathematical functions.

It is available stand-alone, in source code form, for developers who wish to increase performance or reduce the code size of their application without replacing the entire runtime library supplied with their toolchain.

emFloat can also be licensed for inclusion in third-party IDEs. An example is Microchip choosing to include emFloat in the Microchip XC32 V4.0 Compiler Toolchain.

Benchmarking for both floating-point and runtime libraries can be done quickly and easily using Embedded Studio, which is readily available at no cost for evaluation and non-commercial usage under SEGGER’s Friendly License.

For details on why using a thoughtfully designed runtime library is important, refer to the emRun page.

**Key features**

- Small code size, high performance
- Plug-and-play: Can easily replace the default floating point library, delivering better performance with less code.
- Flexible licensing, for integration into user applications or toolchains.
- C-Variant can be used on any 8/16/32/64-bit CPU.
- Hand-coded, assembly-optimized variants for RISC-V and ARM
- Fully reentrant
- No heap requirements

**Licensing**

emFloat is available for integration into specific projects by end users, as well as to toolchain providers that want to deliver a top-of-the-line runtime and/or floating-point library to their users.

Licensing options are available to fit any such needs, usually with a single payment and no royalty obligation.

The library is delivered in source code, with optional rights for redistribution in object code form. All delivered C and assembly language source files are fully commented.

SEGGER software is not covered by an open source or required attribution license, and can be integrated into any commercial or proprietary product, without the obligation to disclose the combined source.

**Variants**

emFloat is available in a universal variant written in C, and specific variants for different CPUs. The specific variants include modules written in assembly language, optimized for the CPU architecture, and deliver a higher performance than the universal C variant.

*Universal C Variant:*

The universal version is written in C. The performance is highly optimized and much higher than the performance of comparable, C-coded open source implementations.

**Supported CPUs:** The universal variant can be used on any platform, including 8-, 16-, 32, and 64- bit processors.

*Arm Variant:*

The ARM-optimized variant is fully coded in Assembly language, conforming to the AEABI.This means it is compatible with any (A)EABI compliant tool chain, including any GCC, LLVM/Clang based tool chains as well as Arm’s own compiler (incl. Keil) and IAR and can replace the default runtime library or parts of it.

**Supported CPUs: **The Arm Variant supports any 32-bit ARM CPU, starting from ARM Architecture V4. This includes Cortex-M, Cortex-A, and Cortex-R.

*RISC-V Variant:*

The RISC-V Variant is written in assembly language, providing functions compatible with the EABI. It can easily be used to replace the default runtime library of EABI compliant toolchains.

**Supported CPUs:** The RISC-V Variant supports RV32I and RV32E with architecture-specific acceleration. It supports faster multiply and divide with the M (multiply/divide) extension. It also supports fast division even if if the M extension lacks a divide instruction.

**Implementation and design**

emFloat consists of two parts:

- Arithmetic functions

implementing functionality similar to that of the FPU, such as floating-point add, subtract, multiply, divide, comparisons, and conversions - Mathematical functions

using the most efficient, modern algorithms, benefiting systems with or without an FPU

While all mathematical functions are written in C, the arithmetic functions for the Arm and RISC-V variants are hand-coded in assembly language. For other processor architectures the library has a portable C implementation.

emFloat is optimized on both, the design level (using efficient algorithms) as well as the implementation level (running on different architectures). The source code contains options to fine-tune it for high performance or small code size or a balance of the two, delivering excellent performance in all cases.

It provides a consistent execution environment which ensures that infinities, not-a-number, and zero results with correct sign are accepted as inputs and generated as outputs. To be consistent with floating-point units executing in fast mode, the library elects to flush subnormals to a correctly-signed zero. Because subnormals do not typically occur in embedded systems, this optimization enables significant code size reduction.

**Integration and use**

emFloat provides all well-known floating-point-related API functions of C standard libraries, as well as floating-point-operation functions defined by the architecture’s EABI which are implicitly called and added by the compiler.

The floating-point Library can either be integrated into a toolchain to replace the existing standard library implementation, or it can be used side-by-side. The side-by-side use enables selective calls to the Floating-Point Library, while retaining the toolchain’s standard library, which makes the integration and use easy and simple.

**Example: **To return the sine of a value x:

- With the integrated use, call sin(x).
- With the side-by-side use, call SEGGER_sin(x).

To multiply two float values A and B without the use of an FPU:

- With the integrated use, call A * B, for which the compiler will implicitly call __mulsf3(A, B) or __aeabi_fmul(A, B).
- With the side-by-side use, call SEGGER_fmul(A,B)

**Configuration options**

emFloat is configurable for small code size or increased execution speed or a combination. Optimizing for code-size or execution-speed or a balance of both does not cause any loss of accuracy. Calculated results are identical in all modes.

In the source distribution, the library can be configured and tuned to favor faster or smaller code with different levels of optimization:

-2 – Favor size at the expense of speed

-1 – Favor size over speed

0 – Balanced

+1 – Favor speed over size

+2 – Favor speed at the expense of size

The sections below show the performance of the high-level explicit functions and the low-level implicit floating-point functions of different architectures. For more information please refer to the blog post Floating-point face-off, part 2.

*Note: Due to the specific features of each architecture, the performance values should not be compared to each other. Instead these values are generated for comparison with other floating-point libraries.*

**Architecture-specific optimization**

The assembly language variants of the emFloat take advantage of processor-specific features. Each architecture has its own fine-tuned implementation.

In the Arm variant, the floating-point support makes use of the 32-bit Arm and Thumb-2 instruction set and uses divide instructions and extended multiply instructions, when available, which results in a smaller and faster implementation. Pure Thumb instruction sets, such as on Cortex-M0 and Cortex-M23 processors, are, of course, supported entirely, too.

In the RISC-V variant, software floating-point is supported on all RV32I and RV32E architectures. The floating-point implementation takes full advantage of processors that have the M extension, and processors that have the M extension but without a divide instruction. The C extension is supported to select registers that make the best use of the compact instruction encoding, achieving smaller code.

The assembly language versions have multiple implementations of arithmetic operations, using architectural details and tailored algorithms to make the best use of the available instruction set. For instance, for single precision division:

- Use a division instruction to iteratively develop a quotient, if available.
- If no division instruction, but there is a multiplication instruction, use an initial reciprocal approximation refined by Newton-Raphson iterations with a final correction and multiplication.
- If no division and no multiplication instruction, use a non-restoring division algorithm.

Similar optimizations apply to double-precision division.

**API functions — Explicit & implicit**

**Explicit Functions:** emFloat implements all standard library functions which are usually exposed through math.h. These functions are always explicitly called by a user application.

With the integrated use of the library, the standard function can simply be called (no prefix), with the side-by-side use, the functions of this library can be called instead of the standard library implementation by adding the prefix “SEGGER_”. The API interface is compatible.

*Function List:*

**Implicit Functions: **When there is no hardware support for basic operations, such as multiplication of two floats, the compiler adds calls to helper functions emulating the operation with available resources. These are the implicit functions, defined by the toolchain’s and architecture’s EABI.

With the integrated use of the specific variants of emFloat, the compiler will use its implicit functions. When used side-by-side, the functions can be explicitly called instead of writing the standard operation in code.

*Function List:*

**Explicit function performance**

For verification and benchmark of the explicit functions, the IEEE-754 Floating-point Library Benchmark application is available. It measures performance and precision of the implementation. For each function, significant values have been chosen to get best coverage.

The tables below show the results of the benchmark application running the C implementation on different architectures.

**Performance on Arm: **The benchmark has been done on an Arm Cortex-M4 microcontroller (NXP K66FN2M0), running from RAM.

sinf() | Bit Error | Cycles |

sin(1e-4) | 0.00 | 21 |

sin(1e-3) | 0.00 | 55 |

sin(1e-2) | 0.00 | 55 |

sin(1e-1) | 0.00 | 54 |

sin(1) | 0.00 | 139 |

sin(1.47264147) | 0.00 | 138 |

sin(1.57079089) | 0.00 | 138 |

sinf(3.14158154) | 0.00 | 106 |

sin(39.0735703) | 0.00 | 148 |

sin(355) | 0.00 | 152 |

sin(1048582.75) | 0.00 | 176 |

sin(100000*Pi) | 0.00 | 151 |

sin(1e10) | 0.00 | 187 |

sin(1e38) | 0.00 | 186 |

Total | 0.00 | 1706 |

cosf() | Bit Error | Cycles |

cos(1e-4) | 0.00 | 3 |

cos(1e-3) | 0.00 | 48 |

cos(1e-2) | 0.00 | 48 |

cos(1e-1) | 0.00 | 48 |

cos(1) | 0.00 | 136 |

cos(1.47264147) | 0.00 | 103 |

cos(1.57079780) | 0.00 | 103 |

cos(6.28319073) | 0.00 | 136 |

cos(355) | 0.00 | 180 |

cos(100000*Pi) | 0.00 | 180 |

cos(1e10) | 0.00 | 183 |

cos(1e38) | 0.00 | 182 |

Total | 0.00 | 1350 |

tanf() | Bit Error | Cycles |

tan(1e-4) | 0.00 | 25 |

tan(1e-3) | 0.00 | 74 |

tan(1e-2) | 0.00 | 74 |

tan(1e-1) | 0.00 | 73 |

tan(1) | 0.00 | 258 |

tan(6.45840693) | 0.00 | 258 |

tan(355) | 0.00 | 282 |

tan(100000*Pi) | 0.00 | 273 |

tan(1e10) | 0.00 | 304 |

tan(1e38) | 0.00 | 321 |

Total | 0.00 | 1942 |

expf() | Bit Error | Cycles |

expf(0) | 0.00 | 3 |

expf(1e-5) | 0.00 | 44 |

expf(1e-4) | 0.00 | 44 |

expf(2e-4) | 0.00 | 44 |

expf(4e-4) | 0.00 | 43 |

expf(4.5e-4) | 0.00 | 44 |

expf(1e-3) | 0.00 | 44 |

expf(0.25123) | 0.00 | 81 |

expf(0.55123) | 0.00 | 80 |

expf(8.1) | 0.00 | 81 |

expf(16.1) | 0.00 | 81 |

Total | 0.00 | 589 |

sinhf() | Bit Error | Cycles |

sinhf(1e-5) | 0.00 | 22 |

sinhf(1e-4) | 0.00 | 23 |

sinhf(2e-4) | 0.00 | 23 |

sinhf(4e-4) | 0.00 | 60 |

sinhf(4.5e-4) | 0.00 | 59 |

sinhf(1e-3) | 0.00 | 60 |

sinhf(0.25123) | 0.00 | 60 |

sinhf(0.55123) | 0.00 | 119 |

sinhf(8.1) | 0.00 | 121 |

sinhf(16.1) | 0.00 | 108 |

Total | 0.00 | 655 |

coshf() | Bit Error | Cycles |

coshf(1e-5) | 0.00 | 28 |

coshf(1e-4) | 0.00 | 28 |

coshf(2e-4) | 0.00 | 29 |

coshf(4e-4) | 0.00 | 48 |

coshf(4.5e-4) | 0.00 | 48 |

coshf(1e-3) | 0.00 | 47 |

coshf(0.25123) | 0.00 | 48 |

coshf(0.55123) | 0.00 | 111 |

coshf(8.1) | 0.00 | 114 |

coshf(16.1) | 0.00 | 100 |

Total | 0.00 | 601 |

tanhf() | Bit Error | Cycles |

tanhf(0.25) | 0.00 | 66 |

tanhf(1) | 0.00 | 108 |

tanhf(10) | 0.00 | 18 |

Total | 0.00 | 192 |

logf() | Bit Error | Cycles |

logf(1e-5) | 0.00 | 158 |

logf(1024) | 0.00 | 100 |

logf(4177.25) | 0.00 | 140 |

Total | 0.00 | 398 |

**Performance on RISC-V:** The benchmark has been done on an RISC-V RV32IMAC microcontroller (GigaDevice GD32VF103), running from RAM.

sinf() | Bit Error | Cycles |

sin(1e-4) | 0.00 | 8 |

sin(1e-3) | 0.00 | 70 |

sin(1e-2) | 0.00 | 67 |

sin(1e-1) | 0.00 | 67 |

sin(1) | 0.00 | 182 |

sin(1.47264147) | 0.00 | 193 |

sin(1.57079089) | 0.00 | 196 |

sinf(3.14158154) | 0.00 | 153 |

sin(39.0735703) | 0.00 | 193 |

sin(355) | 0.00 | 219 |

sin(1048582.75) | 0.00 | 236 |

sin(100000*Pi) | 0.00 | 214 |

sin(1e10) | 0.00 | 255 |

sin(1e38) | 0.00 | 248 |

Total | 0.00 | 2301 |

cosf() | Bit Error | Cycles |

cos(1e-4) | 0.00 | 10 |

cos(1e-3) | 0.00 | 50 |

cos(1e-2) | 0.00 | 43 |

cos(1e-1) | 0.00 | 43 |

cos(1) | 0.00 | 186 |

cos(1.47264147) | 0.00 | 158 |

cos(1.57079780) | 0.00 | 161 |

cos(6.28319073) | 0.00 | 190 |

cos(355) | 0.00 | 252 |

cos(100000*Pi) | 0.00 | 251 |

cos(1e10) | 0.00 | 245 |

cos(1e38) | 0.00 | 257 |

Total | 0.00 | 1846 |

tanf() | Bit Error | Cycles |

tan(1e-4) | 0.00 | 7 |

tan(1e-3) | 0.00 | 92 |

tan(1e-2) | 0.00 | 87 |

tan(1e-1) | 0.00 | 86 |

tan(1) | 0.00 | 403 |

tan(6.45840693) | 0.00 | 397 |

tan(355) | 0.00 | 444 |

tan(100000*Pi) | 0.00 | 430 |

tan(1e10) | 0.00 | 458 |

tan(1e38) | 0.00 | 483 |

Total | 0.00 | 2887 |

expf() | Bit Error | Cycles |

expf(0) | 0.00 | 10 |

expf(1e-5) | 0.00 | 45 |

expf(1e-4) | 0.00 | 41 |

expf(2e-4) | 0.00 | 38 |

expf(4e-4) | 0.00 | 38 |

expf(4.5e-4) | 0.00 | 38 |

expf(1e-3) | 0.00 | 38 |

expf(0.25123) | 0.00 | 86 |

expf(0.55123) | 0.00 | 89 |

expf(8.1) | 0.00 | 88 |

expf(16.1) | 0.00 | 86 |

Total | 0.00 | 597 |

sinhf() | Bit Error | Cycles |

sinhf(1e-5) | 0.00 | 14 |

sinhf(1e-4) | 0.00 | 14 |

sinhf(2e-4) | 0.00 | 13 |

sinhf(4e-4) | 0.00 | 67 |

sinhf(4.5e-4) | 0.00 | 59 |

sinhf(1e-3) | 0.00 | 59 |

sinhf(0.25123) | 0.00 | 59 |

sinhf(0.55123) | 0.00 | 137 |

sinhf(8.1) | 0.00 | 133 |

sinhf(16.1) | 0.00 | 112 |

Total | 0.00 | 667 |

coshf() | Bit Error | Cycles |

coshf(1e-5) | 0.00 | 26 |

coshf(1e-4) | 0.00 | 24 |

coshf(2e-4) | 0.00 | 24 |

coshf(4e-4) | 0.00 | 50 |

coshf(4.5e-4) | 0.00 | 50 |

coshf(1e-3) | 0.00 | 50 |

coshf(0.25123) | 0.00 | 50 |

coshf(0.55123) | 0.00 | 140 |

coshf(8.1) | 0.00 | 139 |

coshf(16.1) | 0.00 | 126 |

Total | 0.00 | 679 |

tanhf() | Bit Error | Cycles |

tanhf(0.25) | 0.00 | 89 |

tanhf(1) | 0.00 | 145 |

tanhf(10) | 0.00 | 14 |

Total | 0.00 | 248 |

logf() | Bit Error | Cycles |

logf(1e-5) | 0.00 | 265 |

logf(1024) | 0.00 | 183 |

logf(4177.25) | 0.00 | 240 |

Total | 0.00 | 688 |

**Implicit function performance**

The following tables show the performance and code size of the Arm and RISC-V EABI floating-point functions.

The performance benchmark runs the speed-optimized implementation of the floating-point library (__SEGGER_RTL_OPTIMIZE +2).

The code size has been measured with size optimization (__SEGGER_RTL_OPTIMIZE -2). The speed-optimized configuration requires slightly more code.

**Performance on Arm:** The benchmarks have been done on an Arm Cortex-M4 microcontroller (NXP K66FN2M0), running from RAM, compiled with Embedded Studio (GCC).

Function | Average Cycles | |

Float, Math | __aeabi_fadd | 31.0 |

__aeabi_fsub | 39.9 | |

__aeabi_frsub | 39.9 | |

__aeabi_fmul | 26.0 | |

__aeabi_fdiv | 53.0 | |

Float, Compare | __aeabi_fcmplt | 13.0 |

__aeabi_fcmple | 13.0 | |

__aeabi_fcmpgt | 13.0 | |

__aeabi_fcmpge | 13.0 | |

__aeabi_fcmpeq | 7.0 | |

Double, Math | __aeabi_dadd | 54.5 |

__aeabi_dsub | 71.2 | |

__aeabi_drsub | 71.2 | |

__aeabi_dmul | 56.4 | |

__aeabi_ddiv | 134.0 | |

Double, Compare | __aeabi_dcmplt | 14.0 |

__aeabi_dcmple | 14.0 | |

__aeabi_dcmpgt | 14.0 | |

__aeabi_dcmpge | 14.0 | |

__aeabi_dcmpeq | 14.0 | |

Float, Conversion | __aeabi_f2iz | 9.0 |

__aeabi_f2uiz | 6.0 | |

__aeabi_f2lz | 13.5 | |

__aeabi_f2ulz | 12.0 | |

__aeabi_i2f | 10.5 | |

__aeabi_ui2f | 7.5 | |

__aeabi_l2f | 19.0 | |

__aeabi_ul2f | 13.8 | |

__aeabi_f2d | 9.0 | |

Double, Conversion | __aeabi_d2iz | 10.0 |

__aeabi_d2uiz | 8.0 | |

__aeabi_d2lz | 16.5 | |

__aeabi_d2ulz | 13.5 | |

__aeabi_i2d | 12.0 | |

__aeabi_ui2d | 8.0 | |

__aeabi_l2d | 17.9 | |

__aeabi_ul2d | 12.9 | |

__aeabi_d2f | 11.0 |

**EABI function performance on RISC-V**

The benchmarks have been done on a GD32VD107 (RV32IMAC), running from Flash, compiled with Embedded Studio (GCC), optimized for speed.

Function | Cycles, Min | Cycles, Max | Cycles, Avg | |

Float, Math | __addsf3 | 45 | 60 | 49.5 |

__subsf3 | 42 | 84 | 62.2 | |

__mulsf3 | 37 | 57 | 39.3 | |

__divsf3 | 67 | 70 | 67.0 | |

Float, Compare | __ltsf2 | 11 | 15 | 11.0 |

__lesf2 | 10 | 14 | 10.0 | |

__gtsf2 | 10 | 17 | 10.0 | |

__gesf2 | 11 | 14 | 11.0 | |

__eqsf2 | 10 | 13 | 10.0 | |

__nesf2 | 10 | 10 | 10.0 | |

Double, Math | __adddf3 | 52 | 89 | 62.8 |

__subdf3 | 60 | 123 | 82.8 | |

__muldf3 | 68 | 88 | 75.0 | |

__divdf3 | 192 | 204 | 197.2 | |

Double, Compare | __ltdf2 | 15 | 20 | 16.0 |

__ledf2 | 15 | 19 | 16.0 | |

__gtdf2 | 15 | 20 | 16.1 | |

__gedf2 | 15 | 19 | 16.1 | |

__eqdf2 | 14 | 17 | 14.0 | |

__nedf2 | 14 | 14 | 14.0 | |

Float, Conversion | __fixsfsi | 14 | 14 | 14.0 |

__fixunssfsi | 13 | 13 | 13.0 | |

__fixsfdi | 20 | 29 | 23.2 | |

__fixunssfdi | 15 | 23 | 18.9 | |

__floatsisf | 28 | 47 | 32.6 | |

__floatunsisf | 28 | 42 | 33.0 | |

__floatdisf | 39 | 66 | 49.1 | |

__floatundisf | 35 | 58 | 44.1 | |

__extendsfdf2 | 14 | 18 | 14.1 | |

Double, Conversion | __fixdfsi | 9 | 20 | 16.8 |

__fixunsdfsi | 9 | 14 | 13.8 | |

__fixdfdi | 9 | 34 | 26.9 | |

__fixunsdfdi | 9 | 25 | 21.5 | |

__floatsidf | 28 | 47 | 31.6 | |

__floatunsidf | 19 | 32 | 23.9 | |

__floatdidf | 30 | 73 | 45.1 | |

__floatundidf | 27 | 62 | 39.3 | |

__truncdfsf2 | 25 | 36 | 25.1 |

**EABI function code size on RISC-V**

For function code size, the floating-point library has been compiled with optimization for size, targeting RV32IMC.

Function | Code Size [Bytes] | |

Float, Math | __addsf3 | 410 |

__subsf3 | 10 | |

__mulsf3 | 178 | |

__divsf3 | 184 | |

Float, Compare | __ltsf2 | 58 |

__lesf2 | 54 | |

__gtsf2 | 50 | |

__gesf2 | 62 | |

__eqsf2 | 44 | |

__nesf2 | — | |

Double, Math | __adddf3 | 724 |

__subdf3 | 10 | |

__muldf3 | 286 | |

__divdf3 | 278 | |

Double, Compare | __ltdf2 | 70 |

__ledf2 | 70 | |

__gtdf2 | 70 | |

__gedf2 | 70 | |

__eqdf2 | 52 | |

__nedf2 | — | |

Float, Conversion | __fixsfsi | 74 |

__fixunssfsi | 50 | |

__fixsfdi | 146 | |

__fixunssfdi | 98 | |

__floatsisf | 66 | |

__floatunsisf | 52 | |

__floatdisf | 96 | |

__floatundisf | 70 | |

__extendsfdf2 | 64 | |

Double, Conversion | __fixdfsi | 84 |

__fixunsdfsi | 54 | |

__fixdfdi | 146 | |

__fixunsdfdi | 96 | |

__floatsidf | 46 | |

__floatunsidf | 34 | |

__floatdidf | 128 | |

__floatundidf | 106 | |

__truncdfsf2 | 130 |

*Notes: __subsf3 tail-calls __addsf3, __subdf3 tail-calls __adddf3. __nesf2 is an alias of __eqsf2, __nedf2 is an alias of __eqdf2.*

**Size comparison**

To demonstrate how competitive the floating-point library is from a size perspective, a level -playing-field benchmark is available. Quite simply, it calls a selection of the explicit floating-point library functions from main().

For the Arm variant and similarly for RISC-V, a single application and startup code can successfully link against multiple vendor-provided runtime systems thanks to all compilers conforming to the core-specific EABI. This simplifies the swapping of libraries both in the benchmark and in other projects.

The benchmark project uses exactly the same object modules with different runtimes for each vendor. The project uses Embedded Studio and a standard project template to build for different architectures.

The Arm applications are built with the SEGGER Compiler and the SEGGER Linker. The RISC-V applications use the GNU tools which are included in Embedded Studio.

**Size comparison on Arm**

emFloat has been tested against:

- IAR Embedded Workbench 8.50
- GNU Arm Embedded 9-2020-q2-update
- Arm Compiler 6.14, standard Arm libraries (flush-to-zero mode)
- Arm Compiler 6.14, MicroLib (non-conforming IEEE implementation)
- TI Code Composer 20.2.1 LTS

The table shows the results for ARMv7M with full software floating point.

Library | ROM Usage |

SEGGER | 10,628 bytes |

IAR | 17,656 |

AC6 MicroLib | 18,668 bytes |

AC6 | 21,514 bytes |

GNU | 33,809 bytes |

CCS | 34,274 bytes |

*The overhead from the benchmark application is 306 bytes Flash.*

**Size comparison on RISC-V**

For RISC-V emFloat has been tested against:

- standard 2019-08-gcc-8.3.0 toolset (maintained by SiFive)

These are the results for RV32IMC.

Library | ROM Usage |

SEGGER | 12,644 bytes |

GNU | 47,176 bytes |