Investigación

Azequia Azequia

{tab=Description}

Introduction

In the high performance computing world, MPI is the de-facto standard for message passing among the clusters of workstations. AzequiaMPI is an on-going thread-based full conformant implementation of the MPI-1 standard for MMU-less processors. It runs on the entire Texas Instruments TMS320C6000 family of digital signal processors, as well as on MicroBlaze and PowerPC 405, what makes it a useful tool in FPGA-based distributed image processing. AzequiaMPI also runs on top of Linux.Azequia was originally conceived as an alternative communication solution for Sundance multicomputers of digital signal processors, as the one shown in Figure 1. Sundance provides embedded, PCI and CompactPCI versions of the carrier boards.

Fig. 1. SundanceTM multicomputer

Fig. 1. SundanceTM multicomputer composed by a CompactPCI SMT310Q board with four TIM modules.

These machines fit in the category of high performance embedded computing (HPEC) environments. Its modularity and scalability attracted our attention years ago. A multicomputer can be built by combining one of more carrier boards. Each board provides four sites where a TIM module is inserted. TIM (for Texas Instruments Module) is just a physical connection specification. A TIM conformant module can be plugged in a TIM conformant board. Sundance provides I/O, DSP and FPGA TIM modules.

A DSP module contains a high-end Texas Instruments DSP processor from the TMS320C6000 family. These DSPs are broadly used in high performance real time signal processing applications. They natively run DSP/BIOS, a proprietary small monoprocessor real-time operating system (RTOS).

Diamond is the proprietary distributed operating system for Sundance machines. Under Diamond, a distributed application is an immutable graph of tasks (nodes) and data streams (arrows) statically configured by hand. Every task has a vector of input ports and a vector of output ports that connect tasks by name. These vectors are passed to the main routine of the task. A program called the configurer running in the host PC combines task image files to form the executable that it later loads on each processor. A user-supplied textual configuration file drives the configurer. It specifies the hardware (available processors and physical links connecting them), the software (tasks and how they are connected), and how tasks are assigned to processors. Once running, for instance, a task sends the upper case version of character ch to “the output port 0” by invoking chan_out_word(toupper(ch), out_ports[0]);

Note that no addressing is involved, what makes a communication independent of the rank of the receiver task or its specific machine. As a result, the source code of a task is independent of the graph in which it participates. Static configuration ensures the real-time application will keep enough processing power and communication bandwidth during its life time, but prohibits something as simple and useful as forking new applications at run-time. Another severe limitation is that it is not possible a sporadic communication between two unconnected tasks. Finally, timeout in communications are not considered at all. In our view, Diamond puts in perspective the challenges of defining and building the right real-time distributed software for the HPEC field.

AzequiaMPI versus Diamond

Fig. 2. AzequiaMPI versus Diamond 3.1.10 bandwidth test for increasing message sizes. Diamond only outperforms AzequiaMPI when it reserves the physical SDB link for the communicating tasks.

In contrast with Diamond, Azequia is not a closed solution for mapping a distributed algorithm to a multi-DSP, but an open library upon the native RTOS, either Xilkernel, DSP/BIOS or whatever it be, what enables advanced communication middlewares as AzequiaMPI. Figure 2 shows the performance of AzequiaMPI against Diamond.

{tab-segundo=Threads}

Threads

An MPI application is a graph of processing nodes. Running an MPI node as a process imposes some disadvantages because firstly, process context switch and synchronization are expensive, and secondly, message passing between two MPI nodes in the same machine must go through a system buffer and buffer copying degrades the communication efficiency. The MPI node, notwithstanding, is a process in all the mainstream implementations of MPI as MPICH and OpenMPI. In contrast, the MPI node is a thread in a thread-based implementation. Thread-based MPI is not a new concept. Paper [7] discusses TOMPI (for thread only MPI), following an approach similar to AzequiaMPI. TOMPI however is just an early proof of concept prototype that implements just a reduced set of MPI primitives. As it does not support multicomputers, it has not had any influence in the Azequia design.

Some recent research projects, as RSC, aimed, or at least considered, porting the full MPICH to uClinux ([5]). This task, however, is not trivial, because MPICH is process-based and uCLinux does not support processes. In fact, we are not aware it has successfully been completed yet. A similar migration effort, also based on MPICH is [9] and also without visible results.

AzequiaMPI needs a Pthreads box, what in most cases requires an underlying operating system. Depending on the platform, the operating system is DSP/BIOS, Xilkernel, uClinux or Linux. As the DSP/BIOS interface has nothing to do with the Pthreads standard, we decided to write a partially conformant Pthreads library upon DSP/BIOS. This library is OSI, for Operating System Interface. The timed counting semaphore is the only relevant locking abstraction of DSP/BIOS. Though it is well known that implementing condition variables out of simple primitives like semaphores is surprisingly tricky ([8]), we have achieved to develop OSI as a monolithic monitor that currently provides 39 Pthreads functions. Azequia uses most of them and AzequiaMPI just mutices and pthread_self. OSI is also necessary upon Xilkernel because although being partially Pthreads conformant, it lacks essential services as condition variables, thread-specific data and an absolute clock. Table I shows the relevant internal features of both kernels.

 

  Join Condition variable Mutex Semaphore Thread-specific data
Xilkernel Yes No Yes Yes No
DSP / BIOS No No Yes Yes Yes

Table I. Services of OSI target operating systems.

{tab-segundo=Azequia Design}

Azequia Design

AzequiaMPI follows a two-layer design. Figure 4 shows the protocol stack. The Azequia layer knows nothing about the MPI standard, which is provided by the AzequiaMPI layer. This last one exports the standard MPI interface and implements MPI-specific issues as communicators, topologies, data types, etc. Primitives as MPI_send, MPI_recv, etc, are practically mapped on their Azequia COM counterparts.

AzequiaMPI

Fig. 3. AzequiaMPI internal design.

Azequia is composed by six interfaces. The first one, COM, gives the twelve combinations of (Blocking Non-blocking, Persistent) x (Synchronous, Standard, Buffered, Ready) communication modes of MPI. Besides, extends them with a timeout parameter. The second one, GRP, is charged of the “process management” functionality, creating and destroying the threads that compose a distributed application along the available processors. The third interface is THR, which manages the local threads. RPC is the Remote Procedure Call facility of Azequia. It allows GRP, for instance, creating and starting a thread in a given remote machine. The Networking block consists in the NET and LNK modules. LNK connects two adjacent processors, wired through a SDB or SHB link. The NET layer is a non-reliable connectionless transport service between the multicomputer TIMs.

{tab-segundo=Memory footprint and perfomance}

Memory footprint and performance

We conducted some tests (Figure 4) to compare the performance of AzequiaMPI, TOMPI and MPICH in a single Linux PC (shared memory). Note on one hand how the thread-based implementations double the performance of MPICH due to the single-copy principle and that, on other hand, the performance of AzequiaMPI is quite similar to the TOMPI one in spite of the former supports the whole standard and remote machines.

Ejecucion

Fig. 4. Performance of thread-based versus process-based MPI implementations in shared memory.

Table II shows the memory footprint of a minimal MPI application in a Linux PC. Figures were obtained by directly applying the size command to the executable images. Consider that MPICH makes extensive use of malloc from the .bss segment, something that AzequiaMPI does not. Even so, MPICH text and data sizes double those of AzequiaMPI. In particular, OSI shows a similar memory footprint in DSP/BIOS and Xilkernel, less than 8 Kb.

 

  .text .data .bss Total
AzequiaMPI appl. 88910 1152 97632 187694
MPICH appl.
209465 2276 98688 310429

Table II. Memory footprint contrast of a minimal application using AzequiaMPI versus MPICH in a Pentium Linux Desktop.

{tab-segundo=AzequiaMPI and reconfigurable computing}

AzequiaMPI and reconfigurable computing

Xilkernel is a small, robust, and modular monoprocessor kernel for MicroBlaze and PowerPC. Highly integrated with Xilinx development tools, it is apparently the operating system we need to run AzequiaMPI in reconfigurable hardware. We got first involved in the field of reconfigurable computing in the context of the Hesperia project ([4]). Our objectives are, firstly, porting AzequiaMPI to the Microblaze and PowerPC 405 processors. Secondly, running it on reconfigurable cluster-on-chip (MPSoC) in Xilinx FPGA platforms and, finally, interoperate between DSPs and these clusters. One of our working platforms is a National Instruments CompactPCI rack with two CompactPCI SMT300Q carrier boards (see Figure 1). Each board is a mixture of SMT395-VP30-6 DSP and SMT338-VP30-6 FPGA modules. Each FPGA module contains a Xilinx VirtexIIPro-VP30-FF896-6 FPGA. Effectively, we have recently ported AzequiaMPI to Xilkernel/PowerPC in these FPGAs.

 

References

[1] Juan A. Rico Gallego, Jesús M. Álvarez Llorente, Juan C. Díaz Martín, Francisco Perogil Duque, “A Network Service for DSP Multicomputers”, Lecture Notes in Computer Science, Springer, Volume 5022/2008. pp. 169-172.

[2] http://www.sundance.com. Accesed 18th July 2008

[3] http://www.3l.com. Accesed 18th July 2008

[4] http://www.proyecto-hesperia.org

[5] J. A. Williams, N. W. Bergmann and R. F. Hodson, "A Linux-based Software Platform for the Reconfigurable Scaleable Computing Project", 8th MAPLD International Conference, September 7-9, 2005.

[6] H. Tang, K. Shen, and T. Yang, “Program Transformation and Runtime Support for Threaded MPI Execution on Shared Memory Machines”. ACM Transactions on Programming Languages and Systems, Vol. 22, No. 4, Nov 2000, Pages 673 - 700.

[7] E. D. Demaine, "A Threads-Only MPI Implementation for the Development of Parallel Programs", International Symposium on High Performance Computing Systems, July 1997, pp. 153-163.

[8] Andrew D. Birrell, "Implementing Condition Variables with Semaphores". In "Computer Systems: Theory, Technology, and Applications: a Tribute to Roger Needham", pp. 29-37. Edited by Andrew J. Herbert and Karen Sparck Jones. Springer, 2004.

[9] M. Scarpino, “Implementing the Message Passing Interface (MPI) with FPGAs s”, 9th MAPLD International Conference, September 26-29, 2006.

{tab=Documentation}

Related papers

{tab=Download}

Azequia is not longer maintained by GIM group, but some level of supporting could be provided if you are interested. If so, please, contact us by Esta dirección de correo electrónico está siendo protegida contra los robots de spam. Necesita tener JavaScript habilitado para poder verlo..

The latest release can be downloaded from here:

DATE VERSION FILE README CHANGES
2010, Apr 28th 2.1.2 Azequia
 Descarga
Descargas: 50
Readme.txt
 Descarga
Descargas: 103
-
2010, Mar 8th
2.1.1 Azequia
 Descarga
Descargas: 33
- -
2009, Nov 12th 2.0.0 Azequia
 Descarga
Descargas: 19
- -

{tab=Members}

Juan Carlos Díaz Martín

Image

juancarl (arroba) unex (punto) es
Doctor en Informática
Profesor Titular de Universidad
Dpto. de Tecnología de los Computadores y las Comunicaciones

Leer más...

Juan Antonio Rico Gallego

Image

jarico (arroba) unex (punto) es
Diploma de Estudios Avanzados
Profesor Colaborador
Departamento de Ingeniería de Sistemas Informáticos y Telemáticos

Leer más...

Jesús María Álvarez Llorente

Image

llorente (arroba) unex (punto) es
Diploma de Estudios Avanzados
Profesor Titular de Escuela Universitaria
Área de Lenguajes y Sistemas Informáticos

Leer más...

 

back

Log in