Azequia Azequia

A Thread-Based Implementation of the Message Passing Interface (MPI-1) Standard



AzequiaMPI is a thread-based full implementation of the Message Passing Interface (MPI-1) standard. It is being built through the knowledge gained in the study of the code and design of MPICH2, Open MPI and other implementations, like TOMPI, TMPI, MPC-MPI, and several more. The most inportant characteristic of the implementation is that built every MPI node as a thread, and not as an operating system process.

AzequiaMPI has three main layers:

  • The kernel, called IDSP or Azequia, provides point to point communication and RPC-based group and thread management. Implements two versions, a BLK kernel implementing a concurrency model based on POSIX 1003.1c threads primitives, and a LFQ kernel, implementing synchronization through non-blocking lock-free structures. The kernel can be used directly from applications, avoiding MPI interface.
  • INET, or Network Interface, that implements network facilities. It is under development, based on the study and experiences with the Open MPI BTLs. Our goal is provide TCP/IP and Infiniband facilities.
  • MPI Interface, upon two layers above, provides MPI semantics to applications.

AzequiaMPI layers design

AzequiaMPI uses other tools:

  • Process Manager Interface (from MPICH2) through MPD (Multi-Purpouse Daemon). It is used for launching applications in cluster environments.
  • HWLOC (HardWare Locality) from Open MPI project. It is used for binding threads to processors.
  • Etc.


{tab-azequia=Design model}

AzequiaMPI implements an MPI endpoint as a thread, launching a container process in each shared memory machine. Figure shows current implementation models of MPI (P is process with a rank, T is a process with a rank and th is a thread without MPI weight).

Process-based MPI

Process-based MPI

Hybrid model

Hybrid model

Thread-based MPI

Thread-based MPI

{tab-azequia=Current work}

1. Improving several aspects of the implementation:

  • Collective operations
  • Static Variables issues through pre-compiling and other techniques
  • Datatype and packing in shared memory thread communication
  • Adding MPI-2 features supported by threads (e.g. One Sided Communication)
  • Profiling tools

2. Adding network (TCP/IP and Infiniband) support to the implementation.

3. AzequiaMPI is now being ported to a TCP/IP FPGA cluster on top of hardware cores PowerPC.


AzequiaMPI runs on different platforms:

  1. Linux based clusters.
  2. Heterogeneous DSPs/FPGAs multicomputers from Sundance company.
  3. Xilkernel/Petalinux operating systems on top of Microblaze software processor in Xilinx FPGAs.
  4. Texas Instruments C6000 DSPs on top of DSP/BIOS. For implementing Pthread semantic upon DSP/BIOS semaphores we developed OSI (Operating System Interface).

>You can get latest versions from the download section.


AzequiaMPI performance has been compared to other implementations, process and thread based. Latency and bandwidth in shared memory take advance of thread common address space. A set of figures are presented below, showing snapshots of the development.

{tab-perfomance=Microbenchmarks performance}

July 2011. We are studying one-copy mechanisms for communicating processes in shared memory multicore computers, and algorithms from some implementations, in an Intel Nehalem dual socket quad-core E5620 (shared memory) with 12 MBytes cache size. We explore SMARTMAT, LiMIC2, and KNEM under Open MPI and MPICH2, as well as other thread based implementations like MPC-MPI.

Figures show results from initial research in latency and bandwidth for point to point communication and some collectives with one-to-all and all-to-one patterns. Note that we use binding for ranks to processors in round robin (rank i to core i, not SMT), collectives are executed with 8 ranks. Note as well different color order in figures:

Netpipe pingpong latency Netpipe pingpong bandwidth
IMB broadcast latency IMB broadcast bandwidth
IMB Reduce bandwidth IMB Gather bandwidth

We are working in designing and implementation of new algorithm for taking advance of thread in shared memory for collective communication.

{tab-perfomance=Euroben Matrix Product Performance}

July 2011. Next figures show Euroben benchmark performance measurements. Euroben benchmark does a matrix product using different communication mechanisms provided by MPI. Benchmark gives some weight to communication time, so expect an improvement for implementations of MPI with higher communication performance.

All tests are run with 8 ranks in a Intel Nehalem (E5620) two-socket quad-core machine, with 12 MB of L3 cache. All ranks are bind to processors (rank i to processor i without SMT).

Next figures shows a matrix product using standard communication (MPI_Send/MPI_Recv) of data between tasks. Also relative performance compared to AzequiaMPI is showed. X-axis shows the square matrix side size. A table with bytes sended for each matrix side size is provided.

Side Size Bytes Size
16 256 B
32 1 KB
64 4 KB
128 16 KB
256 64 KB
512 256 KB
1024 1 MB
2048 4 MB
4096 16 MB
Euroben !D Matrix Product (standard) Euroben 1D Matrix Multiplication (standard) Relative

Next figures shows a matrix product where data between tasks is communicated by MPI broadcast collective operation. Relative performance compared to AzequiaMPI is showed as well. All parameters are the same as figures above.

Euroben !D Matrix Product (broadcast) Euroben 1D Matrix Multiplication (broadcast) Relative

{tab-perfomance=HPL and NAS benchmarks performance}

July 2011. Next figures show realistic performance measurements through High Perfomance Linpack (HPL) and NAS Parallel Benchmarks (NPB). Low communication weight in these benchmarks result in no virtual differences between implementations.

Both benchmarks are run in an Intel Nehalem dual-socket quad-core machine (E5620) with 12 MB of L3 cache. 8 tasks run with binding of tasks to processors (rank i to processor i without SMT).

Althought we choose NPB SP multizone benchmark, flat MPI is used, with 8 ranks and no OpenMP threads.

HPL run with following parameters: N=20.000, NB=16, PxQ=2x4 and 1-ring broadcast.

NPB SP Multizone class D HPL 2x4 1-ring


Additional information

We are interested in High Performance Computing software research and development. The International Exascale Software Project group has defined a roadmap for coordinating efforts from the international open source software community to create a software environment to exploit exascale systems in next years. We are interested in aspects from this roadmap and their development inside our projects.

1. We are studying influence of software in power consumption. Our goal is find and apply software techniques for saving energy in supercomputers. Initially, we are working in evaluating blocking and busy waiting in synchronization and communication, using respectively BLK and LFQ versions of the AzequiaMPI kernel. We are collaborating with CenitS/Computaex in this field.

CenitS - Computaex logo

2. We are interested in applications using MPI. Our goal is to improve AzequiaMPI performance and scalability for supporting real world applications efficiently in different platforms.

3. Some HPC useful links are:

a) PRACE. Partnership for Advanced Computing in Europe
b) HPC-Europe2. Pan-Eurpean Research Infraestructure for High Performance Computing
c) IESP. International Exascale Software Project
d) PlanetHPC. Research and Invertigation Roadmap for High Performance Computing in Europe

PRACE logo HPC-Europa2 logo IESP logo PlanetHPC logo



We are writting an User and Installtion Guide for AzequiaMPI. Coming soon ...

Related papers



Please, read the license information before download AzequiaMPI.


You can download latest version under development from the svn repository: It can be accessed by web at: If you want to get involved and apply changes, please, send an Esta dirección de correo electrónico está siendo protegida contra los robots de spam. Necesita tener JavaScript habilitado para poder verlo. to get an account.

Please, if you have any comments or any proposal about AzequiaMPI or related software, please, contact us by forum or by Esta dirección de correo electrónico está siendo protegida contra los robots de spam. Necesita tener JavaScript habilitado para poder verlo..

A wiki has been created (most in Spanish) for internal documentation about development and using platforms and software.


The latest stable release for Linux multicore (not yet clusters) can be downloaded from here:

2014, June 10th 2.3.0


Descargas: 20

Quick Ref. Guide

Descargas: 24

2014, January 29th 2.2.4 AzequiaMPI
Descargas: 15
Readme.txt Changelog.txt


We have developed a version on FPGAs upon PetaLinux/Microblaze:

2010, Dec 10th 2.1.2_fpga Azequia
Descargas: 41
Readme.txt Changelog.txt
2010, Dec 10th
1.4.0_fpga AzequiaMPI
Descargas: 39
Readme.txt Changelog.txt


We have ported AzequiaMPI to DSP multicomputers. A version for a multicomputer SMT310Q from Sundance Company, with DSPs C64x (SMT361, SMT361A, etc.) of Texas Instruments is available in the Azequia page.


Neither the University of Extremadura, nor any of their employees, makes any
warranty express or implied, or assumes any legal liability or responsibility
for the accuracy, completeness, or usefulness of any information, apparatus,
product, or process disclosed, or represents that its use would not infringe
privately owned rights.


Juan Carlos Díaz Martín


juancarl (arroba) unex (punto) es
Doctor en Informática
Profesor Titular de Universidad
Dpto. de Tecnología de los Computadores y las Comunicaciones

Leer más...

Juan Antonio Rico Gallego


jarico (arroba) unex (punto) es
Diploma de Estudios Avanzados
Profesor Colaborador
Departamento de Ingeniería de Sistemas Informáticos y Telemáticos

Leer más...

Jesús María Álvarez Llorente


llorente (arroba) unex (punto) es
Diploma de Estudios Avanzados
Profesor Titular de Escuela Universitaria
Área de Lenguajes y Sistemas Informáticos

Leer más...



Log in