Arm’s innovative Scalable Vector Extension instructions are a new set of instructions designed for data-heavy supercomputing applications. Superficially they are similar to the existing NEON/AdvSIMD instructions in that they allow you to exploit data parallelism in algorithms by executing several identical operations at the same time over multiple lanes of a vector register. Where SVE differs from traditional vector processing is that the number of lanes are implementation defined. The clever part comes from a programming model that means assumptions about the number of lanes are not hard coded into the binaries. This means the same binary that runs on a mobile SoC, with a small number of lanes, can automatically take advantage of the wider lanes on a HPC supercomputing cluster, without recompiling the program.

The SVE instruction set also provides some additional features which are worth taking advantage of. These include a fault handling mechanism called the First Fault Register which allows you to defer expensive calculations to avoid crossing page boundaries and instead handle reaching non-accessible pages as just another boundary condition for your loop. Another feature of note is scatter/gather load/store support which allows complex structures to be quickly loaded into registers in fewer instructions.

While SVE enabled hardware is on its way, it’s going to be a while before anyone outside of a silicon lab can get their hands on real hardware. In the meantime, software developers want to be able to port and test their software now so they can be ready for the arrival of real SVE enabled chips. With the release of QEMU 3.0, they will be able to do just that. Thanks to QEMU’s linux-user emulation, we now have the ability to develop and test userspace Arm binaries utilising SVE.

If you want to learn more about the history of vector processing and the implications it has for dynamic binary translation, watch the talk I gave at KVM Forum last year.

If you are ready to start experimenting with these instructions then read on.


These instructions currently assume you are running a Linux x86_64 development environment. This will soon be available in the up-coming QEMU 3.0 release but currently you can use a fresh checkout of the master repository.

You will also need to have a working Docker setup as we are going to be hosting our development environment inside a docker container. Ideally it should be setup in a developer friendly way so the main user account can run docker without a password. The QEMU project has some notes on potential docker setups.

We’ll be using the kernel binfmt_misc support to automatically run Aarch64 linux binaries with QEMU. As the configuration is system wide, we need to configure it on the host. On Debian based systems it is usually enough to install the “qemu-user” package which will ensure everything is set up for you. On other systems you can run the helper script in the QEMU source tree:

$ sudo ./scripts/ --qemu-path /usr/bin
Setting /usr/bin/qemu-alpha as binfmt interpreter for alpha
Setting /usr/bin/qemu-arm as binfmt interpreter for arm
Setting /usr/bin/qemu-aarch64 as binfmt interpreter for aarch64
Setting /usr/bin/qemu-aarch64_be as binfmt interpreter for aarch64_be
Setting /usr/bin/qemu-microblaze as binfmt interpreter for microblaze
Setting /usr/bin/qemu-microblazeel as binfmt interpreter for microblazeel

It is important that the interpreter for aarch64 binaries points at /usr/bin/qemu-aarch64 as this is where we will be installing the QEMU support binary in our docker images.

Building QEMU

So the first thing we need to do is build a statically built version of the qemu-aarch64 binary.

./configure --target-list=aarch64-linux-user --static

And then build:

make -j8

The -j option specifies how many units to compile at a time, typically you set it to the number of cores your system has. If you are running on a beefy server hardware, feel free to crank the number higher ;-)

Creating the base Docker Image

The next thing we are going to do is create the base docker image. This will be a complete AArch64 rootfs with basic tools and the ability to install more software. In this case I’ve chosen Ubuntu’s Bionic Beaver but you can bootstrap any Debian based environment. Importantly, we will want to install gcc-8 as that is the first version of gcc that can compile SVE enabled binaries so in Debian you would want to install the rolling “testing” release.

Fortunately all the details of the build are hidden behind QEMU’s build system. So we just need to execute this rather long make invocation:

make docker-binfmt-image-debian-ubuntu-bionic-arm64 \
  DEB_ARCH=arm64 DEB_TYPE=bionic DEB_URL= \
  EXECUTABLE=./aarch64-linux-user/qemu-aarch64 V=1

The V=1 will show you what’s going on under the hood as the bootstrapping process will take a while to complete. Once it has completed there should be an image in your local docker repository tagged qemu:debian-ubuntu-bionic-arm64. We can run it to verify that everything worked ok:

$ docker run --rm -it qemu:debian-ubuntu-bionic-arm64 /bin/bash
root@e68be4cb7b0f:/# uname -a
Linux e68be4cb7b0f 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
root@e68be4cb7b0f:/# exit

While uname reports the host kernel version, as far as the binaries are concerned they are running on an AArch64 machine. Another cool aspect about QEMU’s docker support is that it has automatically created a user in the container that is mapped to the current user on the host. This allows us to spin up a container process with the users privileges. This is useful when combined with dockers volume mounts to allow access to the host file-system as it means any files will be owned by the user and not the all powerful “root” user.

$ docker run --rm -it -u $(id -u) -v $(pwd):$(pwd) -w $(pwd) qemu:debian-ubuntu-bionic-arm64 /bin/bash
alex.bennee@01cfe5adbbec:~/lsrc/qemu.git$ whoami

Setting up our Development Environment

Now we have a working base-system, we can use this to create a new Docker image with all the tools we need to experiment with SVE. Up until this point we have been running with the –rm flag which tells docker to throw away any data inside the container when we are done. For the next set of steps we will run without the flag so we can save the final state of the container as a new image for later use. In this example we demonstrate this manually but most docker image creation is scripted through Dockerfiles and is good practice, like for example QEMU’s aarch64 cross compiler image.

So let’s get our image set up:

$ docker run -it qemu:debian-ubuntu-bionic-arm64 /bin/bash
root@c4dc9b5426ad:/# sed -i 's/main/main universe/' /etc/apt/sources.list
root@c4dc9b5426ad:/# apt update
root@c4dc9b5426ad:/# apt install -y gcc-8 g++-8 wget libtool autoconf libtool gdb less
root@c4dc9b5426ad:/# update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 1000
root@c4dc9b5426ad:/# update-alternatives --install /usr/bin/++ g++ /usr/bin/g++-8 1000
root@c4dc9b5426ad:/# gcc --version
gcc (Ubuntu 8-20180414-1ubuntu2) 8.0.1 20180414 (experimental) [trunk revision 259383]
root@c4dc9b5426ad:/# exit

We have done the following things:

  1. Added the universe repository (for gcc/g++-8)
  2. Updated the repo lists
  3. Installed gcc-8 and some other useful tools
  4. Updated the default gcc to use gcc-8 and confirmed the switch

We now want to save this container as an image we can re-use. First we need to identify the container we have just run:

$ docker ps -a
CONTAINER ID   IMAGE                            COMMAND     CREATED      STATUS                    NAMES
c4dc9b5426ad   qemu:debian-ubuntu-bionic-arm64  "/bin/bash" 2 hours ago  Exited (0) 2 seconds ago  adoring_goodall
e174632927ba   238519b386bc                     "/bin/sh"   5 hours ago  Exited (0) 5 hours ago    friendly_mayer
ad33c7bc7558   0da2cdd3455f                     "/bin/sh"   8 hours ago  Exited (0) 8 hours ago    silly_noether

The first entry on the list is the one we have justed exited, so we commit that as a new image:

$ docker commit -m "setup arm64 env" adoring_goodall development:bionic-arm64-sve

The sha256 is the new image id, but we have tagged it as development:bionic-arm64-sve so we can use a friendly name in the future. You can actually query docker using -l or --latest and use the --format option to do this for you automatically:

$ docker run -it development:bionic-arm64-sve /bin/bash
root@79d919733249:/# ... add more packages into container ...
root@79d919733249:/# exit
$ docker commit -m "another comment" $(docker ps -l --format  "{{.ID}}") development:bionic-arm64-sve

However you should take care if you are using a shared machine as the latest created container might not be the one you were working on. You can view the history of the image with:

$ docker history development:bionic-arm64-sve
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
20683aaafb93        2 minutes ago       /bin/bash                                       33.1MB              Add libnurses-dev
88109c68e0b3        29 minutes ago      /bin/bash                                       837MB               Add linux build-deps
9fc483157d5a        40 minutes ago      /bin/bash                                       400MB               Install dev tools
cef396d7e8c1        About an hour ago   /bin/sh -c #(nop)  LABEL com.qemu.dockerfile…   0B
f7eb4cfcaa7f        About an hour ago   /bin/sh -c id alex.bennee 2>/dev/null || use…   4.15MB
0aad8f469288        About an hour ago   /bin/sh -c cat /etc/apt/sources.list | sed "…   110B
35de98f23355        About an hour ago   /bin/sh -c /debootstrap/debootstrap --second…   189MB
d853fbb0490b        About an hour ago   /bin/sh -c sed -i 's/in_target mount/echo no…   42kB
add85c47ea47        About an hour ago   /bin/sh -c #(nop) ADD dir:ffaea2b15380f9991a…   135MB

All the /bin/sh steps were during the initial creation of the qemu:debian-ubuntu-bionic-arm64 image by the build system. The additional steps were where I ran bash and manually added things to my custom development:bionic-arm64-sve image.

Taking it for a Spin

Now we have a development environment, how do we use it? For this example I’m going to use the Cortex Strings library. This is a staging ground for various accelerated library functions using NEON, AdvSIMD and SVE instructions. We shall be building it with SVE enabled.

First off all check out the git repository on the host system:

$ git clone cortex-strings.git
$ cd cortex-strings.git

We shall now start our development container as the user with our host path mounted:

$ docker run --rm -it -u $(id -u) -v $(pwd):$(pwd) -w $(pwd) development:bionic-arm64-sve /bin/bash

The remaining steps are run inside our container:

user@container:~/cortex-strings.git $ ./
user@container:~/cortex-strings.git $ ./configure --with-sve --enable-static --disable-shared
user@container:~/cortex-strings.git $ make -j
user@container:~/cortex-strings.git $ make check -j

At the end you should be presented with the result of the self-test:

PASS: tests/test-memset
PASS: tests/test-strnlen
PASS: tests/test-strlen
PASS: tests/test-memchr
PASS: tests/test-strcpy
PASS: tests/test-strchr
PASS: tests/test-memmove
PASS: tests/test-strrchr
PASS: tests/test-memcmp
PASS: tests/test-memcpy
PASS: tests/test-strncmp
PASS: tests/test-strcmp
Testsuite summary for cortex-strings 1.1-2012.06~dev
# TOTAL: 12
# PASS:  12
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

Not sure we have just run SVE enabled code? Let’s examine it with gdb:

user@container:~/cortex-string.git $ gdb tests/test-strcpy
GNU gdb (Ubuntu 8.1-0ubuntu3)
Reading symbols from tests/test-strcpy...done.
(gdb) disassemble strcpy
Dump of assembler code for function strcpy:
  0x00000000000023c0 <+0>:     setffr
  0x00000000000023c4 <+4>:     ptrue   p2.b
  0x00000000000023c8 <+8>:     mov     x2, #0x0                        // #0
  0x00000000000023cc <+12>:    nop
  0x00000000000023d0 <+16>:    ldff1b  {z0.b}, p2/z, [x1, x2]
  0x00000000000023d4 <+20>:    rdffrs  p0.b, p2/z
  0x00000000000023d8 <+24>:    b.cs    0x23f0 <strcpy+48>  // b.hs, b.nlast
  0x00000000000023dc <+28>:    cmpeq   p1.b, p2/z, z0.b, #0
  0x00000000000023e0 <+32>:    0x2408 <strcpy+72>  // b.any
  0x00000000000023e4 <+36>:    st1b    {z0.b}, p2, [x0, x2]
  0x00000000000023e8 <+40>:    incb    x2
  0x00000000000023ec <+44>:    b       0x23d0 <strcpy+16>
  0x00000000000023f0 <+48>:    cmpeq   p1.b, p0/z, z0.b, #0
  0x00000000000023f4 <+52>:    0x2408 <strcpy+72>  // b.any
  0x00000000000023f8 <+56>:    setffr
  0x00000000000023fc <+60>:    st1b    {z0.b}, p0, [x0, x2]
  0x0000000000002400 <+64>:    incp    x2, p0.b
  0x0000000000002404 <+68>:    b       0x23d0 <strcpy+16>
  0x0000000000002408 <+72>:    brka    p0.b, p2/z, p1.b
  0x000000000000240c <+76>:    st1b    {z0.b}, p0, [x0, x2]
  0x0000000000002410 <+80>:    ret
End of assembler dump.

And there you have it. You can see predicate instructions ptrue, vector loads ldff1b {z0.b} and resetting of the first fault register setffr.