Glean on aarch64 on Apple Silicon : part 2

See: Part 1: get an aarch64/Linux VM running in UTM on the M1

I want to develop and use Glean on ARM as I have a MacBook Air (road warrior mode) and I’m interested in making Glean more useful for local developer IDE backends. (c.f What is Glean?)

To build Glean just read the fine instructions and fix any compilation errors, right? Actually, we need a few patches to disable Intel-specific things, but otherwise the instructions are the same. It’s a fairly normal-ish Haskell set of projects with an FFI into some moderately bespoke C++ runtime relying on folly and a few other C++ libs.

Thankfully, all the non-portable parts of Glean are easily isolated to the rts/ownership parts of the Glean database runtime. In this case “ownership” is only used for incremental updates to the database and other semi-advanced things I don’t need right now.

The only real bits of non-portable code are:

  • Flags to tell folly and thrift to use haswell or corei7 (we will ignore this on non-x86_64)
  • An implementation of 256-bit bitsets (via AVX).
  • Use of folly/Elias-Fano coding, for efficient compression of sorted integer list or sets as offsets (how we represent ownership of facts to things they depend on).

Why is this stuff in Glean? Well, Glean is a database for storing and querying very large scale code information, represented as 64 bit keys into “tables” (predicates) which represent facts. These facts relate to each other forming DAGs. Facts are named by 64 bit key. A Glean db is millions (or billions) of facts across hundreds of predicates. I.e. lots of 64 bit values.

So we’re in classic information retrieval territory – hence the focus on efficient bit and word encodings and operations. Generally, you flatten AST information (or other code facts) into tables, then write those tables into Glean. Glean then goes to a lot of work to store that efficiently. That’s how we get the sub-millisecond query times.

What is a “fact about code”? A single true statement about the code. E.g. for a method M in file F we might have quite a lot of information:

M is a method
M is located at file F
M is located at span 102-105
M has parent P
F is a file
M has type signature T
M is referred to by file/spans (G, 107-110) and (H, 23-26)

Real code bases have millions of such facts, all relating things in the code to each other – types to methods, methods to container modules, declarations to uses, definitions to declarations etc. We want that to be efficient, hence all the bit fiddling.

So let’s try to build this on non-x86 and see what breaks.

Building Glean from scratch

The normal way to build Glean is from source. There are two repos:

I’ve put some PRs up for the non-x86_64 builds, so if you’re building for ARM or something else, you’ll need these from here:

git clone https://github.com/donsbot/Glean.git
cd Glean
git clone https://github.com/donsbot/hsthrift.git

Worth doing a cabal update as well, just in case you never built Haskell stuff before.

Now we can build the dependent libraries and the thrift compiler (n.b. we need some stuff installed in /usr/local (needs sudo).

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

So let’s build thrift and folly:

 cd hsthrift
 ./install-deps –sudo

I’m doing this on the aarch64 Debian 11 image running in UTM on a Macbook Air.

Now… first time build I seem to reliably get a gcc segfault on both x86 and aarch64, which I will conveniently sidestep by running it again. This seems mildly concerning as open source thrift might be miscompiled with gcc. I should likely be using clang here.

[ 57%] Building CXX object thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o
In file included from :
/usr/include/stdc-predef.h: In substitution of ‘template constexpr T apache::thrift::detail::identity(T) [with T = ]’:
/home/dons/Glean/hsthrift/fbthrift/thrift/lib/cpp2/protocol/TableBasedSerializer.cpp:37:1: required from here
/usr/include/stdc-predef.h:32:92: internal compiler error: Segmentation fault
32 | whether the overall intent is to support these features; otherwise,
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make[2]: *** [thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/build.make:173: thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o] Error 1

Vanilla Debian gcc.

$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110

That really looks like a gcc bug and probably other things lurking there. Urk. Re-run the command and it seems to make progress. Hmm. Compilers in memory-unsafe languages eh? Moving along quickly…

Build the Glean rts and Haskell bits

Once hsthrift is built installed, we have all the various C++ deps (folly, xxhash etc). So we can try building Glean itself now. Glean is a mixture of Haskell tools over a C++ runtime. There’s a ton of schemas, bytecode generators, thrift mungers. Glean is sort of an ecosystem of indexers (analyzing code and spitting out facts as logs), a database runtime coupled to a Thrift server (“Glean” itself) and tooling for building up distributed systems around this (for restoring/ migrating/ monitoring / administering clusters of Glean services).

Building Glean .. if you get an error about missing HUnit, that means we haven’t synced the cabal package list. I got this on the first go with a blank Debian iso as the initial cabal package list is a basic one.

Resolving dependencies…
cabal: Could not resolve dependencies:
[__0] trying: fb-stubs-0.1.0.0 (user goal)
[__1] unknown package: HUnit (dependency of fb-stubs)
[__1] fail (backjumping, conflict set: HUnit, fb-stubs)

That’s fixable with a cabal update.

If you’re not using my branch, and building on non-x86 you’ll fail at the first AVX header.

Preprocessing library 'rts' for glean-0.1.0.0..
Building library 'rts' for glean-0.1.0.0..
In file included from ./glean/rts/ownership/uset.h:11,
from ./glean/rts/ownership.h:12, from glean/rts/ffi.cpp:18:0: error:
glean/rts/ownership/setu32.h:11:10: error:
fatal error: immintrin.h: No such file or directory
11 | #include

Similarly, hsthrift needed some patches where the intel arch was baked in, otherwise you’ll get:

cc1plus: error: unknown value ‘haswell’ for ‘-march’
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a

I fixed up all the .cabal files and other bits:

$ find . -type f -exec grep -hl haswell {} \;
./hsthrift/server/thrift-server.cabal
./hsthrift/cpp-channel/thrift-cpp-channel.cabal
./hsthrift/common/util/fb-util.cabal
./glean.cabal

See this PR for the tweaks for hsthrift https://github.com/facebookincubator/hsthrift/pull/53/commits

AVX instructions

Now, Glean itself uses a whole set of AVX instructions for different things. To see what’s actually needed I added a define to conditionally compile immintrin.h on arm, and then sub out each of the methods until the compiler was happy.

$ find . -type f -exec grep -hl immintrin.h {} \;
./glean/rts/ownership/setu32.h
./glean/rts/ownership.cpp

The methods we need to stub out are:

int _mm256_testc_si256(__m256i __M, __m256i __V);
int _mm256_testz_si256(__m256i __M, __m256i __V);
__m256i _mm256_setzero_si256();
__m256i _mm256_set1_epi32(int __A);
__m256i _mm256_sllv_epi32(__m256i __X, __m256i __Y);__m256i _mm256_sub_epi32(__m256i __A, __m256i __B);
__m256i _mm256_set_epi32(int __A, int __B, int __C, int __D, int __E, int __F, int __G, int __H);

Ooh AV512

long long _mm_popcnt_u64(unsigned long long __X);

Also

unsigned long long _lzcnt_u64(unsigned long long __X);
__m256i _mm256_or_si256(__m256i __A, __m256i __B);__m256i _mm256_and_si256(__m256i __A, __m256i __B);
__m256i _mm256_xor_si256(__m256i __A, __m256i __B);

Figuring out what these are all used for is interesting. We have 256-bit bitsets everywhere, and e.g. 4 64 bit popcnts to count things (fact counts?).

size_t count() const {
const uint64_t* p = reinterpret_cast(&value);
// _mm256_popcnt instructions require AVX512
return
_mm_popcnt_u64(p[0]) +
_mm_popcnt_u64(p[1]) +
_mm_popcnt_u64(p[2]) +
_mm_popcnt_u64(p[3]);
}

Anyway, its relatively trivial to stub these out, match the types and we have a mocked AVX layer. Left to the reader to write a portable shim for 256 bitsets that does these things on vectors of words.

Elias Fano

So the other bit is a little more hairy. Glean uses Elias Fano to compress all these sets of 64 bit keys we have floating around. Tons of sets indicating facts are owned or related to other facts. The folly implementation of Elias Fano is x86_64 only, so just falls over on aarch64:

/usr/local/include/folly/experimental/EliasFanoCoding.h:43:2: error:
error: #error EliasFanoCoding.h requires x86_64
43 | #error EliasFanoCoding.h requires x86_64
| ^~~~~
|
43 | #error EliasFanoCoding.h requires x86_64

So hmm. Reimplement? No its Saturday so I’m going to sub this out as well, just enough to get it to compile. My guess is we don’t use many methods here, read/write/iterate and some constructors. So I copy just enough of the canonical implementation declarations and dummy bodies to get it all to go through. Hsthrift under aarch64 emulation on UMT on an arm64 M1 takes about 10 mins to build with no custom flags.

Build and test

So we should be good to go now. Compile the big thing: Glean. Some of these generated bits of schema are pretty big too.

Glean storage is described via “schemas” for languages. Schemas represent what predicates (tables and their types) we want to capture. Uniquely, Glean’s Angle language is rich enough to support abstracting over types and predicates, building up layers of API that let you hide language complexity. You can paper over differences between languages while also providing precise language-specific captur

To see an example, look at the mulit-language find-references layer in codemarkup.angle:

The joy of this is that a client only has to know to query codemarkup:find-references and the right query will be issued for the right language. Client doesn’t have to know language-specific stuff, its all hidden in the database engine.

But .. that does end up meaning we generate quite a lot of code. With some trial and error I needed something under 16G to compile the “codemarkup” abstraction layer (this is a language-angostic navigation layer over the Glean schemas).

make
make test

That should pass and we are in business. We can run the little hello world demo.

$ uname -msr
Linux 5.10.0-10-arm64 aarch64

$ glean shell --db-root /tmp/glean/db/ --schema /tmp/glean/schema/
Glean Shell, built on 2022-01-08 07:22:56.472585205 UTC, from rev 9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254
type :help for help.

Check our little db created from the walkthrough:

:list
facts/0 (complete)
Created: 2022-01-08 10:59:17 UTC (1 day, 18 hours ago)
Completed: 2022-01-08 10:59:18 UTC (1 day, 18 hours ago)

What predicates does it have?

facts> :schema
predicate example.Member.1 :
{ method : { name : string, doc : maybe string } | variable : { name : string } | }

predicate example.FileClasses.1 : { file : string, classes : [example.Class.1]

predicate example.Reference.1 :
{ file : string, line : nat, column : nat }
-> example.Class.1

predicate example.Class.1 : { name : string, line : nat }

predicate example.Parent.1 : { child : example.Class.1, parent : example.Class.1 }

predicate example.Has.1 :
{ class_ : example.Class.1, has : example.Member.1, access : enum { Public | Private | } }

predicate example.Child.1 : { parent : example.Class.1, child : example.Class.1 }

Try a query or two: e.g. “How many classes do we have?”

facts> example.Class _
{ "id": 1026, "key": { "name": "Fish", "line": 30 } }
{ "id": 1027, "key": { "name": "Goldfish", "line": 40 } }
{ "id": 1025, "key": { "name": "Lizard", "line": 20 } }
{ "id": 1024, "key": { "name": "Pet", "line": 10 } }

What is the parent of the Fish class?

facts> example.Parent { child = { name = "Fish" } }
{
"id": 1029,
"key": {
"child": { "id": 1026, "key": { "name": "Fish", "line": 30 } },
"parent": { "id": 1024, "key": { "name": "Pet", "line": 10 } }

1 results, 3 facts, 5.59ms, 172320 bytes, 1014 compiled bytes

Ok we have a working ARM64 port of Glean. In the next post I’ll look at indexing some real code and serving up queries.

Glean on aarch64 on Apple Silicon : part 1

Get a working aarch64 box

This post show how to get a working aarch64 env on the MacBook Air (M1) for Haskell.

I’m working on the road at the moment, so picked up a MacBook Air with the M1 chip, to travel light. I wanted to use it as a development environment for Glean (c.f. what is Glean), the code search system I work on. But Glean is a Linux/x86_64 only at the moment due to use of some fancy AVX extensions deep down in the runtime. Let’s fix that.

Motivation: getting Glean working on Apple ARM chips could be useful for a few reasons. Apple Silicon is becoming really common, and a lot of devs have MacBooks as their primary development environment (essentially expensive dumb terminals to run VS Code). Glean is/could be the core of a lot of developer environments, as it indexes source code and serves up queries extremely efficiently, so it could be killer as a local language server backend for your Mac IDE. (e.g. a common backend for all your languages, with unified search, jump-to-def, find-refs etc).

Setup up UTM

Glean is still very Linux-focused. So we need a VM. I’m building on an M1 MacBook Air (ARM64). So I install UTM from the app store or internet – this will be our fancy iOS QEMU virtualization layer.

Configure the OS image as per https://medium.com/@lizrice/linux-vms-on-an-m1-based-mac-with-vscode-and-utm-d73e7cb06133 for aarch64 debian, using https://mac.getutm.app/gallery/ubuntu-20-04 for the basic configuration.

In particular, I set up the following.

  • Information -> Style: Operating system 
  • System -> Hardware -> Architecture: aarch64
  • System -> Memory -> 16G (compiling stuff!)
  • Drives -> VirtIO at least 20G, this will be the install drive and build artifacts
  • Drives -> Removable USB , for the installation .iso
  • Display -> console only (we’ll use ssh)
  • Network -> Mode: Emulated VLAN
VM disk configuration

I’ll point VS Code and other things at this VM, so I’m going to forward port 2200 on the Mac to port 22 on the Debian VM.

Network settings for the VM

Choose OS installer and boot

Set the CD/DVD to the Debian ISO file path. I used the arm64 netinst iso for Debian 11 from https://cdimage.debian.org/debian-cd/current/arm64/iso-cd/

Boot the machine and run the Debian install. It’s like 1999 here. (So much nostalgia when I used to scavenge x86 boxes from dumpsters in the Sydney CBD 20 years ago to put Linux on them).

Yeah!

Boot the image and log in. Now we have a working Linux aarch64 box on the M1, running very close to native speed (arm on arm virtualization).

You can ssh into this from the Mac OS side, or set it up as a remote host for VS Code just fine, which is shockingly convenient (on port 2200).

Install the dev env

This is a really basic Debian image, so you need a couple of things to get started with a barebones Haskell env:

apt install sudo curl cabal-install

We have a basic dev env now.

$ uname -msr
Linux 5.10.0-10-arm64 aarch64

$ ghci
GHCi, version 8.8.4: https://www.haskell.org/ghc/  :? for help
Prelude> System.Info.arch
"aarch64"
Prelude> let s = 1 : 1 : zipWith (+) s (tail s) in take 20 s
[1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765]

To build Glean a la https://glean.software/docs/building/ we need to update Cabal to the 3.6.x or greater, as Glean uses some fancy Cabal configuration features.. 

Update cabal

We need cabal > 3.6.x  which isn’t in Debian stable, so I’ll just use the pre-built binary from https://www.haskell.org/cabal/download.html

Choose: Binary download for Debian 10 (aarch64, requires glibc 2.12 or later): cabal-install-3.6.0.0-aarch64-linux-deb10.tar.xz

Unpack that. You’ll also need apt-get libnuma-dev if you use that binary.

$ tar xvfJ  cabal-install-3.6.0.0-aarch64-linux-deb10.tar.xz
$ ./cabal --version
cabal-install version 3.6.0.0
compiled using version 3.6.1.0 of the Cabal library

I just copy that over the system cabal for great good. It’s a good idea now to sync the package list for Hackage, before we start trying to build anything Haskell. with a cabal update.

Install the Glean dependencies

To build Glean we need a bunch C++ things. Glean itself will bootstrap the Haskell parts. The Debian packages needed are identical to those for Ubuntu on the Glean install instructions : https://glean.software/docs/building/#ubuntu except you might see “Package ‘libmysqlclient-dev’ has no installation candidate”. We will instead need default-libmysqlclient-dev. We also need libfmt-dev.

So the full set of Debian Glean dependencies are:

> apt install g++ \
    cmake \
    bison flex \
    git cmake \
    libzstd-dev \
    libboost-all-dev \
    libevent-dev \
    libdouble-conversion-dev \
    libgoogle-glog-dev \
    libgflags-dev \
    libiberty-dev \
    liblz4-dev \
    liblzma-dev \
    libsnappy-dev \
    make \
    zlib1g-dev \
    binutils-dev \
    libjemalloc-dev \
    default-libmysqlclient-dev \
    libssl-dev \
    pkg-config \
    libunwind-dev \
    libsodium-dev \
    curl \
    libpcre3-dev \
    libfftw3-dev \
    librocksdb-dev \
    libxxhash-dev \
    libfmt-dev

Now we have a machine ready to build Glean. We’ll do the ARM port of Glean in the next post and get something running.