1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me>2% SPDX-FileCopyrightText: 2025-2026 Sören Tempel <soeren+git@soeren-tempel.net>3%4% SPDX-License-Identifier: MIT AND GPL-3.0-only56\documentclass{article}7%include polycode.fmt89%subst blankline = "\\[5mm]"1011% See https://github.com/kosmikus/lhs2tex/issues/5812%format <$> = "\mathbin{\langle\$\rangle}"13%format <&> = "\mathbin{\langle\&\rangle}"14%format <|> = "\mathbin{\langle\:\vline\:\rangle}"15%format <?> = "\mathbin{\langle?\rangle}"16%format <*> = "\mathbin{\langle*\rangle}"17%format <* = "\mathbin{\langle*}"18%format *> = "\mathbin{*\rangle}"1920\long\def\ignore#1{}2122\usepackage{hyperref}23\hypersetup{24 colorlinks = true,25}2627\begin{document}2829\title{QBE Intermediate Language\vspace{-2em}}30\date{}31\maketitle32\frenchspacing3334\ignore{35\begin{code}36module Language.QBE.Parser37 ( skipInitComments,38 dataDef,39 typeDef,40 funcDef,41 fileDef42 )43where4445import Control.Monad (foldM)46import Data.Char (chr)47import Data.Word (Word64)48import Data.Functor ((<&>))49import Data.List (singleton)50import Data.Map qualified as Map51import qualified Language.QBE.Types as Q52import Language.QBE.Util (bind, decNumber, octNumber, float)53import Text.ParserCombinators.Parsec54 ( Parser,55 alphaNum,56 anyChar,57 between,58 char,59 choice,60 letter,61 many,62 many1,63 manyTill,64 newline,65 noneOf,66 oneOf,67 optional,68 optionMaybe,69 sepBy,70 sepBy1,71 skipMany,72 skipMany1,73 string,74 try,75 (<?>),76 (<|>),77 )78\end{code}79}8081This an executable description of the82\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language},83specified through \href{https://hackage.haskell.org/package/parsec}{Parsec}84parser combinators and generated from a literate Haskell file. The description85is derived from the original QBE IL documentation, licensed under MIT.86Presently, this implementation targets version 1.2 of the QBE intermediate87language and aims to be equivalent with the original specification.8889\section{Basic Concepts}9091The intermediate language (IL) is a higher-level language than the92machine's assembly language. It smoothes most of the93irregularities of the underlying hardware and allows an infinite number94of temporaries to be used. This higher abstraction level lets frontend95programmers focus on language design issues.9697\subsection{Input Files}9899The intermediate language is provided to QBE as text. Usually, one file100is generated per each compilation unit from the frontend input language.101An IL file is a sequence of \nameref{sec:definitions} for102data, functions, and types. Once processed by QBE, the resulting file103can be assembled and linked using a standard toolchain (e.g., GNU104binutils).105106\begin{code}107comment :: Parser ()108comment = skipMany blankNL >> comment' >> skipMany blankNL109 where110 comment' = char '#' >> manyTill anyChar newline111\end{code}112113\ignore{114\begin{code}115skipNoCode :: Parser () -> Parser ()116skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP117\end{code}118}119120Here is a complete "Hello World" IL file which defines a function that121prints to the screen. Since the string is not a first class object (only122the pointer is) it is defined outside the function\textquotesingle s123body. Comments start with a \# character and finish with the end of the124line.125126\begin{verbatim}127data $str = { b "hello world", b 0 }128129export function w $main() {130@start131 # Call the puts function with $str as argument.132 %r =w call $puts(l $str)133 ret 0134}135\end{verbatim}136137If you have read the LLVM language reference, you might recognize the138example above. In comparison, QBE makes a much lighter use of types and139the syntax is terser.140141\subsection{Parser Combinators}142143\ignore{144\begin{code}145bracesNL :: Parser a -> Parser a146bracesNL = between (wsNL $ char '{') (wsNL $ char '}')147148quoted :: Parser a -> Parser a149quoted = let q = char '"' in between q q150151sepByTrail1 :: Parser a -> Parser sep -> Parser [a]152sepByTrail1 p sep = do153 x <- p154 xs <- many (try $ sep >> p)155 _ <- optional sep156 return (x:xs)157158sepByTrail :: Parser a -> Parser sep -> Parser [a]159sepByTrail p sep = sepByTrail1 p sep <|> return []160161parenLst :: Parser a -> Parser [a]162parenLst p = between (ws $ char '(') (char ')') inner163 where164 inner = sepBy (ws p) (ws $ char ',')165166unaryInstr :: (Q.Value -> Q.Instr) -> String -> Parser Q.Instr167unaryInstr conc keyword = do168 _ <- ws (string keyword)169 conc <$> ws val170171binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr172binaryInstr conc keyword = do173 _ <- ws (string keyword)174 vfst <- ws val <* ws (char ',')175 conc vfst <$> ws val176177-- Can only appear in data and type definitions and hence allows newlines.178alignAny :: Parser Word64179alignAny = (ws1 (string "align")) >> wsNL decNumber180181-- Returns true if it is signed.182signageChar :: Parser Bool183signageChar = (char 's' <|> char 'u') <&> (== 's')184\end{code}185}186187The original QBE specification defines the syntax using a BNF grammar. In188contrast, this document defines it using Parsec parser combinators. As such,189this specification is less formal but more accurate as the parsing code is190actually executable. Consequently, this specification also captures constructs191omitted in the original specification (e.g., \nameref{sec:identifiers}, or192\nameref{sec:strlit}). Nonetheless, the formal language recognized by these193combinators aims to be equivalent to the one of the BNF grammar.194195\subsection{Identifiers}196\label{sec:identifiers}197198% Ident is not documented in the original QBE specification.199% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304200201\begin{code}202ident :: Parser String203ident = do204 start <- letter <|> oneOf "._"205 rest <- many (alphaNum <|> oneOf "$._")206 return $ start : rest207\end{code}208209Identifiers for data, types, and functions can start with any ASCII letter or210the special characters \texttt{.} and \texttt{\_}. This initial character can211be followed by a sequence of zero or more alphanumeric characters and the212special characters \texttt{\$}, \texttt{.}, and \texttt{\_}.213214\subsection{Sigils}215216\begin{code}217userDef :: Parser Q.UserIdent218userDef = Q.UserIdent <$> (char ':' >> ident)219220global :: Parser Q.GlobalIdent221global = Q.GlobalIdent <$> (char '$' >> ident)222223local :: Parser Q.LocalIdent224local = Q.LocalIdent <$> (char '%' >> ident)225226label :: Parser Q.BlockIdent227label = Q.BlockIdent <$> (char '@' >> ident)228\end{code}229230The intermediate language makes heavy use of sigils, all user-defined231names are prefixed with a sigil. This is to avoid keyword conflicts, and232also to quickly spot the scope and nature of identifiers.233234\begin{itemize}235 \item \texttt{:} is for user-defined \nameref{sec:aggregate-types}236 \item \texttt{\$} is for globals (represented by a pointer)237 \item \texttt{\%} is for function-scope temporaries238 \item \texttt{@@} is for block labels239\end{itemize}240241\subsection{Spacing}242243\begin{code}244blank :: Parser Char245blank = oneOf "\t " <?> "blank"246247blankNL :: Parser Char248blankNL = oneOf "\n\t " <?> "blank or newline"249\end{code}250251Individual tokens in IL files must be separated by one or more spacing252characters. Both spaces and tabs are recognized as spacing characters.253In data and type definitions, newlines may also be used as spaces to254prevent overly long lines. When exactly one of two consecutive tokens is255a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted.256257\ignore{258\begin{code}259ws :: Parser a -> Parser a260ws p = p <* skipMany blank261262ws1 :: Parser a -> Parser a263ws1 p = p <* skipMany1 blank264265wsNL :: Parser a -> Parser a266wsNL p = p <* skipNoCode (skipMany blankNL)267268wsNL1 :: Parser a -> Parser a269wsNL1 p = p <* skipNoCode (skipMany1 blankNL)270271-- Only intended to be used to skip comments at the start of a file.272skipInitComments :: Parser ()273skipInitComments = skipNoCode (skipMany blankNL)274\end{code}275}276277\subsection{String Literals}278\label{sec:strlit}279280% The string literal is not documented in the original QBE specification.281% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287282283\begin{code}284strLit :: Parser String285strLit = concat <$> quoted (many strChr)286 where287 strChr :: Parser [Char]288 strChr = (singleton <$> noneOf "\"\\") <|> escSeq289290 -- TODO: not documnted in the QBE BNF.291 octEsc :: Parser Char292 octEsc = do293 n <- octNumber294 pure $ chr (fromIntegral n)295296 escSeq :: Parser [Char]297 escSeq = try $ do298 esc <- char '\\'299 (singleton <$> octEsc) <|> (anyChar <&> (\c -> [esc, c]))300\end{code}301302Strings are enclosed by double quotes and are, for example, used to specify a303section name as part of the \nameref{sec:linkage} information. Within a string,304a double quote can be escaped using a \texttt{\textbackslash} character. All305escape sequences, including double quote escaping, are passed through as-is to306the generated assembly file.307308\section{Types}309310\subsection{Simple Types}311312The IL makes minimal use of types. By design, the types used are313restricted to what is necessary for unambiguous compilation to machine314code and C interfacing. Unlike LLVM, QBE is not using types as a means315to safety; they are only here for semantic purposes.316317\begin{code}318baseType :: Parser Q.BaseType319baseType = choice320 [ bind "w" Q.Word321 , bind "l" Q.Long322 , bind "s" Q.Single323 , bind "d" Q.Double ]324\end{code}325326The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d}327(double), they stand respectively for 32-bit and 64-bit integers, and32832-bit and 64-bit floating-point numbers. There are no pointer types329available; pointers are typed by an integer type sufficiently wide to330represent all memory addresses (e.g., \texttt{l} on 64-bit architectures).331Temporaries in the IL can only have a base type.332333\begin{code}334extType :: Parser Q.ExtType335extType = (Q.Base <$> baseType)336 <|> bind "b" Q.Byte337 <|> bind "h" Q.HalfWord338\end{code}339340Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word),341respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types}342and \nameref{sec:data} definitions.343344For C interfacing, the IL also provides user-defined aggregate types as345well as signed and unsigned variants of the sub-word extended types.346Read more about these types in the \nameref{sec:aggregate-types}347and \nameref{sec:functions} sections.348349\subsection{Subtyping}350\label{sec:subtyping}351352The IL has a minimal subtyping feature, for integer types only. Any353value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the35432 least significant bits of the word value are used.355356Make note that it is the opposite of the usual subtyping on integers (in357C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value358cannot be used in word context. The rationale is that a word can be359signed or unsigned, so extending it to a long could be done in two ways,360either by zero-extension, or by sign-extension.361362\subsection{Constants and Vals}363\label{sec:constants-and-vals}364365\begin{code}366dynConst :: Parser Q.DynConst367dynConst =368 (Q.Const <$> constant)369 <|> (Q.Thread <$> (key "thread" >> global))370 <|> (Q.Extern <$> try (key "extern" >> global))371 <|> (Q.ExternThread <$> (key "extern" >> key "thread" >> global))372 <?> "dynconst"373 where374 key s = ws1 $ string s375\end{code}376377Constants come in two kinds: compile-time constants and dynamic378constants. Dynamic constants include compile-time constants and other379symbol variants that are only known at program-load time or execution380time. Consequently, dynamic constants can only occur in function bodies.381382When the \texttt{extern} keyword prefixes a symbol name, the symbol is383accessed indirectly through a table edited by the dynamic linker (e.g.,384GOT/PLT). This enables PIE/PIC code generation. When \texttt{extern} is385combined with \texttt{thread}, the symbol is accessed using the386initial-exec TLS model, suitable for thread-local variables defined in387shared objects available at startup time (i.e., not loaded through388dlopen).389390The representation of integers is two's complement.391Floating-point numbers are represented using the single-precision and392double-precision formats of the IEEE 754 standard.393394\begin{code}395constant :: Parser Q.Const396constant =397 (Q.Number <$> decNumber)398 <|> (Q.SFP <$> sfp)399 <|> (Q.DFP <$> dfp)400 <|> (Q.Global <$> global)401 <?> "const"402 where403 sfp = string "s_" >> float404 dfp = string "d_" >> float405\end{code}406407Constants specify a sequence of bits and are untyped. They are always408parsed as 64-bit blobs. Depending on the context surrounding a constant,409only some of its bits are used. For example, in the program below, the410two variables defined have the same value since the first operand of the411subtraction is a word (32-bit) context.412413\begin{verbatim}414%x =w sub -1, 0 %y =w sub 4294967295, 0415\end{verbatim}416417Because specifying floating-point constants by their bits makes the code418less readable, syntactic sugar is provided to express them. Standard419scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and420double precision numbers respectively. Once again, the following example421defines twice the same double-precision constant.422423\begin{verbatim}424%x =d add d_0, d_-1425%y =d add d_0, -4616189618054758400426\end{verbatim}427428Global symbols can also be used directly as constants; they will be429resolved and turned into actual numeric constants by the linker.430431When the \texttt{thread} keyword prefixes a symbol name, the432symbol\textquotesingle s numeric value is resolved at runtime in the433thread-local storage.434435\begin{code}436val :: Parser Q.Value437val =438 (Q.VConst <$> dynConst)439 <|> (Q.VLocal <$> local)440 <?> "val"441\end{code}442443Vals are used as arguments in regular, phi, and jump instructions within444function definitions. They are either constants or function-scope445temporaries.446447\subsection{Linkage}448\label{sec:linkage}449450\begin{code}451linkage :: Parser Q.Linkage452linkage =453 wsNL (bind "export" Q.LExport)454 <|> wsNL (bind "thread" Q.LThread)455 <|> do456 _ <- ws1 $ string "section"457 (try secWithFlags) <|> sec458 where459 sec :: Parser Q.Linkage460 sec = wsNL strLit <&> (`Q.LSection` Nothing)461462 secWithFlags :: Parser Q.Linkage463 secWithFlags = do464 n <- ws1 strLit465 wsNL strLit <&> Q.LSection n . Just466\end{code}467468Function and data definitions (see below) can specify linkage469information to be passed to the assembler and eventually to the linker.470471The \texttt{export} linkage flag marks the defined item as visible outside the472current file\textquotesingle s scope. If absent, the symbol can only be473referred to locally. Functions compiled by QBE and called from C need to474be exported.475476The \texttt{thread} linkage flag can only qualify data definitions. It mandates477that the object defined is stored in thread-local storage. Each time a478runtime thread starts, the supporting platform runtime is in charge of479making a new copy of the object for the fresh thread. Objects in480thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax,481as specified in the \nameref{sec:constants-and-vals} section.482483A \texttt{section} flag can be specified to tell the linker to put the defined484item in a certain section. The use of the section flag is platform485dependent and we refer the user to the documentation of their assembler486and linker for relevant information.487488\begin{verbatim}489section ".init_array" data $.init.f = { l $f }490\end{verbatim}491492The section flag can be used to add function pointers to a global493initialization list, as depicted above. Note that some platforms provide494a BSS section that can be used to minimize the footprint of uniformly495zeroed data. When this section is available, QBE will automatically make496use of it and no section flag is required.497498The section and export linkage flags should each appear at most once in499a definition. If multiple occurrences are present, QBE is free to use500any.501502\subsection{Definitions}503\label{sec:definitions}504505Definitions are the essential components of an IL file. They can define506three types of objects: aggregate types, data, and functions. Aggregate507types are never exported and do not compile to any code. Data and508function definitions have file scope and are mutually recursive (even509across IL files). Their visibility can be controlled using linkage510flags.511512\subsubsection{Aggregate Types}513\label{sec:aggregate-types}514515\begin{code}516typeDef :: Parser Q.TypeDef517typeDef = do518 _ <- wsNL1 (string "type")519 i <- wsNL1 userDef520 _ <- wsNL1 (char '=')521 a <- optionMaybe alignAny522 bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a523\end{code}524525Aggregate type definitions start with the \texttt{type} keyword. They have file526scope, but types must be defined before being referenced. The inner527structure of a type is expressed by a comma-separated list of fields.528529\begin{code}530subType :: Parser Q.SubType531subType =532 (Q.SExtType <$> extType)533 <|> (Q.SUserDef <$> userDef)534535field :: Parser Q.Field536field = do537 -- TODO: newline is required if there is a number argument538 f <- wsNL subType539 s <- ws $ optionMaybe decNumber540 pure (f, s)541542fields :: Bool -> Parser [Q.Field]543fields allowEmpty =544 (if allowEmpty then sepByTrail else sepByTrail1) field (wsNL $ char ',')545\end{code}546547A field consists of a subtype, either an extended type or a user-defined type,548and an optional number expressing the value of this field. In case many items549of the same type are sequenced (like in a C array), the shorter array syntax550can be used.551552\begin{code}553regularType :: Parser Q.AggType554regularType = Q.ARegular <$> fields True555\end{code}556557Three different kinds of aggregate types are presentl ysupported: regular558types, union types and opaque types. The fields of regular types will be559packed. By default, the alignment of an aggregate type is the maximum alignment560of its members. The alignment can be explicitly specified by the programmer.561562\begin{code}563unionType :: Parser Q.AggType564unionType = Q.AUnion <$> many1 (wsNL unionType')565 where566 unionType' :: Parser [Q.Field]567 unionType' = bracesNL $ fields False568\end{code}569570Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified.571572\begin{code}573opaqueType :: Parser Q.AggType574opaqueType = Q.AOpaque <$> wsNL decNumber575\end{code}576577Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces.578579\subsubsection{Data}580\label{sec:data}581582\begin{code}583dataDef :: Parser Q.DataDef584dataDef = do585 link <- many linkage586 name <- wsNL1 (string "data") >> wsNL global587 _ <- wsNL (char '=')588 alignment <- optionMaybe alignAny589 bracesNL dataObjs <&> Q.DataDef link name alignment590 where591 -- TODO: sepByTrail is not documented in the QBE BNF.592 dataObjs = sepByTrail dataObj (wsNL $ char ',')593\end{code}594595Data definitions express objects that will be emitted in the compiled596file. Their visibility and location in the compiled artifact are597controlled with linkage flags described in the \nameref{sec:linkage}598section.599600They define a global identifier (starting with the sigil \texttt{\$}), that601will contain a pointer to the object specified by the definition.602603\begin{code}604dataObj :: Parser Q.DataObj605dataObj =606 (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber))607 <|> do608 t <- wsNL1 extType609 i <- many1 (wsNL dataItem)610 return $ Q.OItem t i611\end{code}612613Objects are described by a sequence of fields that start with a type614letter. This letter can either be an extended type, or the \texttt{z} letter.615If the letter used is an extended type, the data item following616specifies the bits to be stored in the field.617618\begin{code}619dataItem :: Parser Q.DataItem620dataItem =621 (Q.DString <$> strLit)622 <|> try623 ( do624 i <- ws global625 off <- (ws $ char '+') >> ws decNumber626 return $ Q.DSymOff i off627 )628 <|> (Q.DConst <$> constant)629\end{code}630631Within each object, several items can be defined. When several data items632follow a letter, they initialize multiple fields of the same size.633634\begin{code}635allocSize :: Parser Q.AllocSize636allocSize =637 choice638 [ bind "4" Q.AllocWord,639 bind "8" Q.AllocLong,640 bind "16" Q.AllocLongLong641 ]642\end{code}643644The members of a struct will be packed. This means that padding has to645be emitted by the frontend when necessary. Alignment of the whole data646objects can be manually specified, and when no alignment is provided,647the maximum alignment from the platform is used.648649When the \texttt{z} letter is used the number following indicates the size of650the field; the contents of the field are zero initialized. It can be651used to add padding between fields or zero-initialize big arrays.652653\subsubsection{Functions}654\label{sec:functions}655656\begin{code}657funcDef :: Parser Q.FuncDef658funcDef = do659 link <- many linkage660 _ <- ws1 (string "function")661 retTy <- optionMaybe (ws1 abity)662 name <- ws global663 args <- wsNL params664 body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block665666 case (insertJumps body) of667 Nothing -> fail $ "invalid fallthrough in " ++ show name668 Just bl -> return $ Q.FuncDef link name retTy args bl669\end{code}670671Function definitions contain the actual code to emit in the compiled672file. They define a global symbol that contains a pointer to the673function code. This pointer can be used in \texttt{call} instructions or stored674in memory.675676\begin{code}677subWordType :: Parser Q.SubWordType678subWordType = choice679 [ try $ bind "sb" Q.SignedByte680 , try $ bind "ub" Q.UnsignedByte681 , bind "sh" Q.SignedHalf682 , bind "uh" Q.UnsignedHalf ]683684abity :: Parser Q.Abity685abity = try (Q.ASubWordType <$> subWordType)686 <|> (Q.ABase <$> baseType)687 <|> (Q.AUserDef <$> userDef)688\end{code}689690The type given right before the function name is the return type of the691function. All return values of this function must have this return type.692If the return type is missing, the function must not return any value.693694\begin{code}695param :: Parser Q.FuncParam696param = (Q.Env <$> (ws1 (string "env") >> local))697 <|> (string "..." >> pure Q.Variadic)698 <|> do699 ty <- ws1 abity700 Q.Regular ty <$> local701702params :: Parser [Q.FuncParam]703params = parenLst param704\end{code}705706The parameter list is a comma separated list of temporary names prefixed707by types. The types are used to correctly implement C compatibility.708When an argument has an aggregate type, a pointer to the aggregate is709passed by thea caller. In the example below, we have to use a load710instruction to get the value of the first (and only) member of the711struct.712713\begin{verbatim}714type :one = { w }715716function w $getone(:one %p) {717@start718 %val =w loadw %p719 ret %val720}721\end{verbatim}722723If a function accepts or returns values that are smaller than a word,724such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type725must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand,726respectively, for signed and unsigned 8-bit values, and signed and727unsigned 16-bit values. Parameters associated with a sub-word type of728bit width N only have their N least significant bits set and have base729type \texttt{w}. For example, the function730731\begin{verbatim}732function w $addbyte(w %a, sb %b) {733@start734 %bw =w extsb %b735 %val =w add %a, %bw736 ret %val737}738\end{verbatim}739740needs to sign-extend its second argument before the addition. Dually,741return values with sub-word types do not need to be sign or zero742extended.743744If the parameter list ends with \texttt{...}, the function is a variadic745function: it can accept a variable number of arguments. To access the746extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg}747instructions described in the \nameref{sec:variadic} section.748749Optionally, the parameter list can start with an environment parameter750\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e.,751of type \texttt{l}). If the function does not use its environment parameter,752callers can safely omit it. This parameter is invisible to a C caller:753for example, the function754755\begin{verbatim}756export function w $add(env %e, w %a, w %b) {757@start758 %c =w add %a, %b759 ret %c760}761\end{verbatim}762763must be given the C prototype \texttt{int add(int, int)}. The intended use of764this feature is to pass the environment pointer of closures while765retaining a very good compatibility with C. The \nameref{sec:call}766section explains how to pass an environment parameter.767768Since global symbols are defined mutually recursive, there is no need769for function declarations: a function can be referenced before its770definition. Similarly, functions from other modules can be used without771previous declaration. All the type information necessary to compile a772call is in the instruction itself.773774The syntax and semantics for the body of functions are described in the775\nameref{sec:control} section.776777\section{Control}778\label{sec:control}779780The IL represents programs as textual transcriptions of control flow781graphs. The control flow is serialized as a sequence of blocks of782straight-line code which are connected using jump instructions.783784\subsection{Blocks}785\label{sec:blocks}786787\ignore{788\begin{code}789-- Basic block abstraction with optional exit points. The 'insertJumps'790-- function takes care of inserting fallthrough for omitted jumps.791data Block'792 = Block'793 { label' :: Q.BlockIdent,794 phi' :: [Q.Phi],795 stmt' :: [Q.Statement],796 term' :: Maybe Q.JumpInstr797 }798 deriving (Show, Eq)799800insertJumps :: [Block'] -> Maybe [Q.Block]801insertJumps xs = foldM go [] $ zipWithNext xs802 where803 zipWithNext :: [a] -> [(a, Maybe a)]804 zipWithNext [] = []805 zipWithNext lst@(_ : t) = zip lst $ map Just t ++ [Nothing]806807 fromBlock' :: Block' -> Q.JumpInstr -> Q.Block808 fromBlock' (Block' l p s _) = Q.Block l p s809810 go :: [Q.Block] -> (Block', Maybe Block') -> Maybe [Q.Block]811 go acc (x@Block' {term' = Just ji}, _) =812 Just (acc ++ [fromBlock' x ji])813 go acc (x@Block' {term' = Nothing}, Just nxt) =814 Just (acc ++ [fromBlock' x (Q.Jump $ label' nxt)])815 go _ (Block' {term' = Nothing}, Nothing) =816 Nothing817\end{code}818}819820\begin{code}821block :: Parser Block'822block = do823 l <- wsNL1 label824 p <- many (wsNL1 $ try phiInstr)825 s <- many (wsNL1 statement)826 Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr)827\end{code}828829All blocks have a name that is specified by a label at their beginning.830Then follows a sequence of instructions that have "fall-through" flow.831Finally one jump terminates the block. The jump can either transfer832control to another block of the same function or return; jumps are833described further below.834835The first block in a function must not be the target of any jump in the836program. If a jump to the function start is needed, the frontend must837insert an empty prelude block at the beginning of the function.838839When one block jumps to the next block in the IL file, it is not840necessary to write the jump instruction, it will be automatically added841by the parser. For example the start block in the example below jumps842directly to the loop block.843844\subsection{Jumps}845\label{sec:jumps}846847\begin{code}848jumpInstr :: Parser Q.JumpInstr849jumpInstr = (string "hlt" >> pure Q.Halt)850 -- TODO: Return requires a space if there is an optionMaybe851 <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val)852 <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label))853 <|> do854 _ <- ws1 $ string "jnz"855 v <- ws val <* ws (char ',')856 l1 <- ws label <* ws (char ',')857 l2 <- ws label858 return $ Q.Jnz v l1 l2859\end{code}860861A jump instruction ends every block and transfers the control to another862program location. The target of a jump must never be the first block in863a function. The three kinds of jumps available are described in the864following list.865866\begin{enumerate}867 \item \textbf{Unconditional jump.} Jumps to another block of the same function.868 \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0.869 \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.870 \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}.871\end{enumerate}872873\section{Instructions}874\label{sec:instructions}875876\begin{code}877instr :: Parser Q.Instr878instr =879 choice880 [ try $ binaryInstr Q.Add "add",881 try $ binaryInstr Q.Sub "sub",882 try $ binaryInstr Q.Mul "mul",883 try $ binaryInstr Q.Div "div",884 try $ binaryInstr Q.URem "urem",885 try $ binaryInstr Q.Rem "rem",886 try $ binaryInstr Q.UDiv "udiv",887 try $ binaryInstr Q.Or "or",888 try $ binaryInstr Q.Xor "xor",889 try $ binaryInstr Q.And "and",890 try $ binaryInstr Q.Sar "sar",891 try $ binaryInstr Q.Shr "shr",892 try $ binaryInstr Q.Shl "shl",893 try $ unaryInstr Q.Neg "neg",894 try $ unaryInstr Q.Cast "cast",895 try $ unaryInstr Q.Copy "copy",896 try $ unaryInstr Q.VAArg "vaarg",897 try $ loadInstr,898 try $ allocInstr,899 try $ compareInstr,900 try $ extInstr,901 try $ truncInstr,902 try $ fromFloatInstr,903 try $ toFloatInstr904 ]905\end{code}906907Instructions are the smallest piece of code in the IL, they form the body of908\nameref{sec:blocks}. This specification distinguishes instructions and909volatile instructions, the latter do not return a value. For the former, the IL910uses a three-address code, which means that one instruction computes an911operation between two operands and assigns the result to a third one.912913\begin{code}914assign :: Parser Q.Statement915assign = do916 n <- ws local917 t <- ws (char '=') >> ws1 baseType918 Q.Assign n t <$> instr919920volatileInstr :: Parser Q.Statement921volatileInstr =922 Q.Volatile <$>923 (storeInstr <|> blitInstr <|> vastartInstr <|> dbglocInstr)924925-- TODO: Not documented in the QBE BNF.926statement :: Parser Q.Statement927statement = (try callInstr) <|> assign <|> volatileInstr928\end{code}929930An instruction has both a name and a return type, this return type is a base931type that defines the size of the instruction's result. The type of the932arguments can be unambiguously inferred using the instruction name and the933return type. For example, for all arithmetic instructions, the type of the934arguments is the same as the return type. The two additions below are valid if935\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}).936937\begin{verbatim}938%x =w add 0, %y939%z =w add %x, %x940\end{verbatim}941942Some instructions, like comparisons and memory loads have operand types943that differ from their return types. For instance, two floating points944can be compared to give a word result (0 if the comparison succeeds, 1945if it fails).946947\begin{verbatim}948%c =w cgts %a, %b949\end{verbatim}950951In the example above, both operands have to have single type. This is952made explicit by the instruction suffix.953954\subsection{Arithmetic and Bits}955956\begin{quote}957\begin{itemize}958\item \texttt{add}, \texttt{sub}, \texttt{div}, \texttt{mul}959\item \texttt{neg}960\item \texttt{udiv}, \texttt{rem}, \texttt{urem}961\item \texttt{or}, \texttt{xor}, \texttt{and}962\item \texttt{sar}, \texttt{shr}, \texttt{shl}963\end{itemize}964\end{quote}965966The base arithmetic instructions in the first bullet are available for967all types, integers and floating points.968969When \texttt{div} is used with word or long return type, the arguments are970treated as signed. The unsigned integral division is available as \texttt{udiv}971instruction. When the result of a division is not an integer, it is truncated972towards zero.973974The signed and unsigned remainder operations are available as \texttt{rem} and975\texttt{urem}. The sign of the remainder is the same as the one of the976dividend. Its magnitude is smaller than the divisor one. These two instructions977and \texttt{udiv} are only available with integer arguments and result.978979Bitwise OR, AND, and XOR operations are available for both integer980types. Logical operations of typical programming languages can be981implemented using \nameref{sec:comparisions} and \nameref{sec:jumps}.982983Shift instructions \texttt{sar}, \texttt{shr}, and \texttt{shl}, shift right or984left their first operand by the amount from the second operand. The shifting985amount is taken modulo the size of the result type. Shifting right can either986preserve the sign of the value (using \texttt{sar}), or fill the newly freed987bits with zeroes (using \texttt{shr}). Shifting left always fills the freed988bits with zeroes.989990Remark that an arithmetic shift right (\texttt{sar}) is only equivalent to a991division by a power of two for non-negative numbers. This is because the shift992right "truncates" towards minus infinity, while the division truncates towards993zero.994995\subsection{Memory}996\label{sec:memory}997998The following sections discuss instructions for interacting with values stored in memory.9991000\subsubsection{Store instructions}10011002\begin{code}1003storeInstr :: Parser Q.VolatileInstr1004storeInstr = do1005 t <- string "store" >> ws1 extType1006 v <- ws val1007 _ <- ws $ char ','1008 ws val <&> Q.Store t v1009\end{code}10101011Store instructions exist to store a value of any base type and any extended1012type. Since halfwords and bytes are not first class in the IL, \texttt{storeh}1013and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of1014this word will be stored in memory at the address specified in the second1015argument.10161017\subsubsection{Load instructions}10181019\begin{code}1020loadInstr :: Parser Q.Instr1021loadInstr = do1022 _ <- string "load"1023 t <- ws1 $ choice1024 [ try $ bind "sw" (Q.LBase Q.Word),1025 try $ bind "uw" (Q.LBase Q.Word),1026 try $ Q.LSubWord <$> subWordType,1027 Q.LBase <$> baseType1028 ]1029 ws val <&> Q.Load t1030\end{code}10311032For types smaller than long, two variants of the load instruction are1033available: one will sign extend the loaded value, while the other will zero1034extend it. Note that all loads smaller than long can load to either a long or a1035word.10361037The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect1038when they are used to define a word temporary. A \texttt{loadw} instruction is1039provided as syntactic sugar for \texttt{loadsw} to make explicit that the1040extension mechanism used is irrelevant.10411042\subsubsection{Blits}10431044\begin{code}1045blitInstr :: Parser Q.VolatileInstr1046blitInstr = do1047 v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',')1048 v2 <- ws val <* (ws $ char ',')1049 nb <- decNumber1050 return $ Q.Blit v1 v2 nb1051\end{code}10521053The blit instruction copies in-memory data from its first address argument to1054its second address argument. The third argument is the number of bytes to copy.1055The source and destination spans are required to be either non-overlapping, or1056fully overlapping (source address identical to the destination address). The1057byte count argument must be a nonnegative numeric constant; it cannot be a1058temporary.10591060One blit instruction may generate a number of instructions proportional to its1061byte count argument, consequently, it is recommended to keep this argument1062relatively small. If large copies are necessary, it is preferable that1063frontends generate calls to a supporting \texttt{memcpy} function.10641065\subsubsection{Stack Allocation}10661067\begin{code}1068allocInstr :: Parser Q.Instr1069allocInstr = do1070 siz <- (ws $ string "alloc") >> (ws1 allocSize)1071 val <&> Q.Alloc siz1072\end{code}10731074These instructions allocate a chunk of memory on the stack. The number ending1075the instruction name is the alignment required for the allocated slot. QBE will1076make sure that the returned address is a multiple of that alignment value.10771078Stack allocation instructions are used, for example, when compiling the C local1079variables, because their address can be taken. When compiling Fortran,1080temporaries can be used directly instead, because it is illegal to take the1081address of a variable.10821083\subsection{Comparisons}1084\label{sec:comparisions}10851086\begin{code}1087compareInstr :: Parser Q.Instr1088compareInstr = do1089 _ <- char 'c'1090 (try intCompare) <|> floatCompare10911092compareArgs :: Parser (Q.Value, Q.Value)1093compareArgs = do1094 lhs <- ws val <* ws (char ',')1095 rhs <- ws val1096 pure (lhs, rhs)10971098intCompare :: Parser Q.Instr1099intCompare = do1100 op <- compareIntOp1101 ty <- ws1 intArg11021103 (lhs, rhs) <- compareArgs1104 pure $ Q.CompareInt ty op lhs rhs11051106floatCompare :: Parser Q.Instr1107floatCompare = do1108 op <- compareFloatOp1109 ty <- ws1 floatArg11101111 (lhs, rhs) <- compareArgs1112 pure $ Q.CompareFloat ty op lhs rhs1113\end{code}11141115Comparison instructions return an integer value (either a word or a long), and1116compare values of arbitrary types. The returned value is 1 if the two operands1117satisfy the comparison relation, or 0 otherwise. The names of comparisons1118respect a standard naming scheme in three parts:11191120\begin{enumerate}1121 \item All comparisons start with the letter \texttt{c}.1122 \item Then comes a comparison type.1123 \item Finally, the instruction name is terminated with a basic type suffix precising the type of the operands to be compared.1124\end{enumerate}11251126The following instruction are available for integer comparisons:11271128\begin{code}1129compareIntOp :: Parser Q.IntCmpOp1130compareIntOp = choice1131 [ bind "eq" Q.IEq1132 , bind "ne" Q.INe1133 , try $ bind "sle" Q.ISle1134 , try $ bind "slt" Q.ISlt1135 , try $ bind "sge" Q.ISge1136 , try $ bind "sgt" Q.ISgt1137 , try $ bind "ule" Q.IUle1138 , try $ bind "ult" Q.IUlt1139 , try $ bind "uge" Q.IUge1140 , try $ bind "ugt" Q.IUgt ]1141\end{code}11421143For floating point comparisons use one of these instructions:11441145\begin{code}1146compareFloatOp :: Parser Q.FloatCmpOp1147compareFloatOp = choice1148 [ bind "eq" Q.FEq1149 , bind "ne" Q.FNe1150 , try $ bind "le" Q.FLe1151 , bind "lt" Q.FLt1152 , try $ bind "ge" Q.FGe1153 , bind "gt" Q.FGt1154 , bind "o" Q.FOrd1155 , bind "uo" Q.FUnord ]1156\end{code}11571158For example, \texttt{cod} compares two double-precision floating point numbers1159and returns 1 if the two floating points are not NaNs, or 0 otherwise. The1160\texttt{csltw} instruction compares two words representing signed numbers and1161returns 1 when the first argument is smaller than the second one.11621163\subsection{Conversions}11641165Conversion operations change the representation of a value, possibly modifying1166it if the target type cannot hold the value of the source type. Conversions can1167extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or1168convert a floating point into an integer and vice versa.11691170\begin{code}1171extInstr :: Parser Q.Instr1172extInstr = do1173 _ <- string "ext"1174 ty <- ws1 extArg1175 ws val <&> Q.Ext ty1176 where1177 extArg :: Parser Q.ExtArg1178 extArg = try (Q.ExtSubWord <$> subWordType)1179 <|> try (bind "sw" Q.ExtSignedWord)1180 <|> bind "s" Q.ExtSingle1181 <|> bind "uw" Q.ExtUnsignedWord1182\end{code}11831184Extending the precision of a temporary is done using the \texttt{ext} family of1185instructions. Because QBE types do not specify the signedness (like in LLVM),1186extension instructions exist to sign-extend and zero-extend a value. For1187example, \texttt{extsb} takes a word argument and sign-extends the 81188least-significant bits to a full word or long, depending on the return type.11891190\begin{code}1191truncInstr :: Parser Q.Instr1192truncInstr = do1193 _ <- ws1 $ string "truncd"1194 ws val <&> Q.TruncDouble1195\end{code}11961197The instructions \texttt{exts} (extend single) and \texttt{truncd} (truncate1198double) are provided to change the precision of a floating point value. When1199the double argument of truncd cannot be represented as a single-precision1200floating point, it is truncated towards zero.12011202\begin{code}1203floatArg :: Parser Q.FloatArg1204floatArg = bind "d" Q.FDouble <|> bind "s" Q.FSingle12051206fromFloatInstr :: Parser Q.Instr1207fromFloatInstr = do1208 arg <- floatArg <* string "to"1209 isSigned <- signageChar1210 _ <- ws1 $ char 'i'1211 ws val <&> Q.FloatToInt arg isSigned12121213intArg :: Parser Q.IntArg1214intArg = bind "w" Q.IWord <|> bind "l" Q.ILong12151216toFloatInstr :: Parser Q.Instr1217toFloatInstr = do1218 isSigned <- signageChar1219 arg <- intArg1220 _ <- ws1 $ string "tof"1221 ws val <&> Q.IntToFloat arg isSigned1222\end{code}12231224Converting between signed integers and floating points is done using1225\texttt{stosi} (single to signed integer), \texttt{stoui} (single to unsigned1226integer), \texttt{dtosi} (double to signed integer), \texttt{dtoui} (double to1227unsigned integer), \texttt{swtof} (signed word to float), \texttt{uwtof}1228(unsigned word to float), \texttt{sltof} (signed long to float) and1229\texttt{ultof} (unsigned long to float).12301231\subsection{Cast and Copy}12321233The \texttt{cast} and \texttt{copy} instructions return the bits of their1234argument verbatim. However a cast will change an integer into a floating point1235of the same width and vice versa.12361237Casts can be used to make bitwise operations on the representation of floating1238point numbers. For example the following program will compute the opposite of1239the single-precision floating point number \texttt{\%f} into \texttt{\%rs}.12401241\begin{verbatim}1242%b0 =w cast %f1243%b1 =w xor 2147483648, %b0 # flip the msb1244%rs =s cast %b11245\end{verbatim}12461247\subsection{Call}1248\label{sec:call}12491250\begin{code}1251-- TODO: Code duplication with 'param'.1252callArg :: Parser Q.FuncArg1253callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))1254 <|> (string "..." >> pure Q.ArgVar)1255 <|> do1256 ty <- ws1 abity1257 Q.ArgReg ty <$> val12581259callArgs :: Parser [Q.FuncArg]1260callArgs = parenLst callArg12611262callInstr :: Parser Q.Statement1263callInstr = do1264 retValue <- optionMaybe $ do1265 i <- ws local <* ws (char '=')1266 a <- ws1 abity1267 return (i, a)1268 toCall <- ws1 (string "call") >> ws val1269 fnArgs <- callArgs1270 return $ Q.Call retValue toCall fnArgs1271\end{code}12721273The call instruction is special in several ways. It is not a three-address1274instruction and requires the type of all its arguments to be given. Also, the1275return type can be either a base type or an aggregate type. These specifics are1276required to compile calls with C compatibility (i.e., to respect the ABI).12771278When an aggregate type is used as argument type or return type, the value1279respectively passed or returned needs to be a pointer to a memory location1280holding the value. This is because aggregate types are not first-class1281citizens of the IL.12821283Sub-word types are used for arguments and return values of width less than a1284word. Details on these types are presented in the \nameref{sec:functions} section.1285Arguments with sub-word types need not be sign or zero extended according to1286their type. Calls with a sub-word return type define a temporary of base type1287\texttt{w} with its most significant bits unspecified.12881289Unless the called function does not return a value, a return temporary must be1290specified, even if it is never used afterwards.12911292An environment parameter can be passed as first argument using the \texttt{env}1293keyword. The passed value must be a 64-bit integer. If the called function does1294not expect an environment parameter, it will be safely discarded. See the1295\nameref{sec:functions} section for more information about environment1296parameters.12971298When the called function is variadic, there must be a \texttt{...} marker1299separating the named and variadic arguments.13001301\subsection{Variadic}1302\label{sec:variadic}13031304\begin{code}1305vastartInstr :: Parser Q.VolatileInstr1306vastartInstr = do1307 _ <- ws1 (string "vastart")1308 Q.VAStart <$> ws val1309\end{code}13101311The \texttt{vastart} and \texttt{vaarg} instructions provide a portable way to1312access the extra parameters of a variadic function.13131314\begin{enumerate}1315 \item \texttt{vastart} -- \texttt{(m)}1316 \item \texttt{vaarg} -- \texttt{T(mmmm)}1317\end{enumerate}13181319The \texttt{vastart} instruction initializes a variable argument list used to1320access the extra parameters of the enclosing variadic function. It is safe to1321call it multiple times.13221323The \texttt{vaarg} instruction fetches the next argument from a variable1324argument list. It is currently limited to fetching arguments that have a base1325type. This instruction is essentially effectful: calling it twice in a row will1326return two consecutive arguments from the argument list.13271328Both instructions take a pointer to a variable argument list as the sole argument.1329The size and alignment of the variable argument lists depends on the target used.13301331\subsection{Phi}13321333\begin{code}1334phiBranch :: Parser (Q.BlockIdent, Q.Value)1335phiBranch = do1336 n <- ws1 label1337 v <- val1338 pure (n, v)13391340phiInstr :: Parser Q.Phi1341phiInstr = do1342 -- TODO: code duplication with 'assign'1343 n <- ws local1344 t <- ws (char '=') >> ws1 baseType13451346 _ <- ws1 (string "phi")1347 -- TODO: combinator for sepBy1348 p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')1349 return $ Q.Phi n t p1350\end{code}13511352First and foremost, phi instructions are NOT necessary when writing a frontend1353to QBE. One solution to avoid having to deal with SSA form is to use stack1354allocated variables for all source program variables and perform assignments1355and lookups using \nameref{sec:memory} operations. This is what LLVM users1356typically do.13571358Another solution is to simply emit code that is not in SSA form! Contrary to1359LLVM, QBE is able to fixup programs not in SSA form without requiring the1360boilerplate of loading and storing in memory. For example, the following1361program will be correctly compiled by QBE.13621363\begin{verbatim}1364@start1365 %x =w copy 1001366 %s =w copy 01367@loop1368 %s =w add %s, %x1369 %x =w sub %x, 11370 jnz %x, @loop, @end1371@end1372 ret %s1373\end{verbatim}13741375Now, if you want to know what phi instructions are and how to use them in QBE,1376you can read the following.13771378Phi instructions are specific to SSA form. In SSA form values can only be1379assigned once, without phi instructions, this requirement is too strong to1380represent many programs. For example consider the following C program.13811382\begin{verbatim}1383int f(int x) {1384 int y;1385 if (x)1386 y = 1;1387 else1388 y = 2;1389 return y;1390}1391\end{verbatim}13921393The variable \texttt{y} is assigned twice, the solution to translate it in SSA1394form is to insert a phi instruction.13951396\begin{verbatim}1397@ifstmt1398 jnz %x, @ift, @iff1399@ift1400 jmp @retstmt1401@iff1402 jmp @retstmt1403@retstmt1404 %y =w phi @ift 1, @iff 21405 ret %y1406\end{verbatim}14071408Phi instructions return one of their arguments depending on where the control1409came from. In the example, \texttt{\%y} is set to 1 if the1410\texttt{\textbackslash{}ift} branch is taken, or it is set to 2 otherwise.14111412An important remark about phi instructions is that QBE assumes that if a1413variable is defined by a phi it respects all the SSA invariants. So it is1414critical to not use phi instructions unless you know exactly what you are1415doing.14161417\subsection{Debug Information}14181419QBE supports the inclusion of debug information. Specifically, it allows1420defining from which source file type, data, and function definitions originated.1421For this purpose, it provides the \texttt{dbgfile} definition, which receives a1422file name (string literal) as its sole argument. Every type, data and function1423definition thereafter are assumed to originate in this file.14241425\begin{code}1426-- TODO: not documnted in the QBE BNF.1427fileDef :: Parser String1428fileDef = do1429 _ <- ws1 $ string "dbgfile"1430 wsNL1 strLit1431\end{code}14321433Further, instructions within a function can be associated with a specific line1434and column number of a previously defined \texttt{dbgfile}. The1435\texttt{dbgfile} is referenced by index using the first argument to1436\texttt{dbgloc}. The second argument represents the line number, the third1437(optional) argument the column number.14381439\begin{code}1440-- TODO: not documnted in the QBE BNF.1441dbglocInstr :: Parser Q.VolatileInstr1442dbglocInstr = do1443 _ <- ws1 $ string "dbgloc"1444 file <- ws decNumber <* ws (char ',')1445 line <- ws decNumber1446 col <- optionMaybe (ws (char ',') >> ws decNumber)1447 return $ Q.DBGLoc file line col1448\end{code}14491450\end{document}