1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me>2% SPDX-FileCopyrightText: 2025-2026 Sören Tempel <soeren+git@soeren-tempel.net>3%4% SPDX-License-Identifier: MIT AND GPL-3.0-only56\documentclass{article}7%include polycode.fmt89%subst blankline = "\\[5mm]"1011% See https://github.com/kosmikus/lhs2tex/issues/5812%format <$> = "\mathbin{\langle\$\rangle}"13%format <&> = "\mathbin{\langle\&\rangle}"14%format <|> = "\mathbin{\langle\:\vline\:\rangle}"15%format <?> = "\mathbin{\langle?\rangle}"16%format <*> = "\mathbin{\langle*\rangle}"17%format <* = "\mathbin{\langle*}"18%format *> = "\mathbin{*\rangle}"1920\long\def\ignore#1{}2122\usepackage{hyperref}23\hypersetup{24 colorlinks = true,25}2627\begin{document}2829\title{QBE Intermediate Language\vspace{-2em}}30\date{}31\maketitle32\frenchspacing3334\ignore{35\begin{code}36module Language.QBE.Parser37 ( skipInitComments,38 dataDef,39 typeDef,40 funcDef,41 fileDef42 )43where4445import Data.Char (chr)46import Data.Word (Word64)47import Data.Functor ((<&>))48import Data.List (singleton)49import Data.Map qualified as Map50import qualified Language.QBE.Types as Q51import Language.QBE.Util (bind, decNumber, octNumber, float)52import Text.ParserCombinators.Parsec53 ( Parser,54 alphaNum,55 anyChar,56 between,57 char,58 choice,59 letter,60 many,61 many1,62 manyTill,63 newline,64 noneOf,65 oneOf,66 optional,67 optionMaybe,68 sepBy,69 sepBy1,70 skipMany,71 skipMany1,72 string,73 try,74 (<?>),75 (<|>),76 )77\end{code}78}7980This an executable description of the81\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language},82specified through \href{https://hackage.haskell.org/package/parsec}{Parsec}83parser combinators and generated from a literate Haskell file. The description84is derived from the original QBE IL documentation, licensed under MIT.85Presently, this implementation targets version 1.2 of the QBE intermediate86language and aims to be equivalent with the original specification.8788\section{Basic Concepts}8990The intermediate language (IL) is a higher-level language than the91machine's assembly language. It smoothes most of the92irregularities of the underlying hardware and allows an infinite number93of temporaries to be used. This higher abstraction level lets frontend94programmers focus on language design issues.9596\subsection{Input Files}9798The intermediate language is provided to QBE as text. Usually, one file99is generated per each compilation unit from the frontend input language.100An IL file is a sequence of \nameref{sec:definitions} for101data, functions, and types. Once processed by QBE, the resulting file102can be assembled and linked using a standard toolchain (e.g., GNU103binutils).104105\begin{code}106comment :: Parser ()107comment = skipMany blankNL >> comment' >> skipMany blankNL108 where109 comment' = char '#' >> manyTill anyChar newline110\end{code}111112\ignore{113\begin{code}114skipNoCode :: Parser () -> Parser ()115skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP116\end{code}117}118119Here is a complete "Hello World" IL file which defines a function that120prints to the screen. Since the string is not a first class object (only121the pointer is) it is defined outside the function\textquotesingle s122body. Comments start with a \# character and finish with the end of the123line.124125\begin{verbatim}126data $str = { b "hello world", b 0 }127128export function w $main() {129@start130 # Call the puts function with $str as argument.131 %r =w call $puts(l $str)132 ret 0133}134\end{verbatim}135136If you have read the LLVM language reference, you might recognize the137example above. In comparison, QBE makes a much lighter use of types and138the syntax is terser.139140\subsection{Parser Combinators}141142\ignore{143\begin{code}144bracesNL :: Parser a -> Parser a145bracesNL = between (wsNL $ char '{') (wsNL $ char '}')146147quoted :: Parser a -> Parser a148quoted = let q = char '"' in between q q149150sepByTrail1 :: Parser a -> Parser sep -> Parser [a]151sepByTrail1 p sep = do152 x <- p153 xs <- many (try $ sep >> p)154 _ <- optional sep155 return (x:xs)156157sepByTrail :: Parser a -> Parser sep -> Parser [a]158sepByTrail p sep = sepByTrail1 p sep <|> return []159160parenLst :: Parser a -> Parser [a]161parenLst p = between (ws $ char '(') (char ')') inner162 where163 inner = sepBy (ws p) (ws $ char ',')164165unaryInstr :: (Q.Value -> Q.Instr) -> String -> Parser Q.Instr166unaryInstr conc keyword = do167 _ <- ws (string keyword)168 conc <$> ws val169170binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr171binaryInstr conc keyword = do172 _ <- ws (string keyword)173 vfst <- ws val <* ws (char ',')174 conc vfst <$> ws val175176-- Can only appear in data and type definitions and hence allows newlines.177alignAny :: Parser Word64178alignAny = (ws1 (string "align")) >> wsNL decNumber179180-- Returns true if it is signed.181signageChar :: Parser Bool182signageChar = (char 's' <|> char 'u') <&> (== 's')183\end{code}184}185186The original QBE specification defines the syntax using a BNF grammar. In187contrast, this document defines it using Parsec parser combinators. As such,188this specification is less formal but more accurate as the parsing code is189actually executable. Consequently, this specification also captures constructs190omitted in the original specification (e.g., \nameref{sec:identifiers}, or191\nameref{sec:strlit}). Nonetheless, the formal language recognized by these192combinators aims to be equivalent to the one of the BNF grammar.193194\subsection{Identifiers}195\label{sec:identifiers}196197% Ident is not documented in the original QBE specification.198% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304199200\begin{code}201ident :: Parser String202ident = do203 start <- letter <|> oneOf "._"204 rest <- many (alphaNum <|> oneOf "$._")205 return $ start : rest206\end{code}207208Identifiers for data, types, and functions can start with any ASCII letter or209the special characters \texttt{.} and \texttt{\_}. This initial character can210be followed by a sequence of zero or more alphanumeric characters and the211special characters \texttt{\$}, \texttt{.}, and \texttt{\_}.212213\subsection{Sigils}214215\begin{code}216userDef :: Parser Q.UserIdent217userDef = Q.UserIdent <$> (char ':' >> ident)218219global :: Parser Q.GlobalIdent220global = Q.GlobalIdent <$> (char '$' >> ident)221222local :: Parser Q.LocalIdent223local = Q.LocalIdent <$> (char '%' >> ident)224225label :: Parser Q.BlockIdent226label = Q.BlockIdent <$> (char '@' >> ident)227\end{code}228229The intermediate language makes heavy use of sigils, all user-defined230names are prefixed with a sigil. This is to avoid keyword conflicts, and231also to quickly spot the scope and nature of identifiers.232233\begin{itemize}234 \item \texttt{:} is for user-defined \nameref{sec:aggregate-types}235 \item \texttt{\$} is for globals (represented by a pointer)236 \item \texttt{\%} is for function-scope temporaries237 \item \texttt{@@} is for block labels238\end{itemize}239240\subsection{Spacing}241242\begin{code}243blank :: Parser Char244blank = oneOf "\t " <?> "blank"245246blankNL :: Parser Char247blankNL = oneOf "\n\t " <?> "blank or newline"248\end{code}249250Individual tokens in IL files must be separated by one or more spacing251characters. Both spaces and tabs are recognized as spacing characters.252In data and type definitions, newlines may also be used as spaces to253prevent overly long lines. When exactly one of two consecutive tokens is254a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted.255256\ignore{257\begin{code}258ws :: Parser a -> Parser a259ws p = p <* skipMany blank260261ws1 :: Parser a -> Parser a262ws1 p = p <* skipMany1 blank263264wsNL :: Parser a -> Parser a265wsNL p = p <* skipNoCode (skipMany blankNL)266267wsNL1 :: Parser a -> Parser a268wsNL1 p = p <* skipNoCode (skipMany1 blankNL)269270-- Only intended to be used to skip comments at the start of a file.271skipInitComments :: Parser ()272skipInitComments = skipNoCode (skipMany blankNL)273\end{code}274}275276\subsection{String Literals}277\label{sec:strlit}278279% The string literal is not documented in the original QBE specification.280% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287281282\begin{code}283strLit :: Parser String284strLit = concat <$> quoted (many strChr)285 where286 strChr :: Parser [Char]287 strChr = (singleton <$> noneOf "\"\\") <|> escSeq288289 -- TODO: not documnted in the QBE BNF.290 octEsc :: Parser Char291 octEsc = do292 n <- octNumber293 pure $ chr (fromIntegral n)294295 escSeq :: Parser [Char]296 escSeq = try $ do297 esc <- char '\\'298 (singleton <$> octEsc) <|> (anyChar <&> (\c -> [esc, c]))299\end{code}300301Strings are enclosed by double quotes and are, for example, used to specify a302section name as part of the \nameref{sec:linkage} information. Within a string,303a double quote can be escaped using a \texttt{\textbackslash} character. All304escape sequences, including double quote escaping, are passed through as-is to305the generated assembly file.306307\section{Types}308309\subsection{Simple Types}310311The IL makes minimal use of types. By design, the types used are312restricted to what is necessary for unambiguous compilation to machine313code and C interfacing. Unlike LLVM, QBE is not using types as a means314to safety; they are only here for semantic purposes.315316\begin{code}317baseType :: Parser Q.BaseType318baseType = choice319 [ bind "w" Q.Word320 , bind "l" Q.Long321 , bind "s" Q.Single322 , bind "d" Q.Double ]323\end{code}324325The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d}326(double), they stand respectively for 32-bit and 64-bit integers, and32732-bit and 64-bit floating-point numbers. There are no pointer types328available; pointers are typed by an integer type sufficiently wide to329represent all memory addresses (e.g., \texttt{l} on 64-bit architectures).330Temporaries in the IL can only have a base type.331332\begin{code}333extType :: Parser Q.ExtType334extType = (Q.Base <$> baseType)335 <|> bind "b" Q.Byte336 <|> bind "h" Q.HalfWord337\end{code}338339Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word),340respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types}341and \nameref{sec:data} definitions.342343For C interfacing, the IL also provides user-defined aggregate types as344well as signed and unsigned variants of the sub-word extended types.345Read more about these types in the \nameref{sec:aggregate-types}346and \nameref{sec:functions} sections.347348\subsection{Subtyping}349\label{sec:subtyping}350351The IL has a minimal subtyping feature, for integer types only. Any352value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the35332 least significant bits of the word value are used.354355Make note that it is the opposite of the usual subtyping on integers (in356C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value357cannot be used in word context. The rationale is that a word can be358signed or unsigned, so extending it to a long could be done in two ways,359either by zero-extension, or by sign-extension.360361\subsection{Constants and Vals}362\label{sec:constants-and-vals}363364\begin{code}365dynConst :: Parser Q.DynConst366dynConst =367 (Q.Const <$> constant)368 <|> (Q.Thread <$> (key "thread" >> global))369 <|> (Q.Extern <$> try (key "extern" >> global))370 <|> (Q.ExternThread <$> (key "extern" >> key "thread" >> global))371 <?> "dynconst"372 where373 key s = ws1 $ string s374\end{code}375376Constants come in two kinds: compile-time constants and dynamic377constants. Dynamic constants include compile-time constants and other378symbol variants that are only known at program-load time or execution379time. Consequently, dynamic constants can only occur in function bodies.380381When the \texttt{extern} keyword prefixes a symbol name, the symbol is382accessed indirectly through a table edited by the dynamic linker (e.g.,383GOT/PLT). This enables PIE/PIC code generation. When \texttt{extern} is384combined with \texttt{thread}, the symbol is accessed using the385initial-exec TLS model, suitable for thread-local variables defined in386shared objects available at startup time (i.e., not loaded through387dlopen).388389The representation of integers is two's complement.390Floating-point numbers are represented using the single-precision and391double-precision formats of the IEEE 754 standard.392393\begin{code}394constant :: Parser Q.Const395constant =396 (Q.Number <$> decNumber)397 <|> (Q.SFP <$> sfp)398 <|> (Q.DFP <$> dfp)399 <|> (Q.Global <$> global)400 <?> "const"401 where402 sfp = string "s_" >> float403 dfp = string "d_" >> float404\end{code}405406Constants specify a sequence of bits and are untyped. They are always407parsed as 64-bit blobs. Depending on the context surrounding a constant,408only some of its bits are used. For example, in the program below, the409two variables defined have the same value since the first operand of the410subtraction is a word (32-bit) context.411412\begin{verbatim}413%x =w sub -1, 0 %y =w sub 4294967295, 0414\end{verbatim}415416Because specifying floating-point constants by their bits makes the code417less readable, syntactic sugar is provided to express them. Standard418scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and419double precision numbers respectively. Once again, the following example420defines twice the same double-precision constant.421422\begin{verbatim}423%x =d add d_0, d_-1424%y =d add d_0, -4616189618054758400425\end{verbatim}426427Global symbols can also be used directly as constants; they will be428resolved and turned into actual numeric constants by the linker.429430When the \texttt{thread} keyword prefixes a symbol name, the431symbol\textquotesingle s numeric value is resolved at runtime in the432thread-local storage.433434\begin{code}435val :: Parser Q.Value436val =437 (Q.VConst <$> dynConst)438 <|> (Q.VLocal <$> local)439 <?> "val"440\end{code}441442Vals are used as arguments in regular, phi, and jump instructions within443function definitions. They are either constants or function-scope444temporaries.445446\subsection{Linkage}447\label{sec:linkage}448449\begin{code}450linkage :: Parser Q.Linkage451linkage =452 wsNL (bind "export" Q.LExport)453 <|> wsNL (bind "thread" Q.LThread)454 <|> do455 _ <- ws1 $ string "section"456 (try secWithFlags) <|> sec457 where458 sec :: Parser Q.Linkage459 sec = wsNL strLit <&> (`Q.LSection` Nothing)460461 secWithFlags :: Parser Q.Linkage462 secWithFlags = do463 n <- ws1 strLit464 wsNL strLit <&> Q.LSection n . Just465\end{code}466467Function and data definitions (see below) can specify linkage468information to be passed to the assembler and eventually to the linker.469470The \texttt{export} linkage flag marks the defined item as visible outside the471current file\textquotesingle s scope. If absent, the symbol can only be472referred to locally. Functions compiled by QBE and called from C need to473be exported.474475The \texttt{thread} linkage flag can only qualify data definitions. It mandates476that the object defined is stored in thread-local storage. Each time a477runtime thread starts, the supporting platform runtime is in charge of478making a new copy of the object for the fresh thread. Objects in479thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax,480as specified in the \nameref{sec:constants-and-vals} section.481482A \texttt{section} flag can be specified to tell the linker to put the defined483item in a certain section. The use of the section flag is platform484dependent and we refer the user to the documentation of their assembler485and linker for relevant information.486487\begin{verbatim}488section ".init_array" data $.init.f = { l $f }489\end{verbatim}490491The section flag can be used to add function pointers to a global492initialization list, as depicted above. Note that some platforms provide493a BSS section that can be used to minimize the footprint of uniformly494zeroed data. When this section is available, QBE will automatically make495use of it and no section flag is required.496497The section and export linkage flags should each appear at most once in498a definition. If multiple occurrences are present, QBE is free to use499any.500501\subsection{Definitions}502\label{sec:definitions}503504Definitions are the essential components of an IL file. They can define505three types of objects: aggregate types, data, and functions. Aggregate506types are never exported and do not compile to any code. Data and507function definitions have file scope and are mutually recursive (even508across IL files). Their visibility can be controlled using linkage509flags.510511\subsubsection{Aggregate Types}512\label{sec:aggregate-types}513514\begin{code}515typeDef :: Parser Q.TypeDef516typeDef = do517 _ <- wsNL1 (string "type")518 i <- wsNL1 userDef519 _ <- wsNL1 (char '=')520 a <- optionMaybe alignAny521 bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a522\end{code}523524Aggregate type definitions start with the \texttt{type} keyword. They have file525scope, but types must be defined before being referenced. The inner526structure of a type is expressed by a comma-separated list of fields.527528\begin{code}529subType :: Parser Q.SubType530subType =531 (Q.SExtType <$> extType)532 <|> (Q.SUserDef <$> userDef)533534field :: Parser Q.Field535field = do536 -- TODO: newline is required if there is a number argument537 f <- wsNL subType538 s <- ws $ optionMaybe decNumber539 pure (f, s)540541fields :: Bool -> Parser [Q.Field]542fields allowEmpty =543 (if allowEmpty then sepByTrail else sepByTrail1) field (wsNL $ char ',')544\end{code}545546A field consists of a subtype, either an extended type or a user-defined type,547and an optional number expressing the value of this field. In case many items548of the same type are sequenced (like in a C array), the shorter array syntax549can be used.550551\begin{code}552regularType :: Parser Q.AggType553regularType = Q.ARegular <$> fields True554\end{code}555556Three different kinds of aggregate types are presentl ysupported: regular557types, union types and opaque types. The fields of regular types will be558packed. By default, the alignment of an aggregate type is the maximum alignment559of its members. The alignment can be explicitly specified by the programmer.560561\begin{code}562unionType :: Parser Q.AggType563unionType = Q.AUnion <$> many1 (wsNL unionType')564 where565 unionType' :: Parser [Q.Field]566 unionType' = bracesNL $ fields False567\end{code}568569Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified.570571\begin{code}572opaqueType :: Parser Q.AggType573opaqueType = Q.AOpaque <$> wsNL decNumber574\end{code}575576Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces.577578\subsubsection{Data}579\label{sec:data}580581\begin{code}582dataDef :: Parser Q.DataDef583dataDef = do584 link <- many linkage585 name <- wsNL1 (string "data") >> wsNL global586 _ <- wsNL (char '=')587 alignment <- optionMaybe alignAny588 bracesNL dataObjs <&> Q.DataDef link name alignment589 where590 -- TODO: sepByTrail is not documented in the QBE BNF.591 dataObjs = sepByTrail dataObj (wsNL $ char ',')592\end{code}593594Data definitions express objects that will be emitted in the compiled595file. Their visibility and location in the compiled artifact are596controlled with linkage flags described in the \nameref{sec:linkage}597section.598599They define a global identifier (starting with the sigil \texttt{\$}), that600will contain a pointer to the object specified by the definition.601602\begin{code}603dataObj :: Parser Q.DataObj604dataObj =605 (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber))606 <|> do607 t <- wsNL1 extType608 i <- many1 (wsNL dataItem)609 return $ Q.OItem t i610\end{code}611612Objects are described by a sequence of fields that start with a type613letter. This letter can either be an extended type, or the \texttt{z} letter.614If the letter used is an extended type, the data item following615specifies the bits to be stored in the field.616617\begin{code}618dataItem :: Parser Q.DataItem619dataItem =620 (Q.DString <$> strLit)621 <|> try622 ( do623 i <- ws global624 off <- (ws $ char '+') >> ws decNumber625 return $ Q.DSymOff i off626 )627 <|> (Q.DConst <$> constant)628\end{code}629630Within each object, several items can be defined. When several data items631follow a letter, they initialize multiple fields of the same size.632633\begin{code}634allocSize :: Parser Q.AllocSize635allocSize =636 choice637 [ bind "4" Q.AllocWord,638 bind "8" Q.AllocLong,639 bind "16" Q.AllocLongLong640 ]641\end{code}642643The members of a struct will be packed. This means that padding has to644be emitted by the frontend when necessary. Alignment of the whole data645objects can be manually specified, and when no alignment is provided,646the maximum alignment from the platform is used.647648When the \texttt{z} letter is used the number following indicates the size of649the field; the contents of the field are zero initialized. It can be650used to add padding between fields or zero-initialize big arrays.651652\subsubsection{Functions}653\label{sec:functions}654655\begin{code}656funcDef :: Parser Q.FuncDef657funcDef = do658 link <- many linkage659 _ <- ws1 (string "function")660 retTy <- optionMaybe (ws1 abity)661 name <- ws global662 args <- wsNL params663 body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block664665 case (Q.insertJumps body) of666 Nothing -> fail $ "invalid fallthrough in " ++ show name667 Just bl -> return $ Q.FuncDef link name retTy args bl668\end{code}669670Function definitions contain the actual code to emit in the compiled671file. They define a global symbol that contains a pointer to the672function code. This pointer can be used in \texttt{call} instructions or stored673in memory.674675\begin{code}676subWordType :: Parser Q.SubWordType677subWordType = choice678 [ try $ bind "sb" Q.SignedByte679 , try $ bind "ub" Q.UnsignedByte680 , bind "sh" Q.SignedHalf681 , bind "uh" Q.UnsignedHalf ]682683abity :: Parser Q.Abity684abity = try (Q.ASubWordType <$> subWordType)685 <|> (Q.ABase <$> baseType)686 <|> (Q.AUserDef <$> userDef)687\end{code}688689The type given right before the function name is the return type of the690function. All return values of this function must have this return type.691If the return type is missing, the function must not return any value.692693\begin{code}694param :: Parser Q.FuncParam695param = (Q.Env <$> (ws1 (string "env") >> local))696 <|> (string "..." >> pure Q.Variadic)697 <|> do698 ty <- ws1 abity699 Q.Regular ty <$> local700701params :: Parser [Q.FuncParam]702params = parenLst param703\end{code}704705The parameter list is a comma separated list of temporary names prefixed706by types. The types are used to correctly implement C compatibility.707When an argument has an aggregate type, a pointer to the aggregate is708passed by thea caller. In the example below, we have to use a load709instruction to get the value of the first (and only) member of the710struct.711712\begin{verbatim}713type :one = { w }714715function w $getone(:one %p) {716@start717 %val =w loadw %p718 ret %val719}720\end{verbatim}721722If a function accepts or returns values that are smaller than a word,723such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type724must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand,725respectively, for signed and unsigned 8-bit values, and signed and726unsigned 16-bit values. Parameters associated with a sub-word type of727bit width N only have their N least significant bits set and have base728type \texttt{w}. For example, the function729730\begin{verbatim}731function w $addbyte(w %a, sb %b) {732@start733 %bw =w extsb %b734 %val =w add %a, %bw735 ret %val736}737\end{verbatim}738739needs to sign-extend its second argument before the addition. Dually,740return values with sub-word types do not need to be sign or zero741extended.742743If the parameter list ends with \texttt{...}, the function is a variadic744function: it can accept a variable number of arguments. To access the745extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg}746instructions described in the \nameref{sec:variadic} section.747748Optionally, the parameter list can start with an environment parameter749\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e.,750of type \texttt{l}). If the function does not use its environment parameter,751callers can safely omit it. This parameter is invisible to a C caller:752for example, the function753754\begin{verbatim}755export function w $add(env %e, w %a, w %b) {756@start757 %c =w add %a, %b758 ret %c759}760\end{verbatim}761762must be given the C prototype \texttt{int add(int, int)}. The intended use of763this feature is to pass the environment pointer of closures while764retaining a very good compatibility with C. The \nameref{sec:call}765section explains how to pass an environment parameter.766767Since global symbols are defined mutually recursive, there is no need768for function declarations: a function can be referenced before its769definition. Similarly, functions from other modules can be used without770previous declaration. All the type information necessary to compile a771call is in the instruction itself.772773The syntax and semantics for the body of functions are described in the774\nameref{sec:control} section.775776\section{Control}777\label{sec:control}778779The IL represents programs as textual transcriptions of control flow780graphs. The control flow is serialized as a sequence of blocks of781straight-line code which are connected using jump instructions.782783\subsection{Blocks}784\label{sec:blocks}785786\begin{code}787block :: Parser Q.Block'788block = do789 l <- wsNL1 label790 p <- many (wsNL1 $ try phiInstr)791 s <- many (wsNL1 statement)792 Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr)793\end{code}794795All blocks have a name that is specified by a label at their beginning.796Then follows a sequence of instructions that have "fall-through" flow.797Finally one jump terminates the block. The jump can either transfer798control to another block of the same function or return; jumps are799described further below.800801The first block in a function must not be the target of any jump in the802program. If a jump to the function start is needed, the frontend must803insert an empty prelude block at the beginning of the function.804805When one block jumps to the next block in the IL file, it is not806necessary to write the jump instruction, it will be automatically added807by the parser. For example the start block in the example below jumps808directly to the loop block.809810\subsection{Jumps}811\label{sec:jumps}812813\begin{code}814jumpInstr :: Parser Q.JumpInstr815jumpInstr = (string "hlt" >> pure Q.Halt)816 -- TODO: Return requires a space if there is an optionMaybe817 <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val)818 <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label))819 <|> do820 _ <- ws1 $ string "jnz"821 v <- ws val <* ws (char ',')822 l1 <- ws label <* ws (char ',')823 l2 <- ws label824 return $ Q.Jnz v l1 l2825\end{code}826827A jump instruction ends every block and transfers the control to another828program location. The target of a jump must never be the first block in829a function. The three kinds of jumps available are described in the830following list.831832\begin{enumerate}833 \item \textbf{Unconditional jump.} Jumps to another block of the same function.834 \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0.835 \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.836 \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}.837\end{enumerate}838839\section{Instructions}840\label{sec:instructions}841842\begin{code}843instr :: Parser Q.Instr844instr =845 choice846 [ try $ binaryInstr Q.Add "add",847 try $ binaryInstr Q.Sub "sub",848 try $ binaryInstr Q.Mul "mul",849 try $ binaryInstr Q.Div "div",850 try $ binaryInstr Q.URem "urem",851 try $ binaryInstr Q.Rem "rem",852 try $ binaryInstr Q.UDiv "udiv",853 try $ binaryInstr Q.Or "or",854 try $ binaryInstr Q.Xor "xor",855 try $ binaryInstr Q.And "and",856 try $ binaryInstr Q.Sar "sar",857 try $ binaryInstr Q.Shr "shr",858 try $ binaryInstr Q.Shl "shl",859 try $ unaryInstr Q.Neg "neg",860 try $ unaryInstr Q.Cast "cast",861 try $ unaryInstr Q.Copy "copy",862 try $ unaryInstr Q.VAArg "vaarg",863 try $ loadInstr,864 try $ allocInstr,865 try $ compareInstr,866 try $ extInstr,867 try $ truncInstr,868 try $ fromFloatInstr,869 try $ toFloatInstr870 ]871\end{code}872873Instructions are the smallest piece of code in the IL, they form the body of874\nameref{sec:blocks}. This specification distinguishes instructions and875volatile instructions, the latter do not return a value. For the former, the IL876uses a three-address code, which means that one instruction computes an877operation between two operands and assigns the result to a third one.878879\begin{code}880assign :: Parser Q.Statement881assign = do882 n <- ws local883 t <- ws (char '=') >> ws1 baseType884 Q.Assign n t <$> instr885886volatileInstr :: Parser Q.Statement887volatileInstr =888 Q.Volatile <$>889 (storeInstr <|> blitInstr <|> vastartInstr <|> dbglocInstr)890891-- TODO: Not documented in the QBE BNF.892statement :: Parser Q.Statement893statement = (try callInstr) <|> assign <|> volatileInstr894\end{code}895896An instruction has both a name and a return type, this return type is a base897type that defines the size of the instruction's result. The type of the898arguments can be unambiguously inferred using the instruction name and the899return type. For example, for all arithmetic instructions, the type of the900arguments is the same as the return type. The two additions below are valid if901\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}).902903\begin{verbatim}904%x =w add 0, %y905%z =w add %x, %x906\end{verbatim}907908Some instructions, like comparisons and memory loads have operand types909that differ from their return types. For instance, two floating points910can be compared to give a word result (0 if the comparison succeeds, 1911if it fails).912913\begin{verbatim}914%c =w cgts %a, %b915\end{verbatim}916917In the example above, both operands have to have single type. This is918made explicit by the instruction suffix.919920\subsection{Arithmetic and Bits}921922\begin{quote}923\begin{itemize}924\item \texttt{add}, \texttt{sub}, \texttt{div}, \texttt{mul}925\item \texttt{neg}926\item \texttt{udiv}, \texttt{rem}, \texttt{urem}927\item \texttt{or}, \texttt{xor}, \texttt{and}928\item \texttt{sar}, \texttt{shr}, \texttt{shl}929\end{itemize}930\end{quote}931932The base arithmetic instructions in the first bullet are available for933all types, integers and floating points.934935When \texttt{div} is used with word or long return type, the arguments are936treated as signed. The unsigned integral division is available as \texttt{udiv}937instruction. When the result of a division is not an integer, it is truncated938towards zero.939940The signed and unsigned remainder operations are available as \texttt{rem} and941\texttt{urem}. The sign of the remainder is the same as the one of the942dividend. Its magnitude is smaller than the divisor one. These two instructions943and \texttt{udiv} are only available with integer arguments and result.944945Bitwise OR, AND, and XOR operations are available for both integer946types. Logical operations of typical programming languages can be947implemented using \nameref{sec:comparisions} and \nameref{sec:jumps}.948949Shift instructions \texttt{sar}, \texttt{shr}, and \texttt{shl}, shift right or950left their first operand by the amount from the second operand. The shifting951amount is taken modulo the size of the result type. Shifting right can either952preserve the sign of the value (using \texttt{sar}), or fill the newly freed953bits with zeroes (using \texttt{shr}). Shifting left always fills the freed954bits with zeroes.955956Remark that an arithmetic shift right (\texttt{sar}) is only equivalent to a957division by a power of two for non-negative numbers. This is because the shift958right "truncates" towards minus infinity, while the division truncates towards959zero.960961\subsection{Memory}962\label{sec:memory}963964The following sections discuss instructions for interacting with values stored in memory.965966\subsubsection{Store instructions}967968\begin{code}969storeInstr :: Parser Q.VolatileInstr970storeInstr = do971 t <- string "store" >> ws1 extType972 v <- ws val973 _ <- ws $ char ','974 ws val <&> Q.Store t v975\end{code}976977Store instructions exist to store a value of any base type and any extended978type. Since halfwords and bytes are not first class in the IL, \texttt{storeh}979and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of980this word will be stored in memory at the address specified in the second981argument.982983\subsubsection{Load instructions}984985\begin{code}986loadInstr :: Parser Q.Instr987loadInstr = do988 _ <- string "load"989 t <- ws1 $ choice990 [ try $ bind "sw" (Q.LBase Q.Word),991 try $ bind "uw" (Q.LBase Q.Word),992 try $ Q.LSubWord <$> subWordType,993 Q.LBase <$> baseType994 ]995 ws val <&> Q.Load t996\end{code}997998For types smaller than long, two variants of the load instruction are999available: one will sign extend the loaded value, while the other will zero1000extend it. Note that all loads smaller than long can load to either a long or a1001word.10021003The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect1004when they are used to define a word temporary. A \texttt{loadw} instruction is1005provided as syntactic sugar for \texttt{loadsw} to make explicit that the1006extension mechanism used is irrelevant.10071008\subsubsection{Blits}10091010\begin{code}1011blitInstr :: Parser Q.VolatileInstr1012blitInstr = do1013 v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',')1014 v2 <- ws val <* (ws $ char ',')1015 nb <- decNumber1016 return $ Q.Blit v1 v2 nb1017\end{code}10181019The blit instruction copies in-memory data from its first address argument to1020its second address argument. The third argument is the number of bytes to copy.1021The source and destination spans are required to be either non-overlapping, or1022fully overlapping (source address identical to the destination address). The1023byte count argument must be a nonnegative numeric constant; it cannot be a1024temporary.10251026One blit instruction may generate a number of instructions proportional to its1027byte count argument, consequently, it is recommended to keep this argument1028relatively small. If large copies are necessary, it is preferable that1029frontends generate calls to a supporting \texttt{memcpy} function.10301031\subsubsection{Stack Allocation}10321033\begin{code}1034allocInstr :: Parser Q.Instr1035allocInstr = do1036 siz <- (ws $ string "alloc") >> (ws1 allocSize)1037 val <&> Q.Alloc siz1038\end{code}10391040These instructions allocate a chunk of memory on the stack. The number ending1041the instruction name is the alignment required for the allocated slot. QBE will1042make sure that the returned address is a multiple of that alignment value.10431044Stack allocation instructions are used, for example, when compiling the C local1045variables, because their address can be taken. When compiling Fortran,1046temporaries can be used directly instead, because it is illegal to take the1047address of a variable.10481049\subsection{Comparisons}1050\label{sec:comparisions}10511052\begin{code}1053compareInstr :: Parser Q.Instr1054compareInstr = do1055 _ <- char 'c'1056 (try intCompare) <|> floatCompare10571058compareArgs :: Parser (Q.Value, Q.Value)1059compareArgs = do1060 lhs <- ws val <* ws (char ',')1061 rhs <- ws val1062 pure (lhs, rhs)10631064intCompare :: Parser Q.Instr1065intCompare = do1066 op <- compareIntOp1067 ty <- ws1 intArg10681069 (lhs, rhs) <- compareArgs1070 pure $ Q.CompareInt ty op lhs rhs10711072floatCompare :: Parser Q.Instr1073floatCompare = do1074 op <- compareFloatOp1075 ty <- ws1 floatArg10761077 (lhs, rhs) <- compareArgs1078 pure $ Q.CompareFloat ty op lhs rhs1079\end{code}10801081Comparison instructions return an integer value (either a word or a long), and1082compare values of arbitrary types. The returned value is 1 if the two operands1083satisfy the comparison relation, or 0 otherwise. The names of comparisons1084respect a standard naming scheme in three parts:10851086\begin{enumerate}1087 \item All comparisons start with the letter \texttt{c}.1088 \item Then comes a comparison type.1089 \item Finally, the instruction name is terminated with a basic type suffix precising the type of the operands to be compared.1090\end{enumerate}10911092The following instruction are available for integer comparisons:10931094\begin{code}1095compareIntOp :: Parser Q.IntCmpOp1096compareIntOp = choice1097 [ bind "eq" Q.IEq1098 , bind "ne" Q.INe1099 , try $ bind "sle" Q.ISle1100 , try $ bind "slt" Q.ISlt1101 , try $ bind "sge" Q.ISge1102 , try $ bind "sgt" Q.ISgt1103 , try $ bind "ule" Q.IUle1104 , try $ bind "ult" Q.IUlt1105 , try $ bind "uge" Q.IUge1106 , try $ bind "ugt" Q.IUgt ]1107\end{code}11081109For floating point comparisons use one of these instructions:11101111\begin{code}1112compareFloatOp :: Parser Q.FloatCmpOp1113compareFloatOp = choice1114 [ bind "eq" Q.FEq1115 , bind "ne" Q.FNe1116 , try $ bind "le" Q.FLe1117 , bind "lt" Q.FLt1118 , try $ bind "ge" Q.FGe1119 , bind "gt" Q.FGt1120 , bind "o" Q.FOrd1121 , bind "uo" Q.FUnord ]1122\end{code}11231124For example, \texttt{cod} compares two double-precision floating point numbers1125and returns 1 if the two floating points are not NaNs, or 0 otherwise. The1126\texttt{csltw} instruction compares two words representing signed numbers and1127returns 1 when the first argument is smaller than the second one.11281129\subsection{Conversions}11301131Conversion operations change the representation of a value, possibly modifying1132it if the target type cannot hold the value of the source type. Conversions can1133extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or1134convert a floating point into an integer and vice versa.11351136\begin{code}1137extInstr :: Parser Q.Instr1138extInstr = do1139 _ <- string "ext"1140 ty <- ws1 extArg1141 ws val <&> Q.Ext ty1142 where1143 extArg :: Parser Q.ExtArg1144 extArg = try (Q.ExtSubWord <$> subWordType)1145 <|> try (bind "sw" Q.ExtSignedWord)1146 <|> bind "s" Q.ExtSingle1147 <|> bind "uw" Q.ExtUnsignedWord1148\end{code}11491150Extending the precision of a temporary is done using the \texttt{ext} family of1151instructions. Because QBE types do not specify the signedness (like in LLVM),1152extension instructions exist to sign-extend and zero-extend a value. For1153example, \texttt{extsb} takes a word argument and sign-extends the 81154least-significant bits to a full word or long, depending on the return type.11551156\begin{code}1157truncInstr :: Parser Q.Instr1158truncInstr = do1159 _ <- ws1 $ string "truncd"1160 ws val <&> Q.TruncDouble1161\end{code}11621163The instructions \texttt{exts} (extend single) and \texttt{truncd} (truncate1164double) are provided to change the precision of a floating point value. When1165the double argument of truncd cannot be represented as a single-precision1166floating point, it is truncated towards zero.11671168\begin{code}1169floatArg :: Parser Q.FloatArg1170floatArg = bind "d" Q.FDouble <|> bind "s" Q.FSingle11711172fromFloatInstr :: Parser Q.Instr1173fromFloatInstr = do1174 arg <- floatArg <* string "to"1175 isSigned <- signageChar1176 _ <- ws1 $ char 'i'1177 ws val <&> Q.FloatToInt arg isSigned11781179intArg :: Parser Q.IntArg1180intArg = bind "w" Q.IWord <|> bind "l" Q.ILong11811182toFloatInstr :: Parser Q.Instr1183toFloatInstr = do1184 isSigned <- signageChar1185 arg <- intArg1186 _ <- ws1 $ string "tof"1187 ws val <&> Q.IntToFloat arg isSigned1188\end{code}11891190Converting between signed integers and floating points is done using1191\texttt{stosi} (single to signed integer), \texttt{stoui} (single to unsigned1192integer), \texttt{dtosi} (double to signed integer), \texttt{dtoui} (double to1193unsigned integer), \texttt{swtof} (signed word to float), \texttt{uwtof}1194(unsigned word to float), \texttt{sltof} (signed long to float) and1195\texttt{ultof} (unsigned long to float).11961197\subsection{Cast and Copy}11981199The \texttt{cast} and \texttt{copy} instructions return the bits of their1200argument verbatim. However a cast will change an integer into a floating point1201of the same width and vice versa.12021203Casts can be used to make bitwise operations on the representation of floating1204point numbers. For example the following program will compute the opposite of1205the single-precision floating point number \texttt{\%f} into \texttt{\%rs}.12061207\begin{verbatim}1208%b0 =w cast %f1209%b1 =w xor 2147483648, %b0 # flip the msb1210%rs =s cast %b11211\end{verbatim}12121213\subsection{Call}1214\label{sec:call}12151216\begin{code}1217-- TODO: Code duplication with 'param'.1218callArg :: Parser Q.FuncArg1219callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))1220 <|> (string "..." >> pure Q.ArgVar)1221 <|> do1222 ty <- ws1 abity1223 Q.ArgReg ty <$> val12241225callArgs :: Parser [Q.FuncArg]1226callArgs = parenLst callArg12271228callInstr :: Parser Q.Statement1229callInstr = do1230 retValue <- optionMaybe $ do1231 i <- ws local <* ws (char '=')1232 a <- ws1 abity1233 return (i, a)1234 toCall <- ws1 (string "call") >> ws val1235 fnArgs <- callArgs1236 return $ Q.Call retValue toCall fnArgs1237\end{code}12381239The call instruction is special in several ways. It is not a three-address1240instruction and requires the type of all its arguments to be given. Also, the1241return type can be either a base type or an aggregate type. These specifics are1242required to compile calls with C compatibility (i.e., to respect the ABI).12431244When an aggregate type is used as argument type or return type, the value1245respectively passed or returned needs to be a pointer to a memory location1246holding the value. This is because aggregate types are not first-class1247citizens of the IL.12481249Sub-word types are used for arguments and return values of width less than a1250word. Details on these types are presented in the \nameref{sec:functions} section.1251Arguments with sub-word types need not be sign or zero extended according to1252their type. Calls with a sub-word return type define a temporary of base type1253\texttt{w} with its most significant bits unspecified.12541255Unless the called function does not return a value, a return temporary must be1256specified, even if it is never used afterwards.12571258An environment parameter can be passed as first argument using the \texttt{env}1259keyword. The passed value must be a 64-bit integer. If the called function does1260not expect an environment parameter, it will be safely discarded. See the1261\nameref{sec:functions} section for more information about environment1262parameters.12631264When the called function is variadic, there must be a \texttt{...} marker1265separating the named and variadic arguments.12661267\subsection{Variadic}1268\label{sec:variadic}12691270\begin{code}1271vastartInstr :: Parser Q.VolatileInstr1272vastartInstr = do1273 _ <- ws1 (string "vastart")1274 Q.VAStart <$> ws val1275\end{code}12761277The \texttt{vastart} and \texttt{vaarg} instructions provide a portable way to1278access the extra parameters of a variadic function.12791280\begin{enumerate}1281 \item \texttt{vastart} -- \texttt{(m)}1282 \item \texttt{vaarg} -- \texttt{T(mmmm)}1283\end{enumerate}12841285The \texttt{vastart} instruction initializes a variable argument list used to1286access the extra parameters of the enclosing variadic function. It is safe to1287call it multiple times.12881289The \texttt{vaarg} instruction fetches the next argument from a variable1290argument list. It is currently limited to fetching arguments that have a base1291type. This instruction is essentially effectful: calling it twice in a row will1292return two consecutive arguments from the argument list.12931294Both instructions take a pointer to a variable argument list as the sole argument.1295The size and alignment of the variable argument lists depends on the target used.12961297\subsection{Phi}12981299\begin{code}1300phiBranch :: Parser (Q.BlockIdent, Q.Value)1301phiBranch = do1302 n <- ws1 label1303 v <- val1304 pure (n, v)13051306phiInstr :: Parser Q.Phi1307phiInstr = do1308 -- TODO: code duplication with 'assign'1309 n <- ws local1310 t <- ws (char '=') >> ws1 baseType13111312 _ <- ws1 (string "phi")1313 -- TODO: combinator for sepBy1314 p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')1315 return $ Q.Phi n t p1316\end{code}13171318First and foremost, phi instructions are NOT necessary when writing a frontend1319to QBE. One solution to avoid having to deal with SSA form is to use stack1320allocated variables for all source program variables and perform assignments1321and lookups using \nameref{sec:memory} operations. This is what LLVM users1322typically do.13231324Another solution is to simply emit code that is not in SSA form! Contrary to1325LLVM, QBE is able to fixup programs not in SSA form without requiring the1326boilerplate of loading and storing in memory. For example, the following1327program will be correctly compiled by QBE.13281329\begin{verbatim}1330@start1331 %x =w copy 1001332 %s =w copy 01333@loop1334 %s =w add %s, %x1335 %x =w sub %x, 11336 jnz %x, @loop, @end1337@end1338 ret %s1339\end{verbatim}13401341Now, if you want to know what phi instructions are and how to use them in QBE,1342you can read the following.13431344Phi instructions are specific to SSA form. In SSA form values can only be1345assigned once, without phi instructions, this requirement is too strong to1346represent many programs. For example consider the following C program.13471348\begin{verbatim}1349int f(int x) {1350 int y;1351 if (x)1352 y = 1;1353 else1354 y = 2;1355 return y;1356}1357\end{verbatim}13581359The variable \texttt{y} is assigned twice, the solution to translate it in SSA1360form is to insert a phi instruction.13611362\begin{verbatim}1363@ifstmt1364 jnz %x, @ift, @iff1365@ift1366 jmp @retstmt1367@iff1368 jmp @retstmt1369@retstmt1370 %y =w phi @ift 1, @iff 21371 ret %y1372\end{verbatim}13731374Phi instructions return one of their arguments depending on where the control1375came from. In the example, \texttt{\%y} is set to 1 if the1376\texttt{\textbackslash{}ift} branch is taken, or it is set to 2 otherwise.13771378An important remark about phi instructions is that QBE assumes that if a1379variable is defined by a phi it respects all the SSA invariants. So it is1380critical to not use phi instructions unless you know exactly what you are1381doing.13821383\subsection{Debug Information}13841385QBE supports the inclusion of debug information. Specifically, it allows1386defining from which source file type, data, and function definitions originated.1387For this purpose, it provides the \texttt{dbgfile} definition, which receives a1388file name (string literal) as its sole argument. Every type, data and function1389definition thereafter are assumed to originate in this file.13901391\begin{code}1392-- TODO: not documnted in the QBE BNF.1393fileDef :: Parser String1394fileDef = do1395 _ <- ws1 $ string "dbgfile"1396 wsNL1 strLit1397\end{code}13981399Further, instructions within a function can be associated with a specific line1400and column number of a previously defined \texttt{dbgfile}. The1401\texttt{dbgfile} is referenced by index using the first argument to1402\texttt{dbgloc}. The second argument represents the line number, the third1403(optional) argument the column number.14041405\begin{code}1406-- TODO: not documnted in the QBE BNF.1407dbglocInstr :: Parser Q.VolatileInstr1408dbglocInstr = do1409 _ <- ws1 $ string "dbgloc"1410 file <- ws decNumber <* ws (char ',')1411 line <- ws decNumber1412 col <- optionMaybe (ws (char ',') >> ws decNumber)1413 return $ Q.DBGLoc file line col1414\end{code}14151416\end{document}