A software analysis framework built around the QBE intermediate language
git clone https://git.8pit.net/quebex.git
1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me> 2% SPDX-FileCopyrightText: 2025 Sören Tempel <soeren+git@soeren-tempel.net> 3% 4% SPDX-License-Identifier: MIT AND GPL-3.0-only 5 6\documentclass{article} 7%include polycode.fmt 8 9%subst blankline = "\\[5mm]" 10 11% See https://github.com/kosmikus/lhs2tex/issues/58 12%format <$> = "\mathbin{\langle\$\rangle}" 13%format <&> = "\mathbin{\langle\&\rangle}" 14%format <|> = "\mathbin{\langle\:\vline\:\rangle}" 15%format <?> = "\mathbin{\langle?\rangle}" 16%format <*> = "\mathbin{\langle*\rangle}" 17%format <* = "\mathbin{\langle*}" 18%format *> = "\mathbin{*\rangle}" 19 20\long\def\ignore#1{} 21 22\usepackage{hyperref} 23\hypersetup{ 24 colorlinks = true, 25} 26 27\begin{document} 28 29\title{QBE Intermediate Language\vspace{-2em}} 30\date{} 31\maketitle 32\frenchspacing 33 34\ignore{ 35\begin{code} 36module Language.QBE.Parser (dataDef, typeDef, funcDef) where 37 38import Data.Functor ((<&>)) 39import Data.List (singleton) 40import Data.Map qualified as Map 41import qualified Language.QBE.Types as Q 42import Language.QBE.Util (bind, decNumber, float) 43import Text.ParserCombinators.Parsec 44 ( Parser, 45 alphaNum, 46 anyChar, 47 between, 48 char, 49 choice, 50 letter, 51 many, 52 many1, 53 manyTill, 54 newline, 55 noneOf, 56 oneOf, 57 optionMaybe, 58 sepBy, 59 sepBy1, 60 skipMany, 61 skipMany1, 62 string, 63 try, 64 (<?>), 65 (<|>), 66 ) 67\end{code} 68} 69 70This an executable description of the 71\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language}, 72specified through \href{https://hackage.haskell.org/package/parsec}{Parsec} 73parser combinators and generated from a literate Haskell file. The description 74is derived from the original QBE IL documentation, licensed under MIT. 75Presently, this implementation targets version 1.2 of the QBE intermediate 76language and aims to be equivalent with the original specification. 77 78\section{Basic Concepts} 79 80The intermediate language (IL) is a higher-level language than the 81machine's assembly language. It smoothes most of the 82irregularities of the underlying hardware and allows an infinite number 83of temporaries to be used. This higher abstraction level lets frontend 84programmers focus on language design issues. 85 86\subsection{Input Files} 87 88The intermediate language is provided to QBE as text. Usually, one file 89is generated per each compilation unit from the frontend input language. 90An IL file is a sequence of \nameref{sec:definitions} for 91data, functions, and types. Once processed by QBE, the resulting file 92can be assembled and linked using a standard toolchain (e.g., GNU 93binutils). 94 95\begin{code} 96comment :: Parser () 97comment = skipMany blankNL >> comment' >> skipMany blankNL 98 where 99 comment' = char '#' >> manyTill anyChar newline 100\end{code} 101 102\ignore{ 103\begin{code} 104skipNoCode :: Parser () -> Parser () 105skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP 106\end{code} 107} 108 109Here is a complete "Hello World" IL file which defines a function that 110prints to the screen. Since the string is not a first class object (only 111the pointer is) it is defined outside the function\textquotesingle s 112body. Comments start with a \# character and finish with the end of the 113line. 114 115\begin{verbatim} 116data $str = { b "hello world", b 0 } 117 118export function w $main() { 119@start 120 # Call the puts function with $str as argument. 121 %r =w call $puts(l $str) 122 ret 0 123} 124\end{verbatim} 125 126If you have read the LLVM language reference, you might recognize the 127example above. In comparison, QBE makes a much lighter use of types and 128the syntax is terser. 129 130\subsection{Parser Combinators} 131 132\ignore{ 133\begin{code} 134bracesNL :: Parser a -> Parser a 135bracesNL = between (wsNL $ char '{') (wsNL $ char '}') 136 137quoted :: Parser a -> Parser a 138quoted = let q = char '"' in between q q 139 140parenLst :: Parser a -> Parser [a] 141parenLst p = between (ws $ char '(') (char ')') inner 142 where 143 inner = sepBy (ws p) (ws $ char ',') 144 145binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr 146binaryInstr conc keyword = do 147 _ <- ws (string keyword) 148 vfst <- ws val <* ws (char ',') 149 conc vfst <$> ws val 150 151-- Can only appear in data and type definitions and hence allows newlines. 152alignSpec :: Parser Q.AllocSize 153alignSpec = (ws1 (string "align")) >> wsNL allocSize 154\end{code} 155} 156 157The original QBE specification defines the syntax using a BNF grammar. In 158contrast, this document defines it using Parsec parser combinators. As such, 159this specification is less formal but more accurate as the parsing code is 160actually executable. Consequently, this specification also captures constructs 161omitted in the original specification (e.g., \nameref{sec:identifiers}, 162\nameref{sec:statements}, or \nameref{sec:strlit}). Nonetheless, the formal 163language recognized by these combinators aims to be equivalent to the one of 164the BNF grammar. 165 166\subsection{Identifiers} 167\label{sec:identifiers} 168 169% Ident is not documented in the original QBE specification. 170% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304 171 172\begin{code} 173ident :: Parser String 174ident = do 175 start <- letter <|> oneOf "._" 176 rest <- many (alphaNum <|> oneOf "$._") 177 return $ start : rest 178\end{code} 179 180Identifiers for data, types, and functions can start with any ASCII letter or 181the special characters \texttt{.} and \texttt{\_}. This initial character can 182be followed by a sequence of zero or more alphanumeric characters and the 183special characters \texttt{\$}, \texttt{.}, and \texttt{\_}. 184 185\subsection{Sigils} 186 187\begin{code} 188userDef :: Parser Q.UserIdent 189userDef = Q.UserIdent <$> (char ':' >> ident) 190 191global :: Parser Q.GlobalIdent 192global = Q.GlobalIdent <$> (char '$' >> ident) 193 194local :: Parser Q.LocalIdent 195local = Q.LocalIdent <$> (char '%' >> ident) 196 197label :: Parser Q.BlockIdent 198label = Q.BlockIdent <$> (char '@' >> ident) 199\end{code} 200 201The intermediate language makes heavy use of sigils, all user-defined 202names are prefixed with a sigil. This is to avoid keyword conflicts, and 203also to quickly spot the scope and nature of identifiers. 204 205\begin{itemize} 206 \item \texttt{:} is for user-defined \nameref{sec:aggregate-types} 207 \item \texttt{\$} is for globals (represented by a pointer) 208 \item \texttt{\%} is for function-scope temporaries 209 \item \texttt{@@} is for block labels 210\end{itemize} 211 212\subsection{Spacing} 213 214\begin{code} 215blank :: Parser Char 216blank = oneOf "\t " <?> "blank" 217 218blankNL :: Parser Char 219blankNL = oneOf "\n\t " <?> "blank or newline" 220\end{code} 221 222Individual tokens in IL files must be separated by one or more spacing 223characters. Both spaces and tabs are recognized as spacing characters. 224In data and type definitions, newlines may also be used as spaces to 225prevent overly long lines. When exactly one of two consecutive tokens is 226a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted. 227 228\ignore{ 229\begin{code} 230ws :: Parser a -> Parser a 231ws p = p <* skipMany blank 232 233ws1 :: Parser a -> Parser a 234ws1 p = p <* skipMany1 blank 235 236wsNL :: Parser a -> Parser a 237wsNL p = p <* skipNoCode (skipMany blankNL) 238 239wsNL1 :: Parser a -> Parser a 240wsNL1 p = p <* skipNoCode (skipMany1 blankNL) 241\end{code} 242} 243 244\subsection{String Literals} 245\label{sec:strlit} 246 247% The string literal is not documented in the original QBE specification. 248% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287 249 250\begin{code} 251strLit :: Parser String 252strLit = concat <$> quoted (many strChr) 253 where 254 strChr :: Parser [Char] 255 strChr = (singleton <$> noneOf "\"\\") <|> escSeq 256 257 escSeq :: Parser [Char] 258 escSeq = try $ do 259 esc <- char '\\' 260 chr <- anyChar 261 return $ [esc, chr] 262\end{code} 263 264Strings are enclosed by double quotes and are, for example, used to specify a 265section name as part of the \nameref{sec:linkage} information. Within a string, 266a double quote can be escaped using a \texttt{\textbackslash} character. All 267escape sequences, including double quote escaping, are passed through as-is to 268the generated assembly file. 269 270\section{Types} 271 272\subsection{Simple Types} 273 274The IL makes minimal use of types. By design, the types used are 275restricted to what is necessary for unambiguous compilation to machine 276code and C interfacing. Unlike LLVM, QBE is not using types as a means 277to safety; they are only here for semantic purposes. 278 279\begin{code} 280baseType :: Parser Q.BaseType 281baseType = choice 282 [ bind "w" Q.Word 283 , bind "l" Q.Long 284 , bind "s" Q.Single 285 , bind "d" Q.Double ] 286\end{code} 287 288The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d} 289(double), they stand respectively for 32-bit and 64-bit integers, and 29032-bit and 64-bit floating-point numbers. There are no pointer types 291available; pointers are typed by an integer type sufficiently wide to 292represent all memory addresses (e.g., \texttt{l} on 64-bit architectures). 293Temporaries in the IL can only have a base type. 294 295\begin{code} 296extType :: Parser Q.ExtType 297extType = (Q.Base <$> baseType) 298 <|> bind "b" Q.Byte 299 <|> bind "h" Q.HalfWord 300\end{code} 301 302Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word), 303respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types} 304and \nameref{sec:data} definitions. 305 306For C interfacing, the IL also provides user-defined aggregate types as 307well as signed and unsigned variants of the sub-word extended types. 308Read more about these types in the \nameref{sec:aggregate-types} 309and \nameref{sec:functions} sections. 310 311\subsection{Subtyping} 312 313The IL has a minimal subtyping feature, for integer types only. Any 314value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the 31532 least significant bits of the word value are used. 316 317Make note that it is the opposite of the usual subtyping on integers (in 318C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value 319cannot be used in word context. The rationale is that a word can be 320signed or unsigned, so extending it to a long could be done in two ways, 321either by zero-extension, or by sign-extension. 322 323\subsection{Constants and Vals} 324\label{sec:constants-and-vals} 325 326\begin{code} 327dynConst :: Parser Q.DynConst 328dynConst = 329 (Q.Const <$> constant) 330 <|> (Q.Thread <$> global) 331 <?> "dynconst" 332\end{code} 333 334Constants come in two kinds: compile-time constants and dynamic 335constants. Dynamic constants include compile-time constants and other 336symbol variants that are only known at program-load time or execution 337time. Consequently, dynamic constants can only occur in function bodies. 338 339The representation of integers is two's complement. 340Floating-point numbers are represented using the single-precision and 341double-precision formats of the IEEE 754 standard. 342 343\begin{code} 344constant :: Parser Q.Const 345constant = 346 (Q.Number <$> decNumber) 347 <|> (Q.SFP <$> sfp) 348 <|> (Q.DFP <$> dfp) 349 <|> (Q.Global <$> global) 350 <?> "const" 351 where 352 sfp = string "s_" >> float 353 dfp = string "d_" >> float 354\end{code} 355 356Constants specify a sequence of bits and are untyped. They are always 357parsed as 64-bit blobs. Depending on the context surrounding a constant, 358only some of its bits are used. For example, in the program below, the 359two variables defined have the same value since the first operand of the 360subtraction is a word (32-bit) context. 361 362\begin{verbatim} 363%x =w sub -1, 0 %y =w sub 4294967295, 0 364\end{verbatim} 365 366Because specifying floating-point constants by their bits makes the code 367less readable, syntactic sugar is provided to express them. Standard 368scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and 369double precision numbers respectively. Once again, the following example 370defines twice the same double-precision constant. 371 372\begin{verbatim} 373%x =d add d_0, d_-1 374%y =d add d_0, -4616189618054758400 375\end{verbatim} 376 377Global symbols can also be used directly as constants; they will be 378resolved and turned into actual numeric constants by the linker. 379 380When the \texttt{thread} keyword prefixes a symbol name, the 381symbol\textquotesingle s numeric value is resolved at runtime in the 382thread-local storage. 383 384\begin{code} 385val :: Parser Q.Value 386val = 387 (Q.VConst <$> dynConst) 388 <|> (Q.VLocal <$> local) 389 <?> "val" 390\end{code} 391 392Vals are used as arguments in regular, phi, and jump instructions within 393function definitions. They are either constants or function-scope 394temporaries. 395 396\subsection{Linkage} 397\label{sec:linkage} 398 399\begin{code} 400linkage :: Parser Q.Linkage 401linkage = 402 wsNL (bind "export" Q.LExport) 403 <|> wsNL (bind "thread" Q.LThread) 404 <|> do 405 _ <- ws1 $ string "section" 406 (try secWithFlags) <|> sec 407 where 408 sec :: Parser Q.Linkage 409 sec = wsNL strLit <&> (`Q.LSection` Nothing) 410 411 secWithFlags :: Parser Q.Linkage 412 secWithFlags = do 413 n <- ws1 strLit 414 wsNL strLit <&> Q.LSection n . Just 415\end{code} 416 417Function and data definitions (see below) can specify linkage 418information to be passed to the assembler and eventually to the linker. 419 420The \texttt{export} linkage flag marks the defined item as visible outside the 421current file\textquotesingle s scope. If absent, the symbol can only be 422referred to locally. Functions compiled by QBE and called from C need to 423be exported. 424 425The \texttt{thread} linkage flag can only qualify data definitions. It mandates 426that the object defined is stored in thread-local storage. Each time a 427runtime thread starts, the supporting platform runtime is in charge of 428making a new copy of the object for the fresh thread. Objects in 429thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax, 430as specified in the \nameref{sec:constants-and-vals} section. 431 432A \texttt{section} flag can be specified to tell the linker to put the defined 433item in a certain section. The use of the section flag is platform 434dependent and we refer the user to the documentation of their assembler 435and linker for relevant information. 436 437\begin{verbatim} 438section ".init_array" data $.init.f = { l $f } 439\end{verbatim} 440 441The section flag can be used to add function pointers to a global 442initialization list, as depicted above. Note that some platforms provide 443a BSS section that can be used to minimize the footprint of uniformly 444zeroed data. When this section is available, QBE will automatically make 445use of it and no section flag is required. 446 447The section and export linkage flags should each appear at most once in 448a definition. If multiple occurrences are present, QBE is free to use 449any. 450 451\subsection{Definitions} 452\label{sec:definitions} 453 454Definitions are the essential components of an IL file. They can define 455three types of objects: aggregate types, data, and functions. Aggregate 456types are never exported and do not compile to any code. Data and 457function definitions have file scope and are mutually recursive (even 458across IL files). Their visibility can be controlled using linkage 459flags. 460 461\subsubsection{Aggregate Types} 462\label{sec:aggregate-types} 463 464\begin{code} 465typeDef :: Parser Q.TypeDef 466typeDef = do 467 _ <- wsNL1 (string "type") 468 i <- wsNL1 userDef 469 _ <- wsNL1 (char '=') 470 a <- optionMaybe alignSpec 471 bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a 472\end{code} 473 474Aggregate type definitions start with the \texttt{type} keyword. They have file 475scope, but types must be defined before being referenced. The inner 476structure of a type is expressed by a comma-separated list of fields. 477 478\begin{code} 479subType :: Parser Q.SubType 480subType = 481 (Q.SExtType <$> extType) 482 <|> (Q.SUserDef <$> userDef) 483 484field :: Parser Q.Field 485field = do 486 -- TODO: newline is required if there is a number argument 487 f <- wsNL subType 488 s <- ws $ optionMaybe decNumber 489 pure (f, s) 490 491-- TODO: "For ease of IL generation, a trailing comma is tolerated by the parser". 492fields :: Bool -> Parser [Q.Field] 493fields allowEmpty = 494 (if allowEmpty then sepBy else sepBy1) field (wsNL $ char ',') 495\end{code} 496 497A field consists of a subtype, either an extended type or a user-defined type, 498and an optional number expressing the value of this field. In case many items 499of the same type are sequenced (like in a C array), the shorter array syntax 500can be used. 501 502\begin{code} 503regularType :: Parser Q.AggType 504regularType = Q.ARegular <$> fields True 505\end{code} 506 507Three different kinds of aggregate types are presentl ysupported: regular 508types, union types and opaque types. The fields of regular types will be 509packed. By default, the alignment of an aggregate type is the maximum alignment 510of its members. The alignment can be explicitly specified by the programmer. 511 512\begin{code} 513unionType :: Parser Q.AggType 514unionType = Q.AUnion <$> many1 (wsNL unionType') 515 where 516 unionType' :: Parser [Q.Field] 517 unionType' = bracesNL $ fields False 518\end{code} 519 520Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified. 521 522\begin{code} 523opaqueType :: Parser Q.AggType 524opaqueType = Q.AOpaque <$> wsNL decNumber 525\end{code} 526 527Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces. 528 529\subsubsection{Data} 530\label{sec:data} 531 532\begin{code} 533dataDef :: Parser Q.DataDef 534dataDef = do 535 link <- many linkage 536 name <- wsNL1 (string "data") >> wsNL global 537 _ <- wsNL (char '=') 538 alignment <- optionMaybe alignSpec 539 bracesNL dataObjs <&> Q.DataDef link name alignment 540 where 541 dataObjs = sepBy dataObj (wsNL $ char ',') 542\end{code} 543 544Data definitions express objects that will be emitted in the compiled 545file. Their visibility and location in the compiled artifact are 546controlled with linkage flags described in the \nameref{sec:linkage} 547section. 548 549They define a global identifier (starting with the sigil \texttt{\$}), that 550will contain a pointer to the object specified by the definition. 551 552\begin{code} 553dataObj :: Parser Q.DataObj 554dataObj = 555 (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber)) 556 <|> do 557 t <- wsNL1 extType 558 i <- many1 (wsNL dataItem) 559 return $ Q.OItem t i 560\end{code} 561 562Objects are described by a sequence of fields that start with a type 563letter. This letter can either be an extended type, or the \texttt{z} letter. 564If the letter used is an extended type, the data item following 565specifies the bits to be stored in the field. 566 567\begin{code} 568dataItem :: Parser Q.DataItem 569dataItem = 570 (Q.DString <$> strLit) 571 <|> (Q.DConst <$> constant) 572 <|> do 573 i <- global 574 off <- optionMaybe (char '+' >> decNumber) 575 return $ Q.DSymbol i off 576\end{code} 577 578Within each object, several items can be defined. When several data items 579follow a letter, they initialize multiple fields of the same size. 580 581\begin{code} 582allocSize :: Parser Q.AllocSize 583allocSize = 584 choice 585 [ bind "4" Q.AlignWord, 586 bind "8" Q.AlignLong, 587 bind "16" Q.AlignLongLong 588 ] 589\end{code} 590 591The members of a struct will be packed. This means that padding has to 592be emitted by the frontend when necessary. Alignment of the whole data 593objects can be manually specified, and when no alignment is provided, 594the maximum alignment from the platform is used. 595 596When the \texttt{z} letter is used the number following indicates the size of 597the field; the contents of the field are zero initialized. It can be 598used to add padding between fields or zero-initialize big arrays. 599 600\subsubsection{Functions} 601\label{sec:functions} 602 603\begin{code} 604funcDef :: Parser Q.FuncDef 605funcDef = do 606 link <- many linkage 607 _ <- ws1 (string "function") 608 retTy <- optionMaybe (ws1 abity) 609 name <- ws global 610 args <- wsNL params 611 body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block 612 613 case (Q.insertJumps body) of 614 Nothing -> fail $ "invalid fallthrough in " ++ show name 615 Just bl -> return $ Q.FuncDef link name retTy args bl 616\end{code} 617 618Function definitions contain the actual code to emit in the compiled 619file. They define a global symbol that contains a pointer to the 620function code. This pointer can be used in \texttt{call} instructions or stored 621in memory. 622 623\begin{code} 624subWordType :: Parser Q.SubWordType 625subWordType = choice 626 [ try $ bind "sb" Q.SignedByte 627 , try $ bind "ub" Q.UnsignedByte 628 , bind "sh" Q.SignedHalf 629 , bind "uh" Q.UnsignedHalf ] 630 631abity :: Parser Q.Abity 632abity = try (Q.ASubWordType <$> subWordType) 633 <|> (Q.ABase <$> baseType) 634 <|> (Q.AUserDef <$> userDef) 635\end{code} 636 637The type given right before the function name is the return type of the 638function. All return values of this function must have this return type. 639If the return type is missing, the function must not return any value. 640 641\begin{code} 642param :: Parser Q.FuncParam 643param = (Q.Env <$> (ws1 (string "env") >> local)) 644 <|> (string "..." >> pure Q.Variadic) 645 <|> do 646 ty <- ws1 abity 647 Q.Regular ty <$> local 648 649params :: Parser [Q.FuncParam] 650params = parenLst param 651\end{code} 652 653The parameter list is a comma separated list of temporary names prefixed 654by types. The types are used to correctly implement C compatibility. 655When an argument has an aggregate type, a pointer to the aggregate is 656passed by thea caller. In the example below, we have to use a load 657instruction to get the value of the first (and only) member of the 658struct. 659 660\begin{verbatim} 661type :one = { w } 662 663function w $getone(:one %p) { 664@start 665 %val =w loadw %p 666 ret %val 667} 668\end{verbatim} 669 670If a function accepts or returns values that are smaller than a word, 671such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type 672must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand, 673respectively, for signed and unsigned 8-bit values, and signed and 674unsigned 16-bit values. Parameters associated with a sub-word type of 675bit width N only have their N least significant bits set and have base 676type \texttt{w}. For example, the function 677 678\begin{verbatim} 679function w $addbyte(w %a, sb %b) { 680@start 681 %bw =w extsb %b 682 %val =w add %a, %bw 683 ret %val 684} 685\end{verbatim} 686 687needs to sign-extend its second argument before the addition. Dually, 688return values with sub-word types do not need to be sign or zero 689extended. 690 691If the parameter list ends with \texttt{...}, the function is a variadic 692function: it can accept a variable number of arguments. To access the 693extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg} 694instructions described in the \nameref{sec:variadic} section. 695 696Optionally, the parameter list can start with an environment parameter 697\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e., 698of type \texttt{l}). If the function does not use its environment parameter, 699callers can safely omit it. This parameter is invisible to a C caller: 700for example, the function 701 702\begin{verbatim} 703export function w $add(env %e, w %a, w %b) { 704@start 705 %c =w add %a, %b 706 ret %c 707} 708\end{verbatim} 709 710must be given the C prototype \texttt{int add(int, int)}. The intended use of 711this feature is to pass the environment pointer of closures while 712retaining a very good compatibility with C. The \nameref{sec:call} 713section explains how to pass an environment parameter. 714 715Since global symbols are defined mutually recursive, there is no need 716for function declarations: a function can be referenced before its 717definition. Similarly, functions from other modules can be used without 718previous declaration. All the type information necessary to compile a 719call is in the instruction itself. 720 721The syntax and semantics for the body of functions are described in the 722\nameref{sec:control} section. 723 724\section{Control} 725\label{sec:control} 726 727The IL represents programs as textual transcriptions of control flow 728graphs. The control flow is serialized as a sequence of blocks of 729straight-line code which are connected using jump instructions. 730 731\subsection{Blocks} 732\label{sec:blocks} 733 734\begin{code} 735block :: Parser Q.Block' 736block = do 737 l <- wsNL1 label 738 p <- many (wsNL1 $ try phiInstr) 739 s <- many (wsNL1 statement) 740 Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr) 741\end{code} 742 743All blocks have a name that is specified by a label at their beginning. 744Then follows a sequence of instructions that have "fall-through" flow. 745Finally one jump terminates the block. The jump can either transfer 746control to another block of the same function or return; jumps are 747described further below. 748 749The first block in a function must not be the target of any jump in the 750program. If a jump to the function start is needed, the frontend must 751insert an empty prelude block at the beginning of the function. 752 753When one block jumps to the next block in the IL file, it is not 754necessary to write the jump instruction, it will be automatically added 755by the parser. For example the start block in the example below jumps 756directly to the loop block. 757 758\subsection{Jumps} 759 760\begin{code} 761jumpInstr :: Parser Q.JumpInstr 762jumpInstr = (string "hlt" >> pure Q.Halt) 763 -- TODO: Return requires a space if there is an optionMaybe 764 <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val) 765 <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label)) 766 <|> do 767 _ <- ws1 $ string "jnz" 768 v <- ws val <* ws (char ',') 769 l1 <- ws label <* ws (char ',') 770 l2 <- ws label 771 return $ Q.Jnz v l1 l2 772\end{code} 773 774A jump instruction ends every block and transfers the control to another 775program location. The target of a jump must never be the first block in 776a function. The three kinds of jumps available are described in the 777following list. 778 779\begin{enumerate} 780 \item \textbf{Unconditional jump.} Jumps to another block of the same function. 781 \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0. 782 \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used. 783 \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}. 784\end{enumerate} 785 786\section{Instructions} 787\label{sec:instructions} 788 789\begin{code} 790instr :: Parser Q.Instr 791instr = 792 choice 793 [ try $ binaryInstr Q.Add "add", 794 try $ binaryInstr Q.Sub "sub", 795 try $ binaryInstr Q.Mul "mul", 796 try $ binaryInstr Q.URem "urem", 797 try $ binaryInstr Q.Rem "rem", 798 try $ binaryInstr Q.UDiv "udiv", 799 try $ binaryInstr Q.Or "or", 800 try $ binaryInstr Q.Xor "xor", 801 try $ binaryInstr Q.And "and", 802 try $ loadInstr, 803 try $ allocInstr, 804 try $ compareInstr, 805 try $ extInstr 806 ] 807\end{code} 808 809Instructions are the smallest piece of code in the IL, they form the body of 810\nameref{sec:blocks}. This specification distinguishes instructions and 811volatile instructions, the latter do not return a value. For the former, the IL 812uses a three-address code, which means that one instruction computes an 813operation between two operands and assigns the result to a third one. 814 815\begin{code} 816assign :: Parser Q.Statement 817assign = do 818 n <- ws local 819 t <- ws (char '=') >> ws1 baseType 820 Q.Assign n t <$> instr 821 822volatileInstr :: Parser Q.Statement 823volatileInstr = Q.Volatile <$> (storeInstr <|> blitInstr) 824 825-- TODO: Not documented in the QBE BNF. 826statement :: Parser Q.Statement 827statement = (try callInstr) <|> assign <|> volatileInstr 828\end{code} 829 830An instruction has both a name and a return type, this return type is a base 831type that defines the size of the instruction's result. The type of the 832arguments can be unambiguously inferred using the instruction name and the 833return type. For example, for all arithmetic instructions, the type of the 834arguments is the same as the return type. The two additions below are valid if 835\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}). 836 837\begin{verbatim} 838%x =w add 0, %y 839%z =w add %x, %x 840\end{verbatim} 841 842Some instructions, like comparisons and memory loads have operand types 843that differ from their return types. For instance, two floating points 844can be compared to give a word result (0 if the comparison succeeds, 1 845if it fails). 846 847\begin{verbatim} 848%c =w cgts %a, %b 849\end{verbatim} 850 851In the example above, both operands have to have single type. This is 852made explicit by the instruction suffix. 853 854\subsection{Arithmetic and Bits} 855 856To-Do. 857 858\subsection{Memory} 859\label{sec:memory} 860 861To-Do. 862 863\subsubsection{Store instructions} 864 865\begin{code} 866storeInstr :: Parser Q.VolatileInstr 867storeInstr = do 868 t <- string "store" >> ws1 extType 869 v <- ws val 870 _ <- ws $ char ',' 871 ws val <&> Q.Store t v 872\end{code} 873 874Store instructions exist to store a value of any base type and any extended 875type. Since halfwords and bytes are not first class in the IL, \texttt{storeh} 876and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of 877this word will be stored in memory at the address specified in the second 878argument. 879 880\subsubsection{Load instructions} 881 882\begin{code} 883loadInstr :: Parser Q.Instr 884loadInstr = do 885 _ <- string "load" 886 t <- ws1 $ choice 887 [ try $ bind "sw" (Q.LBase Q.Word), 888 try $ bind "uw" (Q.LBase Q.Word), 889 try $ Q.LSubWord <$> subWordType, 890 Q.LBase <$> baseType 891 ] 892 ws val <&> Q.Load t 893\end{code} 894 895For types smaller than long, two variants of the load instruction are 896available: one will sign extend the loaded value, while the other will zero 897extend it. Note that all loads smaller than long can load to either a long or a 898word. 899 900The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect 901when they are used to define a word temporary. A \texttt{loadw} instruction is 902provided as syntactic sugar for \texttt{loadsw} to make explicit that the 903extension mechanism used is irrelevant. 904 905\subsubsection{Blits} 906 907\begin{code} 908blitInstr :: Parser Q.VolatileInstr 909blitInstr = do 910 v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',') 911 v2 <- ws val <* (ws $ char ',') 912 nb <- decNumber 913 return $ Q.Blit v1 v2 nb 914\end{code} 915 916The blit instruction copies in-memory data from its first address argument to 917its second address argument. The third argument is the number of bytes to copy. 918The source and destination spans are required to be either non-overlapping, or 919fully overlapping (source address identical to the destination address). The 920byte count argument must be a nonnegative numeric constant; it cannot be a 921temporary. 922 923One blit instruction may generate a number of instructions proportional to its 924byte count argument, consequently, it is recommended to keep this argument 925relatively small. If large copies are necessary, it is preferable that 926frontends generate calls to a supporting \texttt{memcpy} function. 927 928\subsubsection{Stack Allocation} 929 930\begin{code} 931allocInstr :: Parser Q.Instr 932allocInstr = do 933 siz <- (ws $ string "alloc") >> (ws1 allocSize) 934 val <&> Q.Alloc siz 935\end{code} 936 937These instructions allocate a chunk of memory on the stack. The number ending 938the instruction name is the alignment required for the allocated slot. QBE will 939make sure that the returned address is a multiple of that alignment value. 940 941Stack allocation instructions are used, for example, when compiling the C local 942variables, because their address can be taken. When compiling Fortran, 943temporaries can be used directly instead, because it is illegal to take the 944address of a variable. 945 946\subsection{Comparisons} 947 948Comparison instructions return an integer value (either a word or a long), and 949compare values of arbitrary types. The returned value is 1 if the two operands 950satisfy the comparison relation, or 0 otherwise. The names of comparisons 951respect a standard naming scheme in three parts. 952 953\begin{code} 954compareInstr :: Parser Q.Instr 955compareInstr = do 956 _ <- char 'c' 957 op <- compareOp 958 ty <- ws1 baseType 959 lhs <- ws val <* ws (char ',') 960 rhs <- ws val 961 pure $ Q.Compare ty op lhs rhs 962\end{code} 963 964\begin{code} 965compareOp :: Parser Q.CmpOp 966compareOp = choice 967 [ bind "eq" Q.CEq 968 , bind "ne" Q.CNe 969 , try $ bind "sle" Q.CSle 970 , try $ bind "slt" Q.CSlt 971 , try $ bind "sge" Q.CSge 972 , try $ bind "sgt" Q.CSgt 973 , try $ bind "ule" Q.CUle 974 , try $ bind "ult" Q.CUlt 975 , try $ bind "uge" Q.CUge 976 , try $ bind "ugt" Q.CUgt ] 977\end{code} 978 979For example, \texttt{cod} (\texttt{I(dd,dd)}) compares two double-precision floating point numbers and returns 1 if the two floating points are not NaNs, or 0 otherwise. The \texttt{csltw} (\texttt{I(ww,ww)}) instruction compares two words representing signed numbers and returns 1 when the first argument is smaller than the second one. 980 981\subsection{Conversions} 982 983\begin{code} 984subLongType :: Parser Q.SubLongType 985subLongType = try (Q.SLSubWord <$> subWordType) 986 <|> bind "sw" Q.SLSignedWord 987 <|> bind "uw" Q.SLUnsignedWord 988 989extInstr :: Parser Q.Instr 990extInstr = do 991 _ <- string "ext" 992 ty <- ws1 subLongType 993 ws val <&> Q.Ext ty 994\end{code} 995 996Conversion operations change the representation of a value, possibly modifying 997it if the target type cannot hold the value of the source type. Conversions can 998extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or 999convert a floating point into an integer and vice versa.10001001\subsection{Cast and Copy}10021003To-Do.10041005\subsection{Call}1006\label{sec:call}10071008\begin{code}1009-- TODO: Code duplication with 'param'.1010callArg :: Parser Q.FuncArg1011callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))1012 <|> (string "..." >> pure Q.ArgVar)1013 <|> do1014 ty <- ws1 abity1015 Q.ArgReg ty <$> val10161017callArgs :: Parser [Q.FuncArg]1018callArgs = parenLst callArg10191020callInstr :: Parser Q.Statement1021callInstr = do1022 retValue <- optionMaybe $ do1023 i <- ws local <* ws (char '=')1024 a <- ws1 abity1025 return (i, a)1026 toCall <- ws1 (string "call") >> ws val1027 fnArgs <- callArgs1028 return $ Q.Call retValue toCall fnArgs1029\end{code}10301031The call instruction is special in several ways. It is not a three-address1032instruction and requires the type of all its arguments to be given. Also, the1033return type can be either a base type or an aggregate type. These specifics are1034required to compile calls with C compatibility (i.e., to respect the ABI).10351036When an aggregate type is used as argument type or return type, the value1037respectively passed or returned needs to be a pointer to a memory location1038holding the value. This is because aggregate types are not first-class1039citizens of the IL.10401041Sub-word types are used for arguments and return values of width less than a1042word. Details on these types are presented in the <@ Functions > section.1043Arguments with sub-word types need not be sign or zero extended according to1044their type. Calls with a sub-word return type define a temporary of base type1045\texttt{w} with its most significant bits unspecified.10461047Unless the called function does not return a value, a return temporary must be1048specified, even if it is never used afterwards.10491050An environment parameter can be passed as first argument using the \texttt{env}1051keyword. The passed value must be a 64-bit integer. If the called function does1052not expect an environment parameter, it will be safely discarded. See the1053\nameref{sec:functions} section for more information about environment1054parameters.10551056When the called function is variadic, there must be a \texttt{...} marker1057separating the named and variadic arguments.10581059\subsection{Variadic}1060\label{sec:variadic}10611062To-Do.10631064\subsection{Phi}10651066\begin{code}1067phiBranch :: Parser (Q.BlockIdent, Q.Value)1068phiBranch = do1069 n <- ws1 label1070 v <- val1071 pure (n, v)10721073phiInstr :: Parser Q.Phi1074phiInstr = do1075 -- TODO: code duplication with 'assign'1076 n <- ws local1077 t <- ws (char '=') >> ws1 baseType10781079 _ <- ws1 (string "phi")1080 -- TODO: combinator for sepBy1081 p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')1082 return $ Q.Phi n t p1083\end{code}10841085First and foremost, phi instructions are NOT necessary when writing a frontend1086to QBE. One solution to avoid having to deal with SSA form is to use stack1087allocated variables for all source program variables and perform assignments1088and lookups using \nameref{sec:memory} operations. This is what LLVM users1089typically do.10901091Another solution is to simply emit code that is not in SSA form! Contrary to1092LLVM, QBE is able to fixup programs not in SSA form without requiring the1093boilerplate of loading and storing in memory. For example, the following1094program will be correctly compiled by QBE.10951096\begin{verbatim}1097@start1098 %x =w copy 1001099 %s =w copy 01100@loop1101 %s =w add %s, %x1102 %x =w sub %x, 11103 jnz %x, @loop, @end1104@end1105 ret %s1106\end{verbatim}11071108Now, if you want to know what phi instructions are and how to use them in QBE,1109you can read the following.11101111Phi instructions are specific to SSA form. In SSA form values can only be1112assigned once, without phi instructions, this requirement is too strong to1113represent many programs. For example consider the following C program.11141115\begin{verbatim}1116int f(int x) {1117 int y;1118 if (x)1119 y = 1;1120 else1121 y = 2;1122 return y;1123}1124\end{verbatim}11251126The variable \texttt{y} is assigned twice, the solution to translate it in SSA1127form is to insert a phi instruction.11281129\begin{verbatim}1130@ifstmt1131 jnz %x, @ift, @iff1132@ift1133 jmp @retstmt1134@iff1135 jmp @retstmt1136@retstmt1137 %y =w phi @ift 1, @iff 21138 ret %y1139\end{verbatim}11401141Phi instructions return one of their arguments depending on where the control1142came from. In the example, \texttt{%y} is set to 1 if the \texttt{@ift} branch1143is taken, or it is set to 2 otherwise.11441145An important remark about phi instructions is that QBE assumes that if a1146variable is defined by a phi it respects all the SSA invariants. So it is1147critical to not use phi instructions unless you know exactly what you are1148doing.1149\end{document}