A software analysis framework built around the QBE intermediate language
git clone https://git.8pit.net/quebex.git
1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me> 2% SPDX-FileCopyrightText: 2025 Sören Tempel <soeren+git@soeren-tempel.net> 3% 4% SPDX-License-Identifier: MIT AND GPL-3.0-only 5 6\documentclass{article} 7%include polycode.fmt 8 9%subst blankline = "\\[5mm]" 10 11% See https://github.com/kosmikus/lhs2tex/issues/58 12%format <$> = "\mathbin{\langle\$\rangle}" 13%format <&> = "\mathbin{\langle\&\rangle}" 14%format <|> = "\mathbin{\langle\:\vline\:\rangle}" 15%format <?> = "\mathbin{\langle?\rangle}" 16%format <*> = "\mathbin{\langle*\rangle}" 17%format <* = "\mathbin{\langle*}" 18%format *> = "\mathbin{*\rangle}" 19 20\long\def\ignore#1{} 21 22\usepackage{hyperref} 23\hypersetup{ 24 colorlinks = true, 25} 26 27\begin{document} 28 29\title{QBE Intermediate Language\vspace{-2em}} 30\date{} 31\maketitle 32\frenchspacing 33 34\ignore{ 35\begin{code} 36module Language.QBE.Parser (dataDef, typeDef, funcDef) where 37 38import Data.Char (chr) 39import Data.Word (Word64) 40import Data.Functor ((<&>)) 41import Data.List (singleton) 42import Data.Map qualified as Map 43import qualified Language.QBE.Types as Q 44import Language.QBE.Util (bind, decNumber, octNumber, float) 45import Text.ParserCombinators.Parsec 46 ( Parser, 47 alphaNum, 48 anyChar, 49 between, 50 char, 51 choice, 52 letter, 53 many, 54 many1, 55 manyTill, 56 newline, 57 noneOf, 58 oneOf, 59 optional, 60 optionMaybe, 61 sepBy, 62 sepBy1, 63 skipMany, 64 skipMany1, 65 string, 66 try, 67 (<?>), 68 (<|>), 69 ) 70\end{code} 71} 72 73This an executable description of the 74\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language}, 75specified through \href{https://hackage.haskell.org/package/parsec}{Parsec} 76parser combinators and generated from a literate Haskell file. The description 77is derived from the original QBE IL documentation, licensed under MIT. 78Presently, this implementation targets version 1.2 of the QBE intermediate 79language and aims to be equivalent with the original specification. 80 81\section{Basic Concepts} 82 83The intermediate language (IL) is a higher-level language than the 84machine's assembly language. It smoothes most of the 85irregularities of the underlying hardware and allows an infinite number 86of temporaries to be used. This higher abstraction level lets frontend 87programmers focus on language design issues. 88 89\subsection{Input Files} 90 91The intermediate language is provided to QBE as text. Usually, one file 92is generated per each compilation unit from the frontend input language. 93An IL file is a sequence of \nameref{sec:definitions} for 94data, functions, and types. Once processed by QBE, the resulting file 95can be assembled and linked using a standard toolchain (e.g., GNU 96binutils). 97 98\begin{code} 99comment :: Parser () 100comment = skipMany blankNL >> comment' >> skipMany blankNL 101 where 102 comment' = char '#' >> manyTill anyChar newline 103\end{code} 104 105\ignore{ 106\begin{code} 107skipNoCode :: Parser () -> Parser () 108skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP 109\end{code} 110} 111 112Here is a complete "Hello World" IL file which defines a function that 113prints to the screen. Since the string is not a first class object (only 114the pointer is) it is defined outside the function\textquotesingle s 115body. Comments start with a \# character and finish with the end of the 116line. 117 118\begin{verbatim} 119data $str = { b "hello world", b 0 } 120 121export function w $main() { 122@start 123 # Call the puts function with $str as argument. 124 %r =w call $puts(l $str) 125 ret 0 126} 127\end{verbatim} 128 129If you have read the LLVM language reference, you might recognize the 130example above. In comparison, QBE makes a much lighter use of types and 131the syntax is terser. 132 133\subsection{Parser Combinators} 134 135\ignore{ 136\begin{code} 137bracesNL :: Parser a -> Parser a 138bracesNL = between (wsNL $ char '{') (wsNL $ char '}') 139 140quoted :: Parser a -> Parser a 141quoted = let q = char '"' in between q q 142 143sepByTrail1 :: Parser a -> Parser sep -> Parser [a] 144sepByTrail1 p sep = do 145 x <- p 146 xs <- many (try $ sep >> p) 147 _ <- optional sep 148 return (x:xs) 149 150sepByTrail :: Parser a -> Parser sep -> Parser [a] 151sepByTrail p sep = sepByTrail1 p sep <|> return [] 152 153parenLst :: Parser a -> Parser [a] 154parenLst p = between (ws $ char '(') (char ')') inner 155 where 156 inner = sepBy (ws p) (ws $ char ',') 157 158unaryInstr :: (Q.Value -> Q.Instr) -> String -> Parser Q.Instr 159unaryInstr conc keyword = do 160 _ <- ws (string keyword) 161 conc <$> ws val 162 163binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr 164binaryInstr conc keyword = do 165 _ <- ws (string keyword) 166 vfst <- ws val <* ws (char ',') 167 conc vfst <$> ws val 168 169-- Can only appear in data and type definitions and hence allows newlines. 170alignAny :: Parser Word64 171alignAny = (ws1 (string "align")) >> wsNL decNumber 172\end{code} 173} 174 175The original QBE specification defines the syntax using a BNF grammar. In 176contrast, this document defines it using Parsec parser combinators. As such, 177this specification is less formal but more accurate as the parsing code is 178actually executable. Consequently, this specification also captures constructs 179omitted in the original specification (e.g., \nameref{sec:identifiers}, or 180\nameref{sec:strlit}). Nonetheless, the formal language recognized by these 181combinators aims to be equivalent to the one of the BNF grammar. 182 183\subsection{Identifiers} 184\label{sec:identifiers} 185 186% Ident is not documented in the original QBE specification. 187% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304 188 189\begin{code} 190ident :: Parser String 191ident = do 192 start <- letter <|> oneOf "._" 193 rest <- many (alphaNum <|> oneOf "$._") 194 return $ start : rest 195\end{code} 196 197Identifiers for data, types, and functions can start with any ASCII letter or 198the special characters \texttt{.} and \texttt{\_}. This initial character can 199be followed by a sequence of zero or more alphanumeric characters and the 200special characters \texttt{\$}, \texttt{.}, and \texttt{\_}. 201 202\subsection{Sigils} 203 204\begin{code} 205userDef :: Parser Q.UserIdent 206userDef = Q.UserIdent <$> (char ':' >> ident) 207 208global :: Parser Q.GlobalIdent 209global = Q.GlobalIdent <$> (char '$' >> ident) 210 211local :: Parser Q.LocalIdent 212local = Q.LocalIdent <$> (char '%' >> ident) 213 214label :: Parser Q.BlockIdent 215label = Q.BlockIdent <$> (char '@' >> ident) 216\end{code} 217 218The intermediate language makes heavy use of sigils, all user-defined 219names are prefixed with a sigil. This is to avoid keyword conflicts, and 220also to quickly spot the scope and nature of identifiers. 221 222\begin{itemize} 223 \item \texttt{:} is for user-defined \nameref{sec:aggregate-types} 224 \item \texttt{\$} is for globals (represented by a pointer) 225 \item \texttt{\%} is for function-scope temporaries 226 \item \texttt{@@} is for block labels 227\end{itemize} 228 229\subsection{Spacing} 230 231\begin{code} 232blank :: Parser Char 233blank = oneOf "\t " <?> "blank" 234 235blankNL :: Parser Char 236blankNL = oneOf "\n\t " <?> "blank or newline" 237\end{code} 238 239Individual tokens in IL files must be separated by one or more spacing 240characters. Both spaces and tabs are recognized as spacing characters. 241In data and type definitions, newlines may also be used as spaces to 242prevent overly long lines. When exactly one of two consecutive tokens is 243a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted. 244 245\ignore{ 246\begin{code} 247ws :: Parser a -> Parser a 248ws p = p <* skipMany blank 249 250ws1 :: Parser a -> Parser a 251ws1 p = p <* skipMany1 blank 252 253wsNL :: Parser a -> Parser a 254wsNL p = p <* skipNoCode (skipMany blankNL) 255 256wsNL1 :: Parser a -> Parser a 257wsNL1 p = p <* skipNoCode (skipMany1 blankNL) 258\end{code} 259} 260 261\subsection{String Literals} 262\label{sec:strlit} 263 264% The string literal is not documented in the original QBE specification. 265% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287 266 267\begin{code} 268strLit :: Parser String 269strLit = concat <$> quoted (many strChr) 270 where 271 strChr :: Parser [Char] 272 strChr = (singleton <$> noneOf "\"\\") <|> escSeq 273 274 -- TODO: not documnted in the QBE BNF. 275 octEsc :: Parser Char 276 octEsc = do 277 n <- octNumber 278 pure $ chr (fromIntegral n) 279 280 escSeq :: Parser [Char] 281 escSeq = try $ do 282 esc <- char '\\' 283 (singleton <$> octEsc) <|> (anyChar <&> (\c -> [esc, c])) 284\end{code} 285 286Strings are enclosed by double quotes and are, for example, used to specify a 287section name as part of the \nameref{sec:linkage} information. Within a string, 288a double quote can be escaped using a \texttt{\textbackslash} character. All 289escape sequences, including double quote escaping, are passed through as-is to 290the generated assembly file. 291 292\section{Types} 293 294\subsection{Simple Types} 295 296The IL makes minimal use of types. By design, the types used are 297restricted to what is necessary for unambiguous compilation to machine 298code and C interfacing. Unlike LLVM, QBE is not using types as a means 299to safety; they are only here for semantic purposes. 300 301\begin{code} 302baseType :: Parser Q.BaseType 303baseType = choice 304 [ bind "w" Q.Word 305 , bind "l" Q.Long 306 , bind "s" Q.Single 307 , bind "d" Q.Double ] 308\end{code} 309 310The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d} 311(double), they stand respectively for 32-bit and 64-bit integers, and 31232-bit and 64-bit floating-point numbers. There are no pointer types 313available; pointers are typed by an integer type sufficiently wide to 314represent all memory addresses (e.g., \texttt{l} on 64-bit architectures). 315Temporaries in the IL can only have a base type. 316 317\begin{code} 318extType :: Parser Q.ExtType 319extType = (Q.Base <$> baseType) 320 <|> bind "b" Q.Byte 321 <|> bind "h" Q.HalfWord 322\end{code} 323 324Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word), 325respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types} 326and \nameref{sec:data} definitions. 327 328For C interfacing, the IL also provides user-defined aggregate types as 329well as signed and unsigned variants of the sub-word extended types. 330Read more about these types in the \nameref{sec:aggregate-types} 331and \nameref{sec:functions} sections. 332 333\subsection{Subtyping} 334\label{sec:subtyping} 335 336The IL has a minimal subtyping feature, for integer types only. Any 337value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the 33832 least significant bits of the word value are used. 339 340Make note that it is the opposite of the usual subtyping on integers (in 341C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value 342cannot be used in word context. The rationale is that a word can be 343signed or unsigned, so extending it to a long could be done in two ways, 344either by zero-extension, or by sign-extension. 345 346\subsection{Constants and Vals} 347\label{sec:constants-and-vals} 348 349\begin{code} 350dynConst :: Parser Q.DynConst 351dynConst = 352 (Q.Const <$> constant) 353 <|> (Q.Thread <$> global) 354 <?> "dynconst" 355\end{code} 356 357Constants come in two kinds: compile-time constants and dynamic 358constants. Dynamic constants include compile-time constants and other 359symbol variants that are only known at program-load time or execution 360time. Consequently, dynamic constants can only occur in function bodies. 361 362The representation of integers is two's complement. 363Floating-point numbers are represented using the single-precision and 364double-precision formats of the IEEE 754 standard. 365 366\begin{code} 367constant :: Parser Q.Const 368constant = 369 (Q.Number <$> decNumber) 370 <|> (Q.SFP <$> sfp) 371 <|> (Q.DFP <$> dfp) 372 <|> (Q.Global <$> global) 373 <?> "const" 374 where 375 sfp = string "s_" >> float 376 dfp = string "d_" >> float 377\end{code} 378 379Constants specify a sequence of bits and are untyped. They are always 380parsed as 64-bit blobs. Depending on the context surrounding a constant, 381only some of its bits are used. For example, in the program below, the 382two variables defined have the same value since the first operand of the 383subtraction is a word (32-bit) context. 384 385\begin{verbatim} 386%x =w sub -1, 0 %y =w sub 4294967295, 0 387\end{verbatim} 388 389Because specifying floating-point constants by their bits makes the code 390less readable, syntactic sugar is provided to express them. Standard 391scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and 392double precision numbers respectively. Once again, the following example 393defines twice the same double-precision constant. 394 395\begin{verbatim} 396%x =d add d_0, d_-1 397%y =d add d_0, -4616189618054758400 398\end{verbatim} 399 400Global symbols can also be used directly as constants; they will be 401resolved and turned into actual numeric constants by the linker. 402 403When the \texttt{thread} keyword prefixes a symbol name, the 404symbol\textquotesingle s numeric value is resolved at runtime in the 405thread-local storage. 406 407\begin{code} 408val :: Parser Q.Value 409val = 410 (Q.VConst <$> dynConst) 411 <|> (Q.VLocal <$> local) 412 <?> "val" 413\end{code} 414 415Vals are used as arguments in regular, phi, and jump instructions within 416function definitions. They are either constants or function-scope 417temporaries. 418 419\subsection{Linkage} 420\label{sec:linkage} 421 422\begin{code} 423linkage :: Parser Q.Linkage 424linkage = 425 wsNL (bind "export" Q.LExport) 426 <|> wsNL (bind "thread" Q.LThread) 427 <|> do 428 _ <- ws1 $ string "section" 429 (try secWithFlags) <|> sec 430 where 431 sec :: Parser Q.Linkage 432 sec = wsNL strLit <&> (`Q.LSection` Nothing) 433 434 secWithFlags :: Parser Q.Linkage 435 secWithFlags = do 436 n <- ws1 strLit 437 wsNL strLit <&> Q.LSection n . Just 438\end{code} 439 440Function and data definitions (see below) can specify linkage 441information to be passed to the assembler and eventually to the linker. 442 443The \texttt{export} linkage flag marks the defined item as visible outside the 444current file\textquotesingle s scope. If absent, the symbol can only be 445referred to locally. Functions compiled by QBE and called from C need to 446be exported. 447 448The \texttt{thread} linkage flag can only qualify data definitions. It mandates 449that the object defined is stored in thread-local storage. Each time a 450runtime thread starts, the supporting platform runtime is in charge of 451making a new copy of the object for the fresh thread. Objects in 452thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax, 453as specified in the \nameref{sec:constants-and-vals} section. 454 455A \texttt{section} flag can be specified to tell the linker to put the defined 456item in a certain section. The use of the section flag is platform 457dependent and we refer the user to the documentation of their assembler 458and linker for relevant information. 459 460\begin{verbatim} 461section ".init_array" data $.init.f = { l $f } 462\end{verbatim} 463 464The section flag can be used to add function pointers to a global 465initialization list, as depicted above. Note that some platforms provide 466a BSS section that can be used to minimize the footprint of uniformly 467zeroed data. When this section is available, QBE will automatically make 468use of it and no section flag is required. 469 470The section and export linkage flags should each appear at most once in 471a definition. If multiple occurrences are present, QBE is free to use 472any. 473 474\subsection{Definitions} 475\label{sec:definitions} 476 477Definitions are the essential components of an IL file. They can define 478three types of objects: aggregate types, data, and functions. Aggregate 479types are never exported and do not compile to any code. Data and 480function definitions have file scope and are mutually recursive (even 481across IL files). Their visibility can be controlled using linkage 482flags. 483 484\subsubsection{Aggregate Types} 485\label{sec:aggregate-types} 486 487\begin{code} 488typeDef :: Parser Q.TypeDef 489typeDef = do 490 _ <- wsNL1 (string "type") 491 i <- wsNL1 userDef 492 _ <- wsNL1 (char '=') 493 a <- optionMaybe alignAny 494 bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a 495\end{code} 496 497Aggregate type definitions start with the \texttt{type} keyword. They have file 498scope, but types must be defined before being referenced. The inner 499structure of a type is expressed by a comma-separated list of fields. 500 501\begin{code} 502subType :: Parser Q.SubType 503subType = 504 (Q.SExtType <$> extType) 505 <|> (Q.SUserDef <$> userDef) 506 507field :: Parser Q.Field 508field = do 509 -- TODO: newline is required if there is a number argument 510 f <- wsNL subType 511 s <- ws $ optionMaybe decNumber 512 pure (f, s) 513 514fields :: Bool -> Parser [Q.Field] 515fields allowEmpty = 516 (if allowEmpty then sepByTrail else sepByTrail1) field (wsNL $ char ',') 517\end{code} 518 519A field consists of a subtype, either an extended type or a user-defined type, 520and an optional number expressing the value of this field. In case many items 521of the same type are sequenced (like in a C array), the shorter array syntax 522can be used. 523 524\begin{code} 525regularType :: Parser Q.AggType 526regularType = Q.ARegular <$> fields True 527\end{code} 528 529Three different kinds of aggregate types are presentl ysupported: regular 530types, union types and opaque types. The fields of regular types will be 531packed. By default, the alignment of an aggregate type is the maximum alignment 532of its members. The alignment can be explicitly specified by the programmer. 533 534\begin{code} 535unionType :: Parser Q.AggType 536unionType = Q.AUnion <$> many1 (wsNL unionType') 537 where 538 unionType' :: Parser [Q.Field] 539 unionType' = bracesNL $ fields False 540\end{code} 541 542Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified. 543 544\begin{code} 545opaqueType :: Parser Q.AggType 546opaqueType = Q.AOpaque <$> wsNL decNumber 547\end{code} 548 549Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces. 550 551\subsubsection{Data} 552\label{sec:data} 553 554\begin{code} 555dataDef :: Parser Q.DataDef 556dataDef = do 557 link <- many linkage 558 name <- wsNL1 (string "data") >> wsNL global 559 _ <- wsNL (char '=') 560 alignment <- optionMaybe alignAny 561 bracesNL dataObjs <&> Q.DataDef link name alignment 562 where 563 -- TODO: sepByTrail is not documented in the QBE BNF. 564 dataObjs = sepByTrail dataObj (wsNL $ char ',') 565\end{code} 566 567Data definitions express objects that will be emitted in the compiled 568file. Their visibility and location in the compiled artifact are 569controlled with linkage flags described in the \nameref{sec:linkage} 570section. 571 572They define a global identifier (starting with the sigil \texttt{\$}), that 573will contain a pointer to the object specified by the definition. 574 575\begin{code} 576dataObj :: Parser Q.DataObj 577dataObj = 578 (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber)) 579 <|> do 580 t <- wsNL1 extType 581 i <- many1 (wsNL dataItem) 582 return $ Q.OItem t i 583\end{code} 584 585Objects are described by a sequence of fields that start with a type 586letter. This letter can either be an extended type, or the \texttt{z} letter. 587If the letter used is an extended type, the data item following 588specifies the bits to be stored in the field. 589 590\begin{code} 591dataItem :: Parser Q.DataItem 592dataItem = 593 (Q.DString <$> strLit) 594 <|> try 595 ( do 596 i <- ws global 597 off <- (ws $ char '+') >> ws decNumber 598 return $ Q.DSymOff i off 599 ) 600 <|> (Q.DConst <$> constant) 601\end{code} 602 603Within each object, several items can be defined. When several data items 604follow a letter, they initialize multiple fields of the same size. 605 606\begin{code} 607allocSize :: Parser Q.AllocSize 608allocSize = 609 choice 610 [ bind "4" Q.AlignWord, 611 bind "8" Q.AlignLong, 612 bind "16" Q.AlignLongLong 613 ] 614\end{code} 615 616The members of a struct will be packed. This means that padding has to 617be emitted by the frontend when necessary. Alignment of the whole data 618objects can be manually specified, and when no alignment is provided, 619the maximum alignment from the platform is used. 620 621When the \texttt{z} letter is used the number following indicates the size of 622the field; the contents of the field are zero initialized. It can be 623used to add padding between fields or zero-initialize big arrays. 624 625\subsubsection{Functions} 626\label{sec:functions} 627 628\begin{code} 629funcDef :: Parser Q.FuncDef 630funcDef = do 631 link <- many linkage 632 _ <- ws1 (string "function") 633 retTy <- optionMaybe (ws1 abity) 634 name <- ws global 635 args <- wsNL params 636 body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block 637 638 case (Q.insertJumps body) of 639 Nothing -> fail $ "invalid fallthrough in " ++ show name 640 Just bl -> return $ Q.FuncDef link name retTy args bl 641\end{code} 642 643Function definitions contain the actual code to emit in the compiled 644file. They define a global symbol that contains a pointer to the 645function code. This pointer can be used in \texttt{call} instructions or stored 646in memory. 647 648\begin{code} 649subWordType :: Parser Q.SubWordType 650subWordType = choice 651 [ try $ bind "sb" Q.SignedByte 652 , try $ bind "ub" Q.UnsignedByte 653 , bind "sh" Q.SignedHalf 654 , bind "uh" Q.UnsignedHalf ] 655 656abity :: Parser Q.Abity 657abity = try (Q.ASubWordType <$> subWordType) 658 <|> (Q.ABase <$> baseType) 659 <|> (Q.AUserDef <$> userDef) 660\end{code} 661 662The type given right before the function name is the return type of the 663function. All return values of this function must have this return type. 664If the return type is missing, the function must not return any value. 665 666\begin{code} 667param :: Parser Q.FuncParam 668param = (Q.Env <$> (ws1 (string "env") >> local)) 669 <|> (string "..." >> pure Q.Variadic) 670 <|> do 671 ty <- ws1 abity 672 Q.Regular ty <$> local 673 674params :: Parser [Q.FuncParam] 675params = parenLst param 676\end{code} 677 678The parameter list is a comma separated list of temporary names prefixed 679by types. The types are used to correctly implement C compatibility. 680When an argument has an aggregate type, a pointer to the aggregate is 681passed by thea caller. In the example below, we have to use a load 682instruction to get the value of the first (and only) member of the 683struct. 684 685\begin{verbatim} 686type :one = { w } 687 688function w $getone(:one %p) { 689@start 690 %val =w loadw %p 691 ret %val 692} 693\end{verbatim} 694 695If a function accepts or returns values that are smaller than a word, 696such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type 697must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand, 698respectively, for signed and unsigned 8-bit values, and signed and 699unsigned 16-bit values. Parameters associated with a sub-word type of 700bit width N only have their N least significant bits set and have base 701type \texttt{w}. For example, the function 702 703\begin{verbatim} 704function w $addbyte(w %a, sb %b) { 705@start 706 %bw =w extsb %b 707 %val =w add %a, %bw 708 ret %val 709} 710\end{verbatim} 711 712needs to sign-extend its second argument before the addition. Dually, 713return values with sub-word types do not need to be sign or zero 714extended. 715 716If the parameter list ends with \texttt{...}, the function is a variadic 717function: it can accept a variable number of arguments. To access the 718extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg} 719instructions described in the \nameref{sec:variadic} section. 720 721Optionally, the parameter list can start with an environment parameter 722\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e., 723of type \texttt{l}). If the function does not use its environment parameter, 724callers can safely omit it. This parameter is invisible to a C caller: 725for example, the function 726 727\begin{verbatim} 728export function w $add(env %e, w %a, w %b) { 729@start 730 %c =w add %a, %b 731 ret %c 732} 733\end{verbatim} 734 735must be given the C prototype \texttt{int add(int, int)}. The intended use of 736this feature is to pass the environment pointer of closures while 737retaining a very good compatibility with C. The \nameref{sec:call} 738section explains how to pass an environment parameter. 739 740Since global symbols are defined mutually recursive, there is no need 741for function declarations: a function can be referenced before its 742definition. Similarly, functions from other modules can be used without 743previous declaration. All the type information necessary to compile a 744call is in the instruction itself. 745 746The syntax and semantics for the body of functions are described in the 747\nameref{sec:control} section. 748 749\section{Control} 750\label{sec:control} 751 752The IL represents programs as textual transcriptions of control flow 753graphs. The control flow is serialized as a sequence of blocks of 754straight-line code which are connected using jump instructions. 755 756\subsection{Blocks} 757\label{sec:blocks} 758 759\begin{code} 760block :: Parser Q.Block' 761block = do 762 l <- wsNL1 label 763 p <- many (wsNL1 $ try phiInstr) 764 s <- many (wsNL1 statement) 765 Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr) 766\end{code} 767 768All blocks have a name that is specified by a label at their beginning. 769Then follows a sequence of instructions that have "fall-through" flow. 770Finally one jump terminates the block. The jump can either transfer 771control to another block of the same function or return; jumps are 772described further below. 773 774The first block in a function must not be the target of any jump in the 775program. If a jump to the function start is needed, the frontend must 776insert an empty prelude block at the beginning of the function. 777 778When one block jumps to the next block in the IL file, it is not 779necessary to write the jump instruction, it will be automatically added 780by the parser. For example the start block in the example below jumps 781directly to the loop block. 782 783\subsection{Jumps} 784\label{sec:jumps} 785 786\begin{code} 787jumpInstr :: Parser Q.JumpInstr 788jumpInstr = (string "hlt" >> pure Q.Halt) 789 -- TODO: Return requires a space if there is an optionMaybe 790 <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val) 791 <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label)) 792 <|> do 793 _ <- ws1 $ string "jnz" 794 v <- ws val <* ws (char ',') 795 l1 <- ws label <* ws (char ',') 796 l2 <- ws label 797 return $ Q.Jnz v l1 l2 798\end{code} 799 800A jump instruction ends every block and transfers the control to another 801program location. The target of a jump must never be the first block in 802a function. The three kinds of jumps available are described in the 803following list. 804 805\begin{enumerate} 806 \item \textbf{Unconditional jump.} Jumps to another block of the same function. 807 \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0. 808 \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used. 809 \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}. 810\end{enumerate} 811 812\section{Instructions} 813\label{sec:instructions} 814 815\begin{code} 816instr :: Parser Q.Instr 817instr = 818 choice 819 [ try $ binaryInstr Q.Add "add", 820 try $ binaryInstr Q.Sub "sub", 821 try $ binaryInstr Q.Mul "mul", 822 try $ binaryInstr Q.Div "div", 823 try $ binaryInstr Q.URem "urem", 824 try $ binaryInstr Q.Rem "rem", 825 try $ binaryInstr Q.UDiv "udiv", 826 try $ binaryInstr Q.Or "or", 827 try $ binaryInstr Q.Xor "xor", 828 try $ binaryInstr Q.And "and", 829 try $ binaryInstr Q.Sar "sar", 830 try $ binaryInstr Q.Shr "shr", 831 try $ binaryInstr Q.Shl "shl", 832 try $ unaryInstr Q.Neg "neg", 833 try $ unaryInstr Q.Cast "cast", 834 try $ unaryInstr Q.Copy "copy", 835 try $ loadInstr, 836 try $ allocInstr, 837 try $ compareInstr, 838 try $ extInstr 839 ] 840\end{code} 841 842Instructions are the smallest piece of code in the IL, they form the body of 843\nameref{sec:blocks}. This specification distinguishes instructions and 844volatile instructions, the latter do not return a value. For the former, the IL 845uses a three-address code, which means that one instruction computes an 846operation between two operands and assigns the result to a third one. 847 848\begin{code} 849assign :: Parser Q.Statement 850assign = do 851 n <- ws local 852 t <- ws (char '=') >> ws1 baseType 853 Q.Assign n t <$> instr 854 855volatileInstr :: Parser Q.Statement 856volatileInstr = Q.Volatile <$> (storeInstr <|> blitInstr) 857 858-- TODO: Not documented in the QBE BNF. 859statement :: Parser Q.Statement 860statement = (try callInstr) <|> assign <|> volatileInstr 861\end{code} 862 863An instruction has both a name and a return type, this return type is a base 864type that defines the size of the instruction's result. The type of the 865arguments can be unambiguously inferred using the instruction name and the 866return type. For example, for all arithmetic instructions, the type of the 867arguments is the same as the return type. The two additions below are valid if 868\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}). 869 870\begin{verbatim} 871%x =w add 0, %y 872%z =w add %x, %x 873\end{verbatim} 874 875Some instructions, like comparisons and memory loads have operand types 876that differ from their return types. For instance, two floating points 877can be compared to give a word result (0 if the comparison succeeds, 1 878if it fails). 879 880\begin{verbatim} 881%c =w cgts %a, %b 882\end{verbatim} 883 884In the example above, both operands have to have single type. This is 885made explicit by the instruction suffix. 886 887\subsection{Arithmetic and Bits} 888 889\begin{quote} 890\begin{itemize} 891\item \texttt{add}, \texttt{sub}, \texttt{div}, \texttt{mul} 892\item \texttt{neg} 893\item \texttt{udiv}, \texttt{rem}, \texttt{urem} 894\item \texttt{or}, \texttt{xor}, \texttt{and} 895\item \texttt{sar}, \texttt{shr}, \texttt{shl} 896\end{itemize} 897\end{quote} 898 899The base arithmetic instructions in the first bullet are available for 900all types, integers and floating points. 901 902When \texttt{div} is used with word or long return type, the arguments are 903treated as signed. The unsigned integral division is available as \texttt{udiv} 904instruction. When the result of a division is not an integer, it is truncated 905towards zero. 906 907The signed and unsigned remainder operations are available as \texttt{rem} and 908\texttt{urem}. The sign of the remainder is the same as the one of the 909dividend. Its magnitude is smaller than the divisor one. These two instructions 910and \texttt{udiv} are only available with integer arguments and result. 911 912Bitwise OR, AND, and XOR operations are available for both integer 913types. Logical operations of typical programming languages can be 914implemented using \nameref{sec:comparisions} and \nameref{sec:jumps}. 915 916Shift instructions \texttt{sar}, \texttt{shr}, and \texttt{shl}, shift right or 917left their first operand by the amount from the second operand. The shifting 918amount is taken modulo the size of the result type. Shifting right can either 919preserve the sign of the value (using \texttt{sar}), or fill the newly freed 920bits with zeroes (using \texttt{shr}). Shifting left always fills the freed 921bits with zeroes. 922 923Remark that an arithmetic shift right (\texttt{sar}) is only equivalent to a 924division by a power of two for non-negative numbers. This is because the shift 925right "truncates" towards minus infinity, while the division truncates towards 926zero. 927 928\subsection{Memory} 929\label{sec:memory} 930 931The following sections discuss instructions for interacting with values stored in memory. 932 933\subsubsection{Store instructions} 934 935\begin{code} 936storeInstr :: Parser Q.VolatileInstr 937storeInstr = do 938 t <- string "store" >> ws1 extType 939 v <- ws val 940 _ <- ws $ char ',' 941 ws val <&> Q.Store t v 942\end{code} 943 944Store instructions exist to store a value of any base type and any extended 945type. Since halfwords and bytes are not first class in the IL, \texttt{storeh} 946and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of 947this word will be stored in memory at the address specified in the second 948argument. 949 950\subsubsection{Load instructions} 951 952\begin{code} 953loadInstr :: Parser Q.Instr 954loadInstr = do 955 _ <- string "load" 956 t <- ws1 $ choice 957 [ try $ bind "sw" (Q.LBase Q.Word), 958 try $ bind "uw" (Q.LBase Q.Word), 959 try $ Q.LSubWord <$> subWordType, 960 Q.LBase <$> baseType 961 ] 962 ws val <&> Q.Load t 963\end{code} 964 965For types smaller than long, two variants of the load instruction are 966available: one will sign extend the loaded value, while the other will zero 967extend it. Note that all loads smaller than long can load to either a long or a 968word. 969 970The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect 971when they are used to define a word temporary. A \texttt{loadw} instruction is 972provided as syntactic sugar for \texttt{loadsw} to make explicit that the 973extension mechanism used is irrelevant. 974 975\subsubsection{Blits} 976 977\begin{code} 978blitInstr :: Parser Q.VolatileInstr 979blitInstr = do 980 v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',') 981 v2 <- ws val <* (ws $ char ',') 982 nb <- decNumber 983 return $ Q.Blit v1 v2 nb 984\end{code} 985 986The blit instruction copies in-memory data from its first address argument to 987its second address argument. The third argument is the number of bytes to copy. 988The source and destination spans are required to be either non-overlapping, or 989fully overlapping (source address identical to the destination address). The 990byte count argument must be a nonnegative numeric constant; it cannot be a 991temporary. 992 993One blit instruction may generate a number of instructions proportional to its 994byte count argument, consequently, it is recommended to keep this argument 995relatively small. If large copies are necessary, it is preferable that 996frontends generate calls to a supporting \texttt{memcpy} function. 997 998\subsubsection{Stack Allocation} 9991000\begin{code}1001allocInstr :: Parser Q.Instr1002allocInstr = do1003 siz <- (ws $ string "alloc") >> (ws1 allocSize)1004 val <&> Q.Alloc siz1005\end{code}10061007These instructions allocate a chunk of memory on the stack. The number ending1008the instruction name is the alignment required for the allocated slot. QBE will1009make sure that the returned address is a multiple of that alignment value.10101011Stack allocation instructions are used, for example, when compiling the C local1012variables, because their address can be taken. When compiling Fortran,1013temporaries can be used directly instead, because it is illegal to take the1014address of a variable.10151016\subsection{Comparisons}1017\label{sec:comparisions}10181019Comparison instructions return an integer value (either a word or a long), and1020compare values of arbitrary types. The returned value is 1 if the two operands1021satisfy the comparison relation, or 0 otherwise. The names of comparisons1022respect a standard naming scheme in three parts.10231024\begin{code}1025compareInstr :: Parser Q.Instr1026compareInstr = do1027 _ <- char 'c'1028 op <- compareOp1029 ty <- ws1 baseType1030 lhs <- ws val <* ws (char ',')1031 rhs <- ws val1032 pure $ Q.Compare ty op lhs rhs1033\end{code}10341035\begin{code}1036compareOp :: Parser Q.CmpOp1037compareOp = choice1038 [ bind "eq" Q.CEq1039 , bind "ne" Q.CNe1040 , try $ bind "sle" Q.CSle1041 , try $ bind "slt" Q.CSlt1042 , try $ bind "sge" Q.CSge1043 , try $ bind "sgt" Q.CSgt1044 , try $ bind "ule" Q.CUle1045 , try $ bind "ult" Q.CUlt1046 , try $ bind "uge" Q.CUge1047 , try $ bind "ugt" Q.CUgt ]1048\end{code}10491050For example, \texttt{cod} compares two double-precision floating point numbers1051and returns 1 if the two floating points are not NaNs, or 0 otherwise. The1052\texttt{csltw} instruction compares two words representing signed numbers and1053returns 1 when the first argument is smaller than the second one.10541055\subsection{Conversions}10561057\begin{code}1058subLongType :: Parser Q.SubLongType1059subLongType = try (Q.SLSubWord <$> subWordType)1060 <|> bind "sw" Q.SLSignedWord1061 <|> bind "uw" Q.SLUnsignedWord10621063extInstr :: Parser Q.Instr1064extInstr = do1065 _ <- string "ext"1066 ty <- ws1 subLongType1067 ws val <&> Q.Ext ty1068\end{code}10691070Conversion operations change the representation of a value, possibly modifying1071it if the target type cannot hold the value of the source type. Conversions can1072extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or1073convert a floating point into an integer and vice versa.10741075\subsection{Cast and Copy}10761077The \texttt{cast} and \texttt{copy} instructions return the bits of their1078argument verbatim. However a cast will change an integer into a floating point1079of the same width and vice versa.10801081Casts can be used to make bitwise operations on the representation of floating1082point numbers. For example the following program will compute the opposite of1083the single-precision floating point number \texttt{\%f} into \texttt{\%rs}.10841085\begin{verbatim}1086%b0 =w cast %f1087%b1 =w xor 2147483648, %b0 # flip the msb1088%rs =s cast %b11089\end{verbatim}10901091\subsection{Call}1092\label{sec:call}10931094\begin{code}1095-- TODO: Code duplication with 'param'.1096callArg :: Parser Q.FuncArg1097callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))1098 <|> (string "..." >> pure Q.ArgVar)1099 <|> do1100 ty <- ws1 abity1101 Q.ArgReg ty <$> val11021103callArgs :: Parser [Q.FuncArg]1104callArgs = parenLst callArg11051106callInstr :: Parser Q.Statement1107callInstr = do1108 retValue <- optionMaybe $ do1109 i <- ws local <* ws (char '=')1110 a <- ws1 abity1111 return (i, a)1112 toCall <- ws1 (string "call") >> ws val1113 fnArgs <- callArgs1114 return $ Q.Call retValue toCall fnArgs1115\end{code}11161117The call instruction is special in several ways. It is not a three-address1118instruction and requires the type of all its arguments to be given. Also, the1119return type can be either a base type or an aggregate type. These specifics are1120required to compile calls with C compatibility (i.e., to respect the ABI).11211122When an aggregate type is used as argument type or return type, the value1123respectively passed or returned needs to be a pointer to a memory location1124holding the value. This is because aggregate types are not first-class1125citizens of the IL.11261127Sub-word types are used for arguments and return values of width less than a1128word. Details on these types are presented in the \nameref{sec:functions} section.1129Arguments with sub-word types need not be sign or zero extended according to1130their type. Calls with a sub-word return type define a temporary of base type1131\texttt{w} with its most significant bits unspecified.11321133Unless the called function does not return a value, a return temporary must be1134specified, even if it is never used afterwards.11351136An environment parameter can be passed as first argument using the \texttt{env}1137keyword. The passed value must be a 64-bit integer. If the called function does1138not expect an environment parameter, it will be safely discarded. See the1139\nameref{sec:functions} section for more information about environment1140parameters.11411142When the called function is variadic, there must be a \texttt{...} marker1143separating the named and variadic arguments.11441145\subsection{Variadic}1146\label{sec:variadic}11471148To-Do.11491150\subsection{Phi}11511152\begin{code}1153phiBranch :: Parser (Q.BlockIdent, Q.Value)1154phiBranch = do1155 n <- ws1 label1156 v <- val1157 pure (n, v)11581159phiInstr :: Parser Q.Phi1160phiInstr = do1161 -- TODO: code duplication with 'assign'1162 n <- ws local1163 t <- ws (char '=') >> ws1 baseType11641165 _ <- ws1 (string "phi")1166 -- TODO: combinator for sepBy1167 p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')1168 return $ Q.Phi n t p1169\end{code}11701171First and foremost, phi instructions are NOT necessary when writing a frontend1172to QBE. One solution to avoid having to deal with SSA form is to use stack1173allocated variables for all source program variables and perform assignments1174and lookups using \nameref{sec:memory} operations. This is what LLVM users1175typically do.11761177Another solution is to simply emit code that is not in SSA form! Contrary to1178LLVM, QBE is able to fixup programs not in SSA form without requiring the1179boilerplate of loading and storing in memory. For example, the following1180program will be correctly compiled by QBE.11811182\begin{verbatim}1183@start1184 %x =w copy 1001185 %s =w copy 01186@loop1187 %s =w add %s, %x1188 %x =w sub %x, 11189 jnz %x, @loop, @end1190@end1191 ret %s1192\end{verbatim}11931194Now, if you want to know what phi instructions are and how to use them in QBE,1195you can read the following.11961197Phi instructions are specific to SSA form. In SSA form values can only be1198assigned once, without phi instructions, this requirement is too strong to1199represent many programs. For example consider the following C program.12001201\begin{verbatim}1202int f(int x) {1203 int y;1204 if (x)1205 y = 1;1206 else1207 y = 2;1208 return y;1209}1210\end{verbatim}12111212The variable \texttt{y} is assigned twice, the solution to translate it in SSA1213form is to insert a phi instruction.12141215\begin{verbatim}1216@ifstmt1217 jnz %x, @ift, @iff1218@ift1219 jmp @retstmt1220@iff1221 jmp @retstmt1222@retstmt1223 %y =w phi @ift 1, @iff 21224 ret %y1225\end{verbatim}12261227Phi instructions return one of their arguments depending on where the control1228came from. In the example, \texttt{\%y} is set to 1 if the1229\texttt{\textbackslash{}ift} branch is taken, or it is set to 2 otherwise.12301231An important remark about phi instructions is that QBE assumes that if a1232variable is defined by a phi it respects all the SSA invariants. So it is1233critical to not use phi instructions unless you know exactly what you are1234doing.1235\end{document}