quebex

A software analysis framework built around the QBE intermediate language

git clone https://git.8pit.net/quebex.git

   1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me>
   2% SPDX-FileCopyrightText: 2025 Sören Tempel <soeren+git@soeren-tempel.net>
   3%
   4% SPDX-License-Identifier: MIT AND GPL-3.0-only
   5
   6\documentclass{article}
   7%include polycode.fmt
   8
   9%subst blankline = "\\[5mm]"
  10
  11% See https://github.com/kosmikus/lhs2tex/issues/58
  12%format <$> = "\mathbin{\langle\$\rangle}"
  13%format <&> = "\mathbin{\langle\&\rangle}"
  14%format <|> = "\mathbin{\langle\:\vline\:\rangle}"
  15%format <?> = "\mathbin{\langle?\rangle}"
  16%format <*> = "\mathbin{\langle*\rangle}"
  17%format <*  = "\mathbin{\langle*}"
  18%format *>  = "\mathbin{*\rangle}"
  19
  20\long\def\ignore#1{}
  21
  22\usepackage{hyperref}
  23\hypersetup{
  24	colorlinks = true,
  25}
  26
  27\begin{document}
  28
  29\title{QBE Intermediate Language\vspace{-2em}}
  30\date{}
  31\maketitle
  32\frenchspacing
  33
  34\ignore{
  35\begin{code}
  36module Language.QBE.Parser (dataDef, typeDef, funcDef) where
  37
  38import Data.Char (chr)
  39import Data.Word (Word64)
  40import Data.Functor ((<&>))
  41import Data.List (singleton)
  42import Data.Map qualified as Map
  43import qualified Language.QBE.Types as Q
  44import Language.QBE.Util (bind, decNumber, octNumber, float)
  45import Text.ParserCombinators.Parsec
  46  ( Parser,
  47    alphaNum,
  48    anyChar,
  49    between,
  50    char,
  51    choice,
  52    letter,
  53    many,
  54    many1,
  55    manyTill,
  56    newline,
  57    noneOf,
  58    oneOf,
  59    optional,
  60    optionMaybe,
  61    sepBy,
  62    sepBy1,
  63    skipMany,
  64    skipMany1,
  65    string,
  66    try,
  67    (<?>),
  68    (<|>),
  69  )
  70\end{code}
  71}
  72
  73This an executable description of the
  74\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language},
  75specified through \href{https://hackage.haskell.org/package/parsec}{Parsec}
  76parser combinators and generated from a literate Haskell file. The description
  77is derived from the original QBE IL documentation, licensed under MIT.
  78Presently, this implementation targets version 1.2 of the QBE intermediate
  79language and aims to be equivalent with the original specification.
  80
  81\section{Basic Concepts}
  82
  83The intermediate language (IL) is a higher-level language than the
  84machine's assembly language. It smoothes most of the
  85irregularities of the underlying hardware and allows an infinite number
  86of temporaries to be used. This higher abstraction level lets frontend
  87programmers focus on language design issues.
  88
  89\subsection{Input Files}
  90
  91The intermediate language is provided to QBE as text. Usually, one file
  92is generated per each compilation unit from the frontend input language.
  93An IL file is a sequence of \nameref{sec:definitions} for
  94data, functions, and types. Once processed by QBE, the resulting file
  95can be assembled and linked using a standard toolchain (e.g., GNU
  96binutils).
  97
  98\begin{code}
  99comment :: Parser ()
 100comment = skipMany blankNL >> comment' >> skipMany blankNL
 101  where
 102    comment' = char '#' >> manyTill anyChar newline
 103\end{code}
 104
 105\ignore{
 106\begin{code}
 107skipNoCode :: Parser () -> Parser ()
 108skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP
 109\end{code}
 110}
 111
 112Here is a complete "Hello World" IL file which defines a function that
 113prints to the screen. Since the string is not a first class object (only
 114the pointer is) it is defined outside the function\textquotesingle s
 115body. Comments start with a \# character and finish with the end of the
 116line.
 117
 118\begin{verbatim}
 119data $str = { b "hello world", b 0 }
 120
 121export function w $main() {
 122@start
 123        # Call the puts function with $str as argument.
 124        %r =w call $puts(l $str)
 125        ret 0
 126}
 127\end{verbatim}
 128
 129If you have read the LLVM language reference, you might recognize the
 130example above. In comparison, QBE makes a much lighter use of types and
 131the syntax is terser.
 132
 133\subsection{Parser Combinators}
 134
 135\ignore{
 136\begin{code}
 137bracesNL :: Parser a -> Parser a
 138bracesNL = between (wsNL $ char '{') (wsNL $ char '}')
 139
 140quoted :: Parser a -> Parser a
 141quoted = let q = char '"' in between q q
 142
 143sepByTrail1 :: Parser a -> Parser sep -> Parser [a]
 144sepByTrail1 p sep = do
 145  x <- p
 146  xs <- many (try $ sep >> p)
 147  _ <- optional sep
 148  return (x:xs)
 149
 150sepByTrail :: Parser a -> Parser sep -> Parser [a]
 151sepByTrail p sep = sepByTrail1 p sep <|> return []
 152
 153parenLst :: Parser a -> Parser [a]
 154parenLst p = between (ws $ char '(') (char ')') inner
 155  where
 156    inner = sepBy (ws p) (ws $ char ',')
 157
 158unaryInstr :: (Q.Value -> Q.Instr) -> String -> Parser Q.Instr
 159unaryInstr conc keyword = do
 160  _ <- ws (string keyword)
 161  conc <$> ws val
 162
 163binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr
 164binaryInstr conc keyword = do
 165  _ <- ws (string keyword)
 166  vfst <- ws val <* ws (char ',')
 167  conc vfst <$> ws val
 168
 169-- Can only appear in data and type definitions and hence allows newlines.
 170alignAny :: Parser Word64
 171alignAny = (ws1 (string "align")) >> wsNL decNumber
 172\end{code}
 173}
 174
 175The original QBE specification defines the syntax using a BNF grammar. In
 176contrast, this document defines it using Parsec parser combinators. As such,
 177this specification is less formal but more accurate as the parsing code is
 178actually executable. Consequently, this specification also captures constructs
 179omitted in the original specification (e.g., \nameref{sec:identifiers}, or
 180\nameref{sec:strlit}). Nonetheless, the formal language recognized by these
 181combinators aims to be equivalent to the one of the BNF grammar.
 182
 183\subsection{Identifiers}
 184\label{sec:identifiers}
 185
 186% Ident is not documented in the original QBE specification.
 187% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304
 188
 189\begin{code}
 190ident :: Parser String
 191ident = do
 192  start <- letter <|> oneOf "._"
 193  rest <- many (alphaNum <|> oneOf "$._")
 194  return $ start : rest
 195\end{code}
 196
 197Identifiers for data, types, and functions can start with any ASCII letter or
 198the special characters \texttt{.} and \texttt{\_}. This initial character can
 199be followed by a sequence of zero or more alphanumeric characters and the
 200special characters \texttt{\$}, \texttt{.}, and \texttt{\_}.
 201
 202\subsection{Sigils}
 203
 204\begin{code}
 205userDef :: Parser Q.UserIdent
 206userDef = Q.UserIdent <$> (char ':' >> ident)
 207
 208global :: Parser Q.GlobalIdent
 209global = Q.GlobalIdent <$> (char '$' >> ident)
 210
 211local :: Parser Q.LocalIdent
 212local = Q.LocalIdent <$> (char '%' >> ident)
 213
 214label :: Parser Q.BlockIdent
 215label = Q.BlockIdent <$> (char '@' >> ident)
 216\end{code}
 217
 218The intermediate language makes heavy use of sigils, all user-defined
 219names are prefixed with a sigil. This is to avoid keyword conflicts, and
 220also to quickly spot the scope and nature of identifiers.
 221
 222\begin{itemize}
 223  \item \texttt{:} is for user-defined \nameref{sec:aggregate-types}
 224  \item \texttt{\$} is for globals (represented by a pointer)
 225  \item \texttt{\%} is for function-scope temporaries
 226  \item \texttt{@@} is for block labels
 227\end{itemize}
 228
 229\subsection{Spacing}
 230
 231\begin{code}
 232blank :: Parser Char
 233blank = oneOf "\t " <?> "blank"
 234
 235blankNL :: Parser Char
 236blankNL = oneOf "\n\t " <?> "blank or newline"
 237\end{code}
 238
 239Individual tokens in IL files must be separated by one or more spacing
 240characters. Both spaces and tabs are recognized as spacing characters.
 241In data and type definitions, newlines may also be used as spaces to
 242prevent overly long lines. When exactly one of two consecutive tokens is
 243a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted.
 244
 245\ignore{
 246\begin{code}
 247ws :: Parser a -> Parser a
 248ws p = p <* skipMany blank
 249
 250ws1 :: Parser a -> Parser a
 251ws1 p = p <* skipMany1 blank
 252
 253wsNL :: Parser a -> Parser a
 254wsNL p = p <* skipNoCode (skipMany blankNL)
 255
 256wsNL1 :: Parser a -> Parser a
 257wsNL1 p = p <* skipNoCode (skipMany1 blankNL)
 258\end{code}
 259}
 260
 261\subsection{String Literals}
 262\label{sec:strlit}
 263
 264% The string literal is not documented in the original QBE specification.
 265% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287
 266
 267\begin{code}
 268strLit :: Parser String
 269strLit = concat <$> quoted (many strChr)
 270  where
 271    strChr :: Parser [Char]
 272    strChr = (singleton <$> noneOf "\"\\") <|> escSeq
 273
 274    -- TODO: not documnted in the QBE BNF.
 275    octEsc :: Parser Char
 276    octEsc = do
 277      n <- octNumber
 278      pure $ chr (fromIntegral n)
 279
 280    escSeq :: Parser [Char]
 281    escSeq = try $ do
 282      esc <- char '\\'
 283      (singleton <$> octEsc) <|> (anyChar <&> (\c -> [esc, c]))
 284\end{code}
 285
 286Strings are enclosed by double quotes and are, for example, used to specify a
 287section name as part of the \nameref{sec:linkage} information. Within a string,
 288a double quote can be escaped using a \texttt{\textbackslash} character. All
 289escape sequences, including double quote escaping, are passed through as-is to
 290the generated assembly file.
 291
 292\section{Types}
 293
 294\subsection{Simple Types}
 295
 296The IL makes minimal use of types. By design, the types used are
 297restricted to what is necessary for unambiguous compilation to machine
 298code and C interfacing. Unlike LLVM, QBE is not using types as a means
 299to safety; they are only here for semantic purposes.
 300
 301\begin{code}
 302baseType :: Parser Q.BaseType
 303baseType = choice
 304  [ bind "w" Q.Word
 305  , bind "l" Q.Long
 306  , bind "s" Q.Single
 307  , bind "d" Q.Double ]
 308\end{code}
 309
 310The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d}
 311(double), they stand respectively for 32-bit and 64-bit integers, and
 31232-bit and 64-bit floating-point numbers. There are no pointer types
 313available; pointers are typed by an integer type sufficiently wide to
 314represent all memory addresses (e.g., \texttt{l} on 64-bit architectures).
 315Temporaries in the IL can only have a base type.
 316
 317\begin{code}
 318extType :: Parser Q.ExtType
 319extType = (Q.Base <$> baseType)
 320       <|> bind "b" Q.Byte
 321       <|> bind "h" Q.HalfWord
 322\end{code}
 323
 324Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word),
 325respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types}
 326and \nameref{sec:data} definitions.
 327
 328For C interfacing, the IL also provides user-defined aggregate types as
 329well as signed and unsigned variants of the sub-word extended types.
 330Read more about these types in the \nameref{sec:aggregate-types}
 331and \nameref{sec:functions} sections.
 332
 333\subsection{Subtyping}
 334\label{sec:subtyping}
 335
 336The IL has a minimal subtyping feature, for integer types only. Any
 337value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the
 33832 least significant bits of the word value are used.
 339
 340Make note that it is the opposite of the usual subtyping on integers (in
 341C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value
 342cannot be used in word context. The rationale is that a word can be
 343signed or unsigned, so extending it to a long could be done in two ways,
 344either by zero-extension, or by sign-extension.
 345
 346\subsection{Constants and Vals}
 347\label{sec:constants-and-vals}
 348
 349\begin{code}
 350dynConst :: Parser Q.DynConst
 351dynConst =
 352  (Q.Const <$> constant)
 353    <|> (Q.Thread <$> global)
 354    <?> "dynconst"
 355\end{code}
 356
 357Constants come in two kinds: compile-time constants and dynamic
 358constants. Dynamic constants include compile-time constants and other
 359symbol variants that are only known at program-load time or execution
 360time. Consequently, dynamic constants can only occur in function bodies.
 361
 362The representation of integers is two's complement.
 363Floating-point numbers are represented using the single-precision and
 364double-precision formats of the IEEE 754 standard.
 365
 366\begin{code}
 367constant :: Parser Q.Const
 368constant =
 369  (Q.Number <$> decNumber)
 370    <|> (Q.SFP <$> sfp)
 371    <|> (Q.DFP <$> dfp)
 372    <|> (Q.Global <$> global)
 373    <?> "const"
 374  where
 375    sfp = string "s_" >> float
 376    dfp = string "d_" >> float
 377\end{code}
 378
 379Constants specify a sequence of bits and are untyped. They are always
 380parsed as 64-bit blobs. Depending on the context surrounding a constant,
 381only some of its bits are used. For example, in the program below, the
 382two variables defined have the same value since the first operand of the
 383subtraction is a word (32-bit) context.
 384
 385\begin{verbatim}
 386%x =w sub -1, 0 %y =w sub 4294967295, 0
 387\end{verbatim}
 388
 389Because specifying floating-point constants by their bits makes the code
 390less readable, syntactic sugar is provided to express them. Standard
 391scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and
 392double precision numbers respectively. Once again, the following example
 393defines twice the same double-precision constant.
 394
 395\begin{verbatim}
 396%x =d add d_0, d_-1
 397%y =d add d_0, -4616189618054758400
 398\end{verbatim}
 399
 400Global symbols can also be used directly as constants; they will be
 401resolved and turned into actual numeric constants by the linker.
 402
 403When the \texttt{thread} keyword prefixes a symbol name, the
 404symbol\textquotesingle s numeric value is resolved at runtime in the
 405thread-local storage.
 406
 407\begin{code}
 408val :: Parser Q.Value
 409val =
 410  (Q.VConst <$> dynConst)
 411    <|> (Q.VLocal <$> local)
 412    <?> "val"
 413\end{code}
 414
 415Vals are used as arguments in regular, phi, and jump instructions within
 416function definitions. They are either constants or function-scope
 417temporaries.
 418
 419\subsection{Linkage}
 420\label{sec:linkage}
 421
 422\begin{code}
 423linkage :: Parser Q.Linkage
 424linkage =
 425  wsNL (bind "export" Q.LExport)
 426    <|> wsNL (bind "thread" Q.LThread)
 427    <|> do
 428      _ <- ws1 $ string "section"
 429      (try secWithFlags) <|> sec
 430  where
 431    sec :: Parser Q.Linkage
 432    sec = wsNL strLit <&> (`Q.LSection` Nothing)
 433
 434    secWithFlags :: Parser Q.Linkage
 435    secWithFlags = do
 436      n <- ws1 strLit
 437      wsNL strLit <&> Q.LSection n . Just
 438\end{code}
 439
 440Function and data definitions (see below) can specify linkage
 441information to be passed to the assembler and eventually to the linker.
 442
 443The \texttt{export} linkage flag marks the defined item as visible outside the
 444current file\textquotesingle s scope. If absent, the symbol can only be
 445referred to locally. Functions compiled by QBE and called from C need to
 446be exported.
 447
 448The \texttt{thread} linkage flag can only qualify data definitions. It mandates
 449that the object defined is stored in thread-local storage. Each time a
 450runtime thread starts, the supporting platform runtime is in charge of
 451making a new copy of the object for the fresh thread. Objects in
 452thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax,
 453as specified in the \nameref{sec:constants-and-vals} section.
 454
 455A \texttt{section} flag can be specified to tell the linker to put the defined
 456item in a certain section. The use of the section flag is platform
 457dependent and we refer the user to the documentation of their assembler
 458and linker for relevant information.
 459
 460\begin{verbatim}
 461section ".init_array" data $.init.f = { l $f }
 462\end{verbatim}
 463
 464The section flag can be used to add function pointers to a global
 465initialization list, as depicted above. Note that some platforms provide
 466a BSS section that can be used to minimize the footprint of uniformly
 467zeroed data. When this section is available, QBE will automatically make
 468use of it and no section flag is required.
 469
 470The section and export linkage flags should each appear at most once in
 471a definition. If multiple occurrences are present, QBE is free to use
 472any.
 473
 474\subsection{Definitions}
 475\label{sec:definitions}
 476
 477Definitions are the essential components of an IL file. They can define
 478three types of objects: aggregate types, data, and functions. Aggregate
 479types are never exported and do not compile to any code. Data and
 480function definitions have file scope and are mutually recursive (even
 481across IL files). Their visibility can be controlled using linkage
 482flags.
 483
 484\subsubsection{Aggregate Types}
 485\label{sec:aggregate-types}
 486
 487\begin{code}
 488typeDef :: Parser Q.TypeDef
 489typeDef = do
 490  _ <- wsNL1 (string "type")
 491  i <- wsNL1 userDef
 492  _ <- wsNL1 (char '=')
 493  a <- optionMaybe alignAny
 494  bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a
 495\end{code}
 496
 497Aggregate type definitions start with the \texttt{type} keyword. They have file
 498scope, but types must be defined before being referenced. The inner
 499structure of a type is expressed by a comma-separated list of fields.
 500
 501\begin{code}
 502subType :: Parser Q.SubType
 503subType =
 504  (Q.SExtType <$> extType)
 505    <|> (Q.SUserDef <$> userDef)
 506
 507field :: Parser Q.Field
 508field = do
 509  -- TODO: newline is required if there is a number argument
 510  f <- wsNL subType
 511  s <- ws $ optionMaybe decNumber
 512  pure (f, s)
 513
 514fields :: Bool -> Parser [Q.Field]
 515fields allowEmpty =
 516  (if allowEmpty then sepByTrail else sepByTrail1) field (wsNL $ char ',')
 517\end{code}
 518
 519A field consists of a subtype, either an extended type or a user-defined type,
 520and an optional number expressing the value of this field. In case many items
 521of the same type are sequenced (like in a C array), the shorter array syntax
 522can be used.
 523
 524\begin{code}
 525regularType :: Parser Q.AggType
 526regularType = Q.ARegular <$> fields True
 527\end{code}
 528
 529Three different kinds of aggregate types are presentl ysupported: regular
 530types, union types and opaque types. The fields of regular types will be
 531packed. By default, the alignment of an aggregate type is the maximum alignment
 532of its members. The alignment can be explicitly specified by the programmer.
 533
 534\begin{code}
 535unionType :: Parser Q.AggType
 536unionType = Q.AUnion <$> many1 (wsNL unionType')
 537  where
 538    unionType' :: Parser [Q.Field]
 539    unionType' = bracesNL $ fields False
 540\end{code}
 541
 542Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified.
 543
 544\begin{code}
 545opaqueType :: Parser Q.AggType
 546opaqueType = Q.AOpaque <$> wsNL decNumber
 547\end{code}
 548
 549Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces.
 550
 551\subsubsection{Data}
 552\label{sec:data}
 553
 554\begin{code}
 555dataDef :: Parser Q.DataDef
 556dataDef = do
 557  link <- many linkage
 558  name <- wsNL1 (string "data") >> wsNL global
 559  _ <- wsNL (char '=')
 560  alignment <- optionMaybe alignAny
 561  bracesNL dataObjs <&> Q.DataDef link name alignment
 562 where
 563    -- TODO: sepByTrail is not documented in the QBE BNF.
 564    dataObjs = sepByTrail dataObj (wsNL $ char ',')
 565\end{code}
 566
 567Data definitions express objects that will be emitted in the compiled
 568file. Their visibility and location in the compiled artifact are
 569controlled with linkage flags described in the \nameref{sec:linkage}
 570section.
 571
 572They define a global identifier (starting with the sigil \texttt{\$}), that
 573will contain a pointer to the object specified by the definition.
 574
 575\begin{code}
 576dataObj :: Parser Q.DataObj
 577dataObj =
 578  (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber))
 579    <|> do
 580      t <- wsNL1 extType
 581      i <- many1 (wsNL dataItem)
 582      return $ Q.OItem t i
 583\end{code}
 584
 585Objects are described by a sequence of fields that start with a type
 586letter. This letter can either be an extended type, or the \texttt{z} letter.
 587If the letter used is an extended type, the data item following
 588specifies the bits to be stored in the field.
 589
 590\begin{code}
 591dataItem :: Parser Q.DataItem
 592dataItem =
 593  (Q.DString <$> strLit)
 594    <|> try
 595      ( do
 596          i <- ws global
 597          off <- (ws $ char '+') >> ws decNumber
 598          return $ Q.DSymOff i off
 599      )
 600    <|> (Q.DConst <$> constant)
 601\end{code}
 602
 603Within each object, several items can be defined. When several data items
 604follow a letter, they initialize multiple fields of the same size.
 605
 606\begin{code}
 607allocSize :: Parser Q.AllocSize
 608allocSize =
 609  choice
 610    [ bind "4" Q.AlignWord,
 611      bind "8" Q.AlignLong,
 612      bind "16" Q.AlignLongLong
 613    ]
 614\end{code}
 615
 616The members of a struct will be packed. This means that padding has to
 617be emitted by the frontend when necessary. Alignment of the whole data
 618objects can be manually specified, and when no alignment is provided,
 619the maximum alignment from the platform is used.
 620
 621When the \texttt{z} letter is used the number following indicates the size of
 622the field; the contents of the field are zero initialized. It can be
 623used to add padding between fields or zero-initialize big arrays.
 624
 625\subsubsection{Functions}
 626\label{sec:functions}
 627
 628\begin{code}
 629funcDef :: Parser Q.FuncDef
 630funcDef = do
 631  link <- many linkage
 632  _ <- ws1 (string "function")
 633  retTy <- optionMaybe (ws1 abity)
 634  name <- ws global
 635  args <- wsNL params
 636  body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block
 637
 638  case (Q.insertJumps body) of
 639    Nothing -> fail $ "invalid fallthrough in " ++ show name
 640    Just bl -> return $ Q.FuncDef link name retTy args bl
 641\end{code}
 642
 643Function definitions contain the actual code to emit in the compiled
 644file. They define a global symbol that contains a pointer to the
 645function code. This pointer can be used in \texttt{call} instructions or stored
 646in memory.
 647
 648\begin{code}
 649subWordType :: Parser Q.SubWordType
 650subWordType = choice
 651  [ try $ bind "sb" Q.SignedByte
 652  , try $ bind "ub" Q.UnsignedByte
 653  , bind "sh" Q.SignedHalf
 654  , bind "uh" Q.UnsignedHalf ]
 655
 656abity :: Parser Q.Abity
 657abity = try (Q.ASubWordType <$> subWordType)
 658    <|> (Q.ABase <$> baseType)
 659    <|> (Q.AUserDef <$> userDef)
 660\end{code}
 661
 662The type given right before the function name is the return type of the
 663function. All return values of this function must have this return type.
 664If the return type is missing, the function must not return any value.
 665
 666\begin{code}
 667param :: Parser Q.FuncParam
 668param = (Q.Env <$> (ws1 (string "env") >> local))
 669    <|> (string "..." >> pure Q.Variadic)
 670    <|> do
 671          ty <- ws1 abity
 672          Q.Regular ty <$> local
 673
 674params :: Parser [Q.FuncParam]
 675params = parenLst param
 676\end{code}
 677
 678The parameter list is a comma separated list of temporary names prefixed
 679by types. The types are used to correctly implement C compatibility.
 680When an argument has an aggregate type, a pointer to the aggregate is
 681passed by thea caller. In the example below, we have to use a load
 682instruction to get the value of the first (and only) member of the
 683struct.
 684
 685\begin{verbatim}
 686type :one = { w }
 687
 688function w $getone(:one %p) {
 689@start
 690        %val =w loadw %p
 691        ret %val
 692}
 693\end{verbatim}
 694
 695If a function accepts or returns values that are smaller than a word,
 696such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type
 697must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand,
 698respectively, for signed and unsigned 8-bit values, and signed and
 699unsigned 16-bit values. Parameters associated with a sub-word type of
 700bit width N only have their N least significant bits set and have base
 701type \texttt{w}. For example, the function
 702
 703\begin{verbatim}
 704function w $addbyte(w %a, sb %b) {
 705@start
 706        %bw =w extsb %b
 707        %val =w add %a, %bw
 708        ret %val
 709}
 710\end{verbatim}
 711
 712needs to sign-extend its second argument before the addition. Dually,
 713return values with sub-word types do not need to be sign or zero
 714extended.
 715
 716If the parameter list ends with \texttt{...}, the function is a variadic
 717function: it can accept a variable number of arguments. To access the
 718extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg}
 719instructions described in the \nameref{sec:variadic} section.
 720
 721Optionally, the parameter list can start with an environment parameter
 722\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e.,
 723of type \texttt{l}). If the function does not use its environment parameter,
 724callers can safely omit it. This parameter is invisible to a C caller:
 725for example, the function
 726
 727\begin{verbatim}
 728export function w $add(env %e, w %a, w %b) {
 729@start
 730        %c =w add %a, %b
 731        ret %c
 732}
 733\end{verbatim}
 734
 735must be given the C prototype \texttt{int add(int, int)}. The intended use of
 736this feature is to pass the environment pointer of closures while
 737retaining a very good compatibility with C. The \nameref{sec:call}
 738section explains how to pass an environment parameter.
 739
 740Since global symbols are defined mutually recursive, there is no need
 741for function declarations: a function can be referenced before its
 742definition. Similarly, functions from other modules can be used without
 743previous declaration. All the type information necessary to compile a
 744call is in the instruction itself.
 745
 746The syntax and semantics for the body of functions are described in the
 747\nameref{sec:control} section.
 748
 749\section{Control}
 750\label{sec:control}
 751
 752The IL represents programs as textual transcriptions of control flow
 753graphs. The control flow is serialized as a sequence of blocks of
 754straight-line code which are connected using jump instructions.
 755
 756\subsection{Blocks}
 757\label{sec:blocks}
 758
 759\begin{code}
 760block :: Parser Q.Block'
 761block = do
 762  l <- wsNL1 label
 763  p <- many (wsNL1 $ try phiInstr)
 764  s <- many (wsNL1 statement)
 765  Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr)
 766\end{code}
 767
 768All blocks have a name that is specified by a label at their beginning.
 769Then follows a sequence of instructions that have "fall-through" flow.
 770Finally one jump terminates the block. The jump can either transfer
 771control to another block of the same function or return; jumps are
 772described further below.
 773
 774The first block in a function must not be the target of any jump in the
 775program. If a jump to the function start is needed, the frontend must
 776insert an empty prelude block at the beginning of the function.
 777
 778When one block jumps to the next block in the IL file, it is not
 779necessary to write the jump instruction, it will be automatically added
 780by the parser. For example the start block in the example below jumps
 781directly to the loop block.
 782
 783\subsection{Jumps}
 784\label{sec:jumps}
 785
 786\begin{code}
 787jumpInstr :: Parser Q.JumpInstr
 788jumpInstr = (string "hlt" >> pure Q.Halt)
 789        -- TODO: Return requires a space if there is an optionMaybe
 790        <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val)
 791        <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label))
 792        <|> do
 793          _ <- ws1 $ string "jnz"
 794          v <- ws val <* ws (char ',')
 795          l1 <- ws label <* ws (char ',')
 796          l2 <- ws label
 797          return $ Q.Jnz v l1 l2
 798\end{code}
 799
 800A jump instruction ends every block and transfers the control to another
 801program location. The target of a jump must never be the first block in
 802a function. The three kinds of jumps available are described in the
 803following list.
 804
 805\begin{enumerate}
 806  \item \textbf{Unconditional jump.} Jumps to another block of the same function.
 807  \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0.
 808  \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.
 809  \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}.
 810\end{enumerate}
 811
 812\section{Instructions}
 813\label{sec:instructions}
 814
 815\begin{code}
 816instr :: Parser Q.Instr
 817instr =
 818  choice
 819    [ try $ binaryInstr Q.Add "add",
 820      try $ binaryInstr Q.Sub "sub",
 821      try $ binaryInstr Q.Mul "mul",
 822      try $ binaryInstr Q.Div "div",
 823      try $ binaryInstr Q.URem "urem",
 824      try $ binaryInstr Q.Rem "rem",
 825      try $ binaryInstr Q.UDiv "udiv",
 826      try $ binaryInstr Q.Or "or",
 827      try $ binaryInstr Q.Xor "xor",
 828      try $ binaryInstr Q.And "and",
 829      try $ binaryInstr Q.Sar "sar",
 830      try $ binaryInstr Q.Shr "shr",
 831      try $ binaryInstr Q.Shl "shl",
 832      try $ unaryInstr Q.Neg "neg",
 833      try $ unaryInstr Q.Cast "cast",
 834      try $ unaryInstr Q.Copy "copy",
 835      try $ loadInstr,
 836      try $ allocInstr,
 837      try $ compareInstr,
 838      try $ extInstr
 839    ]
 840\end{code}
 841
 842Instructions are the smallest piece of code in the IL, they form the body of
 843\nameref{sec:blocks}. This specification distinguishes instructions and
 844volatile instructions, the latter do not return a value. For the former, the IL
 845uses a three-address code, which means that one instruction computes an
 846operation between two operands and assigns the result to a third one.
 847
 848\begin{code}
 849assign :: Parser Q.Statement
 850assign = do
 851  n <- ws local
 852  t <- ws (char '=') >> ws1 baseType
 853  Q.Assign n t <$> instr
 854
 855volatileInstr :: Parser Q.Statement
 856volatileInstr = Q.Volatile <$> (storeInstr <|> blitInstr)
 857
 858-- TODO: Not documented in the QBE BNF.
 859statement :: Parser Q.Statement
 860statement = (try callInstr) <|> assign <|> volatileInstr
 861\end{code}
 862
 863An instruction has both a name and a return type, this return type is a base
 864type that defines the size of the instruction's result. The type of the
 865arguments can be unambiguously inferred using the instruction name and the
 866return type. For example, for all arithmetic instructions, the type of the
 867arguments is the same as the return type. The two additions below are valid if
 868\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}).
 869
 870\begin{verbatim}
 871%x =w add 0, %y
 872%z =w add %x, %x
 873\end{verbatim}
 874
 875Some instructions, like comparisons and memory loads have operand types
 876that differ from their return types. For instance, two floating points
 877can be compared to give a word result (0 if the comparison succeeds, 1
 878if it fails).
 879
 880\begin{verbatim}
 881%c =w cgts %a, %b
 882\end{verbatim}
 883
 884In the example above, both operands have to have single type. This is
 885made explicit by the instruction suffix.
 886
 887\subsection{Arithmetic and Bits}
 888
 889\begin{quote}
 890\begin{itemize}
 891\item \texttt{add}, \texttt{sub}, \texttt{div}, \texttt{mul}
 892\item \texttt{neg}
 893\item \texttt{udiv}, \texttt{rem}, \texttt{urem}
 894\item \texttt{or}, \texttt{xor}, \texttt{and}
 895\item \texttt{sar}, \texttt{shr}, \texttt{shl}
 896\end{itemize}
 897\end{quote}
 898
 899The base arithmetic instructions in the first bullet are available for
 900all types, integers and floating points.
 901
 902When \texttt{div} is used with word or long return type, the arguments are
 903treated as signed. The unsigned integral division is available as \texttt{udiv}
 904instruction. When the result of a division is not an integer, it is truncated
 905towards zero.
 906
 907The signed and unsigned remainder operations are available as \texttt{rem} and
 908\texttt{urem}. The sign of the remainder is the same as the one of the
 909dividend. Its magnitude is smaller than the divisor one. These two instructions
 910and \texttt{udiv} are only available with integer arguments and result.
 911
 912Bitwise OR, AND, and XOR operations are available for both integer
 913types. Logical operations of typical programming languages can be
 914implemented using \nameref{sec:comparisions} and \nameref{sec:jumps}.
 915
 916Shift instructions \texttt{sar}, \texttt{shr}, and \texttt{shl}, shift right or
 917left their first operand by the amount from the second operand. The shifting
 918amount is taken modulo the size of the result type. Shifting right can either
 919preserve the sign of the value (using \texttt{sar}), or fill the newly freed
 920bits with zeroes (using \texttt{shr}). Shifting left always fills the freed
 921bits with zeroes.
 922
 923Remark that an arithmetic shift right (\texttt{sar}) is only equivalent to a
 924division by a power of two for non-negative numbers. This is because the shift
 925right "truncates" towards minus infinity, while the division truncates towards
 926zero.
 927
 928\subsection{Memory}
 929\label{sec:memory}
 930
 931The following sections discuss instructions for interacting with values stored in memory.
 932
 933\subsubsection{Store instructions}
 934
 935\begin{code}
 936storeInstr :: Parser Q.VolatileInstr
 937storeInstr = do
 938  t <- string "store" >> ws1 extType
 939  v <- ws val
 940  _ <- ws $ char ','
 941  ws val <&> Q.Store t v
 942\end{code}
 943
 944Store instructions exist to store a value of any base type and any extended
 945type. Since halfwords and bytes are not first class in the IL, \texttt{storeh}
 946and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of
 947this word will be stored in memory at the address specified in the second
 948argument.
 949
 950\subsubsection{Load instructions}
 951
 952\begin{code}
 953loadInstr :: Parser Q.Instr
 954loadInstr = do
 955  _ <- string "load"
 956  t <- ws1 $ choice
 957    [ try $ bind "sw" (Q.LBase Q.Word),
 958      try $ bind "uw" (Q.LBase Q.Word),
 959      try $ Q.LSubWord <$> subWordType,
 960      Q.LBase <$> baseType
 961    ]
 962  ws val <&> Q.Load t
 963\end{code}
 964
 965For types smaller than long, two variants of the load instruction are
 966available: one will sign extend the loaded value, while the other will zero
 967extend it. Note that all loads smaller than long can load to either a long or a
 968word.
 969
 970The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect
 971when they are used to define a word temporary. A \texttt{loadw} instruction is
 972provided as syntactic sugar for \texttt{loadsw} to make explicit that the
 973extension mechanism used is irrelevant.
 974
 975\subsubsection{Blits}
 976
 977\begin{code}
 978blitInstr :: Parser Q.VolatileInstr
 979blitInstr = do
 980  v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',')
 981  v2 <- ws val <* (ws $ char ',')
 982  nb <- decNumber
 983  return $ Q.Blit v1 v2 nb
 984\end{code}
 985
 986The blit instruction copies in-memory data from its first address argument to
 987its second address argument. The third argument is the number of bytes to copy.
 988The source and destination spans are required to be either non-overlapping, or
 989fully overlapping (source address identical to the destination address). The
 990byte count argument must be a nonnegative numeric constant; it cannot be a
 991temporary.
 992
 993One blit instruction may generate a number of instructions proportional to its
 994byte count argument, consequently, it is recommended to keep this argument
 995relatively small. If large copies are necessary, it is preferable that
 996frontends generate calls to a supporting \texttt{memcpy} function.
 997
 998\subsubsection{Stack Allocation}
 999
1000\begin{code}
1001allocInstr :: Parser Q.Instr
1002allocInstr = do
1003  siz <- (ws $ string "alloc") >> (ws1 allocSize)
1004  val <&> Q.Alloc siz
1005\end{code}
1006
1007These instructions allocate a chunk of memory on the stack. The number ending
1008the instruction name is the alignment required for the allocated slot. QBE will
1009make sure that the returned address is a multiple of that alignment value.
1010
1011Stack allocation instructions are used, for example, when compiling the C local
1012variables, because their address can be taken. When compiling Fortran,
1013temporaries can be used directly instead, because it is illegal to take the
1014address of a variable.
1015
1016\subsection{Comparisons}
1017\label{sec:comparisions}
1018
1019Comparison instructions return an integer value (either a word or a long), and
1020compare values of arbitrary types. The returned value is 1 if the two operands
1021satisfy the comparison relation, or 0 otherwise. The names of comparisons
1022respect a standard naming scheme in three parts.
1023
1024\begin{code}
1025compareInstr :: Parser Q.Instr
1026compareInstr = do
1027  _ <- char 'c'
1028  op <- compareOp
1029  ty <- ws1 baseType
1030  lhs <- ws val <* ws (char ',')
1031  rhs <- ws val
1032  pure $ Q.Compare ty op lhs rhs
1033\end{code}
1034
1035\begin{code}
1036compareOp :: Parser Q.CmpOp
1037compareOp = choice
1038  [ bind "eq" Q.CEq
1039  , bind "ne" Q.CNe
1040  , try $ bind "sle" Q.CSle
1041  , try $ bind "slt" Q.CSlt
1042  , try $ bind "sge" Q.CSge
1043  , try $ bind "sgt" Q.CSgt
1044  , try $ bind "ule" Q.CUle
1045  , try $ bind "ult" Q.CUlt
1046  , try $ bind "uge" Q.CUge
1047  , try $ bind "ugt" Q.CUgt ]
1048\end{code}
1049
1050For example, \texttt{cod} compares two double-precision floating point numbers
1051and returns 1 if the two floating points are not NaNs, or 0 otherwise. The
1052\texttt{csltw} instruction compares two words representing signed numbers and
1053returns 1 when the first argument is smaller than the second one.
1054
1055\subsection{Conversions}
1056
1057\begin{code}
1058subLongType :: Parser Q.SubLongType
1059subLongType = try (Q.SLSubWord <$> subWordType)
1060  <|> bind "sw" Q.SLSignedWord
1061  <|> bind "uw" Q.SLUnsignedWord
1062
1063extInstr :: Parser Q.Instr
1064extInstr = do
1065  _ <- string "ext"
1066  ty <- ws1 subLongType
1067  ws val <&> Q.Ext ty
1068\end{code}
1069
1070Conversion operations change the representation of a value, possibly modifying
1071it if the target type cannot hold the value of the source type. Conversions can
1072extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or
1073convert a floating point into an integer and vice versa.
1074
1075\subsection{Cast and Copy}
1076
1077The \texttt{cast} and \texttt{copy} instructions return the bits of their
1078argument verbatim. However a cast will change an integer into a floating point
1079of the same width and vice versa.
1080
1081Casts can be used to make bitwise operations on the representation of floating
1082point numbers. For example the following program will compute the opposite of
1083the single-precision floating point number \texttt{\%f} into \texttt{\%rs}.
1084
1085\begin{verbatim}
1086%b0 =w cast %f
1087%b1 =w xor 2147483648, %b0  # flip the msb
1088%rs =s cast %b1
1089\end{verbatim}
1090
1091\subsection{Call}
1092\label{sec:call}
1093
1094\begin{code}
1095-- TODO: Code duplication with 'param'.
1096callArg :: Parser Q.FuncArg
1097callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))
1098    <|> (string "..." >> pure Q.ArgVar)
1099    <|> do
1100          ty <- ws1 abity
1101          Q.ArgReg ty <$> val
1102
1103callArgs :: Parser [Q.FuncArg]
1104callArgs = parenLst callArg
1105
1106callInstr :: Parser Q.Statement
1107callInstr = do
1108  retValue <- optionMaybe $ do
1109    i <- ws local <* ws (char '=')
1110    a <- ws1 abity
1111    return (i, a)
1112  toCall <- ws1 (string "call") >> ws val
1113  fnArgs <- callArgs
1114  return $ Q.Call retValue toCall fnArgs
1115\end{code}
1116
1117The call instruction is special in several ways. It is not a three-address
1118instruction and requires the type of all its arguments to be given. Also, the
1119return type can be either a base type or an aggregate type. These specifics are
1120required to compile calls with C compatibility (i.e., to respect the ABI).
1121
1122When an aggregate type is used as argument type or return type, the value
1123respectively passed or returned needs to be a pointer to a memory location
1124holding the value. This is because aggregate types are not first-class
1125citizens of the IL.
1126
1127Sub-word types are used for arguments and return values of width less than a
1128word. Details on these types are presented in the \nameref{sec:functions} section.
1129Arguments with sub-word types need not be sign or zero extended according to
1130their type. Calls with a sub-word return type define a temporary of base type
1131\texttt{w} with its most significant bits unspecified.
1132
1133Unless the called function does not return a value, a return temporary must be
1134specified, even if it is never used afterwards.
1135
1136An environment parameter can be passed as first argument using the \texttt{env}
1137keyword. The passed value must be a 64-bit integer. If the called function does
1138not expect an environment parameter, it will be safely discarded. See the
1139\nameref{sec:functions} section for more information about environment
1140parameters.
1141
1142When the called function is variadic, there must be a \texttt{...} marker
1143separating the named and variadic arguments.
1144
1145\subsection{Variadic}
1146\label{sec:variadic}
1147
1148To-Do.
1149
1150\subsection{Phi}
1151
1152\begin{code}
1153phiBranch :: Parser (Q.BlockIdent, Q.Value)
1154phiBranch = do
1155  n <- ws1 label
1156  v <- val
1157  pure (n, v)
1158
1159phiInstr :: Parser Q.Phi
1160phiInstr = do
1161  -- TODO: code duplication with 'assign'
1162  n <- ws local
1163  t <- ws (char '=') >> ws1 baseType
1164
1165  _ <- ws1 (string "phi")
1166  -- TODO: combinator for sepBy
1167  p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')
1168  return $ Q.Phi n t p
1169\end{code}
1170
1171First and foremost, phi instructions are NOT necessary when writing a frontend
1172to QBE. One solution to avoid having to deal with SSA form is to use stack
1173allocated variables for all source program variables and perform assignments
1174and lookups using \nameref{sec:memory} operations. This is what LLVM users
1175typically do.
1176
1177Another solution is to simply emit code that is not in SSA form! Contrary to
1178LLVM, QBE is able to fixup programs not in SSA form without requiring the
1179boilerplate of loading and storing in memory. For example, the following
1180program will be correctly compiled by QBE.
1181
1182\begin{verbatim}
1183@start
1184    %x =w copy 100
1185    %s =w copy 0
1186@loop
1187    %s =w add %s, %x
1188    %x =w sub %x, 1
1189    jnz %x, @loop, @end
1190@end
1191    ret %s
1192\end{verbatim}
1193
1194Now, if you want to know what phi instructions are and how to use them in QBE,
1195you can read the following.
1196
1197Phi instructions are specific to SSA form. In SSA form values can only be
1198assigned once, without phi instructions, this requirement is too strong to
1199represent many programs. For example consider the following C program.
1200
1201\begin{verbatim}
1202int f(int x) {
1203    int y;
1204    if (x)
1205        y = 1;
1206    else
1207        y = 2;
1208    return y;
1209}
1210\end{verbatim}
1211
1212The variable \texttt{y} is assigned twice, the solution to translate it in SSA
1213form is to insert a phi instruction.
1214
1215\begin{verbatim}
1216@ifstmt
1217    jnz %x, @ift, @iff
1218@ift
1219    jmp @retstmt
1220@iff
1221    jmp @retstmt
1222@retstmt
1223    %y =w phi @ift 1, @iff 2
1224    ret %y
1225\end{verbatim}
1226
1227Phi instructions return one of their arguments depending on where the control
1228came from. In the example, \texttt{\%y} is set to 1 if the
1229\texttt{\textbackslash{}ift} branch is taken, or it is set to 2 otherwise.
1230
1231An important remark about phi instructions is that QBE assumes that if a
1232variable is defined by a phi it respects all the SSA invariants. So it is
1233critical to not use phi instructions unless you know exactly what you are
1234doing.
1235\end{document}