quebex

A software analysis framework built around the QBE intermediate language

git clone https://git.8pit.net/quebex.git

   1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me>
   2% SPDX-FileCopyrightText: 2025 Sören Tempel <soeren+git@soeren-tempel.net>
   3%
   4% SPDX-License-Identifier: MIT AND GPL-3.0-only
   5
   6\documentclass{article}
   7%include polycode.fmt
   8
   9%subst blankline = "\\[5mm]"
  10
  11% See https://github.com/kosmikus/lhs2tex/issues/58
  12%format <$> = "\mathbin{\langle\$\rangle}"
  13%format <&> = "\mathbin{\langle\&\rangle}"
  14%format <|> = "\mathbin{\langle\:\vline\:\rangle}"
  15%format <?> = "\mathbin{\langle?\rangle}"
  16%format <*> = "\mathbin{\langle*\rangle}"
  17%format <*  = "\mathbin{\langle*}"
  18%format *>  = "\mathbin{*\rangle}"
  19
  20\long\def\ignore#1{}
  21
  22\usepackage{hyperref}
  23\hypersetup{
  24	colorlinks = true,
  25}
  26
  27\begin{document}
  28
  29\title{QBE Intermediate Language\vspace{-2em}}
  30\date{}
  31\maketitle
  32\frenchspacing
  33
  34\ignore{
  35\begin{code}
  36module Language.QBE.Parser (dataDef, typeDef, funcDef) where
  37
  38import Data.Word (Word64)
  39import Data.Functor ((<&>))
  40import Data.List (singleton)
  41import Data.Map qualified as Map
  42import qualified Language.QBE.Types as Q
  43import Language.QBE.Util (bind, decNumber, float)
  44import Text.ParserCombinators.Parsec
  45  ( Parser,
  46    alphaNum,
  47    anyChar,
  48    between,
  49    char,
  50    choice,
  51    letter,
  52    many,
  53    many1,
  54    manyTill,
  55    newline,
  56    noneOf,
  57    oneOf,
  58    optionMaybe,
  59    sepBy,
  60    sepBy1,
  61    skipMany,
  62    skipMany1,
  63    string,
  64    try,
  65    (<?>),
  66    (<|>),
  67  )
  68\end{code}
  69}
  70
  71This an executable description of the
  72\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language},
  73specified through \href{https://hackage.haskell.org/package/parsec}{Parsec}
  74parser combinators and generated from a literate Haskell file. The description
  75is derived from the original QBE IL documentation, licensed under MIT.
  76Presently, this implementation targets version 1.2 of the QBE intermediate
  77language and aims to be equivalent with the original specification.
  78
  79\section{Basic Concepts}
  80
  81The intermediate language (IL) is a higher-level language than the
  82machine's assembly language. It smoothes most of the
  83irregularities of the underlying hardware and allows an infinite number
  84of temporaries to be used. This higher abstraction level lets frontend
  85programmers focus on language design issues.
  86
  87\subsection{Input Files}
  88
  89The intermediate language is provided to QBE as text. Usually, one file
  90is generated per each compilation unit from the frontend input language.
  91An IL file is a sequence of \nameref{sec:definitions} for
  92data, functions, and types. Once processed by QBE, the resulting file
  93can be assembled and linked using a standard toolchain (e.g., GNU
  94binutils).
  95
  96\begin{code}
  97comment :: Parser ()
  98comment = skipMany blankNL >> comment' >> skipMany blankNL
  99  where
 100    comment' = char '#' >> manyTill anyChar newline
 101\end{code}
 102
 103\ignore{
 104\begin{code}
 105skipNoCode :: Parser () -> Parser ()
 106skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP
 107\end{code}
 108}
 109
 110Here is a complete "Hello World" IL file which defines a function that
 111prints to the screen. Since the string is not a first class object (only
 112the pointer is) it is defined outside the function\textquotesingle s
 113body. Comments start with a \# character and finish with the end of the
 114line.
 115
 116\begin{verbatim}
 117data $str = { b "hello world", b 0 }
 118
 119export function w $main() {
 120@start
 121        # Call the puts function with $str as argument.
 122        %r =w call $puts(l $str)
 123        ret 0
 124}
 125\end{verbatim}
 126
 127If you have read the LLVM language reference, you might recognize the
 128example above. In comparison, QBE makes a much lighter use of types and
 129the syntax is terser.
 130
 131\subsection{Parser Combinators}
 132
 133\ignore{
 134\begin{code}
 135bracesNL :: Parser a -> Parser a
 136bracesNL = between (wsNL $ char '{') (wsNL $ char '}')
 137
 138quoted :: Parser a -> Parser a
 139quoted = let q = char '"' in between q q
 140
 141parenLst :: Parser a -> Parser [a]
 142parenLst p = between (ws $ char '(') (char ')') inner
 143  where
 144    inner = sepBy (ws p) (ws $ char ',')
 145
 146unaryInstr :: (Q.Value -> Q.Instr) -> String -> Parser Q.Instr
 147unaryInstr conc keyword = do
 148  _ <- ws (string keyword)
 149  conc <$> ws val
 150
 151binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr
 152binaryInstr conc keyword = do
 153  _ <- ws (string keyword)
 154  vfst <- ws val <* ws (char ',')
 155  conc vfst <$> ws val
 156
 157-- Can only appear in data and type definitions and hence allows newlines.
 158alignAny :: Parser Word64
 159alignAny = (ws1 (string "align")) >> wsNL decNumber
 160\end{code}
 161}
 162
 163The original QBE specification defines the syntax using a BNF grammar. In
 164contrast, this document defines it using Parsec parser combinators. As such,
 165this specification is less formal but more accurate as the parsing code is
 166actually executable. Consequently, this specification also captures constructs
 167omitted in the original specification (e.g., \nameref{sec:identifiers},
 168\nameref{sec:statements}, or \nameref{sec:strlit}). Nonetheless, the formal
 169language recognized by these combinators aims to be equivalent to the one of
 170the BNF grammar.
 171
 172\subsection{Identifiers}
 173\label{sec:identifiers}
 174
 175% Ident is not documented in the original QBE specification.
 176% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304
 177
 178\begin{code}
 179ident :: Parser String
 180ident = do
 181  start <- letter <|> oneOf "._"
 182  rest <- many (alphaNum <|> oneOf "$._")
 183  return $ start : rest
 184\end{code}
 185
 186Identifiers for data, types, and functions can start with any ASCII letter or
 187the special characters \texttt{.} and \texttt{\_}. This initial character can
 188be followed by a sequence of zero or more alphanumeric characters and the
 189special characters \texttt{\$}, \texttt{.}, and \texttt{\_}.
 190
 191\subsection{Sigils}
 192
 193\begin{code}
 194userDef :: Parser Q.UserIdent
 195userDef = Q.UserIdent <$> (char ':' >> ident)
 196
 197global :: Parser Q.GlobalIdent
 198global = Q.GlobalIdent <$> (char '$' >> ident)
 199
 200local :: Parser Q.LocalIdent
 201local = Q.LocalIdent <$> (char '%' >> ident)
 202
 203label :: Parser Q.BlockIdent
 204label = Q.BlockIdent <$> (char '@' >> ident)
 205\end{code}
 206
 207The intermediate language makes heavy use of sigils, all user-defined
 208names are prefixed with a sigil. This is to avoid keyword conflicts, and
 209also to quickly spot the scope and nature of identifiers.
 210
 211\begin{itemize}
 212  \item \texttt{:} is for user-defined \nameref{sec:aggregate-types}
 213  \item \texttt{\$} is for globals (represented by a pointer)
 214  \item \texttt{\%} is for function-scope temporaries
 215  \item \texttt{@@} is for block labels
 216\end{itemize}
 217
 218\subsection{Spacing}
 219
 220\begin{code}
 221blank :: Parser Char
 222blank = oneOf "\t " <?> "blank"
 223
 224blankNL :: Parser Char
 225blankNL = oneOf "\n\t " <?> "blank or newline"
 226\end{code}
 227
 228Individual tokens in IL files must be separated by one or more spacing
 229characters. Both spaces and tabs are recognized as spacing characters.
 230In data and type definitions, newlines may also be used as spaces to
 231prevent overly long lines. When exactly one of two consecutive tokens is
 232a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted.
 233
 234\ignore{
 235\begin{code}
 236ws :: Parser a -> Parser a
 237ws p = p <* skipMany blank
 238
 239ws1 :: Parser a -> Parser a
 240ws1 p = p <* skipMany1 blank
 241
 242wsNL :: Parser a -> Parser a
 243wsNL p = p <* skipNoCode (skipMany blankNL)
 244
 245wsNL1 :: Parser a -> Parser a
 246wsNL1 p = p <* skipNoCode (skipMany1 blankNL)
 247\end{code}
 248}
 249
 250\subsection{String Literals}
 251\label{sec:strlit}
 252
 253% The string literal is not documented in the original QBE specification.
 254% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287
 255
 256\begin{code}
 257strLit :: Parser String
 258strLit = concat <$> quoted (many strChr)
 259  where
 260    strChr :: Parser [Char]
 261    strChr = (singleton <$> noneOf "\"\\") <|> escSeq
 262
 263    escSeq :: Parser [Char]
 264    escSeq = try $ do
 265      esc <- char '\\'
 266      chr <- anyChar
 267      return $ [esc, chr]
 268\end{code}
 269
 270Strings are enclosed by double quotes and are, for example, used to specify a
 271section name as part of the \nameref{sec:linkage} information. Within a string,
 272a double quote can be escaped using a \texttt{\textbackslash} character. All
 273escape sequences, including double quote escaping, are passed through as-is to
 274the generated assembly file.
 275
 276\section{Types}
 277
 278\subsection{Simple Types}
 279
 280The IL makes minimal use of types. By design, the types used are
 281restricted to what is necessary for unambiguous compilation to machine
 282code and C interfacing. Unlike LLVM, QBE is not using types as a means
 283to safety; they are only here for semantic purposes.
 284
 285\begin{code}
 286baseType :: Parser Q.BaseType
 287baseType = choice
 288  [ bind "w" Q.Word
 289  , bind "l" Q.Long
 290  , bind "s" Q.Single
 291  , bind "d" Q.Double ]
 292\end{code}
 293
 294The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d}
 295(double), they stand respectively for 32-bit and 64-bit integers, and
 29632-bit and 64-bit floating-point numbers. There are no pointer types
 297available; pointers are typed by an integer type sufficiently wide to
 298represent all memory addresses (e.g., \texttt{l} on 64-bit architectures).
 299Temporaries in the IL can only have a base type.
 300
 301\begin{code}
 302extType :: Parser Q.ExtType
 303extType = (Q.Base <$> baseType)
 304       <|> bind "b" Q.Byte
 305       <|> bind "h" Q.HalfWord
 306\end{code}
 307
 308Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word),
 309respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types}
 310and \nameref{sec:data} definitions.
 311
 312For C interfacing, the IL also provides user-defined aggregate types as
 313well as signed and unsigned variants of the sub-word extended types.
 314Read more about these types in the \nameref{sec:aggregate-types}
 315and \nameref{sec:functions} sections.
 316
 317\subsection{Subtyping}
 318
 319The IL has a minimal subtyping feature, for integer types only. Any
 320value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the
 32132 least significant bits of the word value are used.
 322
 323Make note that it is the opposite of the usual subtyping on integers (in
 324C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value
 325cannot be used in word context. The rationale is that a word can be
 326signed or unsigned, so extending it to a long could be done in two ways,
 327either by zero-extension, or by sign-extension.
 328
 329\subsection{Constants and Vals}
 330\label{sec:constants-and-vals}
 331
 332\begin{code}
 333dynConst :: Parser Q.DynConst
 334dynConst =
 335  (Q.Const <$> constant)
 336    <|> (Q.Thread <$> global)
 337    <?> "dynconst"
 338\end{code}
 339
 340Constants come in two kinds: compile-time constants and dynamic
 341constants. Dynamic constants include compile-time constants and other
 342symbol variants that are only known at program-load time or execution
 343time. Consequently, dynamic constants can only occur in function bodies.
 344
 345The representation of integers is two's complement.
 346Floating-point numbers are represented using the single-precision and
 347double-precision formats of the IEEE 754 standard.
 348
 349\begin{code}
 350constant :: Parser Q.Const
 351constant =
 352  (Q.Number <$> decNumber)
 353    <|> (Q.SFP <$> sfp)
 354    <|> (Q.DFP <$> dfp)
 355    <|> (Q.Global <$> global)
 356    <?> "const"
 357  where
 358    sfp = string "s_" >> float
 359    dfp = string "d_" >> float
 360\end{code}
 361
 362Constants specify a sequence of bits and are untyped. They are always
 363parsed as 64-bit blobs. Depending on the context surrounding a constant,
 364only some of its bits are used. For example, in the program below, the
 365two variables defined have the same value since the first operand of the
 366subtraction is a word (32-bit) context.
 367
 368\begin{verbatim}
 369%x =w sub -1, 0 %y =w sub 4294967295, 0
 370\end{verbatim}
 371
 372Because specifying floating-point constants by their bits makes the code
 373less readable, syntactic sugar is provided to express them. Standard
 374scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and
 375double precision numbers respectively. Once again, the following example
 376defines twice the same double-precision constant.
 377
 378\begin{verbatim}
 379%x =d add d_0, d_-1
 380%y =d add d_0, -4616189618054758400
 381\end{verbatim}
 382
 383Global symbols can also be used directly as constants; they will be
 384resolved and turned into actual numeric constants by the linker.
 385
 386When the \texttt{thread} keyword prefixes a symbol name, the
 387symbol\textquotesingle s numeric value is resolved at runtime in the
 388thread-local storage.
 389
 390\begin{code}
 391val :: Parser Q.Value
 392val =
 393  (Q.VConst <$> dynConst)
 394    <|> (Q.VLocal <$> local)
 395    <?> "val"
 396\end{code}
 397
 398Vals are used as arguments in regular, phi, and jump instructions within
 399function definitions. They are either constants or function-scope
 400temporaries.
 401
 402\subsection{Linkage}
 403\label{sec:linkage}
 404
 405\begin{code}
 406linkage :: Parser Q.Linkage
 407linkage =
 408  wsNL (bind "export" Q.LExport)
 409    <|> wsNL (bind "thread" Q.LThread)
 410    <|> do
 411      _ <- ws1 $ string "section"
 412      (try secWithFlags) <|> sec
 413  where
 414    sec :: Parser Q.Linkage
 415    sec = wsNL strLit <&> (`Q.LSection` Nothing)
 416
 417    secWithFlags :: Parser Q.Linkage
 418    secWithFlags = do
 419      n <- ws1 strLit
 420      wsNL strLit <&> Q.LSection n . Just
 421\end{code}
 422
 423Function and data definitions (see below) can specify linkage
 424information to be passed to the assembler and eventually to the linker.
 425
 426The \texttt{export} linkage flag marks the defined item as visible outside the
 427current file\textquotesingle s scope. If absent, the symbol can only be
 428referred to locally. Functions compiled by QBE and called from C need to
 429be exported.
 430
 431The \texttt{thread} linkage flag can only qualify data definitions. It mandates
 432that the object defined is stored in thread-local storage. Each time a
 433runtime thread starts, the supporting platform runtime is in charge of
 434making a new copy of the object for the fresh thread. Objects in
 435thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax,
 436as specified in the \nameref{sec:constants-and-vals} section.
 437
 438A \texttt{section} flag can be specified to tell the linker to put the defined
 439item in a certain section. The use of the section flag is platform
 440dependent and we refer the user to the documentation of their assembler
 441and linker for relevant information.
 442
 443\begin{verbatim}
 444section ".init_array" data $.init.f = { l $f }
 445\end{verbatim}
 446
 447The section flag can be used to add function pointers to a global
 448initialization list, as depicted above. Note that some platforms provide
 449a BSS section that can be used to minimize the footprint of uniformly
 450zeroed data. When this section is available, QBE will automatically make
 451use of it and no section flag is required.
 452
 453The section and export linkage flags should each appear at most once in
 454a definition. If multiple occurrences are present, QBE is free to use
 455any.
 456
 457\subsection{Definitions}
 458\label{sec:definitions}
 459
 460Definitions are the essential components of an IL file. They can define
 461three types of objects: aggregate types, data, and functions. Aggregate
 462types are never exported and do not compile to any code. Data and
 463function definitions have file scope and are mutually recursive (even
 464across IL files). Their visibility can be controlled using linkage
 465flags.
 466
 467\subsubsection{Aggregate Types}
 468\label{sec:aggregate-types}
 469
 470\begin{code}
 471typeDef :: Parser Q.TypeDef
 472typeDef = do
 473  _ <- wsNL1 (string "type")
 474  i <- wsNL1 userDef
 475  _ <- wsNL1 (char '=')
 476  a <- optionMaybe alignAny
 477  bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a
 478\end{code}
 479
 480Aggregate type definitions start with the \texttt{type} keyword. They have file
 481scope, but types must be defined before being referenced. The inner
 482structure of a type is expressed by a comma-separated list of fields.
 483
 484\begin{code}
 485subType :: Parser Q.SubType
 486subType =
 487  (Q.SExtType <$> extType)
 488    <|> (Q.SUserDef <$> userDef)
 489
 490field :: Parser Q.Field
 491field = do
 492  -- TODO: newline is required if there is a number argument
 493  f <- wsNL subType
 494  s <- ws $ optionMaybe decNumber
 495  pure (f, s)
 496
 497-- TODO: "For ease of IL generation, a trailing comma is tolerated by the parser".
 498fields :: Bool -> Parser [Q.Field]
 499fields allowEmpty =
 500  (if allowEmpty then sepBy else sepBy1) field (wsNL $ char ',')
 501\end{code}
 502
 503A field consists of a subtype, either an extended type or a user-defined type,
 504and an optional number expressing the value of this field. In case many items
 505of the same type are sequenced (like in a C array), the shorter array syntax
 506can be used.
 507
 508\begin{code}
 509regularType :: Parser Q.AggType
 510regularType = Q.ARegular <$> fields True
 511\end{code}
 512
 513Three different kinds of aggregate types are presentl ysupported: regular
 514types, union types and opaque types. The fields of regular types will be
 515packed. By default, the alignment of an aggregate type is the maximum alignment
 516of its members. The alignment can be explicitly specified by the programmer.
 517
 518\begin{code}
 519unionType :: Parser Q.AggType
 520unionType = Q.AUnion <$> many1 (wsNL unionType')
 521  where
 522    unionType' :: Parser [Q.Field]
 523    unionType' = bracesNL $ fields False
 524\end{code}
 525
 526Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified.
 527
 528\begin{code}
 529opaqueType :: Parser Q.AggType
 530opaqueType = Q.AOpaque <$> wsNL decNumber
 531\end{code}
 532
 533Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces.
 534
 535\subsubsection{Data}
 536\label{sec:data}
 537
 538\begin{code}
 539dataDef :: Parser Q.DataDef
 540dataDef = do
 541  link <- many linkage
 542  name <- wsNL1 (string "data") >> wsNL global
 543  _ <- wsNL (char '=')
 544  alignment <- optionMaybe alignAny
 545  bracesNL dataObjs <&> Q.DataDef link name alignment
 546 where
 547    dataObjs = sepBy dataObj (wsNL $ char ',')
 548\end{code}
 549
 550Data definitions express objects that will be emitted in the compiled
 551file. Their visibility and location in the compiled artifact are
 552controlled with linkage flags described in the \nameref{sec:linkage}
 553section.
 554
 555They define a global identifier (starting with the sigil \texttt{\$}), that
 556will contain a pointer to the object specified by the definition.
 557
 558\begin{code}
 559dataObj :: Parser Q.DataObj
 560dataObj =
 561  (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber))
 562    <|> do
 563      t <- wsNL1 extType
 564      i <- many1 (wsNL dataItem)
 565      return $ Q.OItem t i
 566\end{code}
 567
 568Objects are described by a sequence of fields that start with a type
 569letter. This letter can either be an extended type, or the \texttt{z} letter.
 570If the letter used is an extended type, the data item following
 571specifies the bits to be stored in the field.
 572
 573\begin{code}
 574dataItem :: Parser Q.DataItem
 575dataItem =
 576  (Q.DString <$> strLit)
 577    <|> (Q.DConst <$> constant)
 578    <|> do
 579      i <- global
 580      off <- optionMaybe (char '+' >> decNumber)
 581      return $ Q.DSymbol i off
 582\end{code}
 583
 584Within each object, several items can be defined. When several data items
 585follow a letter, they initialize multiple fields of the same size.
 586
 587\begin{code}
 588allocSize :: Parser Q.AllocSize
 589allocSize =
 590  choice
 591    [ bind "4" Q.AlignWord,
 592      bind "8" Q.AlignLong,
 593      bind "16" Q.AlignLongLong
 594    ]
 595\end{code}
 596
 597The members of a struct will be packed. This means that padding has to
 598be emitted by the frontend when necessary. Alignment of the whole data
 599objects can be manually specified, and when no alignment is provided,
 600the maximum alignment from the platform is used.
 601
 602When the \texttt{z} letter is used the number following indicates the size of
 603the field; the contents of the field are zero initialized. It can be
 604used to add padding between fields or zero-initialize big arrays.
 605
 606\subsubsection{Functions}
 607\label{sec:functions}
 608
 609\begin{code}
 610funcDef :: Parser Q.FuncDef
 611funcDef = do
 612  link <- many linkage
 613  _ <- ws1 (string "function")
 614  retTy <- optionMaybe (ws1 abity)
 615  name <- ws global
 616  args <- wsNL params
 617  body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block
 618
 619  case (Q.insertJumps body) of
 620    Nothing -> fail $ "invalid fallthrough in " ++ show name
 621    Just bl -> return $ Q.FuncDef link name retTy args bl
 622\end{code}
 623
 624Function definitions contain the actual code to emit in the compiled
 625file. They define a global symbol that contains a pointer to the
 626function code. This pointer can be used in \texttt{call} instructions or stored
 627in memory.
 628
 629\begin{code}
 630subWordType :: Parser Q.SubWordType
 631subWordType = choice
 632  [ try $ bind "sb" Q.SignedByte
 633  , try $ bind "ub" Q.UnsignedByte
 634  , bind "sh" Q.SignedHalf
 635  , bind "uh" Q.UnsignedHalf ]
 636
 637abity :: Parser Q.Abity
 638abity = try (Q.ASubWordType <$> subWordType)
 639    <|> (Q.ABase <$> baseType)
 640    <|> (Q.AUserDef <$> userDef)
 641\end{code}
 642
 643The type given right before the function name is the return type of the
 644function. All return values of this function must have this return type.
 645If the return type is missing, the function must not return any value.
 646
 647\begin{code}
 648param :: Parser Q.FuncParam
 649param = (Q.Env <$> (ws1 (string "env") >> local))
 650    <|> (string "..." >> pure Q.Variadic)
 651    <|> do
 652          ty <- ws1 abity
 653          Q.Regular ty <$> local
 654
 655params :: Parser [Q.FuncParam]
 656params = parenLst param
 657\end{code}
 658
 659The parameter list is a comma separated list of temporary names prefixed
 660by types. The types are used to correctly implement C compatibility.
 661When an argument has an aggregate type, a pointer to the aggregate is
 662passed by thea caller. In the example below, we have to use a load
 663instruction to get the value of the first (and only) member of the
 664struct.
 665
 666\begin{verbatim}
 667type :one = { w }
 668
 669function w $getone(:one %p) {
 670@start
 671        %val =w loadw %p
 672        ret %val
 673}
 674\end{verbatim}
 675
 676If a function accepts or returns values that are smaller than a word,
 677such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type
 678must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand,
 679respectively, for signed and unsigned 8-bit values, and signed and
 680unsigned 16-bit values. Parameters associated with a sub-word type of
 681bit width N only have their N least significant bits set and have base
 682type \texttt{w}. For example, the function
 683
 684\begin{verbatim}
 685function w $addbyte(w %a, sb %b) {
 686@start
 687        %bw =w extsb %b
 688        %val =w add %a, %bw
 689        ret %val
 690}
 691\end{verbatim}
 692
 693needs to sign-extend its second argument before the addition. Dually,
 694return values with sub-word types do not need to be sign or zero
 695extended.
 696
 697If the parameter list ends with \texttt{...}, the function is a variadic
 698function: it can accept a variable number of arguments. To access the
 699extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg}
 700instructions described in the \nameref{sec:variadic} section.
 701
 702Optionally, the parameter list can start with an environment parameter
 703\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e.,
 704of type \texttt{l}). If the function does not use its environment parameter,
 705callers can safely omit it. This parameter is invisible to a C caller:
 706for example, the function
 707
 708\begin{verbatim}
 709export function w $add(env %e, w %a, w %b) {
 710@start
 711        %c =w add %a, %b
 712        ret %c
 713}
 714\end{verbatim}
 715
 716must be given the C prototype \texttt{int add(int, int)}. The intended use of
 717this feature is to pass the environment pointer of closures while
 718retaining a very good compatibility with C. The \nameref{sec:call}
 719section explains how to pass an environment parameter.
 720
 721Since global symbols are defined mutually recursive, there is no need
 722for function declarations: a function can be referenced before its
 723definition. Similarly, functions from other modules can be used without
 724previous declaration. All the type information necessary to compile a
 725call is in the instruction itself.
 726
 727The syntax and semantics for the body of functions are described in the
 728\nameref{sec:control} section.
 729
 730\section{Control}
 731\label{sec:control}
 732
 733The IL represents programs as textual transcriptions of control flow
 734graphs. The control flow is serialized as a sequence of blocks of
 735straight-line code which are connected using jump instructions.
 736
 737\subsection{Blocks}
 738\label{sec:blocks}
 739
 740\begin{code}
 741block :: Parser Q.Block'
 742block = do
 743  l <- wsNL1 label
 744  p <- many (wsNL1 $ try phiInstr)
 745  s <- many (wsNL1 statement)
 746  Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr)
 747\end{code}
 748
 749All blocks have a name that is specified by a label at their beginning.
 750Then follows a sequence of instructions that have "fall-through" flow.
 751Finally one jump terminates the block. The jump can either transfer
 752control to another block of the same function or return; jumps are
 753described further below.
 754
 755The first block in a function must not be the target of any jump in the
 756program. If a jump to the function start is needed, the frontend must
 757insert an empty prelude block at the beginning of the function.
 758
 759When one block jumps to the next block in the IL file, it is not
 760necessary to write the jump instruction, it will be automatically added
 761by the parser. For example the start block in the example below jumps
 762directly to the loop block.
 763
 764\subsection{Jumps}
 765
 766\begin{code}
 767jumpInstr :: Parser Q.JumpInstr
 768jumpInstr = (string "hlt" >> pure Q.Halt)
 769        -- TODO: Return requires a space if there is an optionMaybe
 770        <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val)
 771        <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label))
 772        <|> do
 773          _ <- ws1 $ string "jnz"
 774          v <- ws val <* ws (char ',')
 775          l1 <- ws label <* ws (char ',')
 776          l2 <- ws label
 777          return $ Q.Jnz v l1 l2
 778\end{code}
 779
 780A jump instruction ends every block and transfers the control to another
 781program location. The target of a jump must never be the first block in
 782a function. The three kinds of jumps available are described in the
 783following list.
 784
 785\begin{enumerate}
 786  \item \textbf{Unconditional jump.} Jumps to another block of the same function.
 787  \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0.
 788  \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.
 789  \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}.
 790\end{enumerate}
 791
 792\section{Instructions}
 793\label{sec:instructions}
 794
 795\begin{code}
 796instr :: Parser Q.Instr
 797instr =
 798  choice
 799    [ try $ binaryInstr Q.Add "add",
 800      try $ binaryInstr Q.Sub "sub",
 801      try $ binaryInstr Q.Mul "mul",
 802      try $ binaryInstr Q.Div "div",
 803      try $ binaryInstr Q.URem "urem",
 804      try $ binaryInstr Q.Rem "rem",
 805      try $ binaryInstr Q.UDiv "udiv",
 806      try $ binaryInstr Q.Or "or",
 807      try $ binaryInstr Q.Xor "xor",
 808      try $ binaryInstr Q.And "and",
 809      try $ binaryInstr Q.Sar "sar",
 810      try $ binaryInstr Q.Shr "shr",
 811      try $ binaryInstr Q.Shl "shl",
 812      try $ unaryInstr Q.Neg "neg",
 813      try $ loadInstr,
 814      try $ allocInstr,
 815      try $ compareInstr,
 816      try $ extInstr
 817    ]
 818\end{code}
 819
 820Instructions are the smallest piece of code in the IL, they form the body of
 821\nameref{sec:blocks}. This specification distinguishes instructions and
 822volatile instructions, the latter do not return a value. For the former, the IL
 823uses a three-address code, which means that one instruction computes an
 824operation between two operands and assigns the result to a third one.
 825
 826\begin{code}
 827assign :: Parser Q.Statement
 828assign = do
 829  n <- ws local
 830  t <- ws (char '=') >> ws1 baseType
 831  Q.Assign n t <$> instr
 832
 833volatileInstr :: Parser Q.Statement
 834volatileInstr = Q.Volatile <$> (storeInstr <|> blitInstr)
 835
 836-- TODO: Not documented in the QBE BNF.
 837statement :: Parser Q.Statement
 838statement = (try callInstr) <|> assign <|> volatileInstr
 839\end{code}
 840
 841An instruction has both a name and a return type, this return type is a base
 842type that defines the size of the instruction's result. The type of the
 843arguments can be unambiguously inferred using the instruction name and the
 844return type. For example, for all arithmetic instructions, the type of the
 845arguments is the same as the return type. The two additions below are valid if
 846\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}).
 847
 848\begin{verbatim}
 849%x =w add 0, %y
 850%z =w add %x, %x
 851\end{verbatim}
 852
 853Some instructions, like comparisons and memory loads have operand types
 854that differ from their return types. For instance, two floating points
 855can be compared to give a word result (0 if the comparison succeeds, 1
 856if it fails).
 857
 858\begin{verbatim}
 859%c =w cgts %a, %b
 860\end{verbatim}
 861
 862In the example above, both operands have to have single type. This is
 863made explicit by the instruction suffix.
 864
 865\subsection{Arithmetic and Bits}
 866
 867To-Do.
 868
 869\subsection{Memory}
 870\label{sec:memory}
 871
 872To-Do.
 873
 874\subsubsection{Store instructions}
 875
 876\begin{code}
 877storeInstr :: Parser Q.VolatileInstr
 878storeInstr = do
 879  t <- string "store" >> ws1 extType
 880  v <- ws val
 881  _ <- ws $ char ','
 882  ws val <&> Q.Store t v
 883\end{code}
 884
 885Store instructions exist to store a value of any base type and any extended
 886type. Since halfwords and bytes are not first class in the IL, \texttt{storeh}
 887and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of
 888this word will be stored in memory at the address specified in the second
 889argument.
 890
 891\subsubsection{Load instructions}
 892
 893\begin{code}
 894loadInstr :: Parser Q.Instr
 895loadInstr = do
 896  _ <- string "load"
 897  t <- ws1 $ choice
 898    [ try $ bind "sw" (Q.LBase Q.Word),
 899      try $ bind "uw" (Q.LBase Q.Word),
 900      try $ Q.LSubWord <$> subWordType,
 901      Q.LBase <$> baseType
 902    ]
 903  ws val <&> Q.Load t
 904\end{code}
 905
 906For types smaller than long, two variants of the load instruction are
 907available: one will sign extend the loaded value, while the other will zero
 908extend it. Note that all loads smaller than long can load to either a long or a
 909word.
 910
 911The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect
 912when they are used to define a word temporary. A \texttt{loadw} instruction is
 913provided as syntactic sugar for \texttt{loadsw} to make explicit that the
 914extension mechanism used is irrelevant.
 915
 916\subsubsection{Blits}
 917
 918\begin{code}
 919blitInstr :: Parser Q.VolatileInstr
 920blitInstr = do
 921  v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',')
 922  v2 <- ws val <* (ws $ char ',')
 923  nb <- decNumber
 924  return $ Q.Blit v1 v2 nb
 925\end{code}
 926
 927The blit instruction copies in-memory data from its first address argument to
 928its second address argument. The third argument is the number of bytes to copy.
 929The source and destination spans are required to be either non-overlapping, or
 930fully overlapping (source address identical to the destination address). The
 931byte count argument must be a nonnegative numeric constant; it cannot be a
 932temporary.
 933
 934One blit instruction may generate a number of instructions proportional to its
 935byte count argument, consequently, it is recommended to keep this argument
 936relatively small. If large copies are necessary, it is preferable that
 937frontends generate calls to a supporting \texttt{memcpy} function.
 938
 939\subsubsection{Stack Allocation}
 940
 941\begin{code}
 942allocInstr :: Parser Q.Instr
 943allocInstr = do
 944  siz <- (ws $ string "alloc") >> (ws1 allocSize)
 945  val <&> Q.Alloc siz
 946\end{code}
 947
 948These instructions allocate a chunk of memory on the stack. The number ending
 949the instruction name is the alignment required for the allocated slot. QBE will
 950make sure that the returned address is a multiple of that alignment value.
 951
 952Stack allocation instructions are used, for example, when compiling the C local
 953variables, because their address can be taken. When compiling Fortran,
 954temporaries can be used directly instead, because it is illegal to take the
 955address of a variable.
 956
 957\subsection{Comparisons}
 958
 959Comparison instructions return an integer value (either a word or a long), and
 960compare values of arbitrary types. The returned value is 1 if the two operands
 961satisfy the comparison relation, or 0 otherwise. The names of comparisons
 962respect a standard naming scheme in three parts.
 963
 964\begin{code}
 965compareInstr :: Parser Q.Instr
 966compareInstr = do
 967  _ <- char 'c'
 968  op <- compareOp
 969  ty <- ws1 baseType
 970  lhs <- ws val <* ws (char ',')
 971  rhs <- ws val
 972  pure $ Q.Compare ty op lhs rhs
 973\end{code}
 974
 975\begin{code}
 976compareOp :: Parser Q.CmpOp
 977compareOp = choice
 978  [ bind "eq" Q.CEq
 979  , bind "ne" Q.CNe
 980  , try $ bind "sle" Q.CSle
 981  , try $ bind "slt" Q.CSlt
 982  , try $ bind "sge" Q.CSge
 983  , try $ bind "sgt" Q.CSgt
 984  , try $ bind "ule" Q.CUle
 985  , try $ bind "ult" Q.CUlt
 986  , try $ bind "uge" Q.CUge
 987  , try $ bind "ugt" Q.CUgt ]
 988\end{code}
 989
 990For example, \texttt{cod} (\texttt{I(dd,dd)}) compares two double-precision floating point numbers and returns 1 if the two floating points are not NaNs, or 0 otherwise. The \texttt{csltw} (\texttt{I(ww,ww)}) instruction compares two words representing signed numbers and returns 1 when the first argument is smaller than the second one.
 991
 992\subsection{Conversions}
 993
 994\begin{code}
 995subLongType :: Parser Q.SubLongType
 996subLongType = try (Q.SLSubWord <$> subWordType)
 997  <|> bind "sw" Q.SLSignedWord
 998  <|> bind "uw" Q.SLUnsignedWord
 999
1000extInstr :: Parser Q.Instr
1001extInstr = do
1002  _ <- string "ext"
1003  ty <- ws1 subLongType
1004  ws val <&> Q.Ext ty
1005\end{code}
1006
1007Conversion operations change the representation of a value, possibly modifying
1008it if the target type cannot hold the value of the source type. Conversions can
1009extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or
1010convert a floating point into an integer and vice versa.
1011
1012\subsection{Cast and Copy}
1013
1014To-Do.
1015
1016\subsection{Call}
1017\label{sec:call}
1018
1019\begin{code}
1020-- TODO: Code duplication with 'param'.
1021callArg :: Parser Q.FuncArg
1022callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))
1023    <|> (string "..." >> pure Q.ArgVar)
1024    <|> do
1025          ty <- ws1 abity
1026          Q.ArgReg ty <$> val
1027
1028callArgs :: Parser [Q.FuncArg]
1029callArgs = parenLst callArg
1030
1031callInstr :: Parser Q.Statement
1032callInstr = do
1033  retValue <- optionMaybe $ do
1034    i <- ws local <* ws (char '=')
1035    a <- ws1 abity
1036    return (i, a)
1037  toCall <- ws1 (string "call") >> ws val
1038  fnArgs <- callArgs
1039  return $ Q.Call retValue toCall fnArgs
1040\end{code}
1041
1042The call instruction is special in several ways. It is not a three-address
1043instruction and requires the type of all its arguments to be given. Also, the
1044return type can be either a base type or an aggregate type. These specifics are
1045required to compile calls with C compatibility (i.e., to respect the ABI).
1046
1047When an aggregate type is used as argument type or return type, the value
1048respectively passed or returned needs to be a pointer to a memory location
1049holding the value. This is because aggregate types are not first-class
1050citizens of the IL.
1051
1052Sub-word types are used for arguments and return values of width less than a
1053word. Details on these types are presented in the <@ Functions > section.
1054Arguments with sub-word types need not be sign or zero extended according to
1055their type. Calls with a sub-word return type define a temporary of base type
1056\texttt{w} with its most significant bits unspecified.
1057
1058Unless the called function does not return a value, a return temporary must be
1059specified, even if it is never used afterwards.
1060
1061An environment parameter can be passed as first argument using the \texttt{env}
1062keyword. The passed value must be a 64-bit integer. If the called function does
1063not expect an environment parameter, it will be safely discarded. See the
1064\nameref{sec:functions} section for more information about environment
1065parameters.
1066
1067When the called function is variadic, there must be a \texttt{...} marker
1068separating the named and variadic arguments.
1069
1070\subsection{Variadic}
1071\label{sec:variadic}
1072
1073To-Do.
1074
1075\subsection{Phi}
1076
1077\begin{code}
1078phiBranch :: Parser (Q.BlockIdent, Q.Value)
1079phiBranch = do
1080  n <- ws1 label
1081  v <- val
1082  pure (n, v)
1083
1084phiInstr :: Parser Q.Phi
1085phiInstr = do
1086  -- TODO: code duplication with 'assign'
1087  n <- ws local
1088  t <- ws (char '=') >> ws1 baseType
1089
1090  _ <- ws1 (string "phi")
1091  -- TODO: combinator for sepBy
1092  p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')
1093  return $ Q.Phi n t p
1094\end{code}
1095
1096First and foremost, phi instructions are NOT necessary when writing a frontend
1097to QBE. One solution to avoid having to deal with SSA form is to use stack
1098allocated variables for all source program variables and perform assignments
1099and lookups using \nameref{sec:memory} operations. This is what LLVM users
1100typically do.
1101
1102Another solution is to simply emit code that is not in SSA form! Contrary to
1103LLVM, QBE is able to fixup programs not in SSA form without requiring the
1104boilerplate of loading and storing in memory. For example, the following
1105program will be correctly compiled by QBE.
1106
1107\begin{verbatim}
1108@start
1109    %x =w copy 100
1110    %s =w copy 0
1111@loop
1112    %s =w add %s, %x
1113    %x =w sub %x, 1
1114    jnz %x, @loop, @end
1115@end
1116    ret %s
1117\end{verbatim}
1118
1119Now, if you want to know what phi instructions are and how to use them in QBE,
1120you can read the following.
1121
1122Phi instructions are specific to SSA form. In SSA form values can only be
1123assigned once, without phi instructions, this requirement is too strong to
1124represent many programs. For example consider the following C program.
1125
1126\begin{verbatim}
1127int f(int x) {
1128    int y;
1129    if (x)
1130        y = 1;
1131    else
1132        y = 2;
1133    return y;
1134}
1135\end{verbatim}
1136
1137The variable \texttt{y} is assigned twice, the solution to translate it in SSA
1138form is to insert a phi instruction.
1139
1140\begin{verbatim}
1141@ifstmt
1142    jnz %x, @ift, @iff
1143@ift
1144    jmp @retstmt
1145@iff
1146    jmp @retstmt
1147@retstmt
1148    %y =w phi @ift 1, @iff 2
1149    ret %y
1150\end{verbatim}
1151
1152Phi instructions return one of their arguments depending on where the control
1153came from. In the example, \texttt{%y} is set to 1 if the \texttt{@ift} branch
1154is taken, or it is set to 2 otherwise.
1155
1156An important remark about phi instructions is that QBE assumes that if a
1157variable is defined by a phi it respects all the SSA invariants. So it is
1158critical to not use phi instructions unless you know exactly what you are
1159doing.
1160\end{document}