quebex - Parser.lhs

   1% SPDX-FileCopyrightText: 2015-2024 Quentin Carbonneaux <quentin@c9x.me>
   2% SPDX-FileCopyrightText: 2025 Sören Tempel <soeren+git@soeren-tempel.net>
   3%
   4% SPDX-License-Identifier: MIT AND GPL-3.0-only
   5
   6\documentclass{article}
   7%include polycode.fmt
   8
   9%subst blankline = "\\[5mm]"
  10
  11% See https://github.com/kosmikus/lhs2tex/issues/58
  12%format <$> = "\mathbin{\langle\$\rangle}"
  13%format <&> = "\mathbin{\langle\&\rangle}"
  14%format <|> = "\mathbin{\langle\:\vline\:\rangle}"
  15%format <?> = "\mathbin{\langle?\rangle}"
  16%format <*> = "\mathbin{\langle*\rangle}"
  17%format <*  = "\mathbin{\langle*}"
  18%format *>  = "\mathbin{*\rangle}"
  19
  20\long\def\ignore#1{}
  21
  22\usepackage{hyperref}
  23\hypersetup{
  24	colorlinks = true,
  25}
  26
  27\begin{document}
  28
  29\title{QBE Intermediate Language\vspace{-2em}}
  30\date{}
  31\maketitle
  32\frenchspacing
  33
  34\ignore{
  35\begin{code}
  36module Language.QBE.Parser (dataDef, typeDef, funcDef) where
  37
  38import Data.Functor ((<&>))
  39import Data.List (singleton)
  40import Data.Map qualified as Map
  41import qualified Language.QBE.Types as Q
  42import Language.QBE.Util (bind, decNumber, float)
  43import Text.ParserCombinators.Parsec
  44  ( Parser,
  45    alphaNum,
  46    anyChar,
  47    between,
  48    char,
  49    choice,
  50    letter,
  51    many,
  52    many1,
  53    manyTill,
  54    newline,
  55    noneOf,
  56    oneOf,
  57    optionMaybe,
  58    sepBy,
  59    sepBy1,
  60    skipMany,
  61    skipMany1,
  62    string,
  63    try,
  64    (<?>),
  65    (<|>),
  66  )
  67\end{code}
  68}
  69
  70This an executable description of the
  71\href{https://c9x.me/compile/doc/il-v1.2.html}{QBE intermediate language},
  72specified through \href{https://hackage.haskell.org/package/parsec}{Parsec}
  73parser combinators and generated from a literate Haskell file. The description
  74is derived from the original QBE IL documentation, licensed under MIT.
  75Presently, this implementation targets version 1.2 of the QBE intermediate
  76language and aims to be equivalent with the original specification.
  77
  78\section{Basic Concepts}
  79
  80The intermediate language (IL) is a higher-level language than the
  81machine's assembly language. It smoothes most of the
  82irregularities of the underlying hardware and allows an infinite number
  83of temporaries to be used. This higher abstraction level lets frontend
  84programmers focus on language design issues.
  85
  86\subsection{Input Files}
  87
  88The intermediate language is provided to QBE as text. Usually, one file
  89is generated per each compilation unit from the frontend input language.
  90An IL file is a sequence of \nameref{sec:definitions} for
  91data, functions, and types. Once processed by QBE, the resulting file
  92can be assembled and linked using a standard toolchain (e.g., GNU
  93binutils).
  94
  95\begin{code}
  96comment :: Parser ()
  97comment = skipMany blankNL >> comment' >> skipMany blankNL
  98  where
  99    comment' = char '#' >> manyTill anyChar newline
 100\end{code}
 101
 102\ignore{
 103\begin{code}
 104skipNoCode :: Parser () -> Parser ()
 105skipNoCode blankP = try (skipMany1 comment <?> "comments") <|> blankP
 106\end{code}
 107}
 108
 109Here is a complete "Hello World" IL file which defines a function that
 110prints to the screen. Since the string is not a first class object (only
 111the pointer is) it is defined outside the function\textquotesingle s
 112body. Comments start with a \# character and finish with the end of the
 113line.
 114
 115\begin{verbatim}
 116data $str = { b "hello world", b 0 }
 117
 118export function w $main() {
 119@start
 120        # Call the puts function with $str as argument.
 121        %r =w call $puts(l $str)
 122        ret 0
 123}
 124\end{verbatim}
 125
 126If you have read the LLVM language reference, you might recognize the
 127example above. In comparison, QBE makes a much lighter use of types and
 128the syntax is terser.
 129
 130\subsection{Parser Combinators}
 131
 132\ignore{
 133\begin{code}
 134bracesNL :: Parser a -> Parser a
 135bracesNL = between (wsNL $ char '{') (wsNL $ char '}')
 136
 137quoted :: Parser a -> Parser a
 138quoted = let q = char '"' in between q q
 139
 140parenLst :: Parser a -> Parser [a]
 141parenLst p = between (ws $ char '(') (char ')') inner
 142  where
 143    inner = sepBy (ws p) (ws $ char ',')
 144
 145binaryInstr :: (Q.Value -> Q.Value -> Q.Instr) -> String -> Parser Q.Instr
 146binaryInstr conc keyword = do
 147  _ <- ws (string keyword)
 148  vfst <- ws val <* ws (char ',')
 149  conc vfst <$> ws val
 150
 151-- Can only appear in data and type definitions and hence allows newlines.
 152alignSpec :: Parser Q.AllocSize
 153alignSpec = (ws1 (string "align")) >> wsNL allocSize
 154\end{code}
 155}
 156
 157The original QBE specification defines the syntax using a BNF grammar. In
 158contrast, this document defines it using Parsec parser combinators. As such,
 159this specification is less formal but more accurate as the parsing code is
 160actually executable. Consequently, this specification also captures constructs
 161omitted in the original specification (e.g., \nameref{sec:identifiers},
 162\nameref{sec:statements}, or \nameref{sec:strlit}). Nonetheless, the formal
 163language recognized by these combinators aims to be equivalent to the one of
 164the BNF grammar.
 165
 166\subsection{Identifiers}
 167\label{sec:identifiers}
 168
 169% Ident is not documented in the original QBE specification.
 170% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n304
 171
 172\begin{code}
 173ident :: Parser String
 174ident = do
 175  start <- letter <|> oneOf "._"
 176  rest <- many (alphaNum <|> oneOf "$._")
 177  return $ start : rest
 178\end{code}
 179
 180Identifiers for data, types, and functions can start with any ASCII letter or
 181the special characters \texttt{.} and \texttt{\_}. This initial character can
 182be followed by a sequence of zero or more alphanumeric characters and the
 183special characters \texttt{\$}, \texttt{.}, and \texttt{\_}.
 184
 185\subsection{Sigils}
 186
 187\begin{code}
 188userDef :: Parser Q.UserIdent
 189userDef = Q.UserIdent <$> (char ':' >> ident)
 190
 191global :: Parser Q.GlobalIdent
 192global = Q.GlobalIdent <$> (char '$' >> ident)
 193
 194local :: Parser Q.LocalIdent
 195local = Q.LocalIdent <$> (char '%' >> ident)
 196
 197label :: Parser Q.BlockIdent
 198label = Q.BlockIdent <$> (char '@' >> ident)
 199\end{code}
 200
 201The intermediate language makes heavy use of sigils, all user-defined
 202names are prefixed with a sigil. This is to avoid keyword conflicts, and
 203also to quickly spot the scope and nature of identifiers.
 204
 205\begin{itemize}
 206  \item \texttt{:} is for user-defined \nameref{sec:aggregate-types}
 207  \item \texttt{\$} is for globals (represented by a pointer)
 208  \item \texttt{\%} is for function-scope temporaries
 209  \item \texttt{@@} is for block labels
 210\end{itemize}
 211
 212\subsection{Spacing}
 213
 214\begin{code}
 215blank :: Parser Char
 216blank = oneOf "\t " <?> "blank"
 217
 218blankNL :: Parser Char
 219blankNL = oneOf "\n\t " <?> "blank or newline"
 220\end{code}
 221
 222Individual tokens in IL files must be separated by one or more spacing
 223characters. Both spaces and tabs are recognized as spacing characters.
 224In data and type definitions, newlines may also be used as spaces to
 225prevent overly long lines. When exactly one of two consecutive tokens is
 226a symbol (for example \texttt{,} or \texttt{=} or \texttt{\{}), spacing may be omitted.
 227
 228\ignore{
 229\begin{code}
 230ws :: Parser a -> Parser a
 231ws p = p <* skipMany blank
 232
 233ws1 :: Parser a -> Parser a
 234ws1 p = p <* skipMany1 blank
 235
 236wsNL :: Parser a -> Parser a
 237wsNL p = p <* skipNoCode (skipMany blankNL)
 238
 239wsNL1 :: Parser a -> Parser a
 240wsNL1 p = p <* skipNoCode (skipMany1 blankNL)
 241\end{code}
 242}
 243
 244\subsection{String Literals}
 245\label{sec:strlit}
 246
 247% The string literal is not documented in the original QBE specification.
 248% See https://c9x.me/git/qbe.git/tree/parse.c?h=v1.2#n287
 249
 250\begin{code}
 251strLit :: Parser String
 252strLit = concat <$> quoted (many strChr)
 253  where
 254    strChr :: Parser [Char]
 255    strChr = (singleton <$> noneOf "\"\\") <|> escSeq
 256
 257    escSeq :: Parser [Char]
 258    escSeq = try $ do
 259      esc <- char '\\'
 260      chr <- anyChar
 261      return $ [esc, chr]
 262\end{code}
 263
 264Strings are enclosed by double quotes and are, for example, used to specify a
 265section name as part of the \nameref{sec:linkage} information. Within a string,
 266a double quote can be escaped using a \texttt{\textbackslash} character. All
 267escape sequences, including double quote escaping, are passed through as-is to
 268the generated assembly file.
 269
 270\section{Types}
 271
 272\subsection{Simple Types}
 273
 274The IL makes minimal use of types. By design, the types used are
 275restricted to what is necessary for unambiguous compilation to machine
 276code and C interfacing. Unlike LLVM, QBE is not using types as a means
 277to safety; they are only here for semantic purposes.
 278
 279\begin{code}
 280baseType :: Parser Q.BaseType
 281baseType = choice
 282  [ bind "w" Q.Word
 283  , bind "l" Q.Long
 284  , bind "s" Q.Single
 285  , bind "d" Q.Double ]
 286\end{code}
 287
 288The four base types are \texttt{w} (word), \texttt{l} (long), \texttt{s} (single), and \texttt{d}
 289(double), they stand respectively for 32-bit and 64-bit integers, and
 29032-bit and 64-bit floating-point numbers. There are no pointer types
 291available; pointers are typed by an integer type sufficiently wide to
 292represent all memory addresses (e.g., \texttt{l} on 64-bit architectures).
 293Temporaries in the IL can only have a base type.
 294
 295\begin{code}
 296extType :: Parser Q.ExtType
 297extType = (Q.Base <$> baseType)
 298       <|> bind "b" Q.Byte
 299       <|> bind "h" Q.HalfWord
 300\end{code}
 301
 302Extended types contain base types plus \texttt{b} (byte) and \texttt{h} (half word),
 303respectively for 8-bit and 16-bit integers. They are used in \nameref{sec:aggregate-types}
 304and \nameref{sec:data} definitions.
 305
 306For C interfacing, the IL also provides user-defined aggregate types as
 307well as signed and unsigned variants of the sub-word extended types.
 308Read more about these types in the \nameref{sec:aggregate-types}
 309and \nameref{sec:functions} sections.
 310
 311\subsection{Subtyping}
 312
 313The IL has a minimal subtyping feature, for integer types only. Any
 314value of type \texttt{l} can be used in a \texttt{w} context. In that case, only the
 31532 least significant bits of the word value are used.
 316
 317Make note that it is the opposite of the usual subtyping on integers (in
 318C, we can safely use an \texttt{int} where a \texttt{long} is expected). A long value
 319cannot be used in word context. The rationale is that a word can be
 320signed or unsigned, so extending it to a long could be done in two ways,
 321either by zero-extension, or by sign-extension.
 322
 323\subsection{Constants and Vals}
 324\label{sec:constants-and-vals}
 325
 326\begin{code}
 327dynConst :: Parser Q.DynConst
 328dynConst =
 329  (Q.Const <$> constant)
 330    <|> (Q.Thread <$> global)
 331    <?> "dynconst"
 332\end{code}
 333
 334Constants come in two kinds: compile-time constants and dynamic
 335constants. Dynamic constants include compile-time constants and other
 336symbol variants that are only known at program-load time or execution
 337time. Consequently, dynamic constants can only occur in function bodies.
 338
 339The representation of integers is two's complement.
 340Floating-point numbers are represented using the single-precision and
 341double-precision formats of the IEEE 754 standard.
 342
 343\begin{code}
 344constant :: Parser Q.Const
 345constant =
 346  (Q.Number <$> decNumber)
 347    <|> (Q.SFP <$> sfp)
 348    <|> (Q.DFP <$> dfp)
 349    <|> (Q.Global <$> global)
 350    <?> "const"
 351  where
 352    sfp = string "s_" >> float
 353    dfp = string "d_" >> float
 354\end{code}
 355
 356Constants specify a sequence of bits and are untyped. They are always
 357parsed as 64-bit blobs. Depending on the context surrounding a constant,
 358only some of its bits are used. For example, in the program below, the
 359two variables defined have the same value since the first operand of the
 360subtraction is a word (32-bit) context.
 361
 362\begin{verbatim}
 363%x =w sub -1, 0 %y =w sub 4294967295, 0
 364\end{verbatim}
 365
 366Because specifying floating-point constants by their bits makes the code
 367less readable, syntactic sugar is provided to express them. Standard
 368scientific notation is prefixed with \texttt{s\_} and \texttt{d\_} for single and
 369double precision numbers respectively. Once again, the following example
 370defines twice the same double-precision constant.
 371
 372\begin{verbatim}
 373%x =d add d_0, d_-1
 374%y =d add d_0, -4616189618054758400
 375\end{verbatim}
 376
 377Global symbols can also be used directly as constants; they will be
 378resolved and turned into actual numeric constants by the linker.
 379
 380When the \texttt{thread} keyword prefixes a symbol name, the
 381symbol\textquotesingle s numeric value is resolved at runtime in the
 382thread-local storage.
 383
 384\begin{code}
 385val :: Parser Q.Value
 386val =
 387  (Q.VConst <$> dynConst)
 388    <|> (Q.VLocal <$> local)
 389    <?> "val"
 390\end{code}
 391
 392Vals are used as arguments in regular, phi, and jump instructions within
 393function definitions. They are either constants or function-scope
 394temporaries.
 395
 396\subsection{Linkage}
 397\label{sec:linkage}
 398
 399\begin{code}
 400linkage :: Parser Q.Linkage
 401linkage =
 402  wsNL (bind "export" Q.LExport)
 403    <|> wsNL (bind "thread" Q.LThread)
 404    <|> do
 405      _ <- ws1 $ string "section"
 406      (try secWithFlags) <|> sec
 407  where
 408    sec :: Parser Q.Linkage
 409    sec = wsNL strLit <&> (`Q.LSection` Nothing)
 410
 411    secWithFlags :: Parser Q.Linkage
 412    secWithFlags = do
 413      n <- ws1 strLit
 414      wsNL strLit <&> Q.LSection n . Just
 415\end{code}
 416
 417Function and data definitions (see below) can specify linkage
 418information to be passed to the assembler and eventually to the linker.
 419
 420The \texttt{export} linkage flag marks the defined item as visible outside the
 421current file\textquotesingle s scope. If absent, the symbol can only be
 422referred to locally. Functions compiled by QBE and called from C need to
 423be exported.
 424
 425The \texttt{thread} linkage flag can only qualify data definitions. It mandates
 426that the object defined is stored in thread-local storage. Each time a
 427runtime thread starts, the supporting platform runtime is in charge of
 428making a new copy of the object for the fresh thread. Objects in
 429thread-local storage must be accessed using the \texttt{thread \$IDENT} syntax,
 430as specified in the \nameref{sec:constants-and-vals} section.
 431
 432A \texttt{section} flag can be specified to tell the linker to put the defined
 433item in a certain section. The use of the section flag is platform
 434dependent and we refer the user to the documentation of their assembler
 435and linker for relevant information.
 436
 437\begin{verbatim}
 438section ".init_array" data $.init.f = { l $f }
 439\end{verbatim}
 440
 441The section flag can be used to add function pointers to a global
 442initialization list, as depicted above. Note that some platforms provide
 443a BSS section that can be used to minimize the footprint of uniformly
 444zeroed data. When this section is available, QBE will automatically make
 445use of it and no section flag is required.
 446
 447The section and export linkage flags should each appear at most once in
 448a definition. If multiple occurrences are present, QBE is free to use
 449any.
 450
 451\subsection{Definitions}
 452\label{sec:definitions}
 453
 454Definitions are the essential components of an IL file. They can define
 455three types of objects: aggregate types, data, and functions. Aggregate
 456types are never exported and do not compile to any code. Data and
 457function definitions have file scope and are mutually recursive (even
 458across IL files). Their visibility can be controlled using linkage
 459flags.
 460
 461\subsubsection{Aggregate Types}
 462\label{sec:aggregate-types}
 463
 464\begin{code}
 465typeDef :: Parser Q.TypeDef
 466typeDef = do
 467  _ <- wsNL1 (string "type")
 468  i <- wsNL1 userDef
 469  _ <- wsNL1 (char '=')
 470  a <- optionMaybe alignSpec
 471  bracesNL (opaqueType <|> unionType <|> regularType) <&> Q.TypeDef i a
 472\end{code}
 473
 474Aggregate type definitions start with the \texttt{type} keyword. They have file
 475scope, but types must be defined before being referenced. The inner
 476structure of a type is expressed by a comma-separated list of fields.
 477
 478\begin{code}
 479subType :: Parser Q.SubType
 480subType =
 481  (Q.SExtType <$> extType)
 482    <|> (Q.SUserDef <$> userDef)
 483
 484field :: Parser Q.Field
 485field = do
 486  -- TODO: newline is required if there is a number argument
 487  f <- wsNL subType
 488  s <- ws $ optionMaybe decNumber
 489  pure (f, s)
 490
 491-- TODO: "For ease of IL generation, a trailing comma is tolerated by the parser".
 492fields :: Bool -> Parser [Q.Field]
 493fields allowEmpty =
 494  (if allowEmpty then sepBy else sepBy1) field (wsNL $ char ',')
 495\end{code}
 496
 497A field consists of a subtype, either an extended type or a user-defined type,
 498and an optional number expressing the value of this field. In case many items
 499of the same type are sequenced (like in a C array), the shorter array syntax
 500can be used.
 501
 502\begin{code}
 503regularType :: Parser Q.AggType
 504regularType = Q.ARegular <$> fields True
 505\end{code}
 506
 507Three different kinds of aggregate types are presentl ysupported: regular
 508types, union types and opaque types. The fields of regular types will be
 509packed. By default, the alignment of an aggregate type is the maximum alignment
 510of its members. The alignment can be explicitly specified by the programmer.
 511
 512\begin{code}
 513unionType :: Parser Q.AggType
 514unionType = Q.AUnion <$> many1 (wsNL unionType')
 515  where
 516    unionType' :: Parser [Q.Field]
 517    unionType' = bracesNL $ fields False
 518\end{code}
 519
 520Union types allow the same chunk of memory to be used with different layouts. They are defined by enclosing multiple regular aggregate type bodies in a pair of curly braces. Size and alignment of union types are set to the maximum size and alignment of each variation or, in the case of alignment, can be explicitly specified.
 521
 522\begin{code}
 523opaqueType :: Parser Q.AggType
 524opaqueType = Q.AOpaque <$> wsNL decNumber
 525\end{code}
 526
 527Opaque types are used when the inner structure of an aggregate cannot be specified; the alignment for opaque types is mandatory. They are defined simply by enclosing their size between curly braces.
 528
 529\subsubsection{Data}
 530\label{sec:data}
 531
 532\begin{code}
 533dataDef :: Parser Q.DataDef
 534dataDef = do
 535  link <- many linkage
 536  name <- wsNL1 (string "data") >> wsNL global
 537  _ <- wsNL (char '=')
 538  alignment <- optionMaybe alignSpec
 539  bracesNL dataObjs <&> Q.DataDef link name alignment
 540 where
 541    dataObjs = sepBy dataObj (wsNL $ char ',')
 542\end{code}
 543
 544Data definitions express objects that will be emitted in the compiled
 545file. Their visibility and location in the compiled artifact are
 546controlled with linkage flags described in the \nameref{sec:linkage}
 547section.
 548
 549They define a global identifier (starting with the sigil \texttt{\$}), that
 550will contain a pointer to the object specified by the definition.
 551
 552\begin{code}
 553dataObj :: Parser Q.DataObj
 554dataObj =
 555  (Q.OZeroFill <$> (wsNL1 (char 'z') >> wsNL decNumber))
 556    <|> do
 557      t <- wsNL1 extType
 558      i <- many1 (wsNL dataItem)
 559      return $ Q.OItem t i
 560\end{code}
 561
 562Objects are described by a sequence of fields that start with a type
 563letter. This letter can either be an extended type, or the \texttt{z} letter.
 564If the letter used is an extended type, the data item following
 565specifies the bits to be stored in the field.
 566
 567\begin{code}
 568dataItem :: Parser Q.DataItem
 569dataItem =
 570  (Q.DString <$> strLit)
 571    <|> (Q.DConst <$> constant)
 572    <|> do
 573      i <- global
 574      off <- optionMaybe (char '+' >> decNumber)
 575      return $ Q.DSymbol i off
 576\end{code}
 577
 578Within each object, several items can be defined. When several data items
 579follow a letter, they initialize multiple fields of the same size.
 580
 581\begin{code}
 582allocSize :: Parser Q.AllocSize
 583allocSize =
 584  choice
 585    [ bind "4" Q.AlignWord,
 586      bind "8" Q.AlignLong,
 587      bind "16" Q.AlignLongLong
 588    ]
 589\end{code}
 590
 591The members of a struct will be packed. This means that padding has to
 592be emitted by the frontend when necessary. Alignment of the whole data
 593objects can be manually specified, and when no alignment is provided,
 594the maximum alignment from the platform is used.
 595
 596When the \texttt{z} letter is used the number following indicates the size of
 597the field; the contents of the field are zero initialized. It can be
 598used to add padding between fields or zero-initialize big arrays.
 599
 600\subsubsection{Functions}
 601\label{sec:functions}
 602
 603\begin{code}
 604funcDef :: Parser Q.FuncDef
 605funcDef = do
 606  link <- many linkage
 607  _ <- ws1 (string "function")
 608  retTy <- optionMaybe (ws1 abity)
 609  name <- ws global
 610  args <- wsNL params
 611  body <- between (wsNL1 $ char '{') (wsNL $ char '}') $ many1 block
 612
 613  case (Q.insertJumps body) of
 614    Nothing -> fail $ "invalid fallthrough in " ++ show name
 615    Just bl -> return $ Q.FuncDef link name retTy args bl
 616\end{code}
 617
 618Function definitions contain the actual code to emit in the compiled
 619file. They define a global symbol that contains a pointer to the
 620function code. This pointer can be used in \texttt{call} instructions or stored
 621in memory.
 622
 623\begin{code}
 624subWordType :: Parser Q.SubWordType
 625subWordType = choice
 626  [ try $ bind "sb" Q.SignedByte
 627  , try $ bind "ub" Q.UnsignedByte
 628  , bind "sh" Q.SignedHalf
 629  , bind "uh" Q.UnsignedHalf ]
 630
 631abity :: Parser Q.Abity
 632abity = try (Q.ASubWordType <$> subWordType)
 633    <|> (Q.ABase <$> baseType)
 634    <|> (Q.AUserDef <$> userDef)
 635\end{code}
 636
 637The type given right before the function name is the return type of the
 638function. All return values of this function must have this return type.
 639If the return type is missing, the function must not return any value.
 640
 641\begin{code}
 642param :: Parser Q.FuncParam
 643param = (Q.Env <$> (ws1 (string "env") >> local))
 644    <|> (string "..." >> pure Q.Variadic)
 645    <|> do
 646          ty <- ws1 abity
 647          Q.Regular ty <$> local
 648
 649params :: Parser [Q.FuncParam]
 650params = parenLst param
 651\end{code}
 652
 653The parameter list is a comma separated list of temporary names prefixed
 654by types. The types are used to correctly implement C compatibility.
 655When an argument has an aggregate type, a pointer to the aggregate is
 656passed by thea caller. In the example below, we have to use a load
 657instruction to get the value of the first (and only) member of the
 658struct.
 659
 660\begin{verbatim}
 661type :one = { w }
 662
 663function w $getone(:one %p) {
 664@start
 665        %val =w loadw %p
 666        ret %val
 667}
 668\end{verbatim}
 669
 670If a function accepts or returns values that are smaller than a word,
 671such as \texttt{signed char} or \texttt{unsigned short} in C, one of the sub-word type
 672must be used. The sub-word types \texttt{sb}, \texttt{ub}, \texttt{sh}, and \texttt{uh} stand,
 673respectively, for signed and unsigned 8-bit values, and signed and
 674unsigned 16-bit values. Parameters associated with a sub-word type of
 675bit width N only have their N least significant bits set and have base
 676type \texttt{w}. For example, the function
 677
 678\begin{verbatim}
 679function w $addbyte(w %a, sb %b) {
 680@start
 681        %bw =w extsb %b
 682        %val =w add %a, %bw
 683        ret %val
 684}
 685\end{verbatim}
 686
 687needs to sign-extend its second argument before the addition. Dually,
 688return values with sub-word types do not need to be sign or zero
 689extended.
 690
 691If the parameter list ends with \texttt{...}, the function is a variadic
 692function: it can accept a variable number of arguments. To access the
 693extra arguments provided by the caller, use the \texttt{vastart} and \texttt{vaarg}
 694instructions described in the \nameref{sec:variadic} section.
 695
 696Optionally, the parameter list can start with an environment parameter
 697\texttt{env \%e}. This special parameter is a 64-bit integer temporary (i.e.,
 698of type \texttt{l}). If the function does not use its environment parameter,
 699callers can safely omit it. This parameter is invisible to a C caller:
 700for example, the function
 701
 702\begin{verbatim}
 703export function w $add(env %e, w %a, w %b) {
 704@start
 705        %c =w add %a, %b
 706        ret %c
 707}
 708\end{verbatim}
 709
 710must be given the C prototype \texttt{int add(int, int)}. The intended use of
 711this feature is to pass the environment pointer of closures while
 712retaining a very good compatibility with C. The \nameref{sec:call}
 713section explains how to pass an environment parameter.
 714
 715Since global symbols are defined mutually recursive, there is no need
 716for function declarations: a function can be referenced before its
 717definition. Similarly, functions from other modules can be used without
 718previous declaration. All the type information necessary to compile a
 719call is in the instruction itself.
 720
 721The syntax and semantics for the body of functions are described in the
 722\nameref{sec:control} section.
 723
 724\section{Control}
 725\label{sec:control}
 726
 727The IL represents programs as textual transcriptions of control flow
 728graphs. The control flow is serialized as a sequence of blocks of
 729straight-line code which are connected using jump instructions.
 730
 731\subsection{Blocks}
 732\label{sec:blocks}
 733
 734\begin{code}
 735block :: Parser Q.Block'
 736block = do
 737  l <- wsNL1 label
 738  p <- many (wsNL1 $ try phiInstr)
 739  s <- many (wsNL1 statement)
 740  Q.Block' l p s <$> (optionMaybe $ wsNL1 jumpInstr)
 741\end{code}
 742
 743All blocks have a name that is specified by a label at their beginning.
 744Then follows a sequence of instructions that have "fall-through" flow.
 745Finally one jump terminates the block. The jump can either transfer
 746control to another block of the same function or return; jumps are
 747described further below.
 748
 749The first block in a function must not be the target of any jump in the
 750program. If a jump to the function start is needed, the frontend must
 751insert an empty prelude block at the beginning of the function.
 752
 753When one block jumps to the next block in the IL file, it is not
 754necessary to write the jump instruction, it will be automatically added
 755by the parser. For example the start block in the example below jumps
 756directly to the loop block.
 757
 758\subsection{Jumps}
 759
 760\begin{code}
 761jumpInstr :: Parser Q.JumpInstr
 762jumpInstr = (string "hlt" >> pure Q.Halt)
 763        -- TODO: Return requires a space if there is an optionMaybe
 764        <|> Q.Return <$> ((ws $ string "ret") >> optionMaybe val)
 765        <|> try (Q.Jump <$> ((ws1 $ string "jmp") >> label))
 766        <|> do
 767          _ <- ws1 $ string "jnz"
 768          v <- ws val <* ws (char ',')
 769          l1 <- ws label <* ws (char ',')
 770          l2 <- ws label
 771          return $ Q.Jnz v l1 l2
 772\end{code}
 773
 774A jump instruction ends every block and transfers the control to another
 775program location. The target of a jump must never be the first block in
 776a function. The three kinds of jumps available are described in the
 777following list.
 778
 779\begin{enumerate}
 780  \item \textbf{Unconditional jump.} Jumps to another block of the same function.
 781  \item \textbf{Conditional jump.} When its word argument is non-zero, it jumps to its first label argument; otherwise it jumps to the other label. The argument must be of word type; because of subtyping a long argument can be passed, but only its least significant 32 bits will be compared to 0.
 782  \item \textbf{Function return.} Terminates the execution of the current function, optionally returning a value to the caller. The value returned must be of the type given in the function prototype. If the function prototype does not specify a return type, no return value can be used.
 783  \item \textbf{Program termination.} Terminates the execution of the program with a target-dependent error. This instruction can be used when it is expected that the execution never reaches the end of the block it closes; for example, after having called a function such as \texttt{exit()}.
 784\end{enumerate}
 785
 786\section{Instructions}
 787\label{sec:instructions}
 788
 789\begin{code}
 790instr :: Parser Q.Instr
 791instr =
 792  choice
 793    [ try $ binaryInstr Q.Add "add",
 794      try $ binaryInstr Q.Sub "sub",
 795      try $ binaryInstr Q.Mul "mul",
 796      try $ binaryInstr Q.URem "urem",
 797      try $ binaryInstr Q.Rem "rem",
 798      try $ binaryInstr Q.UDiv "udiv",
 799      try $ binaryInstr Q.Or "or",
 800      try $ binaryInstr Q.Xor "xor",
 801      try $ binaryInstr Q.And "and",
 802      try $ loadInstr,
 803      try $ allocInstr,
 804      try $ compareInstr,
 805      try $ extInstr
 806    ]
 807\end{code}
 808
 809Instructions are the smallest piece of code in the IL, they form the body of
 810\nameref{sec:blocks}. This specification distinguishes instructions and
 811volatile instructions, the latter do not return a value. For the former, the IL
 812uses a three-address code, which means that one instruction computes an
 813operation between two operands and assigns the result to a third one.
 814
 815\begin{code}
 816assign :: Parser Q.Statement
 817assign = do
 818  n <- ws local
 819  t <- ws (char '=') >> ws1 baseType
 820  Q.Assign n t <$> instr
 821
 822volatileInstr :: Parser Q.Statement
 823volatileInstr = Q.Volatile <$> (storeInstr <|> blitInstr)
 824
 825-- TODO: Not documented in the QBE BNF.
 826statement :: Parser Q.Statement
 827statement = (try callInstr) <|> assign <|> volatileInstr
 828\end{code}
 829
 830An instruction has both a name and a return type, this return type is a base
 831type that defines the size of the instruction's result. The type of the
 832arguments can be unambiguously inferred using the instruction name and the
 833return type. For example, for all arithmetic instructions, the type of the
 834arguments is the same as the return type. The two additions below are valid if
 835\texttt{\%y} is a word or a long (because of \nameref{sec:subtyping}).
 836
 837\begin{verbatim}
 838%x =w add 0, %y
 839%z =w add %x, %x
 840\end{verbatim}
 841
 842Some instructions, like comparisons and memory loads have operand types
 843that differ from their return types. For instance, two floating points
 844can be compared to give a word result (0 if the comparison succeeds, 1
 845if it fails).
 846
 847\begin{verbatim}
 848%c =w cgts %a, %b
 849\end{verbatim}
 850
 851In the example above, both operands have to have single type. This is
 852made explicit by the instruction suffix.
 853
 854\subsection{Arithmetic and Bits}
 855
 856To-Do.
 857
 858\subsection{Memory}
 859\label{sec:memory}
 860
 861To-Do.
 862
 863\subsubsection{Store instructions}
 864
 865\begin{code}
 866storeInstr :: Parser Q.VolatileInstr
 867storeInstr = do
 868  t <- string "store" >> ws1 extType
 869  v <- ws val
 870  _ <- ws $ char ','
 871  ws val <&> Q.Store t v
 872\end{code}
 873
 874Store instructions exist to store a value of any base type and any extended
 875type. Since halfwords and bytes are not first class in the IL, \texttt{storeh}
 876and \texttt{storeb} take a word as argument. Only the first 16 or 8 bits of
 877this word will be stored in memory at the address specified in the second
 878argument.
 879
 880\subsubsection{Load instructions}
 881
 882\begin{code}
 883loadInstr :: Parser Q.Instr
 884loadInstr = do
 885  _ <- string "load"
 886  t <- ws1 $ choice
 887    [ try $ bind "sw" (Q.LBase Q.Word),
 888      try $ bind "uw" (Q.LBase Q.Word),
 889      try $ Q.LSubWord <$> subWordType,
 890      Q.LBase <$> baseType
 891    ]
 892  ws val <&> Q.Load t
 893\end{code}
 894
 895For types smaller than long, two variants of the load instruction are
 896available: one will sign extend the loaded value, while the other will zero
 897extend it. Note that all loads smaller than long can load to either a long or a
 898word.
 899
 900The two instructions \texttt{loadsw} and \texttt{loaduw} have the same effect
 901when they are used to define a word temporary. A \texttt{loadw} instruction is
 902provided as syntactic sugar for \texttt{loadsw} to make explicit that the
 903extension mechanism used is irrelevant.
 904
 905\subsubsection{Blits}
 906
 907\begin{code}
 908blitInstr :: Parser Q.VolatileInstr
 909blitInstr = do
 910  v1 <- (ws1 $ string "blit") >> ws val <* (ws $ char ',')
 911  v2 <- ws val <* (ws $ char ',')
 912  nb <- decNumber
 913  return $ Q.Blit v1 v2 nb
 914\end{code}
 915
 916The blit instruction copies in-memory data from its first address argument to
 917its second address argument. The third argument is the number of bytes to copy.
 918The source and destination spans are required to be either non-overlapping, or
 919fully overlapping (source address identical to the destination address). The
 920byte count argument must be a nonnegative numeric constant; it cannot be a
 921temporary.
 922
 923One blit instruction may generate a number of instructions proportional to its
 924byte count argument, consequently, it is recommended to keep this argument
 925relatively small. If large copies are necessary, it is preferable that
 926frontends generate calls to a supporting \texttt{memcpy} function.
 927
 928\subsubsection{Stack Allocation}
 929
 930\begin{code}
 931allocInstr :: Parser Q.Instr
 932allocInstr = do
 933  siz <- (ws $ string "alloc") >> (ws1 allocSize)
 934  val <&> Q.Alloc siz
 935\end{code}
 936
 937These instructions allocate a chunk of memory on the stack. The number ending
 938the instruction name is the alignment required for the allocated slot. QBE will
 939make sure that the returned address is a multiple of that alignment value.
 940
 941Stack allocation instructions are used, for example, when compiling the C local
 942variables, because their address can be taken. When compiling Fortran,
 943temporaries can be used directly instead, because it is illegal to take the
 944address of a variable.
 945
 946\subsection{Comparisons}
 947
 948Comparison instructions return an integer value (either a word or a long), and
 949compare values of arbitrary types. The returned value is 1 if the two operands
 950satisfy the comparison relation, or 0 otherwise. The names of comparisons
 951respect a standard naming scheme in three parts.
 952
 953\begin{code}
 954compareInstr :: Parser Q.Instr
 955compareInstr = do
 956  _ <- char 'c'
 957  op <- compareOp
 958  ty <- ws1 baseType
 959  lhs <- ws val <* ws (char ',')
 960  rhs <- ws val
 961  pure $ Q.Compare ty op lhs rhs
 962\end{code}
 963
 964\begin{code}
 965compareOp :: Parser Q.CmpOp
 966compareOp = choice
 967  [ bind "eq" Q.CEq
 968  , bind "ne" Q.CNe
 969  , try $ bind "sle" Q.CSle
 970  , try $ bind "slt" Q.CSlt
 971  , try $ bind "sge" Q.CSge
 972  , try $ bind "sgt" Q.CSgt
 973  , try $ bind "ule" Q.CUle
 974  , try $ bind "ult" Q.CUlt
 975  , try $ bind "uge" Q.CUge
 976  , try $ bind "ugt" Q.CUgt ]
 977\end{code}
 978
 979For example, \texttt{cod} (\texttt{I(dd,dd)}) compares two double-precision floating point numbers and returns 1 if the two floating points are not NaNs, or 0 otherwise. The \texttt{csltw} (\texttt{I(ww,ww)}) instruction compares two words representing signed numbers and returns 1 when the first argument is smaller than the second one.
 980
 981\subsection{Conversions}
 982
 983\begin{code}
 984subLongType :: Parser Q.SubLongType
 985subLongType = try (Q.SLSubWord <$> subWordType)
 986  <|> bind "sw" Q.SLSignedWord
 987  <|> bind "uw" Q.SLUnsignedWord
 988
 989extInstr :: Parser Q.Instr
 990extInstr = do
 991  _ <- string "ext"
 992  ty <- ws1 subLongType
 993  ws val <&> Q.Ext ty
 994\end{code}
 995
 996Conversion operations change the representation of a value, possibly modifying
 997it if the target type cannot hold the value of the source type. Conversions can
 998extend the precision of a temporary (e.g., from signed 8-bit to 32-bit), or
 999convert a floating point into an integer and vice versa.
1000
1001\subsection{Cast and Copy}
1002
1003To-Do.
1004
1005\subsection{Call}
1006\label{sec:call}
1007
1008\begin{code}
1009-- TODO: Code duplication with 'param'.
1010callArg :: Parser Q.FuncArg
1011callArg = (Q.ArgEnv <$> (ws1 (string "env") >> val))
1012    <|> (string "..." >> pure Q.ArgVar)
1013    <|> do
1014          ty <- ws1 abity
1015          Q.ArgReg ty <$> val
1016
1017callArgs :: Parser [Q.FuncArg]
1018callArgs = parenLst callArg
1019
1020callInstr :: Parser Q.Statement
1021callInstr = do
1022  retValue <- optionMaybe $ do
1023    i <- ws local <* ws (char '=')
1024    a <- ws1 abity
1025    return (i, a)
1026  toCall <- ws1 (string "call") >> ws val
1027  fnArgs <- callArgs
1028  return $ Q.Call retValue toCall fnArgs
1029\end{code}
1030
1031The call instruction is special in several ways. It is not a three-address
1032instruction and requires the type of all its arguments to be given. Also, the
1033return type can be either a base type or an aggregate type. These specifics are
1034required to compile calls with C compatibility (i.e., to respect the ABI).
1035
1036When an aggregate type is used as argument type or return type, the value
1037respectively passed or returned needs to be a pointer to a memory location
1038holding the value. This is because aggregate types are not first-class
1039citizens of the IL.
1040
1041Sub-word types are used for arguments and return values of width less than a
1042word. Details on these types are presented in the <@ Functions > section.
1043Arguments with sub-word types need not be sign or zero extended according to
1044their type. Calls with a sub-word return type define a temporary of base type
1045\texttt{w} with its most significant bits unspecified.
1046
1047Unless the called function does not return a value, a return temporary must be
1048specified, even if it is never used afterwards.
1049
1050An environment parameter can be passed as first argument using the \texttt{env}
1051keyword. The passed value must be a 64-bit integer. If the called function does
1052not expect an environment parameter, it will be safely discarded. See the
1053\nameref{sec:functions} section for more information about environment
1054parameters.
1055
1056When the called function is variadic, there must be a \texttt{...} marker
1057separating the named and variadic arguments.
1058
1059\subsection{Variadic}
1060\label{sec:variadic}
1061
1062To-Do.
1063
1064\subsection{Phi}
1065
1066\begin{code}
1067phiBranch :: Parser (Q.BlockIdent, Q.Value)
1068phiBranch = do
1069  n <- ws1 label
1070  v <- val
1071  pure (n, v)
1072
1073phiInstr :: Parser Q.Phi
1074phiInstr = do
1075  -- TODO: code duplication with 'assign'
1076  n <- ws local
1077  t <- ws (char '=') >> ws1 baseType
1078
1079  _ <- ws1 (string "phi")
1080  -- TODO: combinator for sepBy
1081  p <- Map.fromList <$> sepBy1 (ws phiBranch) (ws $ char ',')
1082  return $ Q.Phi n t p
1083\end{code}
1084
1085First and foremost, phi instructions are NOT necessary when writing a frontend
1086to QBE. One solution to avoid having to deal with SSA form is to use stack
1087allocated variables for all source program variables and perform assignments
1088and lookups using \nameref{sec:memory} operations. This is what LLVM users
1089typically do.
1090
1091Another solution is to simply emit code that is not in SSA form! Contrary to
1092LLVM, QBE is able to fixup programs not in SSA form without requiring the
1093boilerplate of loading and storing in memory. For example, the following
1094program will be correctly compiled by QBE.
1095
1096\begin{verbatim}
1097@start
1098    %x =w copy 100
1099    %s =w copy 0
1100@loop
1101    %s =w add %s, %x
1102    %x =w sub %x, 1
1103    jnz %x, @loop, @end
1104@end
1105    ret %s
1106\end{verbatim}
1107
1108Now, if you want to know what phi instructions are and how to use them in QBE,
1109you can read the following.
1110
1111Phi instructions are specific to SSA form. In SSA form values can only be
1112assigned once, without phi instructions, this requirement is too strong to
1113represent many programs. For example consider the following C program.
1114
1115\begin{verbatim}
1116int f(int x) {
1117    int y;
1118    if (x)
1119        y = 1;
1120    else
1121        y = 2;
1122    return y;
1123}
1124\end{verbatim}
1125
1126The variable \texttt{y} is assigned twice, the solution to translate it in SSA
1127form is to insert a phi instruction.
1128
1129\begin{verbatim}
1130@ifstmt
1131    jnz %x, @ift, @iff
1132@ift
1133    jmp @retstmt
1134@iff
1135    jmp @retstmt
1136@retstmt
1137    %y =w phi @ift 1, @iff 2
1138    ret %y
1139\end{verbatim}
1140
1141Phi instructions return one of their arguments depending on where the control
1142came from. In the example, \texttt{%y} is set to 1 if the \texttt{@ift} branch
1143is taken, or it is set to 2 otherwise.
1144
1145An important remark about phi instructions is that QBE assumes that if a
1146variable is defined by a phi it respects all the SSA invariants. So it is
1147critical to not use phi instructions unless you know exactly what you are
1148doing.
1149\end{document}