The Type of Sprintf - Ryan Brewer

The Type of Sprintf

May 14, 2024

Today I decided to write a little post about giving a type to the sprintf function via dependent types. The idea of this and related functions (like printf) is to have the first argument be a string, and special patterns in the string determine what the following arguments should be. For example:

// no special patterns, just prints the string
sprintf("hello world!
"); 
// %d pattern, the following arg should be an integer
sprintf("num: %d
", 42);

For languages that have compile-time (aka "static") types, this function generally has to be a special case in the typechecker. There's no way for the user to create a function that does something similar, since the type of the function depends on the contents of the format string. You generally couldn't even do something like this:

void my_sprintf(char* format, ???) {
    sprintf(format, ???);
}

...because my_sprintf doesn't have the same special case that the sprintf function has in the typechecker.

However, in a dependent type system, we actually do have the power to make functions like these. In fact, sprintf is sometimes given as the main example of what dependent types allow you to do that you couldn't do with only refinement types. For full transparency, I got this example from the lecture notes of Daniel Gratzer. Hopefully this blog post makes a cool example a little more accessible.

Without further ado, the set-up goes like this. As you can imagine, we have to parse the format string at compile-time to determine the types of the arguments. Therefore we'll start by making a simple Token datatype:

data Token: Type where
    IntFormat: Token
    CharFormat: Token
    StrFormat: Token
    NormalChar: Char -> Token

This is using a pseudo-code based on Agda or Idris GADT syntax. Obviously, you can put more format specifier tokens here if you want.

Next, we need to lex the string into tokens:

-- `::` is the "cons" operator, prepending to a linked list
lex: List Char -> List Token
lex chars = case chars of
    [] => []
    '%' :: '%' :: rest => NormalChar '%' :: lex rest
    '%' :: 'd' :: rest => IntFormat :: lex rest
    '%' :: 'c' :: rest => CharFormat :: lex rest
    '%' :: 's' :: rest => StrFormat :: lex rest
    c :: rest => NormalChar c :: lex rest

This appears to be a normal lexing function, because it is. There's no special magic in a dependent type system for being able to run code at compile-time!

What is special magic, in my opinion, is the next step: turning our list of tokens into a type! (This will break the syntax highlighting, unfortunately, but bear with me.)

tokens_to_type: List Token -> Type
tokens_to_type tokens = case tokens of
    NormalChar _ :: rest => tokens_to_type rest
    IntFormat :: rest => Int -> tokens_to_type rest
    CharFormat :: rest => Char -> tokens_to_type rest
    StrFormat :: rest => String -> tokens_to_type rest
    [] => String

This is worth taking a moment to unpack. Normal characters get dropped on the floor since they have no impact on the type we're constructing. Format characters, on the other hand, each prepend a new argument type to the overall function type. The base case, when there are no more characters in the format string, produces the final return type of the function: a `String. For example, if the format string was "%d %s", the tokenization would be [IntFormat, NormalChar ' ', StrFormat], so the resulting type would be Int -> String -> String. This might be confusing because it's a post-order traversal of the list, and might feel kind of backwards.

This process of producing different types based on a case-statement inspection ("elimination") of a value is called a "Large Elimination" and can be thought of as a definitive feature of dependent types that is absent from refinement types. It's extremely important for writing proofs of, say, mathematical theorems, but it's also very useful for programming tasks like the sprintf example we're exploring.

Putting it all together, we can write this handy function:

type_of_sprintf: String -> Type
type_of_sprintf s = tokens_to_type (lex (string_to_char_list s))

Nice! What does it feel like to have to use the types produced by this function?

sprintf: (s: String) -> type_of_sprintf s
sprintf s = sprintf_helper (lex (string_to_char_list s)) ""
\
sprintf_helper: (tokens: List Token) -> String -> tokens_to_type tokens
sprintf_helper tokens out_str = case tokens of
    NormalChar c :: rest => sprintf_helper rest (out_str ++ char_to_string c)
    IntFormat :: rest => \i: Int -> sprintf_helper rest (out_str ++ int_to_string i)
    CharFormat :: rest => \c: Char -> sprintf_helper rest (out_str ++ char_to_string c)
    StrFormat :: rest => \s: String -> sprintf_helper rest (out_str ++ s)
    [] => out_str

Not too bad! It's a bit crazy because each branch of this case-statement produces a value of a different type, but indeed the type of the entire case-statement is tokens_to_type tokens, which is dependent on the value being inspected (tokens). Code that is very similar to this (just syntax adjustments) typechecks in Agda, which many consider to be the state of the art in dependent pattern-matching.

Obviously this code doesn't meet a full specification of sprintf, and is implemented in a pretty unoptimized way (something like \(O(N)\) time complexity in the size of the format string). I hope, however, that you can start to see that dependent types can be extremely powerful and expressive while staying completely at compile-time. In fact, if you don't mind writing a proof or two, our sprintf function can be called on format strings that aren't fully known at compile-time! You only need to be able to prove that the string will have certain format specifiers, and then calling sprintf can typecheck.

Maybe this is just a toy example, and languages without dependent types get by with special-cased sprintf functions just fine. But I think that this example is quite a nice demonstration of what's possible with dependent types that wouldn't be possible with other systems, including even refinement types. Personally, I find large eliminations very useful for writing proofs, such as when I want to prove that certain ADT constructors are impossible at a certain point. But that's a story for another time :)

discussion here.