w w w . a q u a m e n t u s . c o m

  What is C?
  Necessary background
    Understanding the C development model
    Understanding the UNIX process model
    Binary numbers
    man pages
  Hello, world
  Built-in types
    Integer numbers
      char
      short
      int
      long
      long long
      prefixes and suffixes
      aside: big-endian vs. little-endian
    Non-integer numbers
      float
      double
      long double
      Prefixes and suffixes
    Pointers
      Pointer arithmetic
      Pointers to functions
    Aggregate types
      Arrays
        Multidimensional arrays
        Arrays vs Pointers
      struct
      union
      enum
    typedef
  Variables
    Identifiers
    Syntax
    const
    static
    register
    Bit-fields
    Type promotions and typecasting
    Scope
  Characters
    isalpha(int)
    isupper(int)
    islower(int)
    isdigit(int)
    isalnum(int)
    isspace(int)
    toupper(int)
    tolower(int)
  Strings
    strlen(char*)
    strcpy(char *dest, char *src)
    strcmp(char *str1, char *str2)
    strcat(char *dest, char *src)
    strchr(char *str, char ch)
    strstr(char *str, char *substr)
  Numbers
    ++ and --
    assignment operators
    functions in math.h
  Control
    if-else
    ?: (the ternary operator)
    switch
    while
    do-while
    for
    goto
  Functions
    Pass-by-value
    return
    varargs
    Arcania
  Memory management
  System
    argv/argc: command-line arguments
    system(char*)
    rand()
    stat()
  I/O
    stdin/stdout/stderr
      getchar()
      putchar(int)
      gets(char *, int)
      puts(char*)
      printf(char *fmt, ...)
      sprintf(char *str, char *fmt, ...)
      scanf(char *fmt, ...)
      sscanf(char *str, char *fmt, ...)
    files (with file handles)
      fopen(char* name, char* mode)
      getc(FILE*)
      putc(int c, FILE*)
      fscanf(FILE*, char *fmt, ...)
      fprintf(FILE*, char *fmt, ...)
      fclose(FILE*)
      ferror(FILE*)
      feof(FILE*)
      fgets(char*, int, FILE*)
      fputs(char*, FILE*)
    files (with file descriptors)
      open(char*, int flags, int perms)
      creat(char*, int perms)
      read(int fd, char* buf, int max)
      write(int fd, char* buf, int n)
      lseek(int fd, long offset, int origin)
    directories
      opendir(char*)
      readdir(DIR*)
      closedir(DIR*)
      (example)
  The preprocessor
    #include
    #define
      swap token for text
      no value
      with parameters
        adding quotes on the fly
        creating tokens on the fly
    #undef
    #if..#endif
    #ifdef/#ifndef..#endif
  Packaging
    header files
      declarations
      #ifndef
    object files
  Operator precedence

What is C?

C is a programming language developed in the 1970s to address the problem of portability across hardware platforms. It is a very low-level language (that is, hardware and operating system details are common), but it is also universal to pretty much every computer architecuture you would ever care about. Some have called it the world's worst programming language, but in actuality it's the world's best assembly language.

Coding in C can be quite fun, provided you're a wheel-reinventing sadist who's paid by the hour. But sometimes that's exactly the mood you're in.

Necessary background

Understanding the C development model

C is a compiled language. To get source code to execute, you must get it through two broad steps:

compilation converts some number of C source code files into object files
linking connects some number of object files together to make an executable program.

Object files and executables are binary files, in a format such as ELF or Mach-O. Their contents are specific to architecture (such as ARM or x86), and since they reference the operating system API, they are also specific to operating system (such as linux or OS X). Unlike other languages, developers need to compile binaries for each platform they support.

Since C is a low-level language, practical projects tend to be large and usually span many more files than they would in a high-level language. To deal with the large number of files (and to minimize the amount of recompilation when something changes), most C projects use Make; I won't go into it here, except to mention that it is a very common practical part of the C development model.

Understanding the UNIX process model

The "Von Neumann" model of computer architecture consists of two connected pieces:

an array of memory cells. The size of a memory cell is almost always one byte (eight bits). A memory cell can be read and written through its address, which is really just the index into the array; the range of indexes/addresses starts at zero and goes up as far as the processor's width allows.
processing logic that can read and write the memory, as well as perform arbitrary operations on binary data. "Operations" are things like AND, OR, NOT, arithmetic (adding/subtracting/etc), and comparisons (is-equal, is-greater-than, etc).

For a 32-bit machine, the highest memory address is 2³²-1, which is 4,294,967,295. For a 64-bit machine, the highest memory address is 2⁶⁴-1, which is 18,446,744,073,709,551,615. If those look like completely arbitrary random numbers to you, remember that in binary 2³²-1 is "11111111111111111111111111111111", and 2⁶⁴-1 is "1111111111111111111111111111111111111111111111111111111111111111".)

A UNIX process occupies memory in approximately the following manner:

memory address	memory contents
0xffffffff (highest address)	The stack. The stack is where temporary memory is allocated, for things like local variables and function frames. The stack grows downward.
...	Between the heap and the stack is open space.
(some address)	The heap. The heap is where dynamic memory is allocated from malloc (et al). The heap grows upwards.
0x00000000 (lowest address)	Code. code is put in the lowest section of memory. The bigger the program, the bigger this section is.

Binary numbers

Actually I'm just going to go ahead and assume that you, fearless reader, already know about binary (and octal, and hex) numbers. :)

man pages

C is baked into UNIX and all its variants, so unlike almost any other domain you can rely on man pages for any information you need.

% man exit -s3
NAME
     exit, _Exit -- perform normal program termination

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include 

     void
     exit(int status);

     void
     _Exit(int status);

DESCRIPTION
     The exit() and _Exit() functions terminate a process
...

I had to add -s3 to avoid getting the description of the bash exit command. Everything you need for C will be in either section 2 or 3.

Hello, world

To do the required first example of any language tutorial, let's create a file called foo.c containing the following:

#include <stdio.h>

int main() {
  printf("Hello, world.\n");
  return 0;
}

We compile and run it like so:

% cc foo.c
% ./a.out
Hello, world.
%

Things to notice:

You must #include a file (stdio.h, which lives in a mysterious implicit location) in order to be able to call printf later.
All top-level programs must have a function named main. By definition, main is where a C program begins execution.
printf is a function, so it requires parentheses.
The text you give to printf does not implicitly include a carriage return, so you must include one yourself.
Yep, you need semicolons.
We set the program's return value by calling return inside main.
We did not specify an output file for the "cc" command. Its default is "a.out".
- In fact, we used cc, when most people now use gcc or clang instead.
- Typical actual compiler invocations are much longer because of warning flags, additional search paths, and compiler directives.

Built-in types

The core C language has bare minimum of types. Its primitives correspond to hardware concepts (integers of various bit-widths, floating-point numbers of various bit-widths, pointers), but it does have the ability to aggregate (with arrays and struct) so that you can create types of arbitrary complexity.

Integer numbers

All of the integral types may be declared as either "signed" or "unsigned". The typical difference is whether the high bit is used to indicate that the number is considered negative, but the standard does not specify the use of ones-complement vs. twos-complement vs. bobs-newfangled-contraption for representing negative numbers. Also, the default for "signed" vs "unsigned" is platform-dependent, so you should probably always be explicit.

char

The fact that "char" is short for "character" is misleading. At the time of the epoch, the only language that existed was English, which contains a mere 52 symbols. Throw in another 10 for the digits, plus a bunch of punctuation symbols, and the total number of characters anyone would ever need to type was well under 256. 256 is the number of numbers you can represent with 8 bits, so the hardware concept of a "byte" got translated to the software concept of a "character".

Unfortunately, immediately after C was deployed in mission-critical systems that could never be refactored, mankind immediately developed 6,500 other languages, plus a bunch of ancient ones that are no longer never having been used, so "char" is nowhere near adequate for representing a generic linguistic character. Yet, here it stays, with us forever like an STD from some cheap Haskell-programming whore.

"char" is defined on every platform to be exactly one byte. It is the only datatype in C with a guaranteed size.

Constants for the char type may take the usual numeric form (char is just an 8-bit integer, despite the name), or they may be actual characters enclosed in single quotes. (This is the only use of single quotes in C.) The contents of the single-quoted string could be a single character or it could be an escape sequence. An "escape sequence" is a backslash followed by something, and the recognized ones are: \0, \a, \b, \f, \n, \r, \t, \v, \\, \?, \', \\, \o### (with octal digits), and \x## (with hex digits).

#include <limits.h>

printf("Number of bits in a byte: %d\n", CHAR_BIT);
printf("Number of bytes in a char: %d\n", sizeof(char));
printf("Range of signed chars: %d to %d\n", SCHAR_MIN, SCHAR_MAX);
printf("Range of unsigned chars: %d to %d\n", 0, UCHAR_MAX);

short

"short" is short for "short int", but the "int" is almost always omitted. Shorts must be at least 2 bytes, but must not be bigger than an "int".

(It's usually 2 bytes.)

#include <limits.h>

printf("Number of bytes in a short: %d\n", sizeof(short));
printf("Range of signed shorts: %d to %d\n", SHRT_MIN, SHRT_MAX);
printf("Range of unsigned shorts: %d to %d\n", 0, USHRT_MAX);

int

"int" is short for "integer". It also must be at least 2 bytes, but it can be bigger. Typically "int" is whatever the natural size of the machine is.

(It's usually 4 bytes.)

#include <limits.h>

int sfoo = 1234;
unsigned int ufoo = 1234;

printf("Number of bytes in an int: %d\n", sizeof(int));
printf("Range of signed ints: %d to %d\n", INT_MIN, INT_MAX);
printf("Range of unsigned ints: %d to %d\n", 0, UINT_MAX);

long

"long" is short (haha) for "long int", but again the "int" is almost always omitted. Longs must be at least 4 bytes (but can be bigger), and must be at least as big as ints.

(It's usually 8 bytes.)

#include <limits.h>

long sfoo = 1234;
unsigned long ufoo = 1234;

printf("Number of bytes in a long: %d\n", sizeof(long));
printf("Range of signed longs: %d to %d\n", LONG_MIN, LONG_MAX);
printf("Range of unsigned longs: %d to %d\n", 0, ULONG_MAX);

long long

I'm not sure that "long long" is a standardized type, but it seems that most or all compilers accept it. Presumably, "long long" just has to be at least as big as a long, with the expectation that it's twice the size.

(However, on my systems it's 8 bytes, instead of 16.)

prefixes and suffixes

On any constants of the integer types, you may add any or all of these suffixes:

l or L: forces the constant to be long (instead of int).
u or U: forces the constant to be unsigned (instead of signed).

Additionally, you may add one of these prefixes:

0x or 0X: specifies that the constant is base-16 (hexadecimal). The constant may include the characters 'a' through 'f' (or 'A' through 'F') to represent numbers 10 through 15.
0: specifies that the constant is base-8 (octal). The constant may not contain characters '8' or '9'.

aside: big-endian vs. little-endian

A passing thought may have occurred to you regarding bigger-than-8-bit numbers. If, say, an "int" is four bytes in memory, there are two sensible ways of ordering those four bytes:

little-endian: this stores the lowest byte in the lowest memory address.
big-endian: this stores the highest byte in the lowest memory address.

For example, if our int consisted of four bytes (byte0, byte1, byte2, byte3) and was put at address "0x1000", memory could look like either of these:

(memory)	0x0fff	0x1000	0x1001	0x1002	0x1003	0x1004
(little-endian)	...	byte0	byte1	byte2	byte3	...
(big-endian)	...	byte3	byte2	byte1	byte0	...

There are no compelling reasons to go with either particular arrangement, it just needs to be understood by consumers. In fact, some architectures do it one way and others do it the other: x86 is little-endian, whereas the 68k is big-endian. (Sparc, ARM, and PowerPC are all bi-endian, which means they let you select which one to use!)

Non-integer numbers

For non-integer numbers, C has direct support only for floating-point numbers. For things like fixed-point numbers, complex numbers, vectors, and fractions, you would have to make your own types (see "struct" later).

Floating-point numbers do not have signed/unsigned variants.

A caution on floating-point numbers: any implementation of floating-point numbers is limited, because floating-point numbers have the nasty ability to not be representable in a finite number of digits. (It's the same problem that "one-third" has in base 10, requiring an infinite number of 3s). Worse, humans do math in base-10 but computers do it in base-2, so there's an added problem of converting between two approximation systems. Avoid using floating-point numbers when your application requires precision. (For example, money systems should use integers, or maybe a custom implementation of fixed-point numbers.)

float

"float" is what they call "single-precision", though "single-precision" doesn't seem to have any defintion other than "smaller than double-precision".

(It's usually 4 bytes.)

float foo = 123.4;

double

"double" is at least as big as "float", and might actually be guaranteed to be larger.

(It's usually 8 bytes.)

double foo = 123.4;

long double

"long double" is actually part of the language!

(It's usually 16 bytes.)

long double foo = 123.4;

Prefixes and suffixes

On any constants of the floating-point types, you may add one of these suffixes:

f or F: forces the constant to be float (instead of double).
l or L: forces the constant to be long double (instead of double).

Pointers

A pointer is exactly what it sounds like: a variable whose contents point to some memory address. Typically pointers point to the address of some other variable (wherever it got allocated in memory), but there are also pointers to functions (wherever they got placed in memory).

int a = 0;  // 'a' is a variable at some memory address consuming sizeof(int)
  // bytes.  Using 'a' means using the contents of that memory, which will be
  // interpreted as an integer.

int *ap;    // 'ap' is a variable at some memory address consuming sizeof(void*)
  // bytes.  Using 'ap' means using the contents of that memory, which will
  // be interpreted as the memory address of some integer.

ap = &a;    // the contents of 'ap' are set to the address of 'a', wherever
  // it happens to be.  The '&' operator gets the address of something.

*ap = 1;    // the contents of 'ap' are interpreted as a memory address, and
  // the contents of that address are set to '1'.  At this point, the value
  // of 'a' is now '1'!

ap = 1;     // the contents of 'ap' are set to '1', which is not a valid thing
  // to do because you really don't know what's going to be in a hardcoded
  // memory address.  This will (usually) generate a compile-time error.

ap = 0;     // however, this is okay.  '0' is a special value for pointers,
  // and it means that the pointer is not pointing to anything.  Attempting
  // to deference it and change the memory at '0' is a runtime error that
  // results in a segmentation error ("segfault").  Sometimes you'll see
  // 'NULL' used for this purpose; NULL is just a #define to 0.

You can declare and use pointers to any data type: char, short, int, float, etc. This includes pointers to custom structs.

You can also have a "pointer to void". void is a way of saying something is typeless (and for functions it's a way to say there is no returned value, or has no parameters). void* is used as a general pointer type when we can't specify what the type is.

Far and away the most common use of pointers in C is to implement pass-by-reference. Since function arguments are copied on each invocation, modifying the local arguments does not affect the originals; when you want to modify the original arguments, you can pass the address of the argument instead of the argument itself, and then the function can modify what the argument points to. An example from Kernigan & Ritchie:

void swap(int *x, int *y) {
  int t = *x;  // set t to whatever x is pointing to
  *x = *y;     // set whatever x is pointing to to whatever y is pointing to
  *y = t;      // set whatever y is pointing to to t
}
..
int myvar1 = 10;
int myvar2 = 42;
swap(&myvar1, &myvar2);

The second most common use of pointers is to implement multiple return values. Since C functions can return only one thing, you could make one of the function arguments a pointer, and the function could put an additional return value there. This is very common in the C library.

// returns: whether we've hit end of file.  '0' means no more data.
// 'c': set to the next character, if there was one.
int get_next_thing_from_file(char *c) {
  ..
}
..
char next_char;
while (get_next_thing_from_file(&next_char)) {
  ..
}

Pointer arithmetic

The type of a pointer is very important because C lets you perform some basic math with pointers. Adding to and subtracting from pointer types operates on the size of the pointed-to data structure. Thus adding 1 to a char* increments the numeric value by one, whereas adding 1 to an int* increments the numeric value by sizeof(int).

Subtraction between pointers tells you the number of elements between the two addresses. It does not return the number of addresses between them, even though pointers are just numeric values!

Since there is basic arithmetic, you also have access to basic logic operations on pointers, such as is-equal (==), is-not-equal (!=), less-than (<), greater-than (>), etc.

Pointers to functions

Pointers to functions are syntactically horrible in C, due to the precedence of the various pieces involved. Suppose you have an easy function that takes a float parameter and returns an int. Its declaration and a pointer to it would look like this:

// declare it:
int myfunc(float);

// make a pointer to it:
int (*myptr)(float) = myfunc;

// call it normally:
int res1 = myfunc(4.0);

// call it through the pointer:
int res2 = (*myptr)(4.0);

The K&R C book has a sample program that converts some obfuscated C declarations into English. I repeat the two meanest ones here:

char (*(*x())[])();
 ->  x is a function, returning a pointer to an array (of unknown size) of
  pointers to functions that return a char.

char (*(*x[3])())[5];
 ->  x is an array of 3 pointers to functions returning a pointer to an array
  of 5 chars.

There's a reason why programmers drink.

Aggregate types

Arrays

Arrays are a block of variables grouped together. The variables are separate, but they're accessed by group index instead of by an individual name.

int a[10];   // defines 'a' to be an array of 10 ints.

a[0] = 42;
a[9] = 29;   // completely separate from a[0].

Arrays are implemented as contiguous elements in memory.

You can initialize an array at construction. The syntax uses braces in a unique way not seen anywhere else in the language:

int primes[10] = { 1, 2, 3, 5, 7, 11, 13, 17, 19, 23 };

Some nifty things about array initialization:

You can omit the array size, and C will figure it out based on what's in the initialization list.
You don't have to initialize all elements -- if you have fewer elements in the list than your declared array size, the remaining ones will be initialized to 0. Bonus: this means you can initialize your whole array to zero with an empty initialization list!
```
int arr[10] = {};
```
Specifying more elements in the initialization list than will fit in the size of the array is a compile-time error.

Since strings are arrays of characters, you can initialize them in the same way. However, since no one wants to type out strings like that, C gives you quotes to do the same thing. The following are equivalent:

char str1[4] = {'m', 'o', 'o', 0};
char str2[4] = "moo";  // note that C adds the terminating null

Multidimensional arrays

C does indeed support multidimensional arrays, but only because they can be viewed as arrays of arrays.

int myarr[3][5] = {
    {0, 1, 2, 3, 4},
    {1, 2, 3, 4, 5},
    {2, 3, 4, 5, 6}};
..
for (int i1 = 0; i1 < 3; ++i1) {
  for (int i2 = 0; i2 < 5; ++i2) {
    .. myarr[i1][i2] ..
  }
}

You can pass these around as function parameters, but you need to specify at least n-1 of the dimension sizes so that the compiler can do the offset math correctly. (The one you don't have to specify is the first one, because it's the number of the biggest chunks, which isn't necessary for figuring out addresses.)

void myfunc1(int myarr[3][5]) {
  ..
}
void myfunc2(int myarr[][5]) {
  ..
}
// you could also do this, for the Obfuscated C Contest:
void myfunc3(int (*myvar)[5]) {
  ..
}

Arrays vs Pointers

Arrays are visible via a pointer to their first element, so much of the time arrays and pointers are interchangable.

int arr[10];   // the literal contents of 'arr' is the memory address where the first array element is
int *ptr;

ptr = arr;     // ptr now points to the same place as 'arr' - the first array element.

// same way to get the first element:
arr[0];
*ptr;

// same way to get the second element:
arr[1];
*(ptr + 1);

// same way to get the address of the second element:
&arr[1];
ptr + 1;

However, arrays and pointers are not exactly equivalent. Arrays are not a primitive type, so you cannot assign them, and you cannot increment/decrement them. Arrays also cannot be null.

arr = ptr;   // NOT okay, even though you can do "ptr = arr"
arr++;       // NOT okay; use pointers if you want to do this

struct

"struct" allows you to create a new type that bundles together any number of variables into a single object. Here's an example of a "rectangle" struct that consists of two 2-dimensional points:

// define what "rect" means:
struct rect {
  int x1;
  int y1;
  int x2;
  int y2;
};
// declare a variable of it:
struct rect myrect1;
// declare a variable and initialize its contents, in defined order:
struct rect myrect2 = {0, 0, 640, 480};  // ready to play Myst!
// print out its members:
printf("(%d, %d)-(%d, %d)\n", myrect2.x1, myrect2.y1, myrect2.x2, myrect2.y2);

structs may be nested, so our definition of rect could instead have included two instantiations of a separate point struct.

structs are implemented as a single block of memory big enough to contain all the members. (The members are necessarily in order, but due to alignment they aren't necessarily contigious.) They can usually be treated as a primitive data type; sizeof works on them, you can assign them to each other (which works by copying bits, just like primitives), you can pass them to and from functions (which passes by value, just like primitives), and you could allocate them either on the stack or on the heap.

Pointers to structs access members with an arrow (->) instead of a period (.). This is because the precedence of the period is higher than the derefence operator (*):

struct rect *foo = ...;
int a = (*foo).x1;  // dereference foo, then look up its x1 member
int b = foo->x1;    // same
int z = *foo.x1;    // WRONG: looks up its x1 member, then tries to dereference it

Arrays of structs are treated like any other arrays. You can still use curly braces to initialize, which is interesting because the contents of the curly braces is then a repeating hodpodge of whatever types are in the struct:

struct {
  int a;
  float b;
  char c;
} myvar[3] = {
  0,   1.3, 'a',
  42, 14.8, 'j',
  18, -0.3, '\0',
};

Structs may also contain pointers to themselves, for use in things like linked lists and trees. Structs cannot contain actual instances of themselves, because how would that work.

struct my_linked_list_node {
  char *data;
  struct my_linked_list_node *next_node;
};

union

"union" is quite strange. It's a bundle of different types, but only one of them is active at a time. Consider this example:

union {
  int   my_int;
  float my_float;
  char* my_str;
} my_union;

my_union foo;

When you create instance foo of my_union, the compiler will allocate just enough space for the largest of the types. When you use foo.my_int, the space is treated as an int; when you use foo.my_float, the space is treated as a float; when you use foo.my_str, the space is treated as a char*.

The typical application and example of union is compiler tokens. "23" would be turned into an integer, "0.42" would be turned into a floating-point number, and "asdf" would be treated as a string. However, if you made a token struct with all possible types you'd need, you'd waste a ton of memory on all those unused fields. union lets you say this space could be any one of a number of different things.

HOWEVER, it's still up to you to make sure you know what the correct type is -- neither the compiler at compile-time nor the code at run-time will be able to tell you how foo was last set.

You can kind of see how this is one of the precursors to polymorphism in object-oriented programming.

enum

"enum" is short for "enumerated type", whose name comes from giving us the ability to spell out (enumerate) the domain of their possible values:

enum COLORS { RED, ORANGE, YELLOW, GREEN, BLUE, VIOLET };

COLORS foreground_color = YELLOW;
COLORS background_color = VIOLET;
if (foreground_color == YELLOW && clashes(foreground_color, background_color)) {
  ..
}

Variables of type enum are actually just integers, and their values are just constants, but the enum mechanism lets you write much better self-documenting code.

Another cool thing about enums: the integer values of the constants can be whatever you want, but without specific overrides the first one is given a value of zero, and each subsquent one is automatically set to "previous + 1".

// months don't start at zero:
enum MONTHS { JAN=1, FEB, MAR, APR, MAY, JUN,
    JUL, AUG, SEP, OCT, NOV, DEC };

// use enum instead of #define for bit fields:
enum FLAGS { FIRST=1, SECOND=2, THIRD=4, FOURTH=8, FIFTH=16 };

// use enum to define aliases:
enum MEDIUM {
  TV = 0,
  TELEVISION = 0,
  RADIO = 1,
  SATELLITE = 1,
  STREAMING = 2,
  SPOTIFY = 2,
  SOUNDCLOUD = 2 };

One downside to enums: you cannot use the same compile-time constant in more than one enum, even if they have the same numeric value. For such awesomeness you would have to use an updated language (ahem, Anduin :).

typedef

typedef allows you to create an alias for any other type, which is useful for two things:

centralizing a type you're using, so that you can switch the actual type around by changing only a single line. This is what size_t is.
simplifying complex declarations, especially function pointers.

typedef int size_t;

typedef int (*funcptr)(int);

// but be somewhat careful of this:

int* p1, p2; // p1 is an int*, but p2 is just an int!

typedef int* intptr;
intpr p1, p2;  // both p1 and p2 are int*!

Variables

In C, variables must be declared, both as a variable and with their specific type. Declaring variables is annoying only if you've never tried to make a real program in a language that makes no effort to help you find typos. Declaring as a specific type (as opposed to duck-typing) is annoying only until you realize it uncovers actual conceptual issues in an interface.

Identifiers

In C, identifiers (variables, function names, etc.) may contain any number of letters, numbers, and underscores, with the usual restriction that the first character cannot be a number. Letters are case-sensitive.

Some systems and linkers have restrictions on the number of characters in an identifier. (And in fact, due to historical deployment reasons, the standard can guarantee uniqueness in only the first 6 characters! And those are case-insensitive!) This does not manifest as an error or warning on those systems, but rather the generated binary contains truncated versions of those identifiers. Aliasing is then a real problem, which you debug at the binary level. Program in C for great fun and relaxation!

Syntax

Variables may be simply declared, or they may also simultaneously be initialized.

To declare a variable, you state its type and then its name:

int myvar;

To declare and initialize, you also include a value:

int myvar = 42;

If you do not initialize, the initial value of such variables is whatever crap was left on the stack. (Which is more commonly called "garbage".)

You can declare more than one variable at once:

int myvar1 = 42, myvar2 = 13;

but be careful with pointer declarations, because the asterisk applies to only one variable while the type (int) applies to all of them.

const

Most variables are, well, variable. However, you can declare a variable to be uneditable with the const keyword. The compiler will then ensure that nothing after its initial declaration and assignment will be able to change its value. (Well, assuming you don't bend over backwards to subvert it.)

const int foo = 42;

By convention, constant identifiers are spelled with all capitals, much like I just did not do in that example.

static

Most variables are automatic, which means they're scoped to their enclosing curly braces. (In fact, you could explicitly declare all your variables with auto if you really wanted worker's comp for carpal.) However, you can also declare a variable as static. There are two contexts where you can use static:

inside a function, it means that the variable is persistent between invocations. It will be initialized the first time the function is called, and retain its value between function calls. You can almost think of it as a global variable with limited lexical scope.
outside a function, it means that the variable will not be exported as a visible symbol in the final binary. It will be initialized before main() starts, and is considered a global variable (for that file). In this way, you can make a variable's scope bigger than a function but less than global.

Because those two contexts really have no relation whatsoever, asking about static makes for a great interview question.

Static variables are initialized to zero (unlike automatic variables). If you want to initialize to a specific value, the value must be a constant expression (also unlike automatic variables). This initialization is done before main starts.

register

Declaring a variables as register (instead of auto or static) tells the compiler that it should try to use processor registers instead of main memory. However, register is only a hint (like inline), and compilers are not required to actually do it. Note that if you declare a variable as register, whether or not the compiler puts it in a register, you cannot take the address of it.

Bit-fields

For signed int and unsigned int variables, you can declare that it should be a specific number of bits instead of its usual full size. The most common use of this is to pack boolean variables together to save space while also making access easy.

The usual way to pack booleans is to use bitwise operations:

enum { IS_FOO=1, IS_BAR=2, IS_BAS=4, IS_BAT=8 };
unsigned int flags;

// turn on the IS_BAS flag:
flags |= IS_BAS;

// turn off the IS_BAR flag:
flags &= ~IS_BAR;

// check the IS_BAT flag:
if (flags & IS_BAT) ..

Bit-fields let you split each flag into its own variable, which means you don't have to do the bitwise math anymore:

// the following requests only 4 bits of space, but it's up to the compiler
// to determine exactly how to pack them:
unsigned int is_foo:1;
unsigned int is_bar:1;
unsigned int is_bas:1;
unsigned int is_bat:1;

// turn on IS_BAS:
is_bas = 1;

// turn off IS_BAR:
is_bar = 0;

// check IS_BAT:
if (is_bat) ..

You're not limited to declaring 1-bit chunks, that's just the common use for booleans.

If you see ":0", that's a way to force this variable to align on a word boundary.

You cannot take the address of variables declared with bit-widths, because the granularity of addresses is whole bytes.

Type promotions and typecasting

It happens occasionally that you have a variable of type X but need it to be of type Y. This is a problem because the entire point of declaring things as X or Y is to catch when you try to put a square peg in a round hole, and the compiler will usually squawk at you. Usually. Some X-to-Y translations are perfectly harmless, such as when you have an int but want to call a function that takes a long. Since there's really nothing that can go wrong with upsizing an int, the compiler will do that one for you automatically. (That's type promotion.) However, the reverse is not true: if you have a double and pass it to a function expecting a float, you'll get a warning.

Automatic type promotion happens all over the place, and the specific rules for it are actually pretty intense. The rules can be reasonably well summarized with "whichever of the two types is smaller will be promoted to the type of the larger." int is promoted to long, which is promoted to float, etc.

But sometimes you want to translate between types that aren't directly mathematically related, such as converting a pointer to an integer so that you can print it out. In those cases, you have to tell the compiler that you really do know what you're doing and to go ahead with the conversion. This is a typecast. You can force a typecast by prefixing the expression with the parenthesized type, e.g.:

extern void myfunc(float);
...
double myvar = ...;
myfunc( (float)myvar );

Note: C++ has a new mechanism for doing typecasts, though this way is still supported.

Scope

Variables come into scope when they're defined, and for the most part go out of scope when their enclosing block ends. "Enclosing block" usually means the pair of braces, but it also applies to end-of-file.

Variables are visible to inner sub-scopes, but not to outer parent scopes. Variables declared in inner scopes will hide ("shadow") variables in outer scopes. Once hidden, there is no way to get to them.

int myvar0;  // visible to the rest of this file, and the rest of the world via 'extern'
static int myvar1;  // visible to the rest of this file, but not the rest of the world

void myfunc(int myparam) {  // myparam is visible until the end of the function

  // we can still see myvar0 and myvar1 in this function scope

  int myvar2; // visible to everything in this function
  if (...) {

    // we can see myvar0, myvar1, and myvar2 in this inner scope

    int myvar3; // visible until the corresponding close-brace
  }
  // myvar3 is no longer visible, but we have all the others.
}
// out here, only myvar0 and myvar1 are still visible.  myvar2 and myparam both
// just went out of scope.

These scoping rules for variables also apply to functions, structs, and everything else that's an identifier.

Automatic variables in recursive functions are specific to their invocation of the function, so are not shared. Use static if you want a variable to be common across all invocations of a function.

Characters

All of the character operations work only for ASCII, not for unicode.

isalpha(int)

Returns nonzero if the given character is in [A-Za-z].

isupper(int)

Returns nonzero if the given character is in [A-Z].

islower(int)

Returns nonzero if the given character is in [a-z].

isdigit(int)

Returns nonzero if the given character is in [0-9]. Note that period is not accepted, so be careful when trying to decode floating-points.

isalnum(int)

Returns nonzero if the given character is in [A-Za-z0-9].

isspace(int)

Returns nonzero if the given character is in [\t\n\v\f\r ].

toupper(int)

If the given character is a lowercase letter, returns the uppercase version of it; otherwise, returns the input unchanged.

tolower(int)

If the given character is an uppercase letter, returns the lowercase version of it; otherwise, returns the input unchanged.

Strings

C only sort of supports the idea of strings. In C, strings are conventionally an array of char's followed by a zero. (A zero value, not a zero character.)

You can specify a string constant in source code by surrounding it with double-quotes. (Single-quotes are not the same thing.) As a convenience, when the compiler sees double-quoted strings, it converts them into char arrays and adds the terminating 0 for you.

And, as you're now aware, arrays and pointers are nearly interchangeable, so you will see strings as either "char*" or "char[]". However, when you initialize strings with direct text, only "char[]"s can be edited; "char*"s are essentially constant. Also, remember that arrays aren't primitives and can't be assigned:

char str_arr[] = "my string";
char *str_ptr = "my string";
..
str_arr[1] = 'a';  // ok
str_ptr[1] = 'a';  // ILLEGAL
..
str_ptr = str_arr;  // ok
str_arr = str_ptr;  // ILLEGAL

Oddly, C will concatenate adjacent string constants for you, so the following are all equivalent:

char *c1 = "foobar";
char *c2 = "foo" "bar";
char *c3 =
  "foo"
  "bar";

All of the functions for dealing with strings are part of the C library, not part of the language itself.

A note on the string functions below. Almost all of them have length-limited equivalents (such as strncpy instead of just strcpy). They are much safer to use because they stop after N bytes, which prevents them from overwriting memory they don't own.

strlen(char*)

Returns the length of given string, not counting the terminating 0.

strcpy(char dest, char src)

Copies the given string (and the terminating zero) into dest. Does not allocate memory for dest; it is assumed you already did this, and dest is big enough to hold it.

strcmp(char str1, char str2)

Compares str1 to str2 and returns one of three values:

0: the two strings are equal
less than 0: str1 is alphabetically less than str2
greater than 0: str1 is alphabetically greater than str2

strcat(char dest, char src)

Appends src to the end of dest and adds a terminating zero.

strchr(char *str, char ch)

Finds the first occurrence of ch in str. Returns the location as a char*, or zero if the character was not found.

strstr(char str, char substr)

Finds the first occurrence of substr in str. Returns the location as a char*, or zero if the string was not found.

Numbers

All numeric types (integers and floating-points) support these common operators:

+: addition
-: subtraction
*: multiplication
/: division

Integer (non-floating-point) numbers also support:

%: modulus
~: bitwise NOT
&: bitwise AND
|: bitwise OR
^: bitwise XOR
<<: bitwise shift-left
>>: bitwise modulus

++ and --

C pioneered the "++" and "--" operators for numeric types. They are a shortcut way to increment or decrement a variable's value by one.

There are two ways to specify them:

++myvar (pre-increment): increments myvar and returns its new value.
myvar++ (post-increment): increments myvar and returns its old value.

If you don't use the value of the expression, then they both just increment the variable and are effectively the same.

assignment operators

Assignment operators are an awesome shortcut. Instead of having to type out this:

mylongvariablename = mylongvariablename + 3;

you can type:

mylongvariablename += 3;

There are lots of assignment operators: +=, -=, *=, /=, %=, <<=, >>=, &=, |=, and ^=.

functions in math.h

sin: sine, in radians
cos: cosine, in radians
atan2: arctangent, in radians
log: base-e logarithm
log10: base-10 logarithm
exp: exponentiation of e
pow: exponentiation of any arbitrary number
sqrt: square root
fabs: absolute value

Control

A quick preamble on truth-value expressions. C does not have a native "boolean" type; instead it uses integers and considers a value of zero to be false while all non-zero values are considered true. Since C is a system language, everything is ultimately interpreted as bits, so everything in a boolean context is interpreted by their bits. (I'm emphasizing this because strings don't work with "==".)

All numeric types support the following logical operators:

<: less-than
<=: less-than-or-equal
>: greater-than
>=: greater-than-or-equal
==: equal
!=: not-equal

Truth-value expressions can be combined using the following logical operators:

&&: and
||: or
!: not

A very important note about && and ||: for as long as C has been around, those operators have been required to evaluate left to right and to stop as soon as the final result can be determined. In particular, this means the right part might not run. For example, "false && anything" is going to be false, regardless of anything; "true || anything" is going to be true, regardless of anything. This is sometimes called "short-circuiting", and has been used extensively by programmers to avoid typing out explicit logic. My personal favorite use of this is to control debugging statements:

// this:
DEBUG && printf("asdf\n");

// ..is a shorter way of writing:
if (DEBUG) {
  printf("asdf\n");
}

if-else

This is the most basic control element in C:

if (expr) {
  ..stuff..
}
..next stuff..

If expr is true, then stuff is run. If expr is false, then stuff is not run and the program continues with next stuff.

You can chain ifs together with else:

if (expr1) {
  ..stuff1..
}
else if (expr2) {
  ..stuff2..
}
..next stuff..

Just to be overly clear, this will first evaluate expr1. If true, then stuff1 is run, followed by next stuff. If expr1 was false, then expr2 is evaluated. If true, then stuff2 is run, followed by next stuff. If expr2 was also false, then only next stuff is run.

Finally, you can also specify a final else without an if, which will always run whenever all the other conditions are false:

if (expr) {
  ..stuff1..
}
else {
  ..stuff2..
}
..next stuff..

If expr is true, then stuff1 is run, followed by next stuff; otherwise, stuff2 is run, followed by next stuff.

?: (the ternary operator)

The ternary operator is a compact if-else expression.

// the long way:
if (something) {
  myvar = 1234;
}
else {
  myvar = 5678;
}

// the ternary way:
myvar = something ? 1234 : 5678;

switch

switch is like a cascaded if-else except that it is much more elegantly compact.

switch(something) {
  case 4: ... break;
  case 12: ... break;
  default: ... break;
}

The compactness of switch comes as a cost:

switch determines where to start executing by looking for the case whose expression matches an equality check with something. That means you cannot do range checks (such as less-than or greater-than), and you cannot use them for strings.
while the something can be any arbitrary expression, each of the case values must be constants so that they can be determined at compile-time.
switch is implemented (quite sneakily!) as a jump-table. That means that the amount of memory they consume is proportional to the range of case values. Caveat programmer.

The default clause is optional.

You may chain multiple cases together. This is both good and bad. It's good because it makes switch even more compact (without sacrificing readability). It's bad because it means you have to explicitly break when you don't want that behavior, which is really easy to forget.

switch(myvar) {
  case 0:
  case 2:
  case 4:
    printf("low-value even!\n");
    break;
  case 1:
  case 3:
  case 5:
    printf("low-value odd!\n");
    break;
  default:
    printf("not a low value\n");
    break;
}

(Strictly speaking, the break on the last case isn't necessary, but I agree with Kernighan and Ritchie that it's a good defensive programming practice in case of later code shuffling.)

while

while is the most basic loop.

while (condition) {
  ..
}

condition is evaluated. If it is true, then the body of the loop is executed, and then condition is evaluated again. If it's still true, the body is executed again, and so forth.

Inside a loop, you have access to two additional loop-control statements:

break transfers execution to the end of the loop, as if it had exited normally with a false condition. You'd use this to stop processing early, instead of using logic to skip the rest of the loop body until the next normal evaluation of condition. You can use break anywhere in a loop body.
continue transfers execution to the start of the loop, as if the body was all done. You'd use this to skip the rest of the loop body but to continue looping. The condition is next evaluated, and life goes on as usual. You can use continue anywhere in a loop body.

Here's an example showing when we'd use these:

// echo lines of input until we see one that starts with an "m":
int has_m = 0;
while (line = get_next_line()) {
  // if the line starts with "#" it's a comment:
  if (line[0] == '#') {
    continue;  // forget him, let's go look at the next one
  }
  // if the line starts with "m" then we're done:
  if (line[0] == 'm') {
    has_m = 1;
    break;  // no need to search any more, the answer won't change
  }
  // otherwise print it out:
  printf(line);
}

do-while

do is almost the same as while, except that it checks the condition at the end of the loop instead of at the beginning. This means the body will always execute at least once.

do {
  ..
} while (condition);

I've seen do-while most commonly used for input operations - query the user for some input, check to see if it's okay, and then re-query if it's not.

do {
  printf("Tell me what I need to know!\n");
  char *ans = get_answer();
} while (answer_does_not_please_me(ans));

You may use both break and continue in do-while loops. break jumps you completely out of the loop body; continue jumps to the condition part, which would then go back to the loop body.

for

for is basically just while but with conveniently built-in initialization and increment code.

for (init-expr; condition; incr-expr) {
  ..
}

For example:

for (x=0;       // this is run only once, before the loop starts
    x < 10;     // this is the condition of the while
    x += 1) {   // this is done after each execution of the loop body
 ..
}

At this point I get to introduce you to the comma operator, which is seriously just a comma. It is used to stitch multiple expressions into a single expression, which is useful for putting multiple things into the for control code:

for (x=0,y=0;
    ++loop_count,x<10;  // the rightmost returns the expression's overall 'value'
    x+=1,y+=1) {
  ..
}

I'm telling you about the comma operator because you'll see it, not because it's necessarily a great idea.

You may use both break and continue in for loops. break jumps you completely out of the loop; continue jumps to the incr-expr and then to the cond, and then continues on as usual.

goto

The goto statement is another of C's warts, but only because software has moved so far away from machine code since C was developed. goto allows you to transfer execution to some other point in the function that's marked with a label (a name followed by colon, which you can put before any statement.)

Kernighan and Ritchie are of the opinion that goto should only ever be used to break out of nested loops (since break can only break out of one at a time). I did enough QBASIC programming to agree.

for (...) {
  for (...) {
    ...
    if (..) goto done;
  }
}
..
done: printf("done!\n");

Functions

C allows you to bundle code into lexically-scoped functions that you can invoke from anywhere else with arbitrary arguments. Functions can take any number of arguments (including none), and may return up to one value. They support recursion.

Inside a function, you can do whatever arbitrary code you want. (Well, except for defining a sub-function.)

At any point in a function, you can use the return statement to both exit the function and set its returning value.

// a function with no arguments, returning nothing:
void myfunc1(void) {
  ..
}

// a function with an int argument, returning a long:
long myfunc2(int myarg) {
  ..
}

// a function with several arguments of different types, returning an int:
int myfunc3(int myarg1, int myarg2, char *myarg3) {
  ..
}

If you want to use a function that's defined either later in the file or defined in a completely different file, you'll need to declare the function before you can call it. (This restriction is so the compiler can line up your arguments to make sure they're the right type, in the right order, etc.) You declare a function by copying its signature and replacing the entire body with a semicolon. You can also omit the variable names of the arguments, though you still need their types:

void myfunc1(void);
long myfunc2(int);
int myfunc3(int, int, char*);

Note that C does not allow you to declare multiple functions with the same name, even if they have different signatures. That's one of the major upgrades in C++.

C does not allow you to define nested functions.

Pass-by-value

In C, arguments to functions are passed by value. That means when you call the myfunc2 function above, the integer value you provide as an argument is copied into a new integer, which is the one that myfunc2 will use. myfunc2 is free to change myarg all it likes, because it's a new variable local to the function.

void bad_increment(int val) {
  ++val;
  printf("new value: %d\n", val);  // prints out 5
  return;
}

int myval = 4;
bad_increment(myval);
printf("final value: %d\n", myval);  // prints out 4

return

The aforementioned return statement establishes the returned value of the function and resumes program execution immediately after the function call. The argument to return is just any ol' arbitrary expression. For functions that don't return anything, you can leave the expression out completely.

If a function does have a return type and you don't specify a return, it is said to "fall off the end". The compiler will ensure that a value will be safely returned to the caller (as opposed to corrupting the stack), but the value will be garbage.

varargs

C supports variable-length function arguments, which lets you can pass any number of things of arbitrary mixed types to a function without running afoul of the compiler.

This is best explained with an example:

void myfunc(int req_arg, ...) {
  // we'll always have 'req_arg', but we'll have other things after it as well:
  va_list ap;
  va_start(ap, req_arg); // 'ap' points to first thing after 'req_arg'
  while (...still have args to handle..) {
    if (arg-is-an-int)
      int i = va_arg(ap, int);
    else if (arg-is-a-float)
      float f = va_arg(ap, float);
    else if (arg-is-a-string)
      char *s = va_arg(ap, char*);
  }
  va_end(ap);
}

The example from K&R is basically what printf does, so it gets the number of args to handle by looking through the fmt arg for "%" constructs. The above example is complete crap, meant for you to just get the idea.

Arcania

Functions were used long before they were standardized, so you may see a few odd things. I'm explaining them so that you'll know what they are, not so that you'll use them.

First, if you don't specify a return type for the function, it assumes int. One would think void, but no.

Next, if you don't specify any arguments, it turns off the compiler argument checking. This is for backwards compatibility with very old code that predates the ability to declare arguments. If you want to declare a function that has no arguments, you should declare the argument list as simply void.

Next, the initial (pre-ANSI) version of C split arguments and their types:

void myfunc(arg1, arg2)
  int arg1;
  char *arg2;
{
  ..
}

Memory management

When you need to get some memory for a new variable, there are two mechanisms for how you can get it: automatically, or dynamically.

"Automatic" variables are allocated via a global stack. When you declare int x; the program pushes sizeof(int) bytes onto the stack, and x becomes a refence to those bytes. x is not a pointer; using x gets you the actual value in those bytes, but you can take the address of x and see where it is in memory. When x goes out of scope, the program pops those bytes back off the stack. All of this stack and scope manipulation is done behind the scenes for you, which is why these are called automatic variables.

int a;
a = 42;

Dynamic memory is allocated via a global pool. When you declare int *x the program pushes the size of a pointer (sizeof(int*)) onto the stack, and x becomes a reference to those bytes. However, the contents of those bytes is an address (the address of an int!), so you need to point x to sizeof(int) bytes so that you can use it as an int. You do this manually by calling malloc to request the memory. It finds enough space from whatever's left in the pool and returns that address. When you're done with it you call free to return the memory to the pool.

int *a = malloc(sizeof(int));
*a = 42;
free(a);

Dynamic memory is useful because it outlives lexical scoping -- if you have a function that returns something that needs to be allocated, then dynamic memory is your only choice since automatic memory goes out of scope as soon as the function exits.

Dynamic memory is also a pain in the butt because it's a manual process. If you forget to free when you're done, you have a memory leak; if you call free on the same address more than once, it's usually bad; worse, free does not reset pointers to zero, so none of the N pointers to your int know that they're now pointing to garbage. Worst, it's not always clear whether you should free a pointer or not:

// declare some external function to get some name:
char *get_name();

// call it:
char *myname = get_name();
// At this point, is 'myname' pointing to the same thing that get_name stores internally,
// or did get_name make a copy for me?
// - If the former, I cannot free it because get_name is still using it;
// - If the latter, I must free it to avoid a memory leak.
// Oh bother.

A few more functions you might want to know about:

calloc: like malloc but initializes memory to 0.
realloc: takes an existing chunk of memory and downsizes it, returning the pointer to the new area it carved out.

System

argv/argc: command-line arguments

The command-line arguments are passed to a C program through the parameters to the main function. argc is the number of parameters, and argv is an array of char*s. argc is always at least 1 because the first argument (argv[0]) is the name of the program executed.

Power tip on argv[0]: usually argv[0] is what the user typed, not necessarily the actual file being run. This means you can make multiple symlinks to a program, and argv[0] tells you which one the user ran to invoke the program. A nifty way to create wrappers!

int main(int argc, char *argv[]) {
  printf("You passed in %i arguments:\n", argc);
  for(int i = 0; i < argc; ++i) {
    printf("  '%s'\n", argv[i]);
  }
}

system(char*)

system interprets its argument as a shell command and executes it through (typically) bash. Execution of the program is suspended until the sub-program finishes; use fork et al for concurrent execution.

sytem's return value contains 8 bits with the process's errcode, 7 bits with the signal that killed it, and 1 bit to say if core was dumped. Consult your local man page for specific bit arrangements.

rand()

Returns a random integer in the range of 0 to RAND_MAX.

stat()

stat looks up a ton of information about the given file and returns it in a huge struct. It's probably best to ask your local man page for details, but here's what mine says:

struct stat { /* when _DARWIN_FEATURE_64_BIT_INODE is NOT defined */
  dev_t    st_dev;    /* device inode resides on */
  ino_t    st_ino;    /* inode's number */
  mode_t   st_mode;   /* inode protection mode */
  nlink_t  st_nlink;  /* number of hard links to the file */
  uid_t    st_uid;    /* user-id of owner */
  gid_t    st_gid;    /* group-id of owner */
  dev_t    st_rdev;   /* device type, for special file inode */
  struct timespec st_atimespec;  /* time of last access */
  struct timespec st_mtimespec;  /* time of last data modification */
  struct timespec st_ctimespec;  /* time of last file status change */
  off_t    st_size;   /* file size, in bytes */
  quad_t   st_blocks; /* blocks allocated for file */
  u_long   st_blksize;/* optimal file sys I/O ops blocksize */
  u_long   st_flags;  /* user defined flags for file */
  u_long   st_gen;    /* file generation number */
};

fstat is similar except it uses a file descriptor (not a filehandle!) instead of a path string.

I/O

stdin/stdout/stderr

getchar()

getchar reads one character of stdin, or EOF if there's nothing left.

putchar(int)

putchar writes one character to stdout. It returns EOF if there was a problem.

gets(char *, int)

Reads in the next line of input from stdin, strips off the trailing newline, and stores it in the given buffer. (Up to a specified max number of characters.)

puts(char*)

Adds a newline to the given string and writes it directly to stdout.

printf(char *fmt, ...)

printf writes an entire string to stdout. fmt is a generic string that contains any number of conversion specifications that say how to handle the remaining arguments to printf. (There must be the same number of conversion specifications in fmt as there are additional arguments to printf or else you'll get yelled at.) These conversion specifications have the following components:

"%". They all start with a percent sign. Use "%%" if you want to print an actual percent sign.
"-" (optional). Makes the field left-aligned instead of right-aligned.
a number (optional) specifying the minimum width of the field, in characters.
"." and a number (optional) specifying different things for different types:
- for strings, the max number of characters
- for floats, the number of digits after the decimal point
- for ints, the min number of digits
a conversion character specifying what the type of this argument is:
- "c": a char
- "s": a char* string
- "d" or "i": a signed int
- "u": an unsigned int
- "o": an int in base-8
- "x" or "X": an int in base-16
- "hd" or "hi" or "hu" or "ho" or "hx" or "hX": a short
- "ld" or "li" or "lu" or "lo" or "lx" or "lX": a long
- "f": a double
- "e" or "E": a double in exponential notation
- "g" or "G": a double that could be treated as either "f" or "e" depending on which would display better.
- "p"; a void*, or really any pointer

// print a somewhat unsafe string with no actual modifiers:
printf("hello, world!\n");

// print a safer version:
printf("%s\n", "hello, world!");

// tell me what the integer is:
int foo = ...
printf("foo = %i\n", foo);

printf returns the number of characters written. You could check that for errors, if you're insane.

sprintf(char str, char fmt, ...)

sprintf is basically printf except it writes to a string instead of to stdout.

char my_str[256]; // not a great idea to hardcode this number, but hey
int foo=..
sprintf(my_str, "foo= %i\n", foo);

scanf(char *fmt, ...)

scanf reads a formatted string from stdin. The formatting looks the same as printf's format string, and the values read in are stored into the pointers you pass to scanf.

int myint1, myint2;
char mystr[256];    // FYI: bad to hardcode
int res = scanf("%d %d %s", &myint1, &myint2, mystr);
if (res == EOF);    // out of input to read
else if (res < 3);  // ERROR: didn't get all three items!

sscanf(char str, char fmt, ...)

sscanf is like scanf except it reads from a strings intead of from stdin.

char *existing_str = "12 foo";
int myint;
char mystr[256];   // FYI: still bad to hardcode
int res = sscanf(existing_str, "%d %s", &myint, mystr);
..

files (with file handles)

There are three global FILE* variables available to you in C: stdin, stdout, and stderr.

fopen(char* name, char* mode)

fopen tries to open the file at the path contained in name. mode is a string (really!) that indicates reading ("r"), writing ("w"), or appending ("a").

fopen returns a FILE*. FILE is a struct holding lots of info you probably don't want to know about. The important part is that it's not null, so you can pass it around to file-manipulating functions.

FILE *fh = fopen("/path/to/some/file", "r");
if (!fh) {
  // error!
}

getc(FILE*)

Returns the next character from the given FILE stream.

putc(int c, FILE*)

Writes the given character to the given FILE stream. Like putchar, it returns the given character, or EOF if there was an error.

fscanf(FILE, char fmt, ...)

fscanf is the same as scanf except that it reads from the given filehandle instead of from stdin.

fprintf(FILE, char fmt, ...)

fprintf is the same as printf except that it writes to the given filehandle instead of to stdout.

fclose(FILE*)

Closes the given filehandle, which just tells the system you're done with it. This doesn't mean much for reading, but for writing, this is the point where the buffer is flushed and errors occur when disks are full.

if (fclose(fh)) {
  // error!
}

fclose is called automatically when the program exits, but you really shouldn't be sloppy about closing filehandles when you're done with them - there's usually a limit to the number of files a process can have open at a time, so not cleaning up may make future fopens fail.

ferror(FILE*)

ferror returns nonzero when there's been an error on the given stream.

feof(FILE*)

feof returns nonzero if the end-of-file has occurred for the given stream. This is what you check when reading through a file.

fgets(char, int, FILE)

fgets reads the next line of the given stream and stores it into the given string. (Up to a max number of characters.)

FILE *fh = fopen("/some/path", "r");
while (!feof(fh)) {
  char line[256];
  fgets(line, 256, fh);
  printf("%s", line);
}
close(fh);

fputs(char, FILE)

fputs writes the given string to the given filehandle, without formatting.

files (with file descriptors)

File descriptors are just integers. They are how UNIX thinks of files, as opposed to the handles used above.

open(char*, int flags, int perms)

Opens the given file, in the mode dictated by flags, which could be one of:

O_RDONLY
O_WRONLY
O_RDWR

Oddly, open cannot be used to create new files.

creat(char*, int perms)

Opens the given file for writing. If the file didn't exist before, it does now; if it did exist, it's now empty.

The permissions on the file are controlled by perms, which is usually specified in octal.

read(int fd, char* buf, int max)

Reads up to the max number of character from the given file descriptor into the given buffer.

read returns the number of characters read. 0 means "end of file", and -1 means there was an error.

write(int fd, char* buf, int n)

Writes the given number of characters from the given buffer into the file pointed to by the file descriptor.

write returns the number of characters written. If that number is not equal to the one you gave it, there was an error.

lseek(int fd, long offset, int origin)

For the given file descriptor, jumps the current position in the file to the given character offset.

origin controls how offset is used:

0: start from beginning
1: start from current position
2: start from end

directories

opendir(char*)

Returns a "file"handle (I guess really a dirhandle) for the given directory.

readdir(DIR*)

Returns a dirent struct object pointing to the first entry of the directory.

closedir(DIR*)

Tells the system you're done reading the directory.

(example)

DIR *dh = opendir("/some/random/dir");
if (!dh)
  //error!
struct dirent *dir_entry;
while (dir_entry = readdir(dh)) {
}
closedir(dh);

The preprocessor

The C preprocessor is the very first part of compilation. "Preprocessing" involves expanding macros (things that start with "#") to create the actual source code that's fed to the compiler. Preprocessor directives don't follow the same set of syntax rules as the rest of the language, so be careful of the following:

"//" is not recognized as a comment. Well, "//" isn't always recognized as a comment in C anyway (it's a C++ thing), but most modern C compilers have added it. However, the preprocessor can't assume that it's necessarily a comment, so it has to be kept.
If your want your directive to span more than one line, each continuation line must end with a backslash.

#include

#include copies the content of the named file into the current file. This is mostly used to pull in declarations that have been put in a centralized file. (And since declarations have to be seen before functions can be used, these #includes are usually at the top, which is why they are called header files.)

There are two variations. If you specify the file name in angle brackets, C looks in implementation-specific places for the file; if you specify the file in double-quotes, it first looks in the same directory that has the current file, and then looks in implementation-specific places.

Note that the "-I" compiler switch adds a directory in which to look for these files.

#include <stdio.h>
#include "myproject.h"

#define

#define lets you swap out any compiler token for something else. The token name you specify has to follow the normal rules for C identifiers, but the value you replace it with can be almost literally anything.

You can #define the same token multiple times; at any point during the preprocess scan, the most recent definition wins.

#define has several variations.

swap token for text

This is what people usually mean when people say "macro". It's most commonly used to define constant values, but can do anything such as define a new loop keyword:

#define PI 3.14159
#define forever for(;;)
..
forever {
  ..
  float area = PI * r * r;
  ..
}

no value

You can specify a #define without a value, in which case its value is actually an empty string. This is most commonly used to see if the token has been seen before (with #ifdef or #ifndef).

#define FOO
...
#ifdef FOO
  ..
#endif

with parameters

Yes, you can define directives with parameters. Sneaky! Here's the example from the Kernigan/Ritchie book:

#define max(A,B) ((A) > (B) ? (A) : (B))
..
int myvar = MAX(var1, var2);

Using parameters in macros comes with even more warnings. First is that each instance of each parameter is re-evaluated in the code, which is occasionally incorrect. Using the MAX example, consider MAX(i++, j++). That will be expanded to ((i++) > (j++) ? (i++) : (j++)), which executes two increments on one of those variables. In addition to this correctness problem, the re-evaluation is also an optimization concern.

The second warning is that since the macro-expanded code is fed back into the compiler, you have to account for the normal rules on precedence. Consider this:

#define square(a) a*a
..
int myvar = square(i+1);  // expands to "int myvar = i+1*i+1;"

Obviously, that did not do what you expected.

Also, there are some even sneakier things you can do with parameters: adding quotes on the fly, and creating tokens on the fly.

adding quotes on the fly

Inside of a directive definition, you can have the preprocessor add quotes by prepending a parameter with "#". Again, the Kernigan/Ritchie example is great for showing why you'd want to do that:

#define dprint(expr) printf(#expr "=%g\n", expr)
..
dprint(x/y);  // expands to: printf("x/y" "=%g\n", x/y);  Then the strings
              // are concatenated to create: printf("x/y=%g\n", x/y);

creating tokens on the fly

Using "##" tells the preprocessor to concatenate two things together and then rescan the result. If you're wondering what that's good for, so am I.

#undef

Things that have been previously #defined may be undefined with #undef.

#if..#endif

The #if directive evaluates an integer expression (at compile time!) and either includes or skips the contents of the block depending on the result.

There are #else and #elif directives you can use as part of an #if block.

The condition expression is pretty much limited to math and logic operations on integers, but you are allowed to use one "function": defined. It returns whether the given identifier has been previously #defined, in any incarnation.

#if MYCONST == VALUE1
..
#elif MYCONST == VALUE2
..
#else
..
#endif

#ifdef/#ifndef..#endif

In fact, if all you're checking is whether something has or has not been #defined, the preprocessor has shortcuts.

#if defined(foo)
// is the same as:
#ifdef foo


#if !defined(bar)
// is the same as:
#ifndef bar

Far and away the most common use of this is to prevent double-definition of header file content. (See below.)

Packaging

C supports 2 pieces of software packaging:

header files (code files which traditionally end with .h) contain declarations of variables and functions that live elsewhere.
object files (binary files which traditionally end with .o or .so) contain already-compiled definitions of things.

Typically they go together -- when someone creates a package for you to use, they'll deliver both the binary object as well as the text header.

Non-standardized UNIX tradition also includes archives, which are a bundle of object files combined into a single file that ends with .a. Archives are created with the ar utility. Pretty much all linkers understand .a files and handle them as a collection of .o files.

header files

The #include mechanism tells the compiler to copy the contents of the specified file and paste them at that location in the current file. Anything you can do in C code can be segmented up and reconstituted with #include. The intended application of it is to centralize external declarations and definitions. That is, instead of having every one of your code files type out the exact declaration for printf, let's put the declaration in a file called stdio.h, and then your code files need only #include <stdio.h>.

#include is great for code reuse and decoupling, but presents a very real problem: object double-definition. If you define an identifier multiple times (even as the exact same thing), the linker will throw an error. Identifiers must be defined exactly once in the final program.

To create your own header files and avoid double-definitions, you need to know two tricks: how to declare things (vs. define them), and how to keep definitions from being seen more than once.

declarations

Variables are declared by prefixing them with extern. This tells the compiler that the variable will be defined somewhere else, but also gives the compiler all the type information it needs.

Functions are declared by replacing the function body with a semicolon.

Structs, as objects, cannot be just declared without also defining them. However, since structs can be recursive, you can use pointers to them just fine.

extern int myvar1;
extern char *myvar2;

int myfunc1(int);

#ifndef

To avoid double-defintions, you'll quite frequently see a #ifndef trick in header files. The #ifndef trick sets a compile-time variable inside a header file, but then skips the whole header if that variable is already defined.

#ifndef _MYHEADER_H_
#define _MYHEADER_H_
...
#endif

This means you can #include a file multiple times and only the first one has the actual content in it.

object files

Object files are the compiled binary of some number of source files. They can be either statically linked (conventionally .o) or they can be dynamically linked (conventionally .so).

Even though they're not in readable ASCII, there are a few things you can do to see what's going on with object files.

nm: shows you the symbol table.
strings: shows you all the constant char* strings.
otool: lists various parts of an object file.
od and hexdump: reads an arbitrary file and prints it out as oct/hex codes.

You consume object files by linking them with your binaries to create the final program. You could do linking yourself with ld, but most compilers will do it for you when you give them binary inputs and/or request a final program as the output.

Operator precedence

One of the most loudly derailed warts in C is that it defines 15 distinct levels of operator precedence of varying associativity. If you wish to reach the end of your life sane, now is the time to avert your eyes, because I now present to you the full table in all its respendent whatthefuckery, in decreasing order of precedence:

operator(s)	associativity	precedence
() [] -> .	left to right	highest
! ~ ++ -- +(no-op) -(negation) *(dereference) &(address-of) sizeof	right to left
*(multiplication) / %	left to right
+(addition) -(subtraction)	left to right
<< >>	left to right
< <= > >=	left to right
== !=	left to right
&(bitwise and)	left to right
^	left to right
\|	left to right
&&	left to right
\|\|	left to right
?:	right to left
= += -= *= /= %= &= ^= \|= <<= >>=	right to left
,	left to right	lowest

Chris verBurg
2015-07-12