|
w w w . a q u a m e n t u s . c o m |
What is C?
Necessary background
Understanding the C development model
Understanding the UNIX process model
Binary numbers
man pages
Hello, world
Built-in types
Integer numbers
char
short
int
long
long long
prefixes and suffixes
aside: big-endian vs. little-endian
Non-integer numbers
float
double
long double
Prefixes and suffixes
Pointers
Pointer arithmetic
Pointers to functions
Aggregate types
Arrays
Multidimensional arrays
Arrays vs Pointers
struct
union
enum
typedef
Variables
Identifiers
Syntax
const
static
register
Bit-fields
Type promotions and typecasting
Scope
Characters
isalpha(int)
isupper(int)
islower(int)
isdigit(int)
isalnum(int)
isspace(int)
toupper(int)
tolower(int)
Strings
strlen(char*)
strcpy(char *dest, char *src)
strcmp(char *str1, char *str2)
strcat(char *dest, char *src)
strchr(char *str, char ch)
strstr(char *str, char *substr)
Numbers
++ and --
assignment operators
functions in math.h
Control
if-else
?: (the ternary operator)
switch
while
do-while
for
goto
Functions
Pass-by-value
return
varargs
Arcania
Memory management
System
argv/argc: command-line arguments
system(char*)
rand()
stat()
I/O
stdin/stdout/stderr
getchar()
putchar(int)
gets(char *, int)
puts(char*)
printf(char *fmt, ...)
sprintf(char *str, char *fmt, ...)
scanf(char *fmt, ...)
sscanf(char *str, char *fmt, ...)
files (with file handles)
fopen(char* name, char* mode)
getc(FILE*)
putc(int c, FILE*)
fscanf(FILE*, char *fmt, ...)
fprintf(FILE*, char *fmt, ...)
fclose(FILE*)
ferror(FILE*)
feof(FILE*)
fgets(char*, int, FILE*)
fputs(char*, FILE*)
files (with file descriptors)
open(char*, int flags, int perms)
creat(char*, int perms)
read(int fd, char* buf, int max)
write(int fd, char* buf, int n)
lseek(int fd, long offset, int origin)
directories
opendir(char*)
readdir(DIR*)
closedir(DIR*)
(example)
The preprocessor
#include
#define
swap token for text
no value
with parameters
adding quotes on the fly
creating tokens on the fly
#undef
#if..#endif
#ifdef/#ifndef..#endif
Packaging
header files
declarations
#ifndef
object files
Operator precedence
What is C?
C is a programming language developed in the 1970s to address the problem of
portability across hardware platforms. It is a very low-level language
(that is, hardware and operating system details are common), but it is
also universal to pretty much every computer architecuture you would
ever care about. Some have called it the world's worst programming language, but
in actuality it's the world's best assembly language.
Coding in C can be quite fun, provided you're a wheel-reinventing sadist
who's paid by the hour. But sometimes that's exactly the mood you're in.
Necessary background
Understanding the C development model
C is a compiled language. To get source code to execute, you must get it
through two broad steps:
- compilation converts some number of C source code files into
object files
- linking connects some number of object files together to
make an executable program.
Object files and executables are binary files, in a format such as
ELF
or
Mach-O.
Their contents are specific to architecture (such as
ARM
or
x86), and since they
reference the operating system API, they are also specific to operating
system (such as
linux
or
OS X). Unlike other
languages, developers need to compile binaries for each platform they
support.
Since C is a low-level language, practical projects tend to be large and
usually span many more files
than they would in a high-level language. To deal with the large number of
files (and to minimize the amount of recompilation when something changes),
most C projects use
Make;
I won't go into it here, except to mention that it is a very common
practical part of the C development model.
Understanding the UNIX process model
The "Von Neumann" model of computer architecture consists of two connected
pieces:
- an array of memory cells. The size of a memory cell is almost always
one byte (eight bits). A memory cell can be read and written through its
address, which is really just the index into the array; the range of
indexes/addresses starts at zero and goes up as far as the processor's
width allows.
- processing logic that can read and write the memory, as well as perform
arbitrary operations on binary data. "Operations" are things like AND, OR,
NOT, arithmetic (adding/subtracting/etc), and comparisons (is-equal,
is-greater-than, etc).
For a 32-bit machine, the highest
memory address is 232-1, which is 4,294,967,295. For a 64-bit
machine, the highest memory address is 264-1, which is
18,446,744,073,709,551,615. If those look like completely arbitrary
random numbers to you, remember that in binary 232-1 is
"11111111111111111111111111111111", and 264-1 is
"1111111111111111111111111111111111111111111111111111111111111111".)
A UNIX process occupies memory in approximately the following manner:
memory address | memory contents |
0xffffffff (highest address) | The stack. The stack is where
temporary memory is allocated, for things like local variables and
function frames. The stack grows downward. |
... | Between the heap and the stack is open space. |
(some address) | The heap. The heap is where dynamic memory is
allocated from malloc (et al). The heap grows upwards. |
0x00000000 (lowest address) | Code. code is put in the lowest section
of memory. The bigger the program, the bigger this section is. |
Binary numbers
Actually I'm just going to go ahead and assume that you, fearless reader,
already know about binary (and octal, and hex) numbers. :)
man pages
C is baked into UNIX and all its variants, so unlike almost any other domain
you can rely on man
pages for any information you need.
% man exit -s3
NAME
exit, _Exit -- perform normal program termination
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include
void
exit(int status);
void
_Exit(int status);
DESCRIPTION
The exit() and _Exit() functions terminate a process
...
I had to add -s3 to avoid getting the description of the bash exit
command.
Everything you need for C will be in either section 2 or 3.
Hello, world
To do the required first example of any language tutorial, let's create a
file called foo.c
containing the following:
#include <stdio.h>
int main() {
printf("Hello, world.\n");
return 0;
}
We compile and run it like so:
% cc foo.c
% ./a.out
Hello, world.
%
Things to notice:
- You must
#include
a file (stdio.h
, which lives in a mysterious
implicit location) in order to be able to call printf
later.
- All top-level programs must have a function named
main
. By definition,
main
is where a C program begins execution.
printf
is a function, so it requires parentheses.
- The text you give to
printf
does not implicitly include a carriage
return, so you must include one yourself.
- Yep, you need semicolons.
- We set the program's return value by calling
return
inside main
.
- We did not specify an output file for the "cc" command. Its default is
"a.out".
- In fact, we used
cc
, when most people now use gcc
or clang
instead.
- Typical actual compiler invocations are much longer because of warning
flags, additional search paths, and compiler directives.
Built-in types
The core C language has bare minimum of types. Its primitives correspond
to hardware concepts (integers of various bit-widths, floating-point
numbers of various bit-widths, pointers), but it does have the ability
to aggregate (with arrays and struct
) so that you can create types of arbitrary
complexity.
Integer numbers
All of the integral types may be declared as either "signed" or "unsigned". The
typical difference is whether the high bit is used to indicate that the
number is considered negative, but the standard does not specify the use of
ones-complement vs. twos-complement vs. bobs-newfangled-contraption for
representing negative numbers. Also, the default for "signed" vs "unsigned" is
platform-dependent, so you should probably always be explicit.
char
The fact that "char" is short for "character" is misleading. At the time of
the epoch, the only language that existed was English, which contains a
mere 52 symbols. Throw in another 10 for the digits, plus a bunch of
punctuation symbols, and the total number of characters anyone would ever
need to type was well under 256. 256 is the number of numbers you can
represent with 8 bits, so the hardware concept of a "byte" got translated to
the software concept of a "character".
Unfortunately, immediately after C was deployed in mission-critical systems
that could never be refactored, mankind immediately developed 6,500 other
languages, plus a bunch of ancient ones that are no longer never having
been used, so "char" is nowhere near adequate for representing a generic
linguistic character. Yet, here it stays, with us forever like an STD from
some cheap Haskell-programming whore.
"char" is defined on every platform to be exactly one byte. It is the only
datatype in C with a guaranteed size.
Constants for the char type may take the usual numeric form (char
is just
an 8-bit integer, despite the name), or they may
be actual characters enclosed in single quotes. (This is the only use of single
quotes in C.) The contents of the single-quoted string could be a single
character or it could be an escape sequence. An "escape sequence" is a
backslash followed by something, and the recognized ones are:
\0, \a, \b, \f, \n, \r, \t, \v, \\, \?,
\', \\, \o### (with octal digits), and \x## (with hex digits).
#include <limits.h>
printf("Number of bits in a byte: %d\n", CHAR_BIT);
printf("Number of bytes in a char: %d\n", sizeof(char));
printf("Range of signed chars: %d to %d\n", SCHAR_MIN, SCHAR_MAX);
printf("Range of unsigned chars: %d to %d\n", 0, UCHAR_MAX);
short
"short" is short for "short int", but the "int" is almost always omitted.
Shorts must be at least 2 bytes, but must not be bigger than an "int".
(It's usually 2 bytes.)
#include <limits.h>
printf("Number of bytes in a short: %d\n", sizeof(short));
printf("Range of signed shorts: %d to %d\n", SHRT_MIN, SHRT_MAX);
printf("Range of unsigned shorts: %d to %d\n", 0, USHRT_MAX);
int
"int" is short for "integer". It also must be at least 2 bytes, but it can
be bigger. Typically "int" is whatever the natural size of the machine is.
(It's usually 4 bytes.)
#include <limits.h>
int sfoo = 1234;
unsigned int ufoo = 1234;
printf("Number of bytes in an int: %d\n", sizeof(int));
printf("Range of signed ints: %d to %d\n", INT_MIN, INT_MAX);
printf("Range of unsigned ints: %d to %d\n", 0, UINT_MAX);
long
"long" is short (haha) for "long int", but again the "int" is almost always
omitted. Longs must be at least 4 bytes (but can be bigger), and must be at least
as big as ints.
(It's usually 8 bytes.)
#include <limits.h>
long sfoo = 1234;
unsigned long ufoo = 1234;
printf("Number of bytes in a long: %d\n", sizeof(long));
printf("Range of signed longs: %d to %d\n", LONG_MIN, LONG_MAX);
printf("Range of unsigned longs: %d to %d\n", 0, ULONG_MAX);
long long
I'm not sure that "long long" is a standardized type, but it seems
that most or all compilers accept it. Presumably, "long long" just has to be
at least as big as a long, with the expectation that it's twice the size.
(However, on my systems it's 8 bytes, instead of 16.)
prefixes and suffixes
On any constants of the integer types, you may add any or all of these suffixes:
- l or L: forces the constant to be long (instead of int).
- u or U: forces the constant to be unsigned (instead of signed).
Additionally, you may add one of these prefixes:
- 0x or 0X: specifies that the constant is base-16
(hexadecimal). The constant may include the characters 'a' through 'f'
(or 'A' through 'F') to represent numbers 10 through 15.
- 0: specifies that the constant is base-8 (octal). The constant
may not contain characters '8' or '9'.
aside: big-endian vs. little-endian
A passing thought may have occurred to you regarding bigger-than-8-bit
numbers. If, say, an "int" is four bytes in memory, there are two sensible
ways of ordering those four bytes:
- little-endian: this stores the lowest byte in the lowest memory address.
- big-endian: this stores the highest byte in the lowest memory address.
For example, if our int consisted of four bytes (byte0, byte1, byte2, byte3) and
was put at address "0x1000", memory could look like either of these:
(memory) | 0x0fff | 0x1000 | 0x1001 | 0x1002 | 0x1003 | 0x1004 |
(little-endian) | ... | byte0 | byte1 | byte2 | byte3 | ... |
(big-endian) | ... | byte3 | byte2 | byte1 | byte0 | ... |
There are no compelling reasons to go with either particular arrangement, it
just needs to be understood by consumers. In fact, some architectures do
it one way and others do it the other: x86 is little-endian, whereas the
68k is big-endian. (Sparc, ARM, and PowerPC are all bi-endian, which means
they let you select which one to use!)
Non-integer numbers
For non-integer numbers, C has direct support only for floating-point numbers.
For things like fixed-point numbers, complex numbers, vectors, and fractions,
you would have to make your own types (see "struct" later).
Floating-point numbers do not have signed/unsigned variants.
A caution on floating-point numbers: any implementation of floating-point
numbers is limited, because floating-point numbers have the nasty ability
to not be representable in a finite number of digits. (It's the same problem
that "one-third" has in base 10, requiring an infinite number of 3s).
Worse, humans do math in base-10
but computers do it in base-2, so there's an added problem of converting
between two approximation systems. Avoid using floating-point numbers
when your application requires precision. (For example, money systems
should use integers, or maybe a custom implementation of fixed-point
numbers.)
float
"float" is what they call "single-precision", though "single-precision"
doesn't seem to have any defintion other than "smaller than double-precision".
(It's usually 4 bytes.)
float foo = 123.4;
double
"double" is at least as big as "float", and might actually be guaranteed
to be larger.
(It's usually 8 bytes.)
double foo = 123.4;
long double
"long double" is actually part of the language!
(It's usually 16 bytes.)
long double foo = 123.4;
Prefixes and suffixes
On any constants of the floating-point types, you may add one of these suffixes:
- f or F: forces the constant to be float (instead of double).
- l or L: forces the constant to be long double (instead of double).
Pointers
A pointer is exactly what it sounds like: a variable whose contents point
to some memory address. Typically pointers point to the address of some
other variable
(wherever it got allocated in memory), but there are also pointers to
functions (wherever they got placed in memory).
int a = 0; // 'a' is a variable at some memory address consuming sizeof(int)
// bytes. Using 'a' means using the contents of that memory, which will be
// interpreted as an integer.
int *ap; // 'ap' is a variable at some memory address consuming sizeof(void*)
// bytes. Using 'ap' means using the contents of that memory, which will
// be interpreted as the memory address of some integer.
ap = &a; // the contents of 'ap' are set to the address of 'a', wherever
// it happens to be. The '&' operator gets the address of something.
*ap = 1; // the contents of 'ap' are interpreted as a memory address, and
// the contents of that address are set to '1'. At this point, the value
// of 'a' is now '1'!
ap = 1; // the contents of 'ap' are set to '1', which is not a valid thing
// to do because you really don't know what's going to be in a hardcoded
// memory address. This will (usually) generate a compile-time error.
ap = 0; // however, this is okay. '0' is a special value for pointers,
// and it means that the pointer is not pointing to anything. Attempting
// to deference it and change the memory at '0' is a runtime error that
// results in a segmentation error ("segfault"). Sometimes you'll see
// 'NULL' used for this purpose; NULL is just a #define to 0.
You can declare and use pointers to any data type: char, short, int, float,
etc. This includes pointers to custom struct
s.
You can also have a "pointer to void". void
is a way of saying something
is typeless (and for functions it's a way to say there is no returned value,
or has no parameters). void*
is used as a general pointer type when we
can't specify what the type is.
Far and away the most common use of pointers in C is to implement pass-by-reference. Since
function arguments are copied on each invocation, modifying the local arguments
does not affect the originals; when you want to modify the original
arguments, you can pass the address of the argument instead of the argument
itself, and then the function can modify what the argument points to. An
example from Kernigan & Ritchie:
void swap(int *x, int *y) {
int t = *x; // set t to whatever x is pointing to
*x = *y; // set whatever x is pointing to to whatever y is pointing to
*y = t; // set whatever y is pointing to to t
}
..
int myvar1 = 10;
int myvar2 = 42;
swap(&myvar1, &myvar2);
The second most common use of pointers is to implement multiple return
values. Since C functions can return
only one thing, you could make one
of the function arguments a pointer, and the function could put an additional
return value there. This is very common in the C library.
// returns: whether we've hit end of file. '0' means no more data.
// 'c': set to the next character, if there was one.
int get_next_thing_from_file(char *c) {
..
}
..
char next_char;
while (get_next_thing_from_file(&next_char)) {
..
}
Pointer arithmetic
The type of a pointer is very important because C lets you perform some
basic math with pointers. Adding to and subtracting from pointer types
operates on the size of the pointed-to data structure. Thus adding 1 to
a char*
increments the numeric value by one, whereas adding 1 to
an int*
increments the numeric value by sizeof(int)
.
Subtraction between pointers tells you the number of elements between
the two addresses. It does not return the number of addresses between
them, even though pointers are just numeric values!
Since there is basic arithmetic, you also have access to basic logic
operations on pointers, such as is-equal (==
), is-not-equal (!=),
less-than (<), greater-than (>), etc.
Pointers to functions
Pointers to functions are syntactically horrible in C, due to the precedence
of the various pieces involved. Suppose you have an easy function that takes
a float
parameter and returns an int
. Its declaration and a pointer to it
would look like this:
// declare it:
int myfunc(float);
// make a pointer to it:
int (*myptr)(float) = myfunc;
// call it normally:
int res1 = myfunc(4.0);
// call it through the pointer:
int res2 = (*myptr)(4.0);
The K&R C book has a sample program that converts some obfuscated C declarations
into English. I repeat the two meanest ones here:
char (*(*x())[])();
-> x is a function, returning a pointer to an array (of unknown size) of
pointers to functions that return a char.
char (*(*x[3])())[5];
-> x is an array of 3 pointers to functions returning a pointer to an array
of 5 chars.
There's a reason why programmers drink.
Aggregate types
Arrays
Arrays are a block of variables grouped together. The variables are
separate, but they're accessed by group index instead of by an individual name.
int a[10]; // defines 'a' to be an array of 10 ints.
a[0] = 42;
a[9] = 29; // completely separate from a[0].
Arrays are implemented as contiguous elements in memory.
You can initialize an array at construction. The syntax uses braces in a
unique way not seen anywhere else in the language:
int primes[10] = { 1, 2, 3, 5, 7, 11, 13, 17, 19, 23 };
Some nifty things about array initialization:
- You can omit the array size, and C will figure it out based on what's in
the initialization list.
- You don't have to initialize all elements -- if you have fewer
elements in the list than your declared array size, the remaining ones will
be initialized to 0. Bonus: this means you can initialize your whole array to
zero with an empty initialization list!
int arr[10] = {};
- Specifying more elements in the initialization list than will fit in the
size of the array is a compile-time error.
Since strings are arrays of characters, you can initialize them in the same
way. However, since no one wants to type out strings like that, C gives you
quotes to do the same thing. The following are equivalent:
char str1[4] = {'m', 'o', 'o', 0};
char str2[4] = "moo"; // note that C adds the terminating null
Multidimensional arrays
C does indeed support multidimensional arrays, but only because they can
be viewed as arrays of arrays.
int myarr[3][5] = {
{0, 1, 2, 3, 4},
{1, 2, 3, 4, 5},
{2, 3, 4, 5, 6}};
..
for (int i1 = 0; i1 < 3; ++i1) {
for (int i2 = 0; i2 < 5; ++i2) {
.. myarr[i1][i2] ..
}
}
You can pass these around as function parameters, but you need to specify
at least n-1 of the dimension sizes so that the compiler can do the offset
math correctly. (The one you don't have to specify is the first one, because
it's the number of the biggest chunks, which isn't necessary for figuring
out addresses.)
void myfunc1(int myarr[3][5]) {
..
}
void myfunc2(int myarr[][5]) {
..
}
// you could also do this, for the Obfuscated C Contest:
void myfunc3(int (*myvar)[5]) {
..
}
Arrays vs Pointers
Arrays are visible via a pointer to their first element, so much of the time
arrays and pointers are interchangable.
int arr[10]; // the literal contents of 'arr' is the memory address where the first array element is
int *ptr;
ptr = arr; // ptr now points to the same place as 'arr' - the first array element.
// same way to get the first element:
arr[0];
*ptr;
// same way to get the second element:
arr[1];
*(ptr + 1);
// same way to get the address of the second element:
&arr[1];
ptr + 1;
However, arrays and pointers are not exactly equivalent. Arrays are not a primitive
type, so you cannot assign them, and you cannot increment/decrement them. Arrays
also cannot be null.
arr = ptr; // NOT okay, even though you can do "ptr = arr"
arr++; // NOT okay; use pointers if you want to do this
struct
"struct" allows you to create a new type that bundles together any number of
variables into a single object. Here's an example of a "rectangle" struct
that consists of two 2-dimensional points:
// define what "rect" means:
struct rect {
int x1;
int y1;
int x2;
int y2;
};
// declare a variable of it:
struct rect myrect1;
// declare a variable and initialize its contents, in defined order:
struct rect myrect2 = {0, 0, 640, 480}; // ready to play Myst!
// print out its members:
printf("(%d, %d)-(%d, %d)\n", myrect2.x1, myrect2.y1, myrect2.x2, myrect2.y2);
structs may be nested, so our definition of rect
could instead have included
two instantiations of a separate point
struct.
structs are implemented as a single block of memory big enough to contain all
the members. (The members are necessarily in order, but due to alignment they
aren't necessarily contigious.) They can usually be treated as a primitive
data type; sizeof
works on them, you can assign them to each other (which
works by copying bits, just like primitives), you can pass them to and
from functions (which passes by value, just like primitives), and you could
allocate them either on the stack or on the heap.
Pointers to structs access members with an arrow (->
) instead of a period
(.). This is because the precedence of the period is higher than the
derefence operator (*):
struct rect *foo = ...;
int a = (*foo).x1; // dereference foo, then look up its x1 member
int b = foo->x1; // same
int z = *foo.x1; // WRONG: looks up its x1 member, then tries to dereference it
Arrays of structs are treated like any other arrays. You can still use curly
braces to initialize, which is interesting because the contents of the curly
braces is then a repeating hodpodge of whatever types are in the struct:
struct {
int a;
float b;
char c;
} myvar[3] = {
0, 1.3, 'a',
42, 14.8, 'j',
18, -0.3, '\0',
};
Structs may also contain pointers to themselves, for use in things like
linked lists and trees. Structs cannot contain actual instances of
themselves, because how would that work.
struct my_linked_list_node {
char *data;
struct my_linked_list_node *next_node;
};
union
"union" is quite strange. It's a bundle of different types, but only one of
them is active at a time. Consider this example:
union {
int my_int;
float my_float;
char* my_str;
} my_union;
my_union foo;
When you create instance foo
of my_union
, the compiler will allocate just
enough space for the largest of the types. When you use foo.my_int
, the
space is treated as an int
; when you use foo.my_float
, the space is
treated as a float
; when you use foo.my_str
, the space is treated as
a char*
.
The typical application and example of union
is compiler tokens. "23"
would be turned into an integer, "0.42" would be turned into a floating-point
number, and "asdf" would be treated as a string. However, if you made a
token struct
with all possible types you'd need, you'd waste a ton of
memory on all those unused fields. union
lets you say this space could be
any one of a number of different things.
HOWEVER, it's still up to you to make sure you know what the correct type
is -- neither the compiler at compile-time nor the code at run-time will
be able to tell you how foo
was last set.
You can kind of see how this is one of the precursors to polymorphism in
object-oriented programming.
enum
"enum" is short for "enumerated type", whose name comes from giving us the
ability to spell out (enumerate) the domain of their possible values:
enum COLORS { RED, ORANGE, YELLOW, GREEN, BLUE, VIOLET };
COLORS foreground_color = YELLOW;
COLORS background_color = VIOLET;
if (foreground_color == YELLOW && clashes(foreground_color, background_color)) {
..
}
Variables of type enum
are actually just integers, and their values are
just constants, but the enum
mechanism lets you write much better
self-documenting code.
Another cool thing about enums: the integer values of the constants can be
whatever you want, but without specific overrides the first one is given a
value of zero, and each subsquent one is automatically set to "previous + 1".
// months don't start at zero:
enum MONTHS { JAN=1, FEB, MAR, APR, MAY, JUN,
JUL, AUG, SEP, OCT, NOV, DEC };
// use enum instead of #define for bit fields:
enum FLAGS { FIRST=1, SECOND=2, THIRD=4, FOURTH=8, FIFTH=16 };
// use enum to define aliases:
enum MEDIUM {
TV = 0,
TELEVISION = 0,
RADIO = 1,
SATELLITE = 1,
STREAMING = 2,
SPOTIFY = 2,
SOUNDCLOUD = 2 };
One downside to enums: you cannot use the same compile-time constant in more
than one enum, even if they have the same numeric value. For such awesomeness
you would have to use an updated language (ahem,
Anduin :).
typedef
typedef
allows you to create an alias for any other type, which is useful for
two things:
- centralizing a type you're using, so that you can switch the actual type
around by changing only a single line. This is what
size_t
is.
- simplifying complex declarations, especially function pointers.
typedef int size_t;
typedef int (*funcptr)(int);
// but be somewhat careful of this:
int* p1, p2; // p1 is an int*, but p2 is just an int!
typedef int* intptr;
intpr p1, p2; // both p1 and p2 are int*!
Variables
In C, variables must be declared, both as a variable and with their specific
type. Declaring variables is annoying only if you've never
tried to make a real program in a language that makes no effort to
help you find typos. Declaring as a specific type (as opposed to
duck-typing) is annoying only until
you realize it uncovers actual conceptual issues in an interface.
Identifiers
In C, identifiers (variables, function names, etc.) may contain any number
of letters, numbers, and
underscores, with the usual restriction that the first character cannot be
a number. Letters are case-sensitive.
Some systems and linkers have restrictions on the number of characters in an
identifier. (And in fact, due to historical deployment reasons, the standard
can guarantee uniqueness in only the first 6 characters! And those are
case-insensitive!) This does not manifest as an error or warning on those systems,
but rather the generated binary contains truncated versions of those identifiers.
Aliasing is then a real problem, which you debug at the binary level.
Program in C for great fun and relaxation!
Syntax
Variables may be simply declared, or they may also simultaneously be
initialized.
To declare a variable, you state its type and then its name:
int myvar;
To declare and initialize, you also include a value:
int myvar = 42;
If you do not initialize, the initial value of such variables is whatever
crap was left on the stack. (Which is more commonly called "garbage".)
You can declare more than one variable at once:
int myvar1 = 42, myvar2 = 13;
but be careful with pointer declarations, because the asterisk applies
to only one variable while the type (int
) applies to all of them.
const
Most variables are, well, variable. However, you can declare a variable
to be uneditable with the const
keyword. The compiler will then ensure that
nothing after its initial declaration and assignment will be able to change
its value. (Well, assuming you don't bend over backwards to subvert it.)
const int foo = 42;
By convention, constant identifiers are spelled with all capitals, much like
I just did not do in that example.
static
Most variables are automatic, which means they're scoped to their enclosing curly braces. (In
fact, you could explicitly declare all your variables with auto
if you really
wanted worker's comp for carpal.) However, you can also declare a variable
as static
. There are two contexts where you can use static
:
- inside a function, it means that the variable is persistent between
invocations. It will be initialized the first time the function is called,
and retain
its value between function calls. You can almost think of it as a global
variable with limited lexical scope.
- outside a function, it means that the variable will not be exported as a
visible symbol in the final binary. It will be initialized before
main()
starts, and is considered a global variable (for that file). In this way,
you can make a variable's scope bigger than a function but less than global.
Because those two contexts really have no relation whatsoever, asking about
static
makes for a great interview question.
Static variables are initialized to zero (unlike automatic variables). If you
want to initialize to a specific value, the value must be a constant
expression (also unlike automatic variables). This initialization is done
before main
starts.
register
Declaring a variables as register
(instead of auto
or static
) tells the
compiler that it
should try to use processor registers instead of main memory. However,
register
is only a hint (like inline
), and compilers are not required to
actually do it. Note that if you declare a variable as register
, whether
or not the compiler puts it in a register, you cannot take
the address of it.
Bit-fields
For signed int
and unsigned int
variables, you can declare that it should
be a specific number of bits instead of its usual full size. The most common
use of this is to pack boolean
variables together to save space while also making access easy.
The usual way to pack booleans is to use bitwise operations:
enum { IS_FOO=1, IS_BAR=2, IS_BAS=4, IS_BAT=8 };
unsigned int flags;
// turn on the IS_BAS flag:
flags |= IS_BAS;
// turn off the IS_BAR flag:
flags &= ~IS_BAR;
// check the IS_BAT flag:
if (flags & IS_BAT) ..
Bit-fields let you split each flag into its own variable, which means you
don't have to do the bitwise math anymore:
// the following requests only 4 bits of space, but it's up to the compiler
// to determine exactly how to pack them:
unsigned int is_foo:1;
unsigned int is_bar:1;
unsigned int is_bas:1;
unsigned int is_bat:1;
// turn on IS_BAS:
is_bas = 1;
// turn off IS_BAR:
is_bar = 0;
// check IS_BAT:
if (is_bat) ..
You're not limited to declaring 1-bit chunks, that's just the common use for
booleans.
If you see ":0", that's a way to force this variable to align on a word
boundary.
You cannot take the address of variables declared with bit-widths, because
the granularity of addresses is whole bytes.
Type promotions and typecasting
It happens occasionally that you have a variable of type X but need it to
be of type Y. This is a problem because the entire point of declaring things
as X or Y is to catch when you try to put a square peg in a round hole, and
the compiler will usually squawk at you.
Usually. Some X-to-Y translations are perfectly harmless, such as when you have
an int
but want to call a function that takes a long
. Since there's really
nothing that can go wrong with upsizing an int
, the compiler will do that one
for you automatically. (That's type promotion.) However, the reverse is not
true: if you have a double
and pass it to a function expecting a float
,
you'll get a warning.
Automatic type promotion happens all over the place, and the specific
rules for it are actually pretty intense. The rules can be reasonably
well summarized with "whichever of the two types is smaller will be promoted
to the type of the larger." int
is promoted to long
, which is promoted
to float
, etc.
But sometimes you want to translate between types that aren't directly
mathematically related, such as converting a pointer to an integer so that
you can print it out. In those cases, you have to tell the compiler that
you really do know what you're doing and to go ahead with the conversion.
This is a typecast. You can force a typecast by prefixing the expression
with the parenthesized type, e.g.:
extern void myfunc(float);
...
double myvar = ...;
myfunc( (float)myvar );
Note: C++ has a new mechanism for doing typecasts, though this way is still
supported.
Scope
Variables come into scope when they're defined, and for the most part go out of
scope when their enclosing block ends. "Enclosing block" usually means the
pair of braces, but it also applies to end-of-file.
Variables are visible to inner sub-scopes, but not to outer parent scopes.
Variables declared in inner scopes will hide ("shadow") variables in outer
scopes. Once hidden, there is no way to get to them.
int myvar0; // visible to the rest of this file, and the rest of the world via 'extern'
static int myvar1; // visible to the rest of this file, but not the rest of the world
void myfunc(int myparam) { // myparam is visible until the end of the function
// we can still see myvar0 and myvar1 in this function scope
int myvar2; // visible to everything in this function
if (...) {
// we can see myvar0, myvar1, and myvar2 in this inner scope
int myvar3; // visible until the corresponding close-brace
}
// myvar3 is no longer visible, but we have all the others.
}
// out here, only myvar0 and myvar1 are still visible. myvar2 and myparam both
// just went out of scope.
These scoping rules for variables also apply to functions, structs, and
everything else that's an identifier.
Automatic variables in recursive functions are specific to their invocation of
the function, so are not shared. Use static
if you want a variable to be
common across all invocations of a function.
Characters
All of the character operations work only for ASCII, not for unicode.
isalpha(int)
Returns nonzero if the given character is in [A-Za-z].
isupper(int)
Returns nonzero if the given character is in [A-Z].
islower(int)
Returns nonzero if the given character is in [a-z].
isdigit(int)
Returns nonzero if the given character is in [0-9]. Note that period is not
accepted, so be careful when trying to decode floating-points.
isalnum(int)
Returns nonzero if the given character is in [A-Za-z0-9].
isspace(int)
Returns nonzero if the given character is in [\t\n\v\f\r ].
toupper(int)
If the given character is a lowercase letter, returns the uppercase version
of it; otherwise, returns the input unchanged.
tolower(int)
If the given character is an uppercase letter, returns the lowercase version
of it; otherwise, returns the input unchanged.
Strings
C only sort of supports the idea of strings. In C, strings are
conventionally an array of char's followed by a zero. (A zero value, not
a zero character.)
You can specify a string constant in source code by surrounding it with
double-quotes. (Single-quotes are not the same thing.) As a convenience,
when the compiler
sees double-quoted strings, it converts them into char arrays and adds the
terminating 0 for you.
And, as you're now
aware, arrays and pointers are nearly interchangeable, so you will see
strings as either "char*" or "char[]". However, when you initialize
strings with direct text, only "char[]"s can be edited; "char*"s are
essentially constant. Also, remember that arrays aren't primitives and
can't be assigned:
char str_arr[] = "my string";
char *str_ptr = "my string";
..
str_arr[1] = 'a'; // ok
str_ptr[1] = 'a'; // ILLEGAL
..
str_ptr = str_arr; // ok
str_arr = str_ptr; // ILLEGAL
Oddly, C will concatenate adjacent string constants for you, so the following
are all equivalent:
char *c1 = "foobar";
char *c2 = "foo" "bar";
char *c3 =
"foo"
"bar";
All of the functions for dealing with strings are part of the C library, not
part of the language itself.
A note on the string functions below. Almost all of them have length-limited
equivalents (such as strncpy
instead of just strcpy
). They are much
safer to use because they stop after N bytes, which prevents them from
overwriting memory they don't own.
strlen(char*)
Returns the length of given string, not counting the terminating 0.
strcpy(char *dest, char *src)
Copies the given string (and the terminating zero) into dest
. Does not
allocate memory for dest
; it is assumed you already did this, and dest
is big enough to hold it.
strcmp(char *str1, char *str2)
Compares str1
to str2
and returns one of three values:
- 0: the two strings are equal
- less than 0:
str1
is alphabetically less than str2
- greater than 0:
str1
is alphabetically greater than str2
strcat(char *dest, char *src)
Appends src
to the end of dest
and adds a terminating zero.
strchr(char *str, char ch)
Finds the first occurrence of ch
in str
. Returns the location as
a char*
, or zero if the character was not found.
strstr(char *str, char *substr)
Finds the first occurrence of substr
in str
. Returns the location
as a char*
, or zero if the string was not found.
Numbers
All numeric types (integers and floating-points) support these common operators:
- +: addition
- -: subtraction
- *: multiplication
- /: division
Integer (non-floating-point) numbers also support:
- %: modulus
- ~: bitwise NOT
- &: bitwise AND
- |: bitwise OR
- ^: bitwise XOR
- <<: bitwise shift-left
- >>: bitwise modulus
++ and --
C pioneered the "++" and "--" operators for numeric types. They are a
shortcut way to increment or decrement a variable's value by one.
There are two ways to specify them:
- ++myvar (pre-increment): increments
myvar
and returns its new value.
- myvar++ (post-increment): increments
myvar
and returns its old value.
If you don't use the value of the expression, then they both just increment
the variable and are effectively the same.
assignment operators
Assignment operators are an awesome shortcut. Instead of having to type
out this:
mylongvariablename = mylongvariablename + 3;
you can type:
mylongvariablename += 3;
There are lots of assignment operators: +=, -=, *=, /=, %=, <<=, >>=, &=,
|=, and ^=.
functions in math.h
- sin: sine, in radians
- cos: cosine, in radians
- atan2: arctangent, in radians
- log: base-e logarithm
- log10: base-10 logarithm
- exp: exponentiation of e
- pow: exponentiation of any arbitrary number
- sqrt: square root
- fabs: absolute value
Control
A quick preamble on truth-value expressions. C does not have a native "boolean"
type; instead it uses integers and considers a value of zero to be false while
all non-zero values are considered true. Since C is a system language,
everything is ultimately interpreted as bits, so everything in a boolean
context is interpreted by their bits. (I'm emphasizing this because strings
don't work with "==".)
All numeric types support the following logical operators:
- <: less-than
- <=: less-than-or-equal
- >: greater-than
- >=: greater-than-or-equal
- ==: equal
- !=: not-equal
Truth-value expressions can be combined using the following logical operators:
A very important note about && and ||: for as long as C has
been around, those operators have been required to evaluate left to right
and to stop as soon as the final result can be determined. In particular,
this means the right part might not run. For example, "false && anything"
is going to be false, regardless of anything
; "true || anything" is
going to be true, regardless of anything
. This is sometimes called
"short-circuiting", and has been used extensively
by programmers to avoid typing out explicit logic. My personal favorite use
of this is to control debugging statements:
// this:
DEBUG && printf("asdf\n");
// ..is a shorter way of writing:
if (DEBUG) {
printf("asdf\n");
}
if-else
This is the most basic control element in C:
if (expr) {
..stuff..
}
..next stuff..
If expr is true, then stuff is run. If expr is false, then stuff is
not run and the program continues with next stuff.
You can chain if
s together with else
:
if (expr1) {
..stuff1..
}
else if (expr2) {
..stuff2..
}
..next stuff..
Just to be overly clear, this will first evaluate expr1. If true, then
stuff1 is run, followed by next stuff. If expr1 was false, then
expr2 is evaluated. If true, then stuff2 is run, followed by
next stuff. If expr2 was also false, then only next stuff is run.
Finally, you can also specify a final else
without an if
, which will
always run whenever all the other conditions are false:
if (expr) {
..stuff1..
}
else {
..stuff2..
}
..next stuff..
If expr is true, then stuff1 is run, followed by next stuff; otherwise,
stuff2 is run, followed by next stuff.
?: (the ternary operator)
The ternary operator is a compact if-else expression.
// the long way:
if (something) {
myvar = 1234;
}
else {
myvar = 5678;
}
// the ternary way:
myvar = something ? 1234 : 5678;
switch
switch
is like a cascaded if-else
except that it is much more elegantly
compact.
switch(something) {
case 4: ... break;
case 12: ... break;
default: ... break;
}
The compactness of switch
comes as a cost:
switch
determines where to start executing by looking for the case
whose expression matches an equality check with something. That means
you cannot do range checks (such as less-than or greater-than), and you
cannot use them for strings.
- while the something can be any arbitrary expression, each of the
case
values must be constants so that they can be determined at
compile-time.
switch
is implemented (quite sneakily!) as a jump-table. That means that
the amount of memory they consume is proportional to the range of case
values. Caveat programmer.
The default
clause is optional.
You may chain multiple case
s together. This is both good and bad. It's
good because it makes switch
even more compact (without sacrificing readability).
It's bad because it means you have to explicitly break
when you don't want
that behavior, which is really easy to forget.
switch(myvar) {
case 0:
case 2:
case 4:
printf("low-value even!\n");
break;
case 1:
case 3:
case 5:
printf("low-value odd!\n");
break;
default:
printf("not a low value\n");
break;
}
(Strictly speaking, the break
on the last case
isn't necessary, but I agree
with Kernighan and Ritchie that it's a good defensive programming practice in
case of later code shuffling.)
while
while
is the most basic loop.
while (condition) {
..
}
condition is evaluated. If it is true, then the body of the loop is executed, and
then condition is evaluated again. If it's still true, the body is executed again,
and so forth.
Inside a loop, you have access to two additional loop-control statements:
break
transfers execution to the end of the loop, as if it had exited
normally with a false condition. You'd use this to stop processing
early, instead of using logic to skip the rest of the loop body until the
next normal evaluation of condition. You can use break
anywhere in
a loop body.
continue
transfers execution to the start of the loop, as if the body
was all done. You'd use this to skip the rest of the loop body but to
continue looping. The condition is next evaluated, and life goes on
as usual. You can use continue
anywhere in a loop body.
Here's an example showing when we'd use these:
// echo lines of input until we see one that starts with an "m":
int has_m = 0;
while (line = get_next_line()) {
// if the line starts with "#" it's a comment:
if (line[0] == '#') {
continue; // forget him, let's go look at the next one
}
// if the line starts with "m" then we're done:
if (line[0] == 'm') {
has_m = 1;
break; // no need to search any more, the answer won't change
}
// otherwise print it out:
printf(line);
}
do-while
do
is almost the same as while
, except that it checks the condition at the
end of the loop instead of at the beginning. This means the body will always
execute at least once.
do {
..
} while (condition);
I've seen do-while most commonly used for input operations - query the user for
some input, check to see if it's okay, and then re-query if it's not.
do {
printf("Tell me what I need to know!\n");
char *ans = get_answer();
} while (answer_does_not_please_me(ans));
You may use both break
and continue
in do-while
loops. break
jumps
you completely out of the loop body; continue
jumps to the condition part,
which would then go back to the loop body.
for
for
is basically just while
but with conveniently built-in initialization and increment code.
for (init-expr; condition; incr-expr) {
..
}
For example:
for (x=0; // this is run only once, before the loop starts
x < 10; // this is the condition of the while
x += 1) { // this is done after each execution of the loop body
..
}
At this point I get to introduce you to the comma operator, which is
seriously just a comma. It is used to stitch multiple expressions into
a single expression, which is useful for putting multiple things into
the for
control code:
for (x=0,y=0;
++loop_count,x<10; // the rightmost returns the expression's overall 'value'
x+=1,y+=1) {
..
}
I'm telling you about the comma operator because you'll see it, not because
it's necessarily a great idea.
You may use both break
and continue
in for
loops. break
jumps you
completely out of the loop; continue
jumps to the incr-expr and then to
the cond, and then continues on as usual.
goto
The goto
statement is another of C's warts, but only because software has
moved so far away from machine code since C was developed. goto
allows you
to transfer execution to some other point in the function that's marked with
a label (a name followed by colon, which you can put before any statement.)
Kernighan and Ritchie are of the opinion that goto
should only ever be
used to break out of nested loops (since break
can only break out of one
at a time). I did enough QBASIC programming to agree.
for (...) {
for (...) {
...
if (..) goto done;
}
}
..
done: printf("done!\n");
Functions
C allows you to bundle code into lexically-scoped functions that you can
invoke from anywhere else with arbitrary arguments. Functions can take
any number of arguments (including none), and may return up to one value.
They support recursion.
Inside a function, you can do whatever arbitrary code you want. (Well,
except for defining a sub-function.)
At any point in a function, you can use the return
statement to both exit
the function and set its returning value.
// a function with no arguments, returning nothing:
void myfunc1(void) {
..
}
// a function with an int argument, returning a long:
long myfunc2(int myarg) {
..
}
// a function with several arguments of different types, returning an int:
int myfunc3(int myarg1, int myarg2, char *myarg3) {
..
}
If you want to use a function that's defined either later in the file or defined
in a completely different file, you'll need to declare the function before you
can call it. (This restriction is so the compiler can line up your arguments to make sure
they're the right type, in the right order, etc.) You declare a function by
copying its signature and replacing the entire body with a semicolon. You can
also omit the variable names of the arguments, though you still need their
types:
void myfunc1(void);
long myfunc2(int);
int myfunc3(int, int, char*);
Note that C does not allow you to declare multiple functions with the same
name, even if they have different signatures. That's one of the
major upgrades in C++.
C does not allow you to define nested functions.
Pass-by-value
In C, arguments to functions are passed by value. That means when you call
the myfunc2
function above, the integer value you provide as an argument
is copied into a new integer, which is the one that myfunc2
will use.
myfunc2
is free to change myarg
all it likes, because it's a new variable
local to the function.
void bad_increment(int val) {
++val;
printf("new value: %d\n", val); // prints out 5
return;
}
int myval = 4;
bad_increment(myval);
printf("final value: %d\n", myval); // prints out 4
return
The aforementioned return
statement establishes the returned value of the
function and resumes program execution immediately after the function call.
The argument to return
is just any ol' arbitrary expression. For functions
that don't return anything, you can leave the expression out completely.
If a function does have a return type and you don't specify a return
, it
is said to "fall off the end". The compiler will ensure that a value will
be safely returned to the caller (as opposed to corrupting the stack), but
the value will be garbage.
varargs
C supports variable-length function arguments, which lets you can pass any
number of things of arbitrary mixed types to a function without running
afoul of the compiler.
This is best explained with an example:
void myfunc(int req_arg, ...) {
// we'll always have 'req_arg', but we'll have other things after it as well:
va_list ap;
va_start(ap, req_arg); // 'ap' points to first thing after 'req_arg'
while (...still have args to handle..) {
if (arg-is-an-int)
int i = va_arg(ap, int);
else if (arg-is-a-float)
float f = va_arg(ap, float);
else if (arg-is-a-string)
char *s = va_arg(ap, char*);
}
va_end(ap);
}
The example from K&R is basically what printf
does, so it gets the
number of args to handle by looking through the fmt
arg for "%"
constructs. The above example is complete crap, meant for you to just get the
idea.
Arcania
Functions were used long before they were standardized, so you may see a few
odd things. I'm explaining them so that you'll know what they are, not so
that you'll use them.
First, if you don't specify a return type for the function, it assumes
int
. One would think void
, but no.
Next, if you don't specify any arguments, it turns off the compiler argument checking. This
is for backwards compatibility with very old code that predates the ability
to declare arguments. If you want to declare a function that has no arguments,
you should declare the argument list as simply void
.
Next, the initial (pre-ANSI) version of C split arguments and their types:
void myfunc(arg1, arg2)
int arg1;
char *arg2;
{
..
}
Memory management
When you need to get some memory for a new variable, there are two mechanisms
for how you can get it: automatically, or dynamically.
"Automatic" variables are allocated via a global stack. When you declare
int x;
the program pushes sizeof(int)
bytes onto the stack, and x
becomes a refence to those bytes. x is not a pointer; using x gets
you the actual value in those bytes, but you can take the address of x
and see where it is in memory. When x goes out of scope, the program
pops those bytes back off the stack. All of this stack and scope manipulation
is done behind the scenes for you, which is why these are called automatic
variables.
int a;
a = 42;
Dynamic memory is allocated via a global pool. When you declare int *x
the
program pushes the size of a pointer (sizeof(int*)
) onto the stack, and x
becomes a reference to those bytes. However, the contents of those bytes is
an address (the address of an int!), so you need to point x to
sizeof(int)
bytes so that you can use it as an int
. You do this manually
by calling malloc
to request the memory. It finds enough space from
whatever's left in the pool and returns that address. When you're done with
it you call free
to return the memory to the pool.
int *a = malloc(sizeof(int));
*a = 42;
free(a);
Dynamic memory is useful because it outlives lexical scoping -- if you have a
function that returns something that needs to be allocated, then dynamic
memory is your only choice since automatic memory goes out of scope as soon as
the function exits.
Dynamic memory is also a pain in the butt because it's a manual process. If
you forget to free
when you're done, you have a memory leak; if you call
free
on the same address more than once, it's usually bad; worse, free
does not reset pointers to zero, so none of the N pointers to your int know
that they're now pointing to garbage. Worst, it's not always clear whether you
should free
a pointer or not:
// declare some external function to get some name:
char *get_name();
// call it:
char *myname = get_name();
// At this point, is 'myname' pointing to the same thing that get_name stores internally,
// or did get_name make a copy for me?
// - If the former, I cannot free it because get_name is still using it;
// - If the latter, I must free it to avoid a memory leak.
// Oh bother.
A few more functions you might want to know about:
- calloc: like malloc but initializes memory to 0.
- realloc: takes an existing chunk of memory and downsizes it, returning
the pointer to the new area it carved out.
System
argv/argc: command-line arguments
The command-line arguments are passed to a C program through the parameters to
the main
function. argc
is the number of parameters, and argv
is an array
of char*
s. argc
is always at least 1 because the first argument (argv[0]
)
is the name of the program executed.
Power tip on argv[0]: usually argv[0] is what the user typed, not necessarily
the actual file being run. This means you can make multiple symlinks to a
program, and argv[0] tells you which one the user ran to invoke the program.
A nifty way to create wrappers!
int main(int argc, char *argv[]) {
printf("You passed in %i arguments:\n", argc);
for(int i = 0; i < argc; ++i) {
printf(" '%s'\n", argv[i]);
}
}
system(char*)
system
interprets its argument as a shell command and executes it through
(typically) bash. Execution of the program is suspended until the
sub-program finishes; use fork
et al for concurrent execution.
sytem
's return value contains 8 bits with the process's errcode, 7 bits
with the signal that killed it, and 1 bit to say if core was dumped.
Consult your local man page for specific bit arrangements.
rand()
Returns a random integer in the range of 0 to RAND_MAX.
stat()
stat
looks up a ton of information about the given file and returns it
in a huge struct. It's probably best to ask your local man page for details,
but here's what mine says:
struct stat { /* when _DARWIN_FEATURE_64_BIT_INODE is NOT defined */
dev_t st_dev; /* device inode resides on */
ino_t st_ino; /* inode's number */
mode_t st_mode; /* inode protection mode */
nlink_t st_nlink; /* number of hard links to the file */
uid_t st_uid; /* user-id of owner */
gid_t st_gid; /* group-id of owner */
dev_t st_rdev; /* device type, for special file inode */
struct timespec st_atimespec; /* time of last access */
struct timespec st_mtimespec; /* time of last data modification */
struct timespec st_ctimespec; /* time of last file status change */
off_t st_size; /* file size, in bytes */
quad_t st_blocks; /* blocks allocated for file */
u_long st_blksize;/* optimal file sys I/O ops blocksize */
u_long st_flags; /* user defined flags for file */
u_long st_gen; /* file generation number */
};
fstat
is similar except it uses a file descriptor (not a filehandle!)
instead of a path string.
I/O
stdin/stdout/stderr
getchar()
getchar
reads one character of stdin, or EOF
if there's nothing left.
putchar(int)
putchar
writes one character to stdout. It returns EOF
if there was a problem.
gets(char *, int)
Reads in the next line of input from stdin
, strips off the trailing newline,
and stores it in the given buffer. (Up to a specified max number of characters.)
puts(char*)
Adds a newline to the given string and writes it directly to stdout
.
printf(char *fmt, ...)
printf
writes an entire string to stdout. fmt
is a generic string that
contains any number of conversion specifications that say how to handle the
remaining arguments to printf
. (There must be the same number of
conversion specifications in fmt
as there are additional arguments to
printf
or else you'll get yelled at.) These conversion specifications have
the following components:
- "%". They all start with a percent sign. Use "%%" if you want to
print an actual percent sign.
- "-" (optional). Makes the field left-aligned instead of right-aligned.
- a number (optional) specifying the minimum width of the field, in characters.
- "." and a number (optional) specifying different things for different types:
- for strings, the max number of characters
- for floats, the number of digits after the decimal point
- for ints, the min number of digits
- a conversion character specifying what the type of this argument is:
- "c": a char
- "s": a char* string
- "d" or "i": a signed int
- "u": an unsigned int
- "o": an int in base-8
- "x" or "X": an int in base-16
- "hd" or "hi" or "hu" or "ho" or "hx" or "hX": a short
- "ld" or "li" or "lu" or "lo" or "lx" or "lX": a long
- "f": a double
- "e" or "E": a double in exponential notation
- "g" or "G": a double that could be treated as either "f" or "e"
depending on which would display better.
- "p"; a void*, or really any pointer
// print a somewhat unsafe string with no actual modifiers:
printf("hello, world!\n");
// print a safer version:
printf("%s\n", "hello, world!");
// tell me what the integer is:
int foo = ...
printf("foo = %i\n", foo);
printf
returns the number of characters written. You could check that for
errors, if you're insane.
sprintf(char *str, char *fmt, ...)
sprintf
is basically printf
except it writes to a string instead of
to stdout.
char my_str[256]; // not a great idea to hardcode this number, but hey
int foo=..
sprintf(my_str, "foo= %i\n", foo);
scanf(char *fmt, ...)
scanf
reads a formatted string from stdin. The formatting looks the
same as printf
's format string, and the values read in are stored into the
pointers you pass to scanf
.
int myint1, myint2;
char mystr[256]; // FYI: bad to hardcode
int res = scanf("%d %d %s", &myint1, &myint2, mystr);
if (res == EOF); // out of input to read
else if (res < 3); // ERROR: didn't get all three items!
sscanf(char *str, char *fmt, ...)
sscanf
is like scanf
except it reads from a strings intead of from stdin.
char *existing_str = "12 foo";
int myint;
char mystr[256]; // FYI: still bad to hardcode
int res = sscanf(existing_str, "%d %s", &myint, mystr);
..
files (with file handles)
There are three global FILE* variables available to you in C: stdin
,
stdout
, and stderr
.
fopen(char* name, char* mode)
fopen
tries to open the file at the path contained in name
. mode
is
a string (really!) that indicates reading ("r"), writing ("w"), or appending
("a").
fopen
returns a FILE*
. FILE
is a struct holding lots of info you probably
don't want to know about. The important part is that it's not null, so you can
pass it around to file-manipulating functions.
FILE *fh = fopen("/path/to/some/file", "r");
if (!fh) {
// error!
}
getc(FILE*)
Returns the next character from the given FILE
stream.
putc(int c, FILE*)
Writes the given character to the given FILE
stream. Like putchar
, it
returns the given character, or EOF if there was an error.
fscanf(FILE*, char *fmt, ...)
fscanf
is the same as scanf
except that it reads from the given filehandle
instead of from stdin.
fprintf(FILE*, char *fmt, ...)
fprintf
is the same as printf
except that it writes to the given
filehandle instead of to stdout.
fclose(FILE*)
Closes the given filehandle, which just tells the system you're done with it.
This doesn't mean much for reading, but for writing, this is the point where
the buffer is flushed and errors occur when disks are full.
if (fclose(fh)) {
// error!
}
fclose
is called automatically when the program exits, but you really
shouldn't be sloppy about closing filehandles when you're done with them -
there's usually a limit to the number of files a process can have open
at a time, so not cleaning up may make future fopen
s fail.
ferror(FILE*)
ferror
returns nonzero when there's been an error on the given stream.
feof(FILE*)
feof
returns nonzero if the end-of-file has occurred for the given stream.
This is what you check when reading through a file.
fgets(char*, int, FILE*)
fgets
reads the next line of the given stream and stores it into the given
string. (Up to a max number of characters.)
FILE *fh = fopen("/some/path", "r");
while (!feof(fh)) {
char line[256];
fgets(line, 256, fh);
printf("%s", line);
}
close(fh);
fputs(char*, FILE*)
fputs
writes the given string to the given filehandle, without formatting.
files (with file descriptors)
File descriptors are just integers. They are how UNIX thinks of files, as
opposed to the handles used above.
open(char*, int flags, int perms)
Opens the given file, in the mode dictated by flags
, which could be one of:
Oddly, open
cannot be used to create new files.
creat(char*, int perms)
Opens the given file for writing. If the file didn't exist before, it does
now; if it did exist, it's now empty.
The permissions on the file are controlled by perms
, which is usually
specified in octal.
read(int fd, char* buf, int max)
Reads up to the max number of character from the given file descriptor
into the given buffer.
read
returns the number of characters read. 0 means "end of file", and
-1 means there was an error.
write(int fd, char* buf, int n)
Writes the given number of characters from the given buffer into the file
pointed to by the file descriptor.
write
returns the number of characters written. If that number is not
equal to the one you gave it, there was an error.
lseek(int fd, long offset, int origin)
For the given file descriptor, jumps the current position in the file to
the given character offset.
origin
controls how offset
is used:
- 0: start from beginning
- 1: start from current position
- 2: start from end
directories
opendir(char*)
Returns a "file"handle (I guess really a dirhandle) for the given directory.
readdir(DIR*)
Returns a dirent
struct object pointing to the first entry of the directory.
closedir(DIR*)
Tells the system you're done reading the directory.
(example)
DIR *dh = opendir("/some/random/dir");
if (!dh)
//error!
struct dirent *dir_entry;
while (dir_entry = readdir(dh)) {
}
closedir(dh);
The preprocessor
The C preprocessor is the very first part of compilation. "Preprocessing"
involves expanding macros (things that start with "#") to create the actual source code
that's fed to the compiler. Preprocessor directives don't follow the same
set of syntax rules as the rest of the language, so be careful of the following:
- "//" is not recognized as a comment. Well, "//" isn't always
recognized as a comment in C anyway (it's a C++ thing), but most modern
C compilers have added it. However, the preprocessor can't assume that
it's necessarily a comment, so it has to be kept.
- If your want your directive to span more than one line, each continuation
line must end with a backslash.
#include
#include
copies the content of the named file into
the current file. This is mostly used to pull in declarations that have
been put in a centralized file. (And since declarations have to be seen
before functions can be used, these #includes are usually at the top, which
is why they are called header files.)
There are two variations. If you specify the file name in angle brackets, C
looks in implementation-specific places for the file; if you specify the file
in double-quotes, it first looks in the same directory that has the current
file, and then looks in implementation-specific places.
Note that the "-I" compiler switch adds a directory in which to look for
these files.
#include <stdio.h>
#include "myproject.h"
#define
#define
lets you swap out any compiler token for something else. The token
name you specify has to follow the normal rules for C identifiers, but the
value you replace it with can be almost literally anything.
You can #define
the same token multiple times; at any point during the
preprocess scan, the most recent definition wins.
#define
has several variations.
swap token for text
This is what people usually mean when people say "macro". It's most commonly
used to define constant values, but can do anything such as define a new
loop keyword:
#define PI 3.14159
#define forever for(;;)
..
forever {
..
float area = PI * r * r;
..
}
no value
You can specify a #define
without a value, in which case its value is
actually an empty string. This is most commonly used to see if the token
has been seen before (with #ifdef
or #ifndef
).
#define FOO
...
#ifdef FOO
..
#endif
with parameters
Yes, you can define directives with parameters. Sneaky! Here's the example
from the Kernigan/Ritchie book:
#define max(A,B) ((A) > (B) ? (A) : (B))
..
int myvar = MAX(var1, var2);
Using parameters in macros comes with even more warnings. First is that
each instance of each parameter is re-evaluated in the code, which is
occasionally incorrect. Using the MAX
example, consider MAX(i++, j++)
.
That will be expanded to ((i++) > (j++) ? (i++) : (j++))
, which executes
two increments on one of those variables. In addition to this correctness
problem, the re-evaluation is also an optimization concern.
The second warning is that since the macro-expanded code is fed back into the
compiler, you have to account for the normal rules on precedence. Consider
this:
#define square(a) a*a
..
int myvar = square(i+1); // expands to "int myvar = i+1*i+1;"
Obviously, that did not do what you expected.
Also, there are some even sneakier things you can do with parameters:
adding quotes on the fly, and creating tokens on the fly.
adding quotes on the fly
Inside of a directive definition, you can have the preprocessor add quotes
by prepending a parameter with "#". Again, the Kernigan/Ritchie example is
great for showing why you'd want to do that:
#define dprint(expr) printf(#expr "=%g\n", expr)
..
dprint(x/y); // expands to: printf("x/y" "=%g\n", x/y); Then the strings
// are concatenated to create: printf("x/y=%g\n", x/y);
creating tokens on the fly
Using "##" tells the preprocessor to concatenate two things together and
then rescan the result. If you're wondering what that's good for, so
am I.
#undef
Things that have been previously #define
d may be undefined with #undef
.
#if..#endif
The #if
directive evaluates an integer expression (at compile time!) and
either includes or skips the contents of the block depending on the result.
There are #else
and #elif
directives you can use as part of an #if
block.
The condition expression is pretty much limited to math and logic operations
on integers, but you are allowed to use one "function": defined
. It returns
whether the given identifier has been previously #define
d, in any
incarnation.
#if MYCONST == VALUE1
..
#elif MYCONST == VALUE2
..
#else
..
#endif
#ifdef/#ifndef..#endif
In fact, if all you're checking is whether something has or has not been
#define
d, the preprocessor has shortcuts.
#if defined(foo)
// is the same as:
#ifdef foo
#if !defined(bar)
// is the same as:
#ifndef bar
Far and away the most common use of this is to prevent double-definition of
header file content. (See below.)
Packaging
C supports 2 pieces of software packaging:
- header files (code files which traditionally end with .h) contain
declarations of variables and functions that live elsewhere.
- object files (binary files which traditionally end with .o or .so)
contain already-compiled definitions of things.
Typically they go together -- when someone creates a package for you to use,
they'll deliver both the binary object as well as the text header.
Non-standardized UNIX tradition also includes archives, which are a
bundle of object files combined into a single file that ends with .a. Archives
are created with the
ar
utility. Pretty much all linkers understand .a files
and handle them as a collection of .o files.
header files
The #include
mechanism tells the compiler to copy the contents of the specified
file and paste them at that location in the current file. Anything you can do
in C code can be segmented up and reconstituted with #include
. The intended
application of it is to centralize external declarations and definitions. That
is, instead of having every one of your code files type out the exact
declaration for printf
, let's put the declaration in a file called stdio.h
,
and then your code files need only #include <stdio.h>
.
#include
is great for code reuse and decoupling, but presents a very real
problem: object double-definition. If you define an identifier multiple
times (even as the exact same thing), the linker will throw an error. Identifiers
must be defined exactly once in the final program.
To create your own header files and avoid double-definitions, you need to know
two tricks: how to declare things (vs. define them), and how to keep definitions
from being seen more than once.
declarations
Variables are declared by prefixing them with extern
. This tells the
compiler that the variable will be defined somewhere else, but also gives
the compiler all the type information it needs.
Functions are declared by replacing the function body with a semicolon.
Structs, as objects, cannot be just declared without also defining them.
However, since structs can be recursive, you can use pointers to them just
fine.
extern int myvar1;
extern char *myvar2;
int myfunc1(int);
#ifndef
To avoid double-defintions, you'll quite frequently see a #ifndef
trick in
header files. The #ifndef
trick sets a compile-time variable inside a
header file, but then skips the whole header if that variable is already
defined.
#ifndef _MYHEADER_H_
#define _MYHEADER_H_
...
#endif
This means you can #include
a file multiple times and only the first
one has the actual content in it.
object files
Object files are the compiled binary of some number of source files. They
can be either statically linked (conventionally .o
) or they can be
dynamically linked (conventionally .so
).
Even though they're not in readable ASCII, there are a few things you can do
to see what's going on with object files.
nm
: shows you the symbol table.
strings
: shows you all the constant char* strings.
otool
: lists various parts of an object file.
od
and hexdump
: reads an arbitrary file and prints it out as oct/hex codes.
You consume object files by linking them with your binaries to create
the final program. You could do linking yourself with ld
, but most
compilers will do it for you when you give them binary inputs and/or
request a final program as the output.
Operator precedence
One of the most loudly derailed warts in C is that it defines 15 distinct
levels of operator precedence of varying associativity. If you wish to
reach the end of your life sane, now is the time to avert your eyes, because
I now present to you the full table in all its respendent whatthefuckery,
in decreasing order of precedence:
operator(s) | associativity | precedence |
() [] -> . | left to right | highest |
! ~ ++ -- +(no-op) -(negation) *(dereference) &(address-of) sizeof | right to left |
*(multiplication) / % | left to right |
+(addition) -(subtraction) | left to right |
<< >> | left to right |
< <= > >= | left to right |
== != | left to right |
&(bitwise and) | left to right |
^ | left to right |
| | left to right |
&& | left to right |
|| | left to right |
?: | right to left |
= += -= *= /= %= &= ^= |= <<= >>= | right to left |
, | left to right | lowest |
Chris verBurg
2015-07-12