Part 1 - Introduction and Setting up the REPL

As a web developer, I use relational databases every day at my job, but they’re a black box to me. Some questions I have:

  • What format is data saved in? (in memory and on disk)
  • When does it move from memory to disk?
  • Why can there only be one primary key per table?
  • How does rolling back a transaction work?
  • How are indexes formatted?
  • When and how does a full table scan happen?
  • What format is a prepared statement saved in?
    In other words, how does a database work?

To figure things out, I’m writing a database from scratch. It’s modeled off sqlite because it is designed to be small with fewer features than MySQL or PostgreSQL, so I have a better hope of understanding it. The entire database is stored in a single file!

Sqlite

There’s lots of on their website, plus I’ve got a copy of SQLite Database System: Design and Implementation.

|

A query goes through a chain of components in order to retrieve or modify data. The front-end consists of the:

  • tokenizer
  • parser
  • code generator
    The input to the front-end is a SQL query. the output is sqlite virtual machine bytecode (essentially a compiled program that can operate on the database).

The back-end consists of the:

  • virtual machine
  • B-tree
  • pager
  • os interface
    The virtual machine takes bytecode generated by the front-end as instructions. It can then perform operations on one or more tables or indexes, each of which is stored in a data structure called a B-tree. The VM is essentially a big switch statement on the type the bytecode instruction.

Each B-tree consists of many nodes. Each node is one page in length. The B-tree can retrieve a page from disk or save it back to disk by issuing commands to the pager.

The pager receives commands to read or write pages of data. It is responsible for reading/writing at appropriate offsets in the database file. It also keeps a cache of recently-accessed pages in memory, and determines when those pages need to be written back to disk.

A journey of a thousand miles begins with a single step, so let’s start with something a little more straightforward: the REPL.

Sqlite starts a read-execute-print loop when you start it from the command line:

To do that, our main function will have an infinite loop that prints the prompt, gets a line of input, then processes that line of input:

  1. InputBuffer* input_buffer = new_input_buffer();
  2. while (true) {
  3. print_prompt();
  4. read_input(input_buffer);
  5. if (strcmp(input_buffer->buffer, ".exit") == 0) {
  6. exit(EXIT_SUCCESS);
  7. } else {
  8. printf("Unrecognized command '%s'.\n", input_buffer->buffer);
  9. }
  10. }
  11. }

We’ll define InputBuffer as a small wrapper around the state we need to store to interact with . (More on that in a minute)

  1. struct InputBuffer_t {
  2. char* buffer;
  3. size_t buffer_length;
  4. ssize_t input_length;
  5. };
  6. typedef struct InputBuffer_t InputBuffer;
  7. InputBuffer* new_input_buffer() {
  8. input_buffer->buffer = NULL;
  9. input_buffer->buffer_length = 0;
  10. input_buffer->input_length = 0;
  11. return input_buffer;
  12. }

Next, print_prompt() prints a prompt to the user. We do this before reading each line of input.

To read a line of input, use getline():

  1. ssize_t getline(char **lineptr, size_t *n, FILE *stream);

lineptr : a pointer to the variable we use to point to the buffer containing the read line.

n : a pointer to the variable we use to save the size of allocated buffer.

return value : the number of bytes read, which may be less than the size of the buffer.

We tell getline to store the read line in input_buffer->buffer and the size of the allocated buffer in . We store the return value in input_buffer->input_length.

buffer starts as null, so getline allocates enough memory to hold the line of input and makes buffer point to it.

  1. void read_input(InputBuffer* input_buffer) {
  2. ssize_t bytes_read =
  3. getline(&(input_buffer->buffer), &(input_buffer->buffer_length), stdin);
  4. if (bytes_read <= 0) {
  5. printf("Error reading input\n");
  6. exit(EXIT_FAILURE);
  7. }
  8. // Ignore trailing newline
  9. input_buffer->input_length = bytes_read - 1;
  10. input_buffer->buffer[bytes_read - 1] = 0;
  11. }

Finally, we parse and execute the command. There is only one recognized command right now : .exit, which terminates the program. Otherwise we print an error message and continue the loop.

Let’s try it out!

  1. ~ ./db
  2. db > .tables
  3. Unrecognized command '.tables'.
  4. db > .exit
  5. ~

Alright, we’ve got a working REPL. In the next part, we’ll start developing our command language. Meanwhile, here’s the entire program from this part:

  1. #include <stdbool.h>
  2. #include <stdio.h>
  3. #include <stdlib.h>
  4. #include <string.h>
  5. struct InputBuffer_t {
  6. char* buffer;
  7. size_t buffer_length;
  8. ssize_t input_length;
  9. };
  10. typedef struct InputBuffer_t InputBuffer;
  11. input_buffer->buffer = NULL;
  12. input_buffer->buffer_length = 0;
  13. input_buffer->input_length = 0;
  14. return input_buffer;
  15. }
  16. void print_prompt() { printf("db > "); }
  17. void read_input(InputBuffer* input_buffer) {
  18. ssize_t bytes_read =
  19. getline(&(input_buffer->buffer), &(input_buffer->buffer_length), stdin);
  20. if (bytes_read <= 0) {
  21. printf("Error reading input\n");
  22. exit(EXIT_FAILURE);
  23. }
  24. // Ignore trailing newline
  25. input_buffer->input_length = bytes_read - 1;
  26. input_buffer->buffer[bytes_read - 1] = 0;
  27. }
  28. int main(int argc, char* argv[]) {
  29. InputBuffer* input_buffer = new_input_buffer();
  30. while (true) {
  31. print_prompt();
  32. read_input(input_buffer);
  33. if (strcmp(input_buffer->buffer, ".exit") == 0) {
  34. exit(EXIT_SUCCESS);
  35. } else {
  36. printf("Unrecognized command '%s'.\n", input_buffer->buffer);
  37. }
  38. }
  39. }