Database Design, MySQL, and the SQL Language - Part 1

The original document from Christopher Sung can be found [here](http://itp.nyu.edu/~cs220/dwd/class03.php). I would read it from there since mine is only an aggregation of the original. ## Relationship Modeling Relational databases are based on tables, and how they correspond to each other. You can think of a table as a set of rows and columns. Each column represents a general attribute (i.e. "name of musical style"), and each row is an instance of that attribute ("rock", "jazz"). Tables are often represented in short form by simply listing the columns. For example, here's a table of musical styles in full form with all its data, and in its short form with just its attributes:

This is a short display of a table

This is display of the column with attributes. Specifically, table attributes (i.e. the columns) are called fields, and the instances of these fields (i.e. the rows) are called records. If you look above, you can see that we have two fields (id, name) and there are 21 entries. Thus we have 21 records in our musical style table.

The main task of designing tables lies in what fields you decide to include as part of your table. Often, certain fields that you’d like to include may be better suited to being in their own table, and then being linked to the original table. One example of this would be if we had a table of audio recordings. Part of it might look like this:

and its data view might look like this:

But as we fill it with recordings, we find we are often repeating the same data over again in the “musical_style” field. So, to make our design more compact, we can separate musical style into its own table, such as the one we looked at above. We can then replace the musical_style field with the ID of the given musical style, such that our database is now two tables that look like this:

and the fields of the two tables in short form are now:

This type of mapping is called a one-to-many relationship as one recording has many possible musical styles of which it can be an example. The relationship of a book to a bookpage is another example of this, as there are many possible pages that can a part of a specific book. At WholeNote, lessons have their own table, and lesson pages occupy another. The field that links them is the lesson’s ID, which is a required field for every lesson page (i.e. I am a lesson page and my lesson_id field tells me which lesson I belong to).

A field that uniquely identifies a specific row in a table is called a primary key. In general, I like to make an id field in most every table I make. It is always an integer that starts at 1, and that increments by 1 every time a new record is added. When you reference one table’s ID in another table, the name of that field in the second table is called a foreign key. It doesn’t function as the primary key since we already have one, but it gives us a way to uniquely identify how our second table relates to the first. In our above example, musical_style_id is the foreign key in the recordings table that binds it to the musical_style table.

The only problem with our current database is that it forces each recording to belong to one and only one musical style. Some might argue that Hello, Nasty is also a rock album, which means it could belong both to the rap style AND the rock style. This is an example of a many-to-many relationship. When this is the case, a third table is needed to define every possible relationship between the two existing tables, which in our example, is recordings and musical styles. We no longer need the foreign key in our recording table because our new third table will define the relationships and will do so totally using the “id” field of each table:

Database Design, MySQL, and the SQL Language - Part 2

Now that we have a basic understanding of how to create a simple table, we need to develop a better understanding of how to separate our data into a particular configuration of tables. Database design has evolved to the point where there are specific steps you can take to optimize any database configuration. This process is called normalization and it has three distinct forms.

Here is a set of raw data about several known musicians:

First Normal Form

n order for a set of data to be in first normal form, the following must be true:

Each field in a table must contain only a single item of data
Each type of data must be stored in only one place

The first condition relates to what variable type we choose for a specific field. If you tried to put a student’s GPA in the same field as their e-mail address, you’d have a hard time extracting either when you ask for a value from that column. Thus each field must contain only a single item of data.

The second condition says that if you have similar types of data in different columns, move them to one single column, or create a new table for them if you have to.

As you can see, the column “birth_information” violated our first rule, since it contained both a place and a date. Thus, we had to separate them into their own columns.

Additionally, for our instrument fields, we had the same type of data in three different columns. The only way for us to put this data in first normal form is to put it in its own table, with a foreign pointing back to the musician, in order to keep track of which musician plays which instrument. This necessitated us to add an ID field to the musician table in order to uniquely identify them.

This is first normal form.

Second Normal Form

In order for a set of data to be in second normal form, the following must be true:

All data must be in first normal form
Each field in a table must contain data about one and only one entity.

The first condition was just satisfied when we revamped our raw data into our current two tables. The second condition says that not only must each field contain a single item of data, but that they must contain one and only one attribute or entity.

So, what happens when we apply these rules to our current data set?

Our old “birthplace” actually contained two entities: a city, and a country. Thus, we needed to split this values into two values: birth_city and birth_country

This is second normal form.

Third Normal Form

Third normal form is all about removing dependencies in our table design. In order for a set of data to be in third normal form, the following must be true:

All data must be in second normal form
All non-key data dependent on other non-key data should either be eliminated (if possible) or moved to its own table.

The first condition was just satisfied when we revamped our data into second normal form. The second condition says that if we have a field that’s dependent upon another field(s), we should either get rid of it, or put it in another table.

So, what happens when we apply these rules to our current data set?

As you can see, we were able to eliminate the “age” field because it was dependent upon the “birthdate” field. There is no need for us to store age because we can always just derive it by subtracting it from the current date.

This is second normal form.

Removing Further Redundency

Just because something is normalized doesn’t mean you can’t streamline things a bit more. There is one other thing you can investigate to make sure your design is as compact as possible. For this, I pay attention to two things:

Move any field that represents a finite set of possible values to its own table
Replace any of these fields in other tables with a foreign key to the new table

What this means is that if you have a field where there are a reasonably small number of possible values (US states, NFL football teams, cereal brands), create a new table for them, give it a primary key, and replace their values in the old table with a foreign key to the newly created table.

So, what happens when we apply these rules to our current data set?

We found that the number of countries and the number of instruments to be finite in their range of possible values. Thus, we created two new tables: “instrument” and “country”. For instrument, we actually replaced the old “instrument” table with a new one with its own primary key. We then renamed the old “instrument” table as “musician_x_instrument” and replaced the “instrument” field with “instrument_id”, which is a foreign key to the new “instrument” table. By doing this, we created a many-to-many relationship between musicians and the instruments they play. For the country of birth for each musician, we created a “country” table with a primary key, and then replaced the old “birth_country” column in the “musician” table with “birth_country_id”, which is a foreign key to the “country” table.

Multiple Table Joins

The whole point of having a relational database is to be able to connect two or more different tables such that we can extract various sets of data. The process of pulling data from two or more related tables is called a join. The most common is the inner join which has two different syntaxes. For our purposes, we will be using the older syntax, which is a bit easier to read. This format is:


SELECT [DISTINCT] column1[,column2]
FROM table1,table2[,table3]
[WHERE "conditions"]
[GROUP BY "column-list" [ASC | DESC] ]
[HAVING "conditions]
[ORDER BY "column-list" [ASC | DESC] ]
[LIMIT [offset,] number-rows]

For example, if we want to get the names of our musicians and their respective birth countries from our example above, we would say:

SELECT m.firstname, m.lastname, c.name FROM musician m, country c WHERE m.birth_country_id=c.id ORDER BY m.lastname, m.firstname

which would return the following results:

Note that in order to identify columns, it is necessary to give each table its own abbreviation. In the above case, we use “m” for the musician table, and “c” for the country table. Thus, if we want to refer to a column name in the musician table, we use the syntax “m.[column name]”.

Also note that in order to equate the two tables, we are using the foreign key in the musician table (m.birth_country_id), that points to the primary key in the country table (c.id), and making it a necessary condition that they must be equal to other. If we don’t we get this:


SELECT m.firstname, m.lastname, c.name
FROM musician m, country c
ORDER BY m.lastname, m.firstname

which would return the following results:

This redundency occurs because we never included our country-related reference in our musician table (i.e. birth_country_id) in our query. Thus, the query returned every musician in combination with every country. It’s only when we make use of the “WHERE m.birth_country_id=c.id” clause that we look for the birth country for each individual musician.