ORM's are better than databases, right?

home

 

ORM's are clearly better, aren't they?

Pulling a database into a nicer in-memory data structure is the new thing. It's called ORM -- object relational mapping. Microsoft's C# Entity Framework is one. Hibernate for Java is another. The idea is that programmers can already quickly and easily navigate traditional intricate data-structures. Having them clunk around with the archaic language SQL is a waste of their and the CPU's time. Essentially, an ORM is an interface, hiding the arcane yucky details of a relational database. Clearly an excellent idea.

But now I'm wondering. Most new language features these days are unproductive oversimpifications made to appeal to novices. Could ORM's be like that? Is it possible that relational databases and SQL were made for good reasons? SQL was designed before the original Star Wars. There's no way it's still the best at what it does, is there? In short, could the basic idea behind ORM's be flawed? Are ORM's more difficult to learn and less useful than simply learning SQL? Naaah.

We know none of that is true since normal coding is just easier. I took an SQL course a long time ago and once taught the basics of Microsoft Access, so I know what I'm talking about. SQL queries must be carefully crafted as single big statements. That's fine for simple stuff but not for anything complex. For that we need code and a normal data-structure, provided by our ORM.

SQL is clunky and awkward

To start out, SELECT is a terrible keyword. It doesn't select. WHERE is the real select: SELECT id FROM people WHERE name<>"unknown"; uses WHERE to select people with real names. The SELECT merely formats results to display only the ID. That's the first thing people see about SQL -- a confusing keyword.

But the real problem with SQL isn't poor cosmetic choices -- it's the backwards way it works. That is what convinces most people that databases are clunky. Suppose we want to store Cats and the foods each likes. Here's how normal humans would do that:

class Cat {
  int ID; // someone somewhere will probably need this
  string name="Fluffy"; // default name
  List<string> Foods; //  <-- a lovely list of this cat's favorite foods
}

List<Cat> Cats;

We get a nice list of Cats, where each cat has its own list of the foods it eats. Say we want to know what cat number 7 eats -- Cats[7].Foods. Beautiful.

Relational databases starts out a tiny bit nicer -- we define a table and create it at the same time. This creates a list named Cats:

create table Cats (
  id int primary key,
  name varchar(20) default 'fluffy'
  /* names have maximum size 20. Eeeew */
);

I suupose that's not too bad, so far. Adding items is also fine. This adds Boots and Max:

insert  into cats (id, name) values (101, "Boots");
insert  into cats (id, name) values (102, "Max");

I can't even really complain about that terrible varchar(20). Database strings have a maximum length, which programs stopped needing in 1985. But we'll need a limit at some point, so it's fine for it to be up-front in databases.

Strangely, the cat table doesn't say what they eat. That's on purpose. We need another table for foods:

create table catEats (
  food varchar(20),
  cat_id int
  /* this should match the id of some Cat */
);

We can insert items so that Boots (Cat 101) eats mice and milk, and Max (Cat 102) eats corn:

/* Boots eats mice and milk: */
insert into catEats (food, cat_id) values ("mice", 101);
insert into catEats (food, cat_id) values ("milk", 101);
/* Max eats corn: */
insert into catEats (food, cat_id) values ("corn", 102);

Holy crap! That table is just everything eaten by all cats, jammed together, with those weird backlinks to their cat id. For example:

select * from catEats;
+--------+--------+
| food   | cat_id |
+--------+--------+
| mice   |    101 |  // in a working database...
| milk   |    101 |  // ...these wouldn't even be...
| corn   |    102 |  // ...grouped by cat_id!
| bread  |    103 |
| mice   |    103 |
| apples |    103 |
+--------+--------+

This all-sublists-together hack is the approved way. It's what really convinces coders that databases are clunky.

Or course, needing to use raw ints as non-functioning pointers doesn't help. To be fair, database people know this is ugly and created a special annotation. Listing a field as a Foreign Key identifies it as a pointer, constrained to the correct table and field:

// better definition of catEats table:
create table catEats (
  food varchar(20),
  cat_id int,
  
  /* this is mark-up for the cat_id field above */
  /* our cat_id matches with the "id" field of the Cats table: */
  foreign key (cat_id) references cats(id)
};

To see how terribly this set-up works, let's find the foods eaten by the third cat:

select food from catEats
  where cat_id=103;

Huh. That's actually fine. We didn't even need to write an explicit loop. But let's defeat SQL by asking for foods with 4 or more letters. Here's our attempt at it:

select distinct food from catEats
  where length(food)>3;

Huh. Well that's short and works perfectly. I should have known SQL has a string-length function. And I guess removing duplicates is such a common thing that of course a single keyword distinct exists for it. But that proves nothing -- we knew SQL was good for simple things. Let's do a hard one: "names of cats that eat mice". To gloat, here's how easy the programming version is:

// cats that eat mice:
for(int c=0; c<Cats.length; c++) {
  if(Cats[i].Food.contains("mice")) print( Cats[i].name );

It's easy since cats have that nice list of what they eat, unlike databases.

For the database version we'll finally need to follow the cat_id link. The JOIN command does this, sort of:

select * from
  cats join catEats
    on cats.id=catEats.cat_id; /* get row with matching cat_id's */

It gives us the same table, but with the linked cat tacked onto the end. This is SQL's idea of following pointers:

mice 101 Boots
milk 101 Boots
corn 102 Max
bread 103 Cat#3
mice 103 Cat#3
apples 103 Cat#3

Clearly this is already a mess, but to be fair let's finish it. Another WHERE selects the mice, then the SELECT displays only the names:

select name from
  cats join catEats
    on cats.id=catEats.cat_id
  where food="mice"; /* <--- the new line */

Huh. This is short -- shorter than the programming version -- works perfectly, and looks fine. Output is "Boots" and "Cat#3", on two lines.

It's not even as inefficient as we think. SQL is a descriptive language, not procedural. The system probably decides to handle the WHERE first, getting rid of non-mice foods, which results in much less work for the JOIN. Rumor has it that common actions are even pre-computed.

But coding still has the advantage of being able start in different ways and still solve the problem. Programming is creative, SQL isn't. Except it turns out we can do this another way in SQL. This query starts with all cat/food pairs (essentially a nested loop) and grabs the ones we want:

// names of cats that eat mice, version #2:
select name from cats, catEats /* ALL pairs of cat & food */
  where food="mice" and id=cat_id;

The approach with JOIN seems better. Here we're just joining by-hand in the WHERE, but what do you know -- SQL has an AND.

Let's change tracks. As we know, SQL will always be awkward since it doesn't have functions. What? It does, called VIEW's? You'll read that VIEW's are virtual tables. They're actaully stored SQL queries run as needed to create virtual tables. Hmmm...that's a function. A parameterless one, but a function. This sets up VIEW catsAndFoods as the basic JOIN on cat-id's:

/* creates catsAndFoods as an alias for this SQL */
create view catsAandFoods as
  /* the function body */
  select cats.id, name, food from
    cats join catEats
    on cats.id=catEats.cat_id;

catsAndFoods is used exactly like a table. This gets "names of cats that eat mice":

/* casts who eat mice: */
select name from catsAndFoods where food="mice";

Damn that's short, and not really complex at all.

But as coders we can still mock how VIEW's are globals. Real languages have scope and local variables. It turns out ... SQL has local variables. WITH is the same as VIEW but for only that query. This uses one to solve "foods eaten by cats whose names are at least 6 letters long" (for example, foods eaten by Sebastian, but not by Max):

/* set-up longNameCats as an alias: */ 
with longNameCats as (
  /* it's simply Cats, but with short rows removed: */
  select name, id from cats
    where length(name)>5
)
/* now the real query, a basic JOIN: */
select distinct food from
  longNameCats join catEats
    on longNameCats.id=catEats.cat_id;

If that's not enough, there's even an alternate syntax which feels like assigning to a variable. Enclose any SQL in parens followed by "as myAlias". Below we set-up longNameCats and use it later:

/* more of: foods eaten by long-name-cats */
select food from
  (select name, id from cats where length(name)>5) as longNameCats
    join catEats
      on longNameCats.id=catEats.cat_id;

If I didn't know better, I'd say SQL was written in the general spirit of a real programming language.

SQL has conditional expressions??

That string-length function is making me so mad. I'll conceed that SQL has some useful built-ins, the same as coding. But when we run out, code can brute force anything. Let's rummage through SQL's various tests to prove how limited its conditional expressions are.

We already know it has AND and OR. It's also got something named IN, but that's merely an "or" shortcut. For example:

/* the foods fluffy and boots can eat, using an OR: */
select distinct food
  from catsAndFoods /* the VIEW */
    where name="fluffy" or name="boots";

/* same thing, with IN shortcut */ 
  "   "   "  "  where name IN ("fluffy", "boots");

SQL has the full complement of less-than and so on. Well, of course. Charmingly, not-equals is written as the old-fashioned <>. BASIC did that. It adds BETWEEN as another shortcut: amt BETWEEN 5 and 10 is really amt>=5 AND amt<=10. I suppose we can feel superior by saying coders don't need the BETWEEN shortcut.

But then we come to SQL's LIKE keyword. It checks for regular expressions. Regular freaking expressions! I can barely do that in coding languages that aren't Perl. In SQL it's easy. Some examples (underscore is one letter, % is any number of letters):

/* 2nd letter is "h": */
select name from cats
  where name like '_h%'; /* sheena chip zharf */

/* ends with Z: */
...
  where name like '%Z';

I don't even know why I'm trying to explain this. They work in the regular way is all that needs to be said. I was going to get mad about the different syntax, except every language is that way. And SQL actually allows the more common syntax, through regexp_like:

select name from cats
  where regexp_like(name, '[aeiou]$';  /* ends with a vowel */

  where regexp_like(name, '^[a-z]*[a-z0-9]$';  /* only letters, may end in number */

It's almost as if SQL kept borrowing the best features from all programming languages. It's still clearly inferior to writing a program, but not as inferior as I first thought.

Arg. It has pivot tables

Here's the catEats table again. Recall that it's every cat's personal food list jammed together:

select * from catEats;
+--------+--------+
| food   | cat_id |
+--------+--------+
| mice   |    101 |
| milk   |    101 |
| corn   |    102 |
| bread  |    103 |
| mice   |    103 |
| apples |    103 |
+--------+--------+

Tables like this feel like a weird SQL hack, but it turns out they're useful. Like a spreadsheet pivot table, we can arrange the items into groups by choosing a column. We can make groups with the same id:

101 mice milk
102 corn
103 bread mice apples

...or groups with the same food ([mice 101 103], [corn 102] ...). We can't use these yet since they aren't proper tables (the rows are different lengths), but we can gets stats -- how many, the maximum, and so on.

The command to do this is GROUPBY. The stats you want to see go into the SELECT. This displays the the number of foods each cat eats:

select cat_id, count(*)
  group by cat_id; /* makes 3 rows, 1 for each cat */

It makes sublists, then counts them because we asked. Pretty neat:

101  2 /* the length of that Cat's food list */
102  1
103  3

Or we can group them by food (not as interesting). For fun we'll also print the sum of the cat_id's in each group

select food, count(*), sum(cat_id)
  from catEats
  group by food;

mice 2 204 /* 204 is the sum of the id's: 101+103 */
milk 1 102
  ...

Other fun statistics besides COUNT and SUM are MIN, MAX and AVG. COUNT has no input (so we write *) -- it just adds them -- but others need a column, for example: max(cat_id).

This makes it a lot tougher to explain the inherent inferiority of SQL. We can even nest functions. This displays each cat with the total letters of foods it eats:

select name, sum(length(food)) as fLen
/* "as fLen" replaces the column heading */
  from catOneFood /* the joined VIEW [name id food] */
  group by name

Max    9 /* mice + milk */
Boots  4 /* corn */
Cat#3 15 /* bread + mice + apples */
Here's a tough one. Display each cat with the name of it's longest-name food. We use MAX to get the length, then use that to get the actual name:
/* save new local table catAndFoodLen: */
with catAndFoodLen as ( /* <-- catAndOneFood is an alias */
  select name, max(length(food)) as fLen
    from catOneFood
    group by name
)
/* rows now look like: [Max,4] [Boots,4] [Cat#3,6] */

/* JOIN to get [Max,4,mice] [Max,4,milk] ... */
/* then find the rows w/max length name: */
select catOneFood.name, catOneFood.food from
  catOneFood join catAndFoodLen
    on catOneFood.name=catAndFoodLen.name and
       fLen=length(catOneFood.food);

But sql is still terrible, and here's proof. We can filter the results of a GROUPBY using a where-like clause (which is fine) except the name is confusingly changed to HAVING. Jerks. Here's "foods eaten by at least 2 cats", using a stupid HAVING:

select food count(cat_id) from catEats
  group by food   /* sample row: "mice 4" */ 
    having count(cat_id)>1;
    /* the same as a WHERE, removes some rows */

Group-by's can have a WHERE. It filters out rows from the original table, before the group-by happens. A HAVING works after the group-by, on the new rows. That's pretty neat. And it also means I have to take back everything bad I just wrote about HAVING. Of course it's a different keyword since it does a different thing.

This uses WHERE and HAVING to get foods eaten by at least two cats with short names:
/* foods eaten by at least 2 cats with short names: */
select food count(cat_id)
  from catEats
    where length(name)<5
  group by food
    having count(cat_id)>1;

The last thing I can complain about is how count and max can be used without a group-by. This means that when you see select count(*) you can't tell how it's used. In a normal query it replaces the entire table with its length. But in a group-by it counts the lengths of subrows. That's confusing. But we already know we need to read SQL inside-out. So it's a pretty stupid complaint.

I may have found the real problem with SQL: it's too minimal and straight-forward. How are we expected to learn a language without a bunch of wordy boiler-plate and funny quirks?

Who owns what?

In our cat example, the two tables feel different. Cats is a regular list of cats, but catEats isn't a list of foods. It is, but really it's a way of assigning foods to cats. It's a bunch of sublists, each belonging to a cat. SQL diagrams have a special notation for expressing that in the most obscure way possible.

SQL tables are linked through foreign keys -- links into other tables which are effectively pointers. SQL turns that into graph theory, which says all links are just lines. It doesn't matter which table writes down the link -- if 3 foods link to Mittens, that's the same thing as Mittens linking to three foods. Or, preferably, a seperate list of lines.

Graph theory calls the entire set of connections a "relation" and classifies each side based on how many lines can touch an item: either only 1, or any amount. A food links to only 1 cat, but a cat can link to many foods, so Cats to catEats is a "many-to-one" relation. That's a confusing way to describe a simple link, but to make it more confusing, database diagrams draw it as a line with a backwards V-shaped arrow (or with a 3rd line, like a pitchfork). Here's the diagram for the cats database showing the many-one relation:

 cats -------< catEats
 

We're suppose to mentally pull the narrow end of that V back to Cats, representing all of the lines coming from it (or ending at it, depending on your perspective):

  CATS       CATEATS
         /  101 mice
        /   101 catnip
cat #101--  101 bread
        \   ...
         \  ...
         /  102 rice
cat #102 -- 102 shoes
         \  102 mice

To sum up, when we see a funny line with a V, it means the V-end table is owned by the single-end one, the way catEats belongs to Cats. But is that true? Let's make another many-one relation: Cats know what US state they live in. They get a 1-50 foreign key into the States table. Since each cat lives in 1 state, and a state can have many cats, it's technically a many-to-one. The diagram:

states -------< cats -------< catEats
How the SQL adding this looks:
  /* master table for every state */
  create table states ( id int, stateName varchar(16) );
  insert into states (id, stateName) values (0,"catLand");
   ...
  
  /* grow the cat table to have state_id's: */
  alter table cats add state_id int;
  /* start adding state_id's to cats */
  update cats set state_id=1 where id=101;

In theory, this means Cats is no longer a real table -- its true purpose is now to be 50 mini-lists of cats. But that's clearly wrong. The secret is that database charts don't tell you how to think; they just tell you what's possible. We can still think of Cats as a main table. That many-to-one to States means we could group them by state.

Here we can perform a GROUPBY on cats by state, to count the stately cat population:

/* slicing cats into one list per state */
/* counts cats living in each state: */
select states.id, count(*)
  from cats
  group by state_id;

But say we want to print the names of cats living in one particular state. We treat Cats as one big, normal list and check for the state we want:

/* cats living in state #2 */
/* just a simple use of WHERE */
select name from cats where state_id=2;
"cats from states starting with a vowel?" is the same idea, but more complicated
select name from
  cats join states
    on cats.state_id=states.id
  where regexp_like(states.stateName, ^[aeiou]);

This next one is "How many different foods do the cats in each state eat?" (for example [Alaska,1] means every cat in Alaska eats the same thing, whereas [Alabama,20] means a full-service cat-food store there sells 20 typs of food.) As usual, read it from the inside-out:

/* these 1st two lines run last */
/* they replace the state_id with the stateName: */
select stateName, amt from
  states join
  /* this does the work of getting [stateID,food count] */
  (select cats.state_id, count(distinct food) as amt
     from cats join catEats on id=cat_id
     group by state_id) as t2
   on states.id=t2.state_id;

So what does all this mean? It means ORM's are doing something that doesn't need to be done. They organize the database into structures to make some queries easier: should each cat have a list of what it eats, or should each food have a list of cats eating it? Databases don't worry about that -- they store connections agnostically and let the system handle the details.

Links everywhere

Besides having many-to-one relations, we can have one-to-one. But those are boring; all they do is let us split one table into several. The interesting relation is many-to-many. It doesn't seem like much at first since it only means we can draw lines however we like. But it's tricky to create and has some cool uses.

To make one, let's redo the food table to be more of a master list of foods. Pretend each entry has all sort of useful food stats (only vitamin-A, for now):

/* new master list of foods: */
create table foods (
  id int primary key, /* real tables need id's */
  name varchar(20) default 'chicken'
  vitA float /* amount of vitamin A */
);

The old way that we linked items won't work anymore. Before, when a cat ate mice, it created its own mouse entry, pointing back to it. Now all of those links need to come from our single master mouse row. We can't do that -- database rows can't hold expanding lists.

The trick is to create another table for the lines. Make it the obvious way: an entry for each end:

/* table holding the lines between cats and foods: */
create table catToFood (
  cat_id int,
  foreign key (cat_id) references cats(id),

  food_id int,
  foreign key (food_id) references food(id)
);

SQL diagrams don't even show this as a many-to-many. They insist on writing out the middle table. Technically, that table has a many-to-one with Cats and Food. Silly SQL shows it like this:

cats---< catToFood >--- foods

By that logic, we can use just one of these as a regular many-to-one. For example, this would get names of cats who eat tuna (assume we know Tuna is food#3):

/* cats that eat tuna (food id 3): */
select cats.name from
  cats join catToFood on cats.id=catToFood.cat_id
  where catToFood.food_id=3;

Huh. That actually works. It really is a many-to-one. We can probably do the same thing with the many-to-one from foods. For example we can group by foods to see how many cats eat each one:

select food, count(*)
  from foods join catToFood
    on id=food_id
  group by food;

mice  3
bread 6
 ...

It turns out that stupid pair of many-to-one arrows to the middle "lines" table is actually good. It's almost as if decades of thought by people who use databases is better than my uninformed opinion.

But can we just follow lines from cats to foods? Yes. That's just a pair of JOINS:

select cats.name, foods.name from
  cats join catToFood on cats.id=catToFood.cat_id
    /* join the result with foods: */
    join foods on food.id=catToFood.food_id;

Maybe my basic problem with sql is I suck at abstract thinking. If I saw Cats[j].Food[i]==3 in a program, I'd know we were checking for cats which eat food 3. It's a little tougher understanding the above SQL. But not that much tougher.

But this many-to-many gives us a new way to prove data structures are better than datbases, Suppose besides State, cats also record their breed. Now cats have a many-to-one both ways, giving us this picture:

states ------- < cats > ------- breeds
 

Ah-ha! That's the same diagram as an edge-list, but Cats isn't an edgelist! Except, crap, I'm wrong. We can use Cats as an edgelist. States can look at which Breeds live in them, and Breeds can look at which States they occur in, through Cats. That's a freaking edgelist and that stupid SQL diagram just told me something useful.

This will count the number of states where each breed of cat can be found (it's a GROUPBY):

select breedName, count(distinct cats.state_id) from
  breeds join cats on breeds.id=cats.breed_id
    group by breedName; /* <-- grouping by breed */

SQL doesn't have sophisticated logic

I didn't mention this earlier, since it's so obvious -- programs can do math, IF's and other logic that databases simply can't. SQL has length(name) but not much else. All it has is ... gah!! SQL has a ton of built-ins:

/* toLower and toUpper: */
select upper(name) from cats; /* FLUFFY */

/* padding, left and right: */
select lpad(name,10,"-") from cats /* ----fluffy */

/* string replace: */
select replace(name,'f','X') from cats; /* XluXXy */
select replace(name,'ff','Z') from cats; /* fluZy */

/* conditions with strcmp: names past "h": */
select * from cats where strcmp(name, "h")>=0; 

But we still have it beat on math, right? SQL can't find names with an even number of letters. What! It can? Using the same math a program would?:

/* even-length cat names: */
select * from cat where length(name)%2=0;

/* we can even do this (silly) thing: */
select * from cats as c1 join cats as c2
  where c1.id-c2.id>2;
       ^^^^^^^^^^^^^ Are you kidding me?

It even has IF's. This one uses a standard nested-if to print "boat", "ship" or "trawler" depending on a ships type:

/* spreadsheet-style nested value-returning if's: */
select name, if(shipType=0,"boat", if(shipType=1,"ship","trawler")) from fleet;

This silly query uses an IF to choose the test in the WHERE (we get short-name cats from state#1, combined with long-name cats from state#2):

/* an IF to choose the WHERE test: */
select * from cats
  where if(length(name)>5, state_id=1, state_id=2);

Are ORM's better?

Obviously ORM's are better, because it's code. But it turns out that every example shows databases are easier and more flexible at doing database stuff. It's difficult to explain to a novice why ORM's are better despite having no obvious advantages.

So why?

I think things like ASP.net are the key to understanding the ORM idea. ASP lets you write round-trip webpages using only C#. It translates it to and from javascript. The goal is for an ASP.net programmer not to have to learn anything. Of course ASP is super clunky, but there's an appeal to not having to learn new things.

If you've already gone that far to avoid learning any web tech, you clearly don't care about what works the best, and an ORM looks pretty good.

 

 

 

 

 

Comments. or email adminATtaxesforcatses.com