Thumb marko Marko Ćilimković Friday, March 4, 2016

There is a time in the project lifecycle when everything is working fine, but the stress test gives really bad results, especially on the web pages where a query traverses through a jungle of rows in the database. What should you do then? The majority of blog posts, articles and forums suggest to add an index on a column in the table, but the theory behind adding indexes goes much deeper. Through this blog post, I'll explain what to do and what not to do in different scenarios. Have you ever wondered how much a well-placed index speeds up your queries? 

Welcome to the Jungle

Have you ever started studying just a couple of days before a test? You woke up in the morning and decided to go to a quiet place, and remembered that there's a really cool library just a couple of blocks away from your home. Of course, there's also tons of books with the information waiting to be found. (nowadays, there's Google so a lot of you don't know what I'm talking about, but there's a point in this, I promise ;)

You come to the library, ask the librarian if they have a book about Einsteins Cosmological Constant, and the librarian tells you they do, but you need to find it somewhere in the library. Naturally, you go to the beginning of the library in the first, or the second row of bookshelves, looking where books labeled with "E" start and realize...the books aren't sorted alphabetically, nor by color, not in any way. The day passes and you fail your test.

That probably never happened to anyone, because we all know about Einsteins Cosmological Constant, but also because all libraries have their books sorted in one way or another (...I'm gonna find you ;) ). That's the sole purpose of indexes in data tables, to help your database find any element faster than it normally would without them.

The Theory of Everything

In the programming world, an index is a database construct that provides faster searching. Following that logic, we can safely assume that adding indexes on all columns will assure us that the database is going to be as fast as possible, right?

Not really. Since indexes speed up the lookup queries, it's normal that additional work needs to be done by the database when a new row is inserted or updated. Because of that creating and updating rows becomes slower. Having too many indexes can also be compared with real life situations. With too many details and instructions how to find something, you can easily get confused, forget a piece of information or prioritize poorly between them. It's the same with your database.

To index, or not to index?

It's hard to say when to index, except in these situations:

  1. foreign keys - always add indexes on foreign keys. Associating two tables with a foreign key isn't going to be fast enough without an index.
  2. polymorphic associations - two tables associated with each other through a third table that contains only the respected id's (again, foreign keys)
  3. often used columns - if there's a column in your table that you use in a lot of your queries for sorting, or just fetching the data

Don't index the table if you find yourself in some of these situations:

  1. a small number of fields - if the table has just a few fields it may not benefit from an index if a large percentage of rows is returned most of the time
  2. small static tables are better to be read as a table scan than an index scan, since it's less job to be done, on a small scale
  3. if the physical size of the table is small, but you have indexes on many columns, you should remove them (leave the ones on primary and foreign keys)

Dissecting an index

Explaining the bits and pieces of an index can be very confusing and probably are not crucial for a web developer to understand, but if you're interested, the guide to database performance for developers is a really good free online book that covers the theory behind indexes and SQL in general.

Indexes are in fact, constructs, physical parts of the database that take up a piece of the disk space, which are redundant in a way that they hold an indexed copy of a table.

Most databases provide support for many index types, which have different algorithms, each tailored for a specific type of queries.

Balanced tree index

The B-tree index is the primary index type which is used by default in PostgreSQL. It is best used for queries that return a set of records based on range, ordering or equality. Basically, any index that uses some of these operators is going to be a b-tree: <, <=, >, >=, =.

It's also well suited for handling queries that contain NULL values and managing all datatypes. The approach of this algorithm is that the amount of data is approximately the same on both sides of the tree.

Generalized search tree

GiST indexes are best used for building balanced tree structures and for performing full-text searches. They go beyond range and equality comparisons, but is lossy, meaning that the index may produce false matches and when that happens it is necessary to check the actual table row to resolve those false matches.

Generalized inverted indexes

GIN are optimized to handle data types like arrays, or many values associated with more than one key, as well as full-text searches. Unlike GiST, GIN is not lossy but depend on the actual number of unique words in the data table, which means they're more suited for standard queries.

Most developers get confused when they see that both GiST and GIN are used for full-text searches, but knowing the differences between them can have a substantial impact on your query response times.

gist vs gin pros-cons

It's a trap!

Some of the databases out there (*cough* MySQL *cough*) can use only one index per table, but sometimes you need to run a query that fetches all books that are available and order them alphabetically. Remembering that will help you immensely in optimizing your query times. The fact that you can tell MySQL which index to use is even more useful because now you can go back to all your past projects and optimize the queries...OR you can continue reading this blog post :)

Another way to improve the query time for the example above is to use compound indexes, which index across two or more columns. But there's another pitfall to this approach as well. If you construct your compound index with the available column first and the title second the query times will differ. Deciding which column to put first is sometimes a tough job, but you can test it and see which column has more distinct values

SELECT count(distinct name), count(distinct created_at) FROM books;
count | count
------+-------
750 | 23564

Basically, you need to figure out what occurs in your application more often and then based on that architect your compound indexes.

Pics or it didn't Happen 

Why don't we roll up our sleeves and check them indexes in action? How much can we speed up our application just by using indexes?

Disclaimer: when doing application optimization, this is only one of many steps you should take for making your app snappy, but we are scientist, and we are experimenting with only indexes :)

So, we need an idea of making an app that we'll test, and how much data we'll insert in the database. Let's keep it simple and create a library app that is going to have 100,000 authors and 100,000,000 books.

The good thing about this is that we only need the models, so relax and forget about all the controllers and views :)

After running these migrations

rails generate model Author name:string
rails generate model Book title:string author_id:integer available:boolean released:date

we get our basic models

app/models/author.rbclass Author < ActiveRecord::Base
has_many :books
end
app/models/book.rbclass Book < ActiveRecord::Base
belongs_to :author
end

and now we populate our Postgresql database with some data

db/seed.rbauthors = []
100000.times do |i|
authors << Author.new(name: Faker::Book.author)
end
Author.import authors

author_ids = Author.all.map(&:id)

books = []
1000000.times do |i|
books << Book.new(title: Faker::Book.title,
available: (rand(0..1) == 0 ? false : true),
released: Faker::Date.between(10.years.ago, Date.today),
author_id: author_ids.sample)
end
Book.import books

I also fixed the ratio on available books making it more realistic. I assumed there's much more available books than rented (9:1)

Now that we have our data, let's jump into our rails console and check some query times.

Importance of foreign key indexing

You might have noticed that I didn't reference the author_id field while I was generating the book model. I left it plain and without an index so we can compare the response times. After we gather our non-index time, we'll add the index (takes about 2 seconds for the migration to end) and then rerun the query.

Book.where(author_id: 125000).explain
#Without the index
Book Load (110.8ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = $1 [["author_id", 125000]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."author_id" = $1 [["author_id", 125000]]
QUERY PLAN
-----------------------------------------------------------
Seq Scan on books (cost=0.00..22578.00 rows=11 width=47)
Filter: (author_id = 125000)

#With the index
Book Load (0.5ms) SELECT "books".* FROM "books" WHERE "books"."author_id" = $1 [["author_id", 125000]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."author_id" = $1 [["author_id", 125000]]
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=4.51..47.56 rows=11 width=47)
Recheck Cond: (author_id = 125000)
-> Bitmap Index Scan on index_books_on_author_id (cost=0.00..4.51 rows=11 width=0)
Index Cond: (author_id = 125000)

It's an incredible drop from 110ms to 0.5ms. That's why everyone is talking about adding indexes on foreign keys :)

Compound indexes

What if a visitor comes to the library and asks specifically for books that are released on a particular date. We then need to run a query that checks for books that are released on that date and make sure they're available.

Without an index the query times look like this:

Book.where(available: true, released: 2.years.ago).explain
Book Load (126.9ms) SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:45:44.604341"]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:45:44.604341"]]
QUERY PLAN
------------------------------------------------------------
Seq Scan on books (cost=0.00..27308.00 rows=259 width=47)
Filter: (available AND (released = '2014-03-03'::date))

126ms...that's around 120ms too much.

We have 3 options now:

  1. add an index on released and available separately
  2. add a compound index on both fields, where released is first
  3. add a compound index on both fields, where available is first
#index on both fields separately

Book.where(available: true, released: 2.years.ago).explain
Book Load (2.4ms) SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:46:37.761993"]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:46:37.761993"]]
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=6.49..963.48 rows=259 width=47)
Recheck Cond: (released = '2014-03-03'::date)
Filter: available
-> Bitmap Index Scan on index_books_on_released (cost=0.00..6.43 rows=267 width=0)
Index Cond: (released = '2014-03-03'::date)
#compound index, released field first

Book.where(available: true, released: 2.years.ago).explain
Book Load (2.0ms) SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:47:39.602750"]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:47:39.602750"]]
QUERY PLAN
------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=7.08..936.75 rows=259 width=47)
Recheck Cond: (released = '2014-03-03'::date)
Filter: available
-> Bitmap Index Scan on index_books_on_released_and_available (cost=0.00..7.01 rows=259 width=0)
Index Cond: ((released = '2014-03-03'::date) AND (available = true))
#compound index, available field first

Book.where(available: true, released: 2.years.ago).explain
Book Load (1.9ms) SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:48:14.801610"]]
=> EXPLAIN for: SELECT "books".* FROM "books" WHERE "books"."available" = $1 AND "books"."released" = $2 [["available", "t"], ["released", "2014-03-03 08:48:14.801610"]]
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Index Scan using index_books_on_available_and_released on books (cost=0.42..513.44 rows=259 width=47)
Index Cond: ((available = true) AND (released = '2014-03-03'::date))
Filter: available

We can see that the best option would be to use a compound index with the available field starting first, not only because the time is the smallest, but also the cost of the query!

In search of authors

A really useful feature in apps like these is the search functionality. If someone is searching for a text (not id's, dates etc) it's best to use some of the algorithms that are tailored best for full-text searches, such as GiST or GIN. But let's also check how b-tree is working, and with no indexes.

#without an index
Author.where("name LIKE ?", '%John%').explain
Author Load (210.8ms) SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
=> EXPLAIN for: SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
QUERY PLAN
--------------------------------------------------------------
Seq Scan on authors (cost=0.00..2081.00 rows=1003 width=35)
Filter: ((name)::text ~~ '%John%'::text)
#b-tree index
Author.where("name LIKE ?", '%John%').explain
Author Load (25.5ms) SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
=> EXPLAIN for: SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
QUERY PLAN
--------------------------------------------------------------
Seq Scan on authors (cost=0.00..2081.00 rows=1003 width=35)
Filter: ((name)::text ~~ '%John%'::text)
#GIN index
Author.where("name LIKE ?", '%John%').explain
Author Load (17.3ms) SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
=> EXPLAIN for: SELECT "authors".* FROM "authors" WHERE (name LIKE '%John%')
QUERY PLAN
--------------------------------------------------------------
Seq Scan on authors (cost=0.00..2081.00 rows=1003 width=35)
Filter: ((name)::text ~~ '%John%'::text)
#GiST index
Author.where("name LIKE ?", '%John%').explain
Author Load (36.5ms) SELECT "authors".* FROM "authors" WHERE (name LIKE '%Smi%')
=> EXPLAIN for: SELECT "authors".* FROM "authors" WHERE (name LIKE '%Smi%')
QUERY PLAN
------------------------------------------------------------
Seq Scan on authors (cost=0.00..2081.00 rows=17 width=35)
Filter: ((name)::text ~~ '%Smi%'::text)

As we mentioned in the chapters above, the theory is confirmed, GIN is better for full-text searches where there's more than 100,000 unique words, and since it's a library, I don't think there's much traffic going on, so it's not a problem if creating and updating records takes a bit longer.

The Takeaway

Keep in mind that indexing can be backbreaking and in most times you won't see much difference, but it is essential for applications with big databases and a lot of traffic, so it's good to have an ace or two up your sleeve. Most importantly remember to:

  • not add indexes automatically during development (except for foreign keys and join tables that only contain foreign keys)
  • always index the primary and foreign keys
  • consider indexing columns that are frequently accessed by the WHERE, ORDER BY, GROUP BY, TOP and DISTINCT clauses
  • identify what queries are run and how often they are run

A highly efficient way to optimize your database is to test out different index types on the most commonly ran queries. Only by testing different indexing strategies you can be sure that you've fully optimized your database.



Cookies help us deliver our services. By using our services, you agree to our use of cookies.