Database Tuning 1 (Thu Mar 24, lect 18) | previous | next | slides |

How database architecture affects performance and scale and how to improve it.

Logistics

  • Magic code is:
  • Looking at “nano twitter core” submissions
  • Excellent progress! Keep it up.
  • There is still a lot of time but there is also a lot to do! I do glance at the commit history and in some cases there are big lags with no updates.
  • Each team has received a slack message today from me with a question about your submissions
  • Remember to update readme with name of who contributed to a feature or change. How can we overcome the “server down” problem? Maybe a staging server? Not sure.
  • In most cases the functionality that was required for the submission was not there. Believe me a try really hard to find it!
  • The grade weight of these interim submissions is not super high. But you will not do well if you try to cram it all in at the end!

Scalability Pattern: Database Paritiioning

  • What are the considerations in deciding whether and how to parition the database?

Considering the code

  • Minimize the number of times code calls the database (which is usually the same as the number of SQL statements are sent.)
  • Investigate and know the capabilities of your database system
  • Check whether there is a bulk operations which will do the job (e.g inserting ten records with one call)

Consider the Schema

  • Are the right columns indexed? Either too many or too few can be bad for performance, depending on the scenario.
  • Check whether there are database constraints that you can add
  • Check whether there are stored procedures that could be useful
  • Check whether denormalizing might help in some cases
teams What specific columns in what specific tables would you index or are you indexing?

Scalability Pattern: Database Partitioning

  • Advanced and central technique to deal with database scaling
  • It can be performance (how long does an operation take)
  • Or throughput (how many operations can be done per second)
  • Or both

Conflicting definitions - Partitioning and Sharding

  • When you divide a big database into several smaller ones
  • Partitioning: Horizontal and Vertical
  • Sharding; Horizontal Partitioning

What’s the problem?

  • When the database is the bottleneck
  • Add a second database server
  • What to do with the data?

Some options

  • Replication: Put a complete copy of the data on the second db server
  • Pay attention to read vs. write
  • What to do about data consistency?
  • Parition

Scenario

  • User Database
    1. Happens to often be a monster
    2. Lots of records
    3. Each record with lots of information
    4. Accessed a lot
  • Schema
    1. User: (id, name, email, biography, hobbies, college, last_login, encrypted_pw, profile_photo_jpg, …)

Vertical partition

  • Often associated with an SOA
  • Divide the User table into three different database servers:
    1. User: (id, name, biography, email)
    2. Autentication: (id, last_login, encrypted_pw)
    3. Photos: (id, profile_photo_jpg)
  • How it changes your application
  • Pretty basic rearchitecture into separate services

Sharding (Horizontal partition)

  • “buckets” of users (== shards)
  • How? Create multiple database servers with
    1. the same schema
    2. different subset or clump of records
  • Need a way to direct requests to the right “shard”
    1. inspect something about the record
    2. determine what shard to look in

Three common algorithms to decide what bucket gets a record

  • Range Based: Range of some scalar value (record id, first letter of name, etc.)
  • List Based: Take some other property (e.g. zipcode, department)
  • Hash Based: Compute a hash on some value
  • How it changes your application
    1. Whenever you either read, write, or search
    2. Require to include enough information to pick the right shard

Pros and Cons

  • Joins become a problem
    1. What was once one db is now spread over more than one db
    2. Can lead to denormalization
  • Data Integrity
    1. Foreign keys might now point to another database
    2. Databases can get out of sync
NBBoth kinds of sharding are advanced techniques and you should only use them when you have quantitative reasons to believe they will improve a measured performance issue.

NB!

TeamsWork out a plan for sharding your databases. What would you shard, why and how?

Scalability Pattern: Database Caching

  • Using caching (e.g. redis) to reduce db access

Example: Social Graph

  • Schema (like all of you have)
    1. User(id, name)
    2. Follow(id, follower_id, following_id)
    3. Content(id, author_id)
  • Nicely normalized
    1. First, Second and Third Normal form
    2. Origins of the relational database
  • Queries like:
    1. How many people are following user X?
    2. Who is following user Y?
    3. What are the most recent “n” posts (i.e. content) for user “u”?
    4. What are the most recent “n” posts for users that “u” is following?
  • But to display each and every user, a join is needed!

Measurement

  • Ask database system to analyze SQL queries that are slow
    1. Discover that the social graph access was very slow
  • Discussion
    1. Have you started using redis yet in your projects?
    2. What do your redis keys look like?
    3. How do you compute your cache key?

DB: Caching

  • Use Network scale caching (Redis) to store and share across servers
    1. count:followers:u = number
    2. count:following:u = number
  • How to maintain this number?
  • How important is it that it is correct?
  • What might make it incorrect?

Add APIs

  • get_follower_count(user), get_following_count(user), incr_follower_count(user), decr_follower_count(user), incr_following_count(user), decr_following_count(user)
  • What class has those methods?
  • Where are they invoked?
  • Result of queries?
  • Result of search?
  • Creating the cache key
  • What do you store in the cache?
Teams Discuss and design how you would incorporate this idea in your specific nanoTwitter

Thank you. Questions?  (random Image from picsum.photos)