Database Tuning 1 (Thu Mar 24, lect 18) | previous | next | slides |

How database architecture affects performance and scale and how to improve it.

Logistics

Magic code is:
Looking at “nano twitter core” submissions
Excellent progress! Keep it up.
There is still a lot of time but there is also a lot to do! I do glance at the commit history and in some cases there are big lags with no updates.
Each team has received a slack message today from me with a question about your submissions
Remember to update readme with name of who contributed to a feature or change. How can we overcome the “server down” problem? Maybe a staging server? Not sure.
In most cases the functionality that was required for the submission was not there. Believe me a try really hard to find it!
The grade weight of these interim submissions is not super high. But you will not do well if you try to cram it all in at the end!

Scalability Pattern: Database Paritiioning

What are the considerations in deciding whether and how to parition the database?

Considering the code

Minimize the number of times code calls the database (which is usually the same as the number of SQL statements are sent.)
Investigate and know the capabilities of your database system
Check whether there is a bulk operations which will do the job (e.g inserting ten records with one call)

Consider the Schema

Are the right columns indexed? Either too many or too few can be bad for performance, depending on the scenario.
Check whether there are database constraints that you can add
Check whether there are stored procedures that could be useful
Check whether denormalizing might help in some cases

teams What specific columns in what specific tables would you index or are you indexing?

Scalability Pattern: Database Partitioning

Advanced and central technique to deal with database scaling
It can be performance (how long does an operation take)
Or throughput (how many operations can be done per second)
Or both

Conflicting definitions - Partitioning and Sharding

When you divide a big database into several smaller ones
Partitioning: Horizontal and Vertical
Sharding; Horizontal Partitioning

What’s the problem?

When the database is the bottleneck
Add a second database server
What to do with the data?

Some options

Replication: Put a complete copy of the data on the second db server
Pay attention to read vs. write
What to do about data consistency?
Parition

Scenario

User Database
1. Happens to often be a monster
2. Lots of records
3. Each record with lots of information
4. Accessed a lot
Schema
1. User: (id, name, email, biography, hobbies, college, last_login, encrypted_pw, profile_photo_jpg, …)

Vertical partition

Often associated with an SOA
Divide the User table into three different database servers:
1. User: (id, name, biography, email)
2. Autentication: (id, last_login, encrypted_pw)
3. Photos: (id, profile_photo_jpg)
How it changes your application
Pretty basic rearchitecture into separate services

Sharding (Horizontal partition)

“buckets” of users (== shards)
How? Create multiple database servers with
1. the same schema
2. different subset or clump of records
Need a way to direct requests to the right “shard”
1. inspect something about the record
2. determine what shard to look in

Three common algorithms to decide what bucket gets a record

Range Based: Range of some scalar value (record id, first letter of name, etc.)
List Based: Take some other property (e.g. zipcode, department)
Hash Based: Compute a hash on some value
How it changes your application
1. Whenever you either read, write, or search
2. Require to include enough information to pick the right shard

Pros and Cons

Joins become a problem
1. What was once one db is now spread over more than one db
2. Can lead to denormalization
Data Integrity
1. Foreign keys might now point to another database
2. Databases can get out of sync

NBBoth kinds of sharding are advanced techniques and you should only use them when you have quantitative reasons to believe they will improve a measured performance issue.

NB!

TeamsWork out a plan for sharding your databases. What would you shard, why and how?

Scalability Pattern: Database Caching

Using caching (e.g. redis) to reduce db access

Schema (like all of you have)
1. User(id, name)
2. Follow(id, follower_id, following_id)
3. Content(id, author_id)
Nicely normalized
1. First, Second and Third Normal form
2. Origins of the relational database
Queries like:
1. How many people are following user X?
2. Who is following user Y?
3. What are the most recent “n” posts (i.e. content) for user “u”?
4. What are the most recent “n” posts for users that “u” is following?
But to display each and every user, a join is needed!

Measurement

Ask database system to analyze SQL queries that are slow
1. Discover that the social graph access was very slow
Discussion
1. Have you started using redis yet in your projects?
2. What do your redis keys look like?
3. How do you compute your cache key?

DB: Caching

Use Network scale caching (Redis) to store and share across servers
1. count:followers:u = number
2. count:following:u = number
How to maintain this number?
How important is it that it is correct?
What might make it incorrect?

Add APIs

get_follower_count(user), get_following_count(user), incr_follower_count(user), decr_follower_count(user), incr_following_count(user), decr_following_count(user)
What class has those methods?
Where are they invoked?
Result of queries?
Result of search?
Creating the cache key
What do you store in the cache?

Teams Discuss and design how you would incorporate this idea in your specific nanoTwitter

Database Tuning 1 (Thu Mar 24, lect 18) | previous | next | slides |

Logistics

Scalability Pattern: Database Paritiioning

Considering the code

Consider the Schema

Scalability Pattern: Database Partitioning

Conflicting definitions - Partitioning and Sharding

What’s the problem?

Some options

Scenario

Vertical partition

Sharding (Horizontal partition)

Three common algorithms to decide what bucket gets a record

Pros and Cons

NB!

Scalability Pattern: Database Caching

Example: Social Graph

Measurement

DB: Caching

Add APIs

Thank you. Questions? (random Image from picsum.photos)