### Logistics * Magic code is: * Looking at "nano twitter core" submissions * Excellent progress! Keep it up. * There is still a lot of time but there is also a lot to do! I do glance at the commit history and in some cases there are big lags with no updates. * Each team has received a slack message today from me with a question about your submissions * Remember to update readme with name of who contributed to a feature or change. How can we overcome the "server down" problem? Maybe a staging server? Not sure. * In most cases the functionality that was required for the submission was not there. Believe me a try really hard to find it! * The grade weight of these interim submissions is not super high. But you will not do well if you try to cram it all in at the end!
Scalability Pattern: Database Paritiioning
What are the considerations in deciding whether and how to parition the database?
Considering the code
Minimize the number of times code calls the database (which is usually the same as the number of SQL statements are sent.)
Investigate and know the capabilities of your database system
Check whether there is a bulk operations which will do the job (e.g inserting ten records with one call)
Consider the Schema
Are the right columns indexed? Either too many or too few can be bad for performance, depending on the scenario.
Check whether there are database constraints that you can add
Check whether there are stored procedures that could be useful
Check whether denormalizing might help in some cases
teams
What specific columns in what specific tables would you index or are you indexing?
Scalability Pattern: Database Partitioning
Advanced and central technique to deal with database scaling
It can be performance (how long does an operation take)
Or throughput (how many operations can be done per second)
Or both
Conflicting definitions - Partitioning and Sharding
When you divide a big database into several smaller ones
Partitioning: Horizontal and Vertical
Sharding; Horizontal Partitioning
What’s the problem?
When the database is the bottleneck
Add a second database server
What to do with the data?
Some options
Replication: Put a complete copy of the data on the second db server
Pay attention to read vs. write
What to do about data consistency?
Parition
Scenario
User Database
Happens to often be a monster
Lots of records
Each record with lots of information
Accessed a lot
Schema
User: (id, name, email, biography, hobbies, college, last_login, encrypted_pw, profile_photo_jpg, …)
Vertical partition
Often associated with an SOA
Divide the User table into three different database servers:
User: (id, name, biography, email)
Autentication: (id, last_login, encrypted_pw)
Photos: (id, profile_photo_jpg)
How it changes your application
Pretty basic rearchitecture into separate services
Sharding (Horizontal partition)
“buckets” of users (== shards)
How? Create multiple database servers with
the same schema
different subset or clump of records
Need a way to direct requests to the right “shard”
inspect something about the record
determine what shard to look in
Three common algorithms to decide what bucket gets a record
Range Based: Range of some scalar value (record id, first letter of name, etc.)
List Based: Take some other property (e.g. zipcode, department)
Hash Based: Compute a hash on some value
How it changes your application
Whenever you either read, write, or search
Require to include enough information to pick the right shard
Pros and Cons
Joins become a problem
What was once one db is now spread over more than one db
Can lead to denormalization
Data Integrity
Foreign keys might now point to another database
Databases can get out of sync
NB
Both kinds of sharding are advanced techniques and you should only use them when you have quantitative reasons to believe they will improve a measured performance issue.
NB!
Teams
Work out a plan for sharding your databases. What would you shard, why and how?
Scalability Pattern: Database Caching
Using caching (e.g. redis) to reduce db access
Example: Social Graph
Schema (like all of you have)
User(id, name)
Follow(id, follower_id, following_id)
Content(id, author_id)
Nicely normalized
First, Second and Third Normal form
Origins of the relational database
Queries like:
How many people are following user X?
Who is following user Y?
What are the most recent “n” posts (i.e. content) for user “u”?
What are the most recent “n” posts for users that “u” is following?
But to display each and every user, a join is needed!
Measurement
Ask database system to analyze SQL queries that are slow
Discover that the social graph access was very slow
Discussion
Have you started using redis yet in your projects?
What do your redis keys look like?
How do you compute your cache key?
DB: Caching
Use Network scale caching (Redis) to store and share across servers
count:followers:u = number
count:following:u = number
How to maintain this number?
How important is it that it is correct?
What might make it incorrect?
Add APIs
get_follower_count(user), get_following_count(user), incr_follower_count(user), decr_follower_count(user), incr_following_count(user), decr_following_count(user)
What class has those methods?
Where are they invoked?
Result of queries?
Result of search?
Creating the cache key
What do you store in the cache?
Teams
Discuss and design how you would incorporate this idea in your specific nanoTwitter
Thank you. Questions?
(random Image from picsum.photos)