Logistics
- Magic code is:
- Looking at “nano twitter core” submissions
- Excellent progress! Keep it up.
- There is still a lot of time but there is also a lot to do! I do glance at the commit history and in some cases there are big lags with no updates.
- Each team has received a slack message today from me with a question about your submissions
- Remember to update readme with name of who contributed to a feature or change. How can we overcome the “server down” problem? Maybe a staging server? Not sure.
- In most cases the functionality that was required for the submission was not there. Believe me a try really hard to find it!
- The grade weight of these interim submissions is not super high. But you will not do well if you try to cram it all in at the end!
Scalability Pattern: Database Paritiioning
- What are the considerations in deciding whether and how to parition the database?
Considering the code
- Minimize the number of times code calls the database (which is usually the same as the number of SQL statements are sent.)
- Investigate and know the capabilities of your database system
- Check whether there is a bulk operations which will do the job (e.g inserting ten records with one call)
Consider the Schema
- Are the right columns indexed? Either too many or too few can be bad for performance, depending on the scenario.
- Check whether there are database constraints that you can add
- Check whether there are stored procedures that could be useful
- Check whether denormalizing might help in some cases
teams What specific columns in what specific tables would you index or are you indexing?
Scalability Pattern: Database Partitioning
- Advanced and central technique to deal with database scaling
- It can be performance (how long does an operation take)
- Or throughput (how many operations can be done per second)
- Or both
Conflicting definitions - Partitioning and Sharding
- When you divide a big database into several smaller ones
- Partitioning: Horizontal and Vertical
- Sharding; Horizontal Partitioning
What’s the problem?
- When the database is the bottleneck
- Add a second database server
- What to do with the data?
Some options
- Replication: Put a complete copy of the data on the second db server
- Pay attention to read vs. write
- What to do about data consistency?
- Parition
Scenario
- User Database
- Happens to often be a monster
- Lots of records
- Each record with lots of information
- Accessed a lot
- Schema
- User: (id, name, email, biography, hobbies, college, last_login, encrypted_pw, profile_photo_jpg, …)
Vertical partition
- Often associated with an SOA
- Divide the User table into three different database servers:
- User: (id, name, biography, email)
- Autentication: (id, last_login, encrypted_pw)
- Photos: (id, profile_photo_jpg)
- How it changes your application
- Pretty basic rearchitecture into separate services
Sharding (Horizontal partition)
- “buckets” of users (== shards)
- How? Create multiple database servers with
- the same schema
- different subset or clump of records
- Need a way to direct requests to the right “shard”
- inspect something about the record
- determine what shard to look in
Three common algorithms to decide what bucket gets a record
- Range Based: Range of some scalar value (record id, first letter of name, etc.)
- List Based: Take some other property (e.g. zipcode, department)
- Hash Based: Compute a hash on some value
- How it changes your application
- Whenever you either read, write, or search
- Require to include enough information to pick the right shard
Pros and Cons
- Joins become a problem
- What was once one db is now spread over more than one db
- Can lead to denormalization
- Data Integrity
- Foreign keys might now point to another database
- Databases can get out of sync
NBBoth kinds of sharding are advanced techniques and you should only use them when you have quantitative reasons to believe they will improve a measured performance issue.
NB!
TeamsWork out a plan for sharding your databases. What would you shard, why and how?
Scalability Pattern: Database Caching
- Using caching (e.g. redis) to reduce db access
Example: Social Graph
- Schema (like all of you have)
- User(id, name)
- Follow(id, follower_id, following_id)
- Content(id, author_id)
- Nicely normalized
- First, Second and Third Normal form
- Origins of the relational database
- Queries like:
- How many people are following user X?
- Who is following user Y?
- What are the most recent “n” posts (i.e. content) for user “u”?
- What are the most recent “n” posts for users that “u” is following?
- But to display each and every user, a join is needed!
Measurement
- Ask database system to analyze SQL queries that are slow
- Discover that the social graph access was very slow
- Discussion
- Have you started using redis yet in your projects?
- What do your redis keys look like?
- How do you compute your cache key?
DB: Caching
- Use Network scale caching (Redis) to store and share across servers
count:followers:u = number
count:following:u = number
- How to maintain this number?
- How important is it that it is correct?
- What might make it incorrect?
Add APIs
- get_follower_count(user), get_following_count(user), incr_follower_count(user), decr_follower_count(user), incr_following_count(user), decr_following_count(user)
- What class has those methods?
- Where are they invoked?
- Result of queries?
- Result of search?
- Creating the cache key
- What do you store in the cache?
Teams Discuss and design how you would incorporate this idea in your specific nanoTwitter
Thank you. Questions? (random Image from picsum.photos)