system-design-primer/resources/noat.cards/05 Availability patterns.md

119 lines
5.7 KiB
Markdown
Raw Normal View History

2021-03-14 13:08:05 +03:00
+++
noatcards = True
isdraft = False
2021-03-27 13:16:09 +03:00
weight = 50:weight
2021-03-14 13:08:05 +03:00
+++
# Availability patterns
There are two main patterns to support high availability:fail-over and replication.
2021-03-21 13:45:39 +03:00
## Active-passive (Fail-Over)
2021-03-14 13:08:05 +03:00
With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service.
The length of downtime is determined by whether the passive server is already running in 'hot' standy or whether it needs to start up from 'cold' standby. Only the active server handles traffic.
Active-passive failover can also be referred to as master-slave failover.
2021-03-21 13:55:58 +03:00
## Active-active (Fail-Over)
2021-03-14 13:08:05 +03:00
In active-active, both servers are managing traffic, spreading the load between them.
If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers.
Active-active failover can also be referred to as master-master failover.
2021-03-21 13:55:58 +03:00
## Disadvantage(s) : failover
2021-03-14 13:08:05 +03:00
2021-03-14 18:13:28 +03:00
- Fail-over adds more hardware and additional complexity.
- There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Master-slave replication
2021-03-14 13:08:05 +03:00
The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.
2021-03-21 13:55:58 +03:00
![](https://camo.githubusercontent.com/6a097809b9690236258747d969b1d3e0d93bb8ca/687474703a2f2f692e696d6775722e636f6d2f4339696f47746e2e706e67)
2021-03-21 14:33:05 +03:00
[Source: Scalability, availability, stability, patterns](http://www.slideshare.net/jboner/scalability-availability-stability-patterns/)
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Disadvantage(s) : master-slave replication
2021-03-14 13:08:05 +03:00
2021-03-14 18:13:28 +03:00
- Additional logic is needed to promote a slave to a master.
2021-03-21 13:12:09 +03:00
- See [Disadvantage(s) : replication](https://github.com/donnemartin/system-design-primer#disadvantages-replication) for points related to both master-slave and master-master.
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Master-master replication
2021-03-14 13:08:05 +03:00
Both masters serve reads and writes and coordinate with each other on writes. If either master goes down, the system can continue to operate with both reads and writes.
2021-03-21 13:55:58 +03:00
![](https://camo.githubusercontent.com/5862604b102ee97d85f86f89edda44bde85a5b7f/687474703a2f2f692e696d6775722e636f6d2f6b7241484c47672e706e67)
2021-03-21 14:33:05 +03:00
[Source: Scalability, availability, stability, patterns](http://www.slideshare.net/jboner/scalability-availability-stability-patterns/)
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Disadvantage(s) : master-master replication
2021-03-14 13:08:05 +03:00
2021-03-14 18:13:28 +03:00
- You'll need a load balancer or you'll need to make changes to your application logic to determine where to write.
- Most master-master systems are either loosely consistent (violating ACID) or have increased write latency due to synchronization.
- Conflict resolution comes more into play as more write nodes are added and as latency increases.
2021-03-21 13:12:09 +03:00
- See [Disadvantage(s) : replication](https://github.com/donnemartin/system-design-primer#disadvantages-replication) for points related to both master-slave and master-master.
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Disadvantage(s) : replication
2021-03-14 13:08:05 +03:00
2021-03-14 18:13:28 +03:00
- There is a potential for loss of data if the master fails before any newly written data can be replicated to other nodes.
- Writes are replayed to the read replicas. If there are a lot of writes, the read replicas can get bogged down with replaying writes and can't do as many reads.
- The more read slaves, the more you have to replicate, which leads to greater replication lag.
- On some systems, writing to the master can spawn multiple threads to write in parallel, whereas read replicas only support writing sequentially with a single thread.
- Replication adds more hardware and additional complexity.
2021-03-14 13:08:05 +03:00
2021-03-21 13:55:58 +03:00
## Source(s) and further reading: replication
2021-03-14 13:08:05 +03:00
2021-03-14 18:13:28 +03:00
- [Scalability, availability, stability, patterns](http://www.slideshare.net/jboner/scalability-availability-stability-patterns/)
2021-03-27 12:47:14 +03:00
- [Multi-master replication](https://en.wikipedia.org/wiki/Multi-master_replication)
## Availability in numbers
Availability is often quantified by uptime (or downtime) as a percentage of time the service is available. Availability is generally measured in number of 9s--a service with 99.99% availability is described as having four 9s.
## 99.9% availability - three 9s
| Duration | Acceptable downtime|
|---------------------|--------------------|
| Downtime per year | 8h 45min 57s |
| Downtime per month | 43m 49.7s |
| Downtime per week | 10m 4.8s |
| Downtime per day | 1m 26.4s |
## 99.99% availability - four 9s
| Duration | Acceptable downtime|
|---------------------|--------------------|
| Downtime per year | 52min 35.7s |
| Downtime per month | 4m 23s |
| Downtime per week | 1m 5s |
| Downtime per day | 8.6s |
## Availability in parallel vs in sequence
If a service consists of multiple components prone to failure, the service's overall availability depends on whether the components are in sequence or in parallel.
## In sequence
Overall availability decreases when two components with availability < 100% are in sequence:
```
Availability (Total) = Availability (Foo) * Availability (Bar)
```
If both `Foo` and `Bar` each had 99.9% availability, their total availability in sequence would be 99.8%.
## In parallel
Overall availability increases when two components with availability < 100% are in parallel:
```
Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))
```
If both `Foo` and `Bar` each had 99.9% availability, their total availability in parallel would be 99.9999%.