revert to origin

This commit is contained in:
Vu
2021-03-26 23:50:38 +07:00
parent c0531421c8
commit 9b92b8963b
45 changed files with 3384 additions and 3384 deletions

View File

@@ -1,6 +1,6 @@
# 设计 Mint.com
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题索引) 中的有关部分,以避免重复的内容。您可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题索引)中的有关部分,以避免重复的内容。您可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
## 第一步:简述用例与约束条件
@@ -80,7 +80,7 @@
> 列出所有重要组件以规划概要设计。
![Imgur](http://i.imgur.com/E8klrBh.png)
![Imgur](http://i.imgur.com/E8klrBh.png)
## 第三步:设计核心组件
@@ -88,9 +88,9 @@
### 用例:用户连接到一个财务账户
我们可以将 1000 万用户的信息存储在一个[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 中。我们应该讨论一下[选择SQL或NoSQL之间的用例和权衡](https://github.com/donnemartin/system-design-primer#sql-or-nosql) 了。
我们可以将 1000 万用户的信息存储在一个[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)中。我们应该讨论一下[选择SQL或NoSQL之间的用例和权衡](https://github.com/donnemartin/system-design-primer#sql-or-nosql)了。
* **客户端** 作为一个[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server) ,发送请求到 **Web 服务器**
* **客户端** 作为一个[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server),发送请求到 **Web 服务器**
* **Web 服务器** 转发请求到 **账户API** 服务器
* **账户API** 服务器将新输入的账户信息更新到 **SQL数据库**`accounts`
@@ -106,13 +106,13 @@ account_url varchar(255) NOT NULL
account_login varchar(32) NOT NULL
account_password_hash char(64) NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
我们将在`id``user_id``created_at`等字段上创建一个[索引](https://github.com/donnemartin/system-design-primer#use-good-indices) 以加速查找(对数时间而不是扫描整个表)并保持数据在内存中。从内存中顺序读取 1 MB数据花费大约250毫秒而从SSD读取是其4倍从磁盘读取是其80倍。<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
我们将在`id``user_id``created_at`等字段上创建一个[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)以加速查找(对数时间而不是扫描整个表)并保持数据在内存中。从内存中顺序读取 1 MB数据花费大约250毫秒而从SSD读取是其4倍从磁盘读取是其80倍。<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
我们将使用公开的[**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
我们将使用公开的[**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
```
$ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
@@ -120,7 +120,7 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
https://mint.com/api/v1/account
```
对于内部通信,我们可以使用[远程过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
对于内部通信,我们可以使用[远程过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)。
接下来,服务从账户中提取交易。
@@ -136,8 +136,8 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
* **客户端**向 **Web服务器** 发送请求
* **Web服务器** 将请求转发到 **帐户API** 服务器
* **帐户API** 服务器将job放在 **队列** 中,如 [Amazon SQS](https://aws.amazon.com/sqs/) 或者 [RabbitMQ](https://www.rabbitmq.com/)
* 提取交易可能需要一段时间,我们可能希望[与队列异步](https://github.com/donnemartin/system-design-primer#asynchronism) 地来做,虽然这会引入额外的复杂度。
* **帐户API** 服务器将job放在 **队列** 中,如 [Amazon SQS](https://aws.amazon.com/sqs/) 或者 [RabbitMQ](https://www.rabbitmq.com/)
* 提取交易可能需要一段时间,我们可能希望[与队列异步](https://github.com/donnemartin/system-design-primer#asynchronism)地来做,虽然这会引入额外的复杂度。
* **交易提取服务** 执行如下操作:
***Queue** 中拉取并从金融机构中提取给定用户的交易,将结果作为原始日志文件存储在 **对象存储区**
* 使用 **分类服务** 来分类每个交易
@@ -156,25 +156,25 @@ created_at datetime NOT NULL
seller varchar(32) NOT NULL
amount decimal NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
我们将在 `id``user_id`,和 `created_at`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)
我们将在 `id``user_id`,和 `created_at`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)。
`monthly_spending`表应该具有如下结构:
```
id int NOT NULL AUTO_INCREMENT
month_year date NOT NULL
category varchar(32)
category varchar(32)
amount decimal NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
我们将在`id``user_id`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)
我们将在`id``user_id`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)。
#### 分类服务
@@ -183,7 +183,7 @@ FOREIGN KEY(user_id) REFERENCES users(id)
**告知你的面试官你准备写多少代码**
```python
class DefaultCategories(Enum) :
class DefaultCategories(Enum):
HOUSING = 0
FOOD = 1
@@ -200,19 +200,19 @@ seller_category_map['Target'] = DefaultCategories.SHOPPING
对于一开始没有在映射中的卖家,我们可以通过评估用户提供的手动类别来进行众包。在 O(1) 时间内,我们可以用堆来快速查找每个卖家的顶端的手动覆盖。
```python
class Categorizer(object) :
class Categorizer(object):
def __init__(self, seller_category_map, self.seller_category_crowd_overrides_map) :
def __init__(self, seller_category_map, self.seller_category_crowd_overrides_map):
self.seller_category_map = seller_category_map
self.seller_category_crowd_overrides_map = \
seller_category_crowd_overrides_map
def categorize(self, transaction) :
def categorize(self, transaction):
if transaction.seller in self.seller_category_map:
return self.seller_category_map[transaction.seller]
elif transaction.seller in self.seller_category_crowd_overrides_map:
self.seller_category_map[transaction.seller] = \
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
return self.seller_category_map[transaction.seller]
return None
```
@@ -220,9 +220,9 @@ class Categorizer(object) :
交易实现:
```python
class Transaction(object) :
class Transaction(object):
def __init__(self, created_at, seller, amount) :
def __init__(self, created_at, seller, amount):
self.timestamp = timestamp
self.seller = seller
self.amount = amount
@@ -234,13 +234,13 @@ class Transaction(object) :
`TABLE budget_overrides`中存储此覆盖。
```python
class Budget(object) :
class Budget(object):
def __init__(self, income) :
def __init__(self, income):
self.income = income
self.categories_to_budget_map = self.create_budget_template()
self.categories_to_budget_map = self.create_budget_template()
def create_budget_template(self) :
def create_budget_template(self):
return {
'DefaultCategories.HOUSING': income * .4,
'DefaultCategories.FOOD': income * .2
@@ -249,7 +249,7 @@ class Budget(object) :
...
}
def override_category_budget(self, category, amount) :
def override_category_budget(self, category, amount):
self.categories_to_budget_map[category] = amount
```
@@ -275,26 +275,26 @@ user_id timestamp seller amount
**MapReduce** 实现:
```python
class SpendingByCategory(MRJob) :
class SpendingByCategory(MRJob):
def __init__(self, categorizer) :
def __init__(self, categorizer):
self.categorizer = categorizer
self.current_year_month = calc_current_year_month()
self.current_year_month = calc_current_year_month()
...
def calc_current_year_month(self) :
def calc_current_year_month(self):
"""返回当前年月"""
...
def extract_year_month(self, timestamp) :
def extract_year_month(self, timestamp):
"""返回时间戳的年,月部分"""
...
def handle_budget_notifications(self, key, total) :
def handle_budget_notifications(self, key, total):
"""如果接近或超出预算调用通知API"""
...
def mapper(self, _, line) :
def mapper(self, _, line):
"""解析每个日志行,提取和转换相关行。
参数行应为如下形式:
@@ -303,31 +303,31 @@ class SpendingByCategory(MRJob) :
使用分类器来将卖家转换成类别生成如下形式的key-value对
(user_id, 2016-01, shopping) , 25
(user_id, 2016-01, shopping) , 100
(user_id, 2016-01, gas) , 50
(user_id, 2016-01, shopping), 25
(user_id, 2016-01, shopping), 100
(user_id, 2016-01, gas), 50
"""
user_id, timestamp, seller, amount = line.split('\t')
category = self.categorizer.categorize(seller)
period = self.extract_year_month(timestamp)
user_id, timestamp, seller, amount = line.split('\t')
category = self.categorizer.categorize(seller)
period = self.extract_year_month(timestamp)
if period == self.current_year_month:
yield (user_id, period, category) , amount
yield (user_id, period, category), amount
def reducer(self, key, value) :
def reducer(self, key, value):
"""将每个key对应的值求和。
(user_id, 2016-01, shopping) , 125
(user_id, 2016-01, gas) , 50
(user_id, 2016-01, shopping), 125
(user_id, 2016-01, gas), 50
"""
total = sum(values)
yield key, sum(values)
total = sum(values)
yield key, sum(values)
```
## 第四步:设计扩展
> 根据限制条件,找到并解决瓶颈。
![Imgur](http://i.imgur.com/V5q57vU.png)
![Imgur](http://i.imgur.com/V5q57vU.png)
**重要提示:不要从最初设计直接跳到最终设计中!**
@@ -337,20 +337,20 @@ class SpendingByCategory(MRJob) :
我们将会介绍一些组件来完成设计,并解决架构扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
* [反向代理web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
* [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
* [异步](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#异步)
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
* [反向代理web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
* [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
* [异步](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#异步)
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
我们将增加一个额外的用例:**用户** 访问摘要和交易数据。
@@ -366,7 +366,7 @@ class SpendingByCategory(MRJob) :
* 如果URL在 **SQL 数据库**中,获取该内容
* 以其内容更新 **内存缓存**
参考 [何时更新缓存](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) 中权衡和替代的内容。以上方法描述了 [cache-aside缓存模式](https://github.com/donnemartin/system-design-primer#cache-aside) .
参考 [何时更新缓存](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) 中权衡和替代的内容。以上方法描述了 [cache-aside缓存模式](https://github.com/donnemartin/system-design-primer#cache-aside).
我们可以使用诸如 Amazon Redshift 或者 Google BigQuery 等数据仓库解决方案,而不是将`monthly_spending`聚合表保留在 **SQL 数据库** 中。
@@ -376,10 +376,10 @@ class SpendingByCategory(MRJob) :
*平均* 200 次交易写入每秒(峰值时更高)对于单个 **SQL 写入主-从服务** 来说可能是棘手的。我们可能需要考虑其它的 SQL 性能拓展技术:
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
我们也可以考虑将一些数据移至 **NoSQL 数据库**
@@ -389,50 +389,50 @@ class SpendingByCategory(MRJob) :
#### NoSQL
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
### 缓存
* 在哪缓存
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
* 什么需要缓存
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
* 何时更新缓存
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
### 异步与微服务
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
### 通信
* 可权衡选择的方案:
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
### 安全性
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 一章。
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)一章。
### 延迟数值
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
### 持续探讨

View File

@@ -80,7 +80,7 @@ Handy conversion guide:
> Outline a high level design with all important components.
![Imgur](http://i.imgur.com/E8klrBh.png)
![Imgur](http://i.imgur.com/E8klrBh.png)
## Step 3: Design core components
@@ -88,9 +88,9 @@ Handy conversion guide:
### Use case: User connects to a financial account
We could store info on the 10 million users in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) . We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) .
We could store info on the 10 million users in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms). We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
* The **Web Server** forwards the request to the **Accounts API** server
* The **Accounts API** server updates the **SQL Database** `accounts` table with the newly entered account info
@@ -106,13 +106,13 @@ account_url varchar(255) NOT NULL
account_login varchar(32) NOT NULL
account_password_hash char(64) NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id`, `user_id `, and `created_at` to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
```
$ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
@@ -120,7 +120,7 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
https://mint.com/api/v1/account
```
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
Next, the service extracts transactions from the account.
@@ -136,8 +136,8 @@ Data flow:
* The **Client** sends a request to the **Web Server**
* The **Web Server** forwards the request to the **Accounts API** server
* The **Accounts API** server places a job on a **Queue** such as [Amazon SQS](https://aws.amazon.com/sqs/) or [RabbitMQ](https://www.rabbitmq.com/)
* Extracting transactions could take awhile, we'd probably want to do this [asynchronously with a queue](https://github.com/donnemartin/system-design-primer#asynchronism) , although this introduces additional complexity
* The **Accounts API** server places a job on a **Queue** such as [Amazon SQS](https://aws.amazon.com/sqs/) or [RabbitMQ](https://www.rabbitmq.com/)
* Extracting transactions could take awhile, we'd probably want to do this [asynchronously with a queue](https://github.com/donnemartin/system-design-primer#asynchronism), although this introduces additional complexity
* The **Transaction Extraction Service** does the following:
* Pulls from the **Queue** and extracts transactions for the given account from the financial institution, storing the results as raw log files in the **Object Store**
* Uses the **Category Service** to categorize each transaction
@@ -156,8 +156,8 @@ created_at datetime NOT NULL
seller varchar(32) NOT NULL
amount decimal NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id`, `user_id `, and `created_at`.
@@ -167,11 +167,11 @@ The `monthly_spending` table could have the following structure:
```
id int NOT NULL AUTO_INCREMENT
month_year date NOT NULL
category varchar(32)
category varchar(32)
amount decimal NOT NULL
user_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
PRIMARY KEY(id)
FOREIGN KEY(user_id) REFERENCES users(id)
```
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id` and `user_id `.
@@ -183,7 +183,7 @@ For the **Category Service**, we can seed a seller-to-category dictionary with t
**Clarify with your interviewer how much code you are expected to write**.
```python
class DefaultCategories(Enum) :
class DefaultCategories(Enum):
HOUSING = 0
FOOD = 1
@@ -200,19 +200,19 @@ seller_category_map['Target'] = DefaultCategories.SHOPPING
For sellers not initially seeded in the map, we could use a crowdsourcing effort by evaluating the manual category overrides our users provide. We could use a heap to quickly lookup the top manual override per seller in O(1) time.
```python
class Categorizer(object) :
class Categorizer(object):
def __init__(self, seller_category_map, seller_category_crowd_overrides_map) :
def __init__(self, seller_category_map, seller_category_crowd_overrides_map):
self.seller_category_map = seller_category_map
self.seller_category_crowd_overrides_map = \
seller_category_crowd_overrides_map
def categorize(self, transaction) :
def categorize(self, transaction):
if transaction.seller in self.seller_category_map:
return self.seller_category_map[transaction.seller]
elif transaction.seller in self.seller_category_crowd_overrides_map:
self.seller_category_map[transaction.seller] = \
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
return self.seller_category_map[transaction.seller]
return None
```
@@ -220,9 +220,9 @@ class Categorizer(object) :
Transaction implementation:
```python
class Transaction(object) :
class Transaction(object):
def __init__(self, created_at, seller, amount) :
def __init__(self, created_at, seller, amount):
self.created_at = created_at
self.seller = seller
self.amount = amount
@@ -233,13 +233,13 @@ class Transaction(object) :
To start, we could use a generic budget template that allocates category amounts based on income tiers. Using this approach, we would not have to store the 100 million budget items identified in the constraints, only those that the user overrides. If a user overrides a budget category, which we could store the override in the `TABLE budget_overrides`.
```python
class Budget(object) :
class Budget(object):
def __init__(self, income) :
def __init__(self, income):
self.income = income
self.categories_to_budget_map = self.create_budget_template()
self.categories_to_budget_map = self.create_budget_template()
def create_budget_template(self) :
def create_budget_template(self):
return {
DefaultCategories.HOUSING: self.income * .4,
DefaultCategories.FOOD: self.income * .2,
@@ -248,7 +248,7 @@ class Budget(object) :
...
}
def override_category_budget(self, category, amount) :
def override_category_budget(self, category, amount):
self.categories_to_budget_map[category] = amount
```
@@ -274,26 +274,26 @@ user_id timestamp seller amount
**MapReduce** implementation:
```python
class SpendingByCategory(MRJob) :
class SpendingByCategory(MRJob):
def __init__(self, categorizer) :
def __init__(self, categorizer):
self.categorizer = categorizer
self.current_year_month = calc_current_year_month()
self.current_year_month = calc_current_year_month()
...
def calc_current_year_month(self) :
def calc_current_year_month(self):
"""Return the current year and month."""
...
def extract_year_month(self, timestamp) :
def extract_year_month(self, timestamp):
"""Return the year and month portions of the timestamp."""
...
def handle_budget_notifications(self, key, total) :
def handle_budget_notifications(self, key, total):
"""Call notification API if nearing or exceeded budget."""
...
def mapper(self, _, line) :
def mapper(self, _, line):
"""Parse each log line, extract and transform relevant lines.
Argument line will be of the form:
@@ -303,31 +303,31 @@ class SpendingByCategory(MRJob) :
Using the categorizer to convert seller to category,
emit key value pairs of the form:
(user_id, 2016-01, shopping) , 25
(user_id, 2016-01, shopping) , 100
(user_id, 2016-01, gas) , 50
(user_id, 2016-01, shopping), 25
(user_id, 2016-01, shopping), 100
(user_id, 2016-01, gas), 50
"""
user_id, timestamp, seller, amount = line.split('\t')
category = self.categorizer.categorize(seller)
period = self.extract_year_month(timestamp)
user_id, timestamp, seller, amount = line.split('\t')
category = self.categorizer.categorize(seller)
period = self.extract_year_month(timestamp)
if period == self.current_year_month:
yield (user_id, period, category) , amount
yield (user_id, period, category), amount
def reducer(self, key, value) :
def reducer(self, key, value):
"""Sum values for each key.
(user_id, 2016-01, shopping) , 125
(user_id, 2016-01, gas) , 50
(user_id, 2016-01, shopping), 125
(user_id, 2016-01, gas), 50
"""
total = sum(values)
yield key, sum(values)
total = sum(values)
yield key, sum(values)
```
## Step 4: Scale the design
> Identify and address bottlenecks, given the constraints.
![Imgur](http://i.imgur.com/V5q57vU.png)
![Imgur](http://i.imgur.com/V5q57vU.png)
**Important: Do not simply jump right into the final design from the initial design!**
@@ -339,19 +339,19 @@ We'll introduce some components to complete the design and to address scalabilit
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
* [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
* [Asynchronism](https://github.com/donnemartin/system-design-primer#asynchronism)
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
* [Asynchronism](https://github.com/donnemartin/system-design-primer#asynchronism)
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
We'll add an additional use case: **User** accesses summaries and transactions.
@@ -367,20 +367,20 @@ User sessions, aggregate stats by category, and recent transactions could be pla
* If the url is in the **SQL Database**, fetches the contents
* Updates the **Memory Cache** with the contents
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside) .
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside).
Instead of keeping the `monthly_spending` aggregate table in the **SQL Database**, we could create a separate **Analytics Database** using a data warehousing solution such as Amazon Redshift or Google BigQuery.
We might only want to store a month of `transactions` data in the database, while storing the rest in a data warehouse or in an **Object Store**. An **Object Store** such as Amazon S3 can comfortably handle the constraint of 250 GB of new content per month.
To address the 200 *average* read requests per second (higher at peak) , traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
To address the 200 *average* read requests per second (higher at peak), traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
2,000 *average* transaction writes per second (higher at peak) might be tough for a single **SQL Write Master-Slave**. We might need to employ additional SQL scaling patterns:
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
We should also consider moving some data to a **NoSQL Database**.
@@ -390,50 +390,50 @@ We should also consider moving some data to a **NoSQL Database**.
#### NoSQL
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
### Caching
* Where to cache
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
* What to cache
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
* When to update the cache
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
### Asynchronism and microservices
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
### Communications
* Discuss tradeoffs:
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
### Security
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
### Latency numbers
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
### Ongoing

View File

@@ -3,55 +3,55 @@
from mrjob.job import MRJob
class SpendingByCategory(MRJob) :
class SpendingByCategory(MRJob):
def __init__(self, categorizer) :
def __init__(self, categorizer):
self.categorizer = categorizer
...
def current_year_month(self) :
def current_year_month(self):
"""Return the current year and month."""
...
def extract_year_month(self, timestamp) :
def extract_year_month(self, timestamp):
"""Return the year and month portions of the timestamp."""
...
def handle_budget_notifications(self, key, total) :
def handle_budget_notifications(self, key, total):
"""Call notification API if nearing or exceeded budget."""
...
def mapper(self, _, line) :
def mapper(self, _, line):
"""Parse each log line, extract and transform relevant lines.
Emit key value pairs of the form:
(2016-01, shopping) , 25
(2016-01, shopping) , 100
(2016-01, gas) , 50
(2016-01, shopping), 25
(2016-01, shopping), 100
(2016-01, gas), 50
"""
timestamp, category, amount = line.split('\t')
period = self. extract_year_month(timestamp)
if period == self.current_year_month() :
yield (period, category) , amount
timestamp, category, amount = line.split('\t')
period = self. extract_year_month(timestamp)
if period == self.current_year_month():
yield (period, category), amount
def reducer(self, key, values) :
def reducer(self, key, values):
"""Sum values for each key.
(2016-01, shopping) , 125
(2016-01, gas) , 50
(2016-01, shopping), 125
(2016-01, gas), 50
"""
total = sum(values)
self.handle_budget_notifications(key, total)
yield key, sum(values)
total = sum(values)
self.handle_budget_notifications(key, total)
yield key, sum(values)
def steps(self) :
def steps(self):
"""Run the map and reduce steps."""
return [
self.mr(mapper=self.mapper,
reducer=self.reducer)
reducer=self.reducer)
]
if __name__ == '__main__':
SpendingByCategory.run()
SpendingByCategory.run()

View File

@@ -3,7 +3,7 @@
from enum import Enum
class DefaultCategories(Enum) :
class DefaultCategories(Enum):
HOUSING = 0
FOOD = 1
@@ -17,34 +17,34 @@ seller_category_map['Exxon'] = DefaultCategories.GAS
seller_category_map['Target'] = DefaultCategories.SHOPPING
class Categorizer(object) :
class Categorizer(object):
def __init__(self, seller_category_map, seller_category_overrides_map) :
def __init__(self, seller_category_map, seller_category_overrides_map):
self.seller_category_map = seller_category_map
self.seller_category_overrides_map = seller_category_overrides_map
def categorize(self, transaction) :
def categorize(self, transaction):
if transaction.seller in self.seller_category_map:
return self.seller_category_map[transaction.seller]
if transaction.seller in self.seller_category_overrides_map:
seller_category_map[transaction.seller] = \
self.manual_overrides[transaction.seller].peek_min()
self.manual_overrides[transaction.seller].peek_min()
return self.seller_category_map[transaction.seller]
return None
class Transaction(object) :
class Transaction(object):
def __init__(self, timestamp, seller, amount) :
def __init__(self, timestamp, seller, amount):
self.timestamp = timestamp
self.seller = seller
self.amount = amount
class Budget(object) :
class Budget(object):
def __init__(self, template_categories_to_budget_map) :
def __init__(self, template_categories_to_budget_map):
self.categories_to_budget_map = template_categories_to_budget_map
def override_category_budget(self, category, amount) :
def override_category_budget(self, category, amount):
self.categories_to_budget_map[category] = amount